Blended Document Similarity based on Text & Image Features

Author(s):Anand Jha


Document Similarity could be a building block for many useful applications, including Information Retrieval, Document Clustering, and Question-Answering Systems, to name a few. In the modern digital world, Informative Documents are composed of Text, Images and Videos. In such a scenario, similarity-based purely on Text, Image or Video may not be adequate. Hence a metrics blending similarity on all these aspects should be used. In this paper, a weighted similarity measure based on Texts and Images has been developed, using some popular open-source Machine Learning (ML) libraries. This provides a flexible and easy method without using large training data, which often is the case with ML tasks.