Blended Document Similarity based on Text & Image Features

Author(s):Anand Jha


Document Similarity could be a building block for many useful applications, including Information Retrieval, Document Clustering, and Question-Answering Systems, to name a few. In the modern digital world, Informative Documents are composed of Text, Images and Videos. In such a scenario, similarity-based purely on Text, Image or Video may not be adequate. Hence a metrics blending similarity on all these aspects should be used. In this paper, a weighted similarity measure based on Texts and Images has been developed, using some popular open-source Machine Learning (ML) libraries. This provides a flexible and easy method without using large training data, which often is the case with ML tasks.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Explore more from Association of Data Scientists

Become ADaSci Chapter Lead

As a chapter lead, you will have the opportunity to connect with fellow data professionals in your area, share knowledge and resources, and work together to advance the field of data science.