Weighted clustering on fast sentence embeddings to determine themes from large unstructured data

Author(s): Paritosh Sinha, Mohan Krishna Askani

Abstract

Most engineering product improvements are driven based on feedback from users and engineers. B2C products, such as the ones used to target customers or send personalised communications or manage order requests, track event-level actions and failures to improve product performance. However, the volume of failure logs (often in the order of a billion) and their unstructured nature (machine logs with minimal friendliness for human understanding) often hinder the detection of underlying themes from event failures. This paper discusses a unique and highly efficient approach to tune and leverage a language model for embedding generation. Using a weighted clustering technique, the embeddings are subsequently used to group failures into auto-detectable themes. The paper also proposes distinctive methods to manage embeddings that help improve the algorithm’s performance, while retaining its focus on efficiency and computation time. Our experiments show that the proposed technique provides similar performance to the latest language models while taking less than one-tenth of the overall computation time.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Explore more from Association of Data Scientists

Become ADaSci Chapter Lead

As a chapter lead, you will have the opportunity to connect with fellow data professionals in your area, share knowledge and resources, and work together to advance the field of data science.