Time Expression Extraction and Normalization in Industrial Setting

Author(s): Piyush Arora, Bharath Venkatesh, Salil Rajeev Joshi, Rahul Ghosh

Abstract

We present TEEN, an industry-grade solution to the problem of time expression extraction and normalization (Timex). Extraction and normalization of temporal units is a challenging problem due to several factors, e.g., (i) same-time units may be expressed in different ways, (ii) inherent ambiguity in natural languages leading to multiple interpretations, and (iii) context-sensitive nature of natural languages. While various academic and industrial approaches have presented solutions towards Timex, building an industry strength solution involves additional challenges in the form of user expectations, need for delivering high precision, and lack of training corpora. We elaborate how TEEN carefully mitigates these challenges. We demonstrate how the proposed approach compares with various state-of-the-art baselines on textual data from finance industry. We further categorize inadequacies of these baselines in an industrial setting. Finally, we provide insights gathered through the observations we made and the lessons we learned while designing TEEN to work in an industrial setting.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Explore more from Association of Data Scientists

Become ADaSci Chapter Lead

As a chapter lead, you will have the opportunity to connect with fellow data professionals in your area, share knowledge and resources, and work together to advance the field of data science.