Enhancing Text Data Quality: A Guide to Detecting Issues with Cleanlab

Improve text data quality with Cleanlab for better LLMs.
data quality cleanlab

In the era of big data, the dataset’s quality can make or break the success of the machine-learning models. Particularly in LLMs, ensuring the text data is clean and error-free is paramount. This comprehensive guide will explore how Cleanlab can enhance your text data quality by identifying and addressing common issues such as mislabeled data, noisy labels, and outliers. 

Table of content

  1. Importance of Data Quality
  2. Understanding Text Data Issues
  3. How does Cleanlab help in Data Quality?
  4. Enhancing Text Data Quality with Cleanlab

Let’s understand the importance of data quality and issues in text data.

Importance of Data Quality

Data quality is especially critical in NLP tasks, where the subtleties and complexities of human language must be accurately captured and understood by models. Text data, unlike numerical data, often contains nuances such as context, tone, and ambiguity that require careful handling. High-quality text data ensures that these nuances are preserved, allowing models to perform more effectively in tasks such as sentiment analysis, machine translation, and information retrieval.

Consequences of Poor Data Quality

The repercussions of poor data quality can be severe and far-reaching:

  • Inaccurate Predictions: Models trained on flawed data are likely to produce erroneous results. For instance, mislabeled data can lead to incorrect classifications, which undermine the model’s reliability.
  • Model Bias: Inaccurate or incomplete data can introduce biases that skew the model’s outputs. This is particularly problematic in NLP, where biases can perpetuate stereotypes or reinforce incorrect assumptions.
  • Increased Costs: Low-quality data can lead to increased costs in terms of both time and resources. Cleaning and preprocessing poor-quality data requires additional efforts, delaying the model development process.
  • Reduced Trust: Users and stakeholders must have confidence in the outputs of machine learning models. Poor data quality erodes this trust, making it difficult to justify the use of these models in critical applications.

Understanding Text Data Issues

Before we can improve data quality, it’s essential to understand the common issues that plague text datasets. Text data, while rich in information, is inherently messy and often requires significant preprocessing to be useful for machine learning models. Here, we will explore some of the most prevalent problems found in text datasets and their impacts.

Common Problems in Text Datasets

Mislabeled Data

Mislabeled data occurs when the labels associated with text samples are incorrect or inconsistent. For example, in a sentiment analysis dataset, a positive review might be mistakenly labelled as negative. Mislabeled data can lead to misleading model training, resulting in inaccurate predictions and poor model performance.

Noisy Labels

Noise in labels refers to variability or errors in the labelling process. This can happen due to human error, subjective interpretations, or automated labelling systems that aren’t perfect. Noisy labels can confuse models during training, reducing their ability to learn meaningful patterns and degrading overall accuracy.

Outliers

Outliers are data points that deviate significantly from other observations. In text data, this could be rare words, unusual phrases, or sentences that don’t fit the general context of the dataset. Outliers can skew the training process, causing models to overfit these anomalies and perform poorly on more typical data.

Incomplete Data

Incomplete data refers to missing values or incomplete sentences and phrases within the dataset. This can occur due to data collection errors or preprocessing steps that unintentionally remove parts of the text. Incomplete data can lead to gaps in the training process, where models fail to learn from complete contexts, thus affecting their ability to make accurate predictions.

Duplicate Data

Duplicate data consists of identical or nearly identical text samples that appear multiple times within the dataset. This often happens due to data collection methods or the merging of datasets from different sources. Duplicates can cause models to overfit certain examples, leading to an unbalanced understanding of the data and poor generalization to new, unseen text.

Ambiguous Data

Ambiguity in data occurs when text samples can be interpreted in multiple ways. For instance, a sentence like “I can’t recommend this product enough” could be read as both positive and negative depending on context. Ambiguous data can introduce uncertainty into model training, making it difficult for models to learn clear distinctions and leading to inconsistent predictions.

Impact on Machine Learning Models

The presence of these issues in text datasets can severely hamper the performance of machine-learning models. Models trained on flawed data are likely to produce unreliable and biased results. This not only affects the accuracy and efficiency of the models but also undermines trust in their outputs. For instance, an NLP model trained on mislabeled or noisy data may fail to accurately classify sentiments, detect entities, or perform translations.

How does Cleanlab help in Data Quality?

Cleanlab, a powerful open-source tool, offers advanced capabilities to identify and correct issues in datasets, ensuring that models are trained on the best possible data. This overview explores how Cleanlab enhances data quality by addressing common problems such as mislabeled data, noisy labels, and outliers.

Cleanlab is designed to automatically find and fix label errors in datasets, making it an invaluable tool for data scientists and machine learning engineers. Its core functionality revolves around identifying mislabeled data and improving the overall quality of datasets with minimal manual intervention.

Key Features and Benefits

  • Automated Label Error Detection: Cleanlab employs sophisticated algorithms to detect mislabeled data points. By analyzing the consistency of each label within the context of the entire dataset, it identifies labels that are likely incorrect.
  • Noise Reduction: The tool helps reduce noise in datasets by flagging data points with high uncertainty. This allows users to focus on the most reliable data, leading to more accurate model training.
  • Outlier Detection: Cleanlab identifies outliers that may distort model training. By highlighting these anomalies, it enables users to either remove or treat them appropriately.
  • Confidence Scores: For each data point, Cleanlab provides a confidence score indicating the likelihood that the label is correct. This helps prioritize which data points need attention, streamlining the data cleaning process.
  • Integration with Existing Workflows: Cleanlab is designed to integrate seamlessly with popular machine learning frameworks and tools, making it easy to incorporate into existing data preprocessing and model training pipelines. It can handle various data types, including images, text, tabular data, and more.

How Cleanlab Detects Issues?

Cleanlab uses a combination of statistical techniques and machine learning models to detect issues in datasets:

  • Consistency Analysis: Cleanlab evaluates the consistency of each label by comparing it with similar data points. If a label significantly deviates from the expected pattern, it is flagged as potentially incorrect.
  • Confidence-Based Filtering: By assigning confidence scores to each label, Cleanlab helps identify data points that are more likely to be correct, allowing users to filter out unreliable data.
  • Cross-Validation: Cleanlab employs cross-validation techniques to assess the reliability of labels. By splitting the dataset into multiple folds and evaluating each fold separately, it ensures that identified issues are not artifacts of a particular subset of the data.
  • Probabilistic Modeling: Cleanlab leverages probabilistic models to estimate the likelihood of label errors, helping to identify both obvious and subtle inconsistencies in datasets.

By employing these techniques, Cleanlab provides a comprehensive solution for improving data quality. It not only identifies existing issues but also offers actionable insights on how to correct them, leading to cleaner and more reliable datasets.

Enhancing Text Data Quality with Cleanlab

In this section, we will walk through the practical steps of enhancing text data quality using Cleanlab. We will cover installation, setup, and a step-by-step guide to using Cleanlab’s features effectively. The following code blocks and explanations will help you understand how to leverage Cleanlab to detect and correct issues in your text datasets.

Installation and Setup

First, let’s install Cleanlab and upgrade other necessary libraries. Ensure you have them installed in your environment before proceeding. Here is the code snippet.

!pip install -U scikit-learn sentence-transformers datasets
!pip install -U "cleanlab[datalab]"

The above block of code upgrades and installs the latest versions of essential Python libraries. The scikit-learn library provides tools for data mining and analysis, sentence-transformers offer state-of-the-art sentence embeddings for transforming text data into numerical vectors, and datasets by Hugging Face give access to a variety of datasets for NLP and other tasks. The cleanlab[datalab] installation includes Cleanlab, which detects and fixes label issues in datasets.

Loading the Data

For this demonstration, we will be using the subset of the Banking77-OOS Dataset containing 1,000 customer service requests which are classified into 11 categories based on their intent. Here is the code snippet.

data = load_dataset("PolyAI/banking77", split="train")
data_util = pd.DataFrame(data[1000:2000])
data_util.head()

The code prepares a manageable portion of the dataset for subsequent data quality checks and cleaning. Initially loading the “banking77” dataset from the Hugging Face Datasets library, specifically the training split. Then we take the subset of the dataset and convert it into a Pandas DataFrame for easier manipulation and analysis. 

Let’s understand the dataset’s structure and contents. Here is the code snippet. 

i = 2
raw_texts, labels = data_util["text"].values, data_util["label"].values
num_classes = len(set(labels))

There are 2 columns in the dataset one has the text and the other with the labels. In subset, there are 11 categories in the dataset 33, 1, 4, 36, 41, 12, 14, 47, 49, 23, 56 and 1000 records.

Tokenization of the Data 

Next, we convert the text data into vectors to feed to the as inputs for the ML models. We will use numeric representations from a pre-trained Transformer model as embeddings of our text. Here is the code snippet.

transformer = SentenceTransformer('google/electra-small-discriminator')
text_embeddings = transformer.encode(raw_texts)

In the above code block, we initialize a SentenceTransformer model using google’s ‘Electra-small-discriminator’ pre-trained model. Then the transformer model to encode the raw text data into numerical vector embeddings. These embeddings capture the semantic meaning of the texts, making them suitable for machine-learning tasks. 

Calculating Predicted Probabilities

Typically, when using pre-trained networks for a specific classification task, one approach is to add a linear output layer and fine-tune the entire network on the new data. However, this method is computationally intensive and often requires substantial GPU resources. 

An alternative approach is to freeze the pre-trained network’s weights and only train the output layer, which can be done without heavy reliance on GPUs. In this scenario, we simplify the process by fitting a linear model on top of the embeddings extracted from the pre-trained network.

To detect label issues, Cleanlab needs probabilistic predictions for each data point from your model. However, predictions for data points the model was trained on tend to be overfit and unreliable. Cleanlab is designed to work with out-of-sample predicted class probabilities, meaning it should be used on data points that were held out from the model during training to ensure reliable identification of label issues.

model = LogisticRegression(max_iter=400)
pred_probs = cross_val_predict(model, text_embeddings, labels, method="predict_proba")
data_dict = {"texts": raw_texts, "labels": labels}
lab = Datalab(data_dict, label_name="labels")
lab.find_issues(pred_probs=pred_probs, features=text_embeddings)

Analysis of issues

As observed in the above image the report indicates that cleanlab identified many label and outlier issues in the dataset. Let’s deep dive into the report and see the examples that are flagged as likely mislabeled. Here is the code snippet.

label_issues = lab.get_issues("label")
label_issues[:3]

The above snippet provides a dataframe containing the columns like label_score, predicted_label, etc. The predicted_label we are getting from the liner model’s prediction probability. Below are the top 5 most likely errors based on the label_score parameter and is_label_issue.

Similarly, now let’s have a look at the data with the outlier issues.

Similarly, we could find near-duplicates, data drift and other anomalies in our texted labelled data using Cleanlab.

Conclusion

Ensuring high-quality data is a critical step in the ML pipeline, especially for NLP tasks where the nuances of text data can significantly impact model performance. Incorporating Cleanlab into the data quality pipeline processes ensures that the ML/DL models are trained on the best possible data, leading to more accurate predictions and more reliable outcomes.

References

  1. Link to the above code
  2. Cleanlab Documentation
Picture of Sourabh Mehta

Sourabh Mehta

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.