Lightweight Text Extraction with NuExtract – A Deep Dive

NuMind's NuExtract model for zero-shot or fine-tuned structured data extraction.

NuMind released NuExtract,  a foundation model designed specifically for structured data extraction from text. The model can be implemented in a zero-shot setting or fine-tuned for solving specific extraction problems. The open-source model is available on Hugging Face for experimentation and research. The following article explains NuExtract with a hands-on implementation. 

Table of Content

  1. Understanding NuExtract 
  2. NuExtract Model Variants
  3. Comparison of NuExtract with Popular LLMs 
  4. Hands-on Implementation of NuExtract

Understanding NuExtract

NuExtract’s core function is transforming unstructured text into a structured format, specifically JSON. This means it can identify and extract key information from text documents like research papers. Structured data extraction is a process of automatically identifying and extracting specific pieces of information from unstructured text data and converting it into a well-organised format such as JSON or XML. 

NuExtract goes beyond the simple conversion of text data into JSON, it understands the language to pinpoint the details (names, dates, locations, or any other data points depending on the text input) a user defines using templates.  

An Example of Structured Extraction 

NuExtract was created using a Phi-3, a generic small language model, fine-tuned on synthetic data generated by Llama 3 large language model for obtaining a model specialised in the task of structured extraction. NuExtract can be implemented for zero-shot inference or be fine-tuned for specific tasks. 

NuExtract Creation Process

NuExtract Model Variants

NuExtract offers three model variants – NuExtract-tiny, NuExtract and NuExtract-large which is a result of 0.5B to 7B parameter language models being trained on an LLM-generated structured extraction dataset. 

NuExtract-tiny, NuExtract and NuExtract-large are a version of Qwen1.5-0.5, Phi-3-mini and Phi-3-small LLMs respectively, which are fine-tuned on a private high-quality synthetic dataset for information extraction. 

As per the zero-shot setting comparison – NuExtract-tiny is better than GPT-3.5, while being 100 times smaller, NuExtract outperforms Llama3-70B while being 35 times smaller and NuExtract-large is reaching GPT-4o levels while being 100 times smaller as shown below: 

Comparison in Zero-Shot Setting

Fine-tuning NuExtract models based on chemical extraction problem, substantially outperform fine-tuned GPT-4o  while being at least 100 times smaller as shown in the image below

Comparison in Fine-tuned Setting

Hands-on Implementation of NuExtract

Step 1: Importing the required libraries – 

  • Json provides functions for working with JSON data. 
  • AutoModelforCausalLLM retrieves a pre-trained LLM from the transformers model hub. 
  • AutoTokenizer retrieves the tokenizer associated with the chosen causal language model. 

Step 2: Using the function predict_NuExtract for simulating the functionality of NuExtract. The function preprocesses the input, performs tokenization, model inference and extracts the predicted output from the text according the given user schema. 

Step 3: Loading the NuExtract model and tokenizer from the transformers model hub and setting parameters – model.to(“Cuda”) for moving the loaded model to the CUDA device, if available, for speeding up the computations. model.eval() sets the model to evaluation mode as model is training is not done, only predictions are made based on the data. 

Step 4: Passing the input text and schema for structured extraction. The schema dictates the JSON keys for structuring. 

Output

The output gives the accurate JSON output based on the input text and schema template. NuExtract was able to identify and extract information about each key from the text based on the template. 

Final Words

NuExtract is a promising step towards effortless information extraction. The lightweight design, zero-shot learning capabilities and open-source release makes it a viable solution for structured extraction requirements. However, further research and development is required for exploring NuExtract’s limitations and refine its accuracy. 

References

  1. NuExtract Release Post
  2. NuExtract Hugging Face Model Collection

Learn more about generative AI and large language models through our hand-picked courses:

Picture of Sachin Tripathi

Sachin Tripathi

Sachin Tripathi is the Manager of AI Research at AIM, with over a decade of experience in AI and Machine Learning. An expert in generative AI and large language models (LLMs), Sachin excels in education, delivering effective training programs. His expertise also includes programming, big data analytics, and cybersecurity. Known for simplifying complex concepts, Sachin is a leading figure in AI education and professional development.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

Subscribe to our Newsletter