Database Search: Text2SQL using dynamic few-shot prompting with self- consistency using LLM

Author(s): Utkarsh Tripathi, Jeff Shelman

Text2SQL, which involves converting natural language to Structured Query Language (SQL), is the disruptive application of Large Language Models (LLM). It has the potential to radically transform how humans interact with data. This paper proposes a new approach to generate SQL which improves the contextual understanding of LLM significantly. It utilizes a 2-layer dynamic few-shot prompting with a self-consistency approach to increase model attention. Also, the redundant information present in few-shot examples is masked, and each example is categorized into a respective use case (domain). Firstly, we store our list of masked few-shot examples along with its metadata and vector embeddings in a database. The masking helps in increasing the attention mechanism of LLM to the actual context of the question instead of irrelevant information present in it. The domain of each few-shot example is stored in metadata. Secondly, the similarity threshold is also dynamically fetched from the database based on the solution where the database search is being used.

This dynamic selection of similar few shot examples and similarity threshold based on solution increases the tool flexibility and makes it scalable across different set of tables. Now, when the user asks a question, it’s first categorized into one of the respective use cases available in the few-shot example set. Once the category of question is finalized, the top 5 similar shot examples from this category are selected based on the similarity score only if the similarity score is above the threshold. These few shot examples, along with additional instructions, are passed to the LLM model to generate the SQL query. This step is only performed if the number of similar examples above the threshold is at least 2. The cutoff for having a minimum number of few shot examples restricts the LLM from hallucinating and keeps model output consistent.

Finally, as refinement module, self- consistency is used to select the final answer and SQL query. Adding categorization and masking of irrelevant information in a few shot examples increases the accuracy by ~7%, resulting in a final accuracy of ~94% when tested across 1100 questions of multiple solutions.

Access this Lattice journal:

Picture of Association of Data Scientists

Association of Data Scientists

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.