Code Search with Vector Embeddings using Qdrant Vector Database

Practical insights to enhance search accuracy and developer productivity in large codebases.
qdrant vector database

In the rapidly evolving world of software development, efficiently locating relevant code snippets within vast codebases is a significant challenge. Traditional keyword-based search methods often fall short of understanding the context and semantics of code, leading to suboptimal search results. In this article, we will dive deep into the mechanics of code search using vector embeddings and explore how the Qdrant Vector Database facilitates efficient and effective search solutions. 

Table of contents

  1. Why Code Search with Vector Embeddings?
  2. Understanding Vector Embeddings
  3. The architecture of Qdrant Vector Database
  4. Implementation of the Code

Let’s start with understanding the problem with the traditional search in the codebase.

Why Code Search with Vector Embeddings?

In the realm of software development, the ability to quickly and accurately search through vast codebases is crucial for productivity and efficiency. Traditional code search methods typically rely on keyword-based searches, which have several limitations. These methods often fall short of understanding the context and semantics of the code, leading to irrelevant or incomplete search results. This is where vector embeddings come into play, offering a revolutionary approach to code search.

Limitations of Traditional Code Search

  • Lack of Contextual Understanding: Traditional keyword-based searches do not capture the meaning or context of the code snippets. For instance, searching for a function name might return numerous unrelated results if the same name is used in different contexts. This can lead to confusion and wasted time as developers sift through irrelevant results.
  • Inefficiency with Large Codebases: As codebases grow in size and complexity, traditional search methods become increasingly inefficient. Keyword searches may return too many results, making it difficult to find the exact piece of code needed. This inefficiency can significantly slow down the development process.
  • Inability to Handle Code Variability: Code can often be written in various ways to achieve the same functionality. Traditional searches struggle to recognize these variations, resulting in missed relevant snippets that use different terminology or structures to achieve the same goal.

Understanding Vector Embeddings

Vector embeddings are numerical representations of data that encode semantic information. Unlike traditional representations, which might rely on keywords or tokens, vector embeddings place data points in a high-dimensional space where similar items are closer together. This proximity in the vector space reflects the semantic similarity between the items.

For instance, in the context of code, two functions that perform similar tasks but use different variable names or structures will have similar vector embeddings. This allows a search system to retrieve relevant results based on meaning rather than just matching keywords.

How are Vector Embeddings Generated?

Generating vector embeddings involves using machine learning models, particularly those based on deep learning. These models are trained on large datasets to learn the relationships and context of the data. Here’s a general overview of the process:

  1. Training Data Preparation: A large corpus of data, such as code snippets or text documents, is collected and prepared for training.
  2. Model Training: Deep learning models, such as transformers, are trained on this data. These models learn to understand the context and relationships within the data, allowing them to generate meaningful embeddings.
  3. Embedding Generation: Once trained, the model can convert new data into embeddings. For example, a code snippet is passed through the model, which outputs a vector representation that captures its semantic meaning.

Image source

  • Capturing Context and Meaning: Vector embeddings can capture the context and semantic meaning of code snippets, which traditional keyword-based methods often miss. This is particularly useful for code search, where understanding the functionality and purpose of code is crucial.
  • Handling Synonyms and Variations: Different developers might write the same functionality using different variable names or structures. Vector embeddings recognize these variations, ensuring that relevant code snippets are retrieved regardless of the exact wording.
  • Enhanced Search Accuracy: By representing code snippets as vectors in a high-dimensional space, searches can be conducted based on semantic similarity. This leads to more accurate and relevant search results compared to traditional methods.
  • Scalability: Vector embeddings allow for efficient indexing and searching even in large codebases. High-dimensional vector spaces enable quick similarity searches, making it feasible to handle extensive and complex projects.
  • Flexibility Across Data Types: While this article focuses on code, vector embeddings are versatile and can be applied to various data types, including text, images, and more. This makes them a powerful tool for a wide range of search and retrieval applications.

The architecture of Qdrant Vector Database

Qdrant is an open-source vector database specifically designed for managing and querying high-dimensional vector data. It is particularly effective for applications in natural language processing, computer vision, and recommendation systems, where large-scale vector embeddings are common.

Key Features of Qdrant

  • Scalability: Qdrant supports seamless horizontal scaling, allowing users to increase storage and processing capabilities as data volumes grow.
  • Efficient Vector Search: It excels in vector similarity search, enabling quick and accurate retrieval of similar vectors within large datasets.
  • Flexible Query Language: Qdrant offers a flexible query language that allows users to express complex search criteria, accommodating diverse application needs.
  • Real-time Updates: The database can handle real-time updates, ensuring that it remains synchronized with changing data sources.
  • Distance Metrics: Qdrant supports various distance metrics, including cosine similarity, dot product, and Euclidean distance, to measure the similarity between vectors effectively.

Architecture of Qdrant

Qdrant’s architecture is built around several core components:

Collections and Points

  • Collections: These are named sets of data points, where each point is a vector associated with metadata (payload). All vectors in a collection share the same dimensionality and can be compared using a selected distance metric.
  • Points: Each point consists of a vector, an optional identifier (ID), and a payload that provides additional context in JSON format.

Storage Mechanisms

Qdrant utilizes a hybrid storage approach:

  • In-memory Storage: This option keeps all vectors in RAM for the fastest data access.
  • Memmap Storage: This method links a virtual address space with a file on disk, balancing memory usage and access speed.

Indexing

Qdrant employs advanced indexing structures, such as Hierarchical Navigable Small World (HNSW), to organize vectors hierarchically. This facilitates efficient nearest-neighbor searches and enhances query performance.

API and Client Libraries

Qdrant provides a user-friendly API and supports various programming languages, including Python, Go, Rust, and TypeScript, allowing developers to interact with the database in their preferred language.

Image Source

Implementation of the Code

In this section, we’ll walk through the implementation of code search with vector embeddings using Qdrant Vector Database. We’ll explain the steps in the provided Jupyter notebook to help you understand how to set up and use this powerful search system.

Setting Up Your Environment

First, you need to install the necessary packages. This includes inflection for text manipulation, qdrant-client for interacting with the Qdrant database, and the fastembed package for generating vector embeddings. Here is the code snippet.

!pip install inflection qdrant-client https://github.com/qdrant/fastembed/archive/main.zip

Downloading and Preparing Data

Next, we download a sample dataset of code snippets in JSONL format and load it into a Python list for further processing. Here is the code snippet.

!wget https://storage.googleapis.com/tutorial-attachments/code-search/structures.jsonl
structures = []
with open("structures.jsonl", "r") as fp:
    for i, row in enumerate(fp):
        entry = json.loads(row)
        structures.append(entry)

The subsequent Python code uses the json library to read the JSONL file line by line, converting each line from a JSON string to a Python dictionary and appending it to the structures list. This prepares the raw data for further manipulation and analysis.

Transforming Code Snippets into Text Representations

To generate meaningful embeddings, we first need to convert the code snippets into human-readable text representations. This involves transforming variable names and function signatures from camel case or snake case into readable text and combining relevant information about the code. Here is the code snippet.

def textify(chunk: Dict[str, Any]) -> str:
    # Get rid of all the camel case / snake case
    # - inflection.underscore changes the camel case to snake case
    # - inflection.humanize converts the snake case to human readable form
    name = inflection.humanize(inflection.underscore(chunk["name"]))
    signature = inflection.humanize(inflection.underscore(chunk["signature"]))

    # Check if docstring is provided
    docstring = ""
    if chunk["docstring"]:
        docstring = f"that does {chunk['docstring']} "

    # Extract the location of that snippet of code
    context = (
        f"module {chunk['context']['module']} " f"file {chunk['context']['file_name']}"
    )
    if chunk["context"]["struct_name"]:
        struct_name = inflection.humanize(
            inflection.underscore(chunk["context"]["struct_name"])
        )
        context = f"defined in struct {struct_name} {context}"

    # Combine all the bits and pieces together
    text_representation = (
        f"{chunk['code_type']} {name} "
        f"{docstring}"
        f"defined as {signature} "
        f"{context}"
    )

    # Remove any special characters and concatenate the tokens
    tokens = re.split(r"\W", text_representation)
    tokens = filter(lambda x: x, tokens)
    return " ".join(tokens)

The textify function takes a dictionary representing a code snippet and converts various elements like function names and signatures into readable text. It uses the inflection library to transform variable names from camel case or snake case to human-readable formats and combines this information into a single string. The re library is used to split the text into tokens, which are then concatenated to form the final text representation.

Generating Vector Embeddings

Using a pre-trained model from fastembed, we generate vector embeddings for the text representations of our code snippets. 

batch_size = 5

nlp_model = TextEmbedding("sentence-transformers/all-MiniLM-L6-v2", threads=0)
nlp_embeddings = nlp_model.embed(text_representations, batch_size=batch_size)

The TextEmbedding class from the fastembed library is initialized with a pre-trained model (sentence-transformers/all-MiniLM-L6-v2). The embed method generates embeddings for the text representations, batch processing them to improve efficiency. These embeddings represent the semantic meaning of the code snippets in a high-dimensional vector space.

Indexing Embeddings in Qdrant

We initialize the Qdrant client and upload the generated embeddings along with their corresponding metadata to the Qdrant database. Here is the code snippet.

for id, (text_embedding, code_embedding, structure) in tqdm(enumerate(zip(nlp_embeddings, code_embeddings, structures)), total=total):
    # FastEmbed returns generators. Embeddings are computed as consumed.
    points.append(
        models.PointStruct(
            id=id,
            vector={
                "text": text_embedding,
                "code": code_embedding,
            },
            payload=structure,
        )
    )

    # Upload points in batches
    if len(points) >= batch_size:
        client.upload_points(COLLECTION_NAME, points=points, wait=True)
        points = []

# Ensure any remaining points are uploaded
if points:
    client.upload_points(COLLECTION_NAME, points=points)

print(f"Total points in collection: {client.count(COLLECTION_NAME).count}")

The QdrantClient is initialized with an in-memory database for quick setup and testing. The upload_collection method uploads the embeddings and their payloads (metadata) to the specified collection in the Qdrant database. The distance parameter specifies the use of cosine similarity for measuring the distance between vectors, which is suitable for semantic search.

Finally, we perform a code search by querying the Qdrant database with a vector representation of the search term. We process the results to display relevant code snippets.

query = "function to calculate factorial"

hits = client.query_points(
    COLLECTION_NAME,
    query=next(nlp_model.query_embed(query)).tolist(),
    using="text",
    limit=3,
).points

rows = []
for point in hits:
    row = {
        'module': point.payload['context'].get('module'),
        'file_name': point.payload['context'].get('file_name'),
        'score': point.score,
        'signature': point.payload.get('signature')
    }
    rows.append(row)

df = pd.DataFrame(rows)
df

The query_points method is used to search the Qdrant database for vectors similar to the query vector, which is generated by embedding the search term. The method returns the top matching vectors along with their metadata. The results are then processed into a DataFrame for easy viewing, displaying relevant information such as module, file name, relevance score, and function signature.

The function fn calculate_avg (& self) -> f32 in the common/operation_time_statistics.rs file has the highest relevance score (0.296968). This function calculates an average, which is mathematically related to factorial, but not the exact match.

The second result, fn scaled_fast_sigmoid (x : ScoreType) -> ScoreType, in the src/math.rs file, suggests a sigmoid function. This is less relevant but might share mathematical operations.

The third result, fn deserialize_factor < ‘de , D > (deserializer : D) -> Result < usize , D :: Error > where D : serde :: Deserializer < ‘de >, in the operations/consistency_params.rs file, deals with deserialization and seems less relevant to calculating a factorial.

Conclusion

Implementing code search with vector embeddings using Qdrant Vector Database represents a significant advancement in the way developers interact with and retrieve code. By transforming code snippets into semantic vector representations, this approach overcomes the limitations of traditional keyword-based searches, providing more accurate and context-aware results. 

References

  1. Link to the above code
  2. QDrant Documentation
Picture of Sourabh Mehta

Sourabh Mehta

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.