Generative AI Crash Course for Non-Tech Professionals. Register Now >

RAG with Milvus Vector Database and LangChain

RAG integrates Milvus and Langchain for improved responses.
vector database rag

In the ever-evolving landscape of artificial intelligence, Retrieval Augmented Generation (RAG) has emerged as a transformative approach, seamlessly blending the strengths of retrieval-based methods with LLMs. This innovative technique enables the creation of highly accurate and contextually relevant responses by leveraging vast external knowledge sources. In this article, we understand the integration of RAG with the cutting-edge Milvus Vector Database and the versatile Langchain framework, exploring how this powerful combination can elevate AI applications to new heights.

Table of content

  1. Overview of Reterival Augmented Generation (RAG)
  2. Understanding Milvus Vector Database
  3. The Architecture of Milvus Vector Database
  4. Retrieving information from Milvus Vector DB

Let’s understand the science behind Retrieval Augmented Generation (RAG).

Overview of Reterival Augmented Generation (RAG)

RAG is a technique that integrates the retrieval of relevant information from external sources with the generation capabilities of advanced language models. Traditional generative models, such as GPT-3, rely solely on their pre-trained knowledge to generate responses. However, they may struggle with specific, up-to-date, or niche information due to the limitations of their training data. RAG addresses this limitation by incorporating a retrieval component that fetches relevant data from external databases, documents, or web pages, enriching the model’s responses with more precise and current information.

How Does RAG Work?

The RAG process typically involves two main components:

  • Retriever: This component searches and retrieves relevant information from an external knowledge base or database. The retriever uses various techniques such as keyword matching, vector similarity search, or neural retrieval models to find the most pertinent data.
  • Generator: The retrieved information is then passed to a generative model, which uses it to create a response. The generator can be a pre-trained language model like GPT-3, BERT, or any other state-of-the-art model that excels at generating coherent and contextually appropriate text.

The synergy between these two components allows RAG to generate responses that are not only fluent and natural but also grounded in factual and up-to-date information.

Benefits of RAG

  • Enhanced Accuracy: By leveraging external data, RAG produces more accurate and contextually relevant responses.
  • Up-to-date Information: RAG can incorporate the latest information, making it particularly useful for dynamic fields like news, finance, and technology.
  • Scalability: The retrieval component can be scaled independently, allowing for efficient handling of large datasets.
  • Flexibility: RAG can be tailored to specific domains by customizing the retrieval database, making it versatile across different applications.

Understanding Milvus Vector Database

Milvus is a specialized vector database designed for managing and retrieving unstructured data through vector embeddings. It’s particularly suited for handling complex data types like images, audio, videos, and text, enabling semantic similarity searches using techniques such as Approximate Nearest Neighbor (ANN) algorithms.

Milvus is particularly useful for applications such as recommender systems, chatbots, multimedia content search, and addressing challenges posed by AI and large language models. It offers advantages over traditional databases and vector search libraries in terms of scalability, multi-tenancy, and comprehensive API support, making it well-suited for handling large-scale, dynamic applications involving unstructured data.

Key features and aspects of Milvus 

  • Scalable and Elastic Architecture: Milvus is built with a service-oriented design that decouples storage, coordinators, and workers. This allows for independent scaling of different components to meet varying workloads.
  • Diverse Index Support: It supports over 10 index types, including HNSW, IVF, Product Quantization, and GPU-based indexing. This variety allows developers to optimize searches for specific performance and accuracy requirements.
  • Versatile Search Capabilities: Milvus offers multiple search types, including top-K Approximate Nearest Neighbor (ANN), Range ANN, and search with metadata filtering. It’s also working on hybrid dense and sparse vector search.
  • Tunable Consistency: It provides a delta consistency model, allowing users to balance query performance and data freshness according to their needs.
  • Hardware-Accelerated Compute Support: Milvus is designed to leverage various hardware capabilities, including AVX512 and Neon for SIMD execution, as well as GPU support for efficient processing.

The architecture of the Milvus Vector Database

The architecture consists of several key layers and components.

Image source

Access Layer

The access layer is the front end of Milvus, serving as the initial point of contact for all external requests. It employs stateless proxies that manage client connections efficiently. These proxies perform two types of checks on incoming requests:

  • Static verification: Ensures that requests are properly formatted and contain all necessary information.
  • Dynamic checks: Validates the requests against the current system state and permissions.

The access layer also implements load balancing, distributing incoming requests evenly across available resources to prevent any single point from becoming overwhelmed. This layer is responsible for implementing Milvus’s comprehensive API suite, providing a unified interface for various client applications and programming languages.

Once a request has been processed by the downstream services, the access layer is responsible for routing the response back to the user. This ensures a seamless communication flow between the client and the Milvus system.

Coordinator Service

The coordinator service acts as the brain of Milvus, orchestrating all operations across the system. It consists of four specialized coordinators, each with distinct responsibilities:

Root Coordinator

  • Manages all data-related tasks within Milvus.
  • Handles the assignment of global timestamps, which is crucial for maintaining data consistency across the distributed system.
  • Ensures that all operations are properly sequenced and that data integrity is maintained throughout the system.

Query Coordinator

  • Oversees all query nodes in the system.
  • Manages the distribution and execution of search operations across these nodes.
  • Determines how to break down the query, which nodes should handle different parts of the query, and how to aggregate the results.
  • Ensures efficient and accurate search operations.

Data Coordinator

  • Manages all data nodes in the system.
  • Handles metadata related to data storage and retrieval.
  • Keeps track of where different pieces of data are stored across the system.
  • Manages data replication for fault tolerance and orchestrates data movement operations when necessary (such as during system scaling or data rebalancing).

Index Coordinator

  • Maintains all index nodes and manages metadata related to indexing.
  • Oversees the creation, updating, and deletion of indexes across the system.
  • Ensures that all relevant indexes are updated accordingly to maintain search accuracy and efficiency.

Worker Nodes

Worker nodes are the workhorses of Milvus, responsible for executing the actual tasks assigned by the coordinators. These are designed as scalable pods, meaning their number can be increased or decreased based on the system’s current needs. There are three types of worker nodes corresponding to the main tasks in Milvus:

  • Query Nodes: Execute search and query operations.
  • Data Nodes: Handle data storage and retrieval operations.
  • Index Nodes: Manage the creation and maintenance of indexes.

The scalability of worker nodes allows Milvus to dynamically adjust to changing demands in data processing, querying, and indexing. For instance, if there’s a sudden increase in search requests, more query nodes can be spun up to handle the load.

Object Storage Layer

This layer is fundamental for data persistence in Milvus and consists of three main components:

Meta Store

  • Uses etcd, a distributed key-value store, as its meta store.
  • Stores metadata snapshots, which include critical information about the system’s state, data distribution, and configuration.
  • Plays a crucial role in system health checks, allowing Milvus to quickly detect and respond to any issues in the distributed system.

Log Broker

  • Manages streaming data persistence and recovery.
  • Can use either Apache Pulsar or RocksDB, depending on the system configuration.
  • Ensures that all operations are logged durably, allowing the system to recover from failures and maintain data consistency.

Object Storage

  • Used for storing large objects such as log snapshots, index files, and query results.
  • Supports various cloud storage services including Amazon S3, Azure Blob Storage, and MinIO.
  • Provides flexibility for users to choose the most suitable and cost-effective storage solution for their needs.

Retrieving information from Milvus Vector DB

Here we will demonstrate how to build a retrieval augmented generation system using Milvus and Langchain. So the idea is we would take weblinks of some articles and create a vector database using vector embedding. Then retrieve the information from the database using RAG methodologies. 

RAG is a technique that combines information retrieval with text generation. In the process, we would retrieve the relevant information from a knowledge base which in this case is the Milvus vector database. Then the information would be provided to the LLM which would generate a response based on both the query and the retrieved context.

Here are the prerequisites for the demonstration.

import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_milvus import Milvus, Zilliz
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI

Since we would be using the link for the article we need to scrap the data from the webpage for we require WebBaseLoaded and Beautifulsoup. The WebBaseLoader is used to load web pages and convert them into document objects that can be processed by LangChain. The Beautifulsoup is used for web scraping and data extraction from web pages. Then to store the data in a vector database and integrate with langchain we require Milvus and Zilliz. 

Now let’s create a document loader which would be web-based since we would be using web links to access the articles. In the web loader we would be using two articles for the documentation, here are the links for Multimodal Knowledge Graphs and Can MultiModal LLMs be a key to AGI? for the reference. Here is the code snippet.

webloader = WebBaseLoader(
                    "elementor-heading-title elementor-size-default", 
                    "elementor-element elementor-element-73d7af23 elementor-widget elementor-widget-theme-post-title elementor-page-title elementor-widget-heading")

Now we would set up a vector database using Milvus (or potentially Zilliz) with OpenAI embeddings. First, we would initialize the OpenAIEmbeddings object, using an API key retrieved from user data. This embedding object will be used to convert text into vector representations. Here is the code snippet.

embeddings = OpenAIEmbeddings(openai_api_key=userdata.get('OPENAI_API_KEY'))

Then, we would create a Milvus vector store from a collection of documents (stored in the docs variable). Here is the code snippet.

vectorstore = Milvus.from_documents(  # or Zilliz.from_documents
        "uri": "./milvus_articles.db",

The from_documents method processes these documents, converting them into vector embeddings using the specified OpenAI embeddings. The vector store is configured to connect to a local Milvus database file named “milvus_articles.db”. This setup allows for efficient similarity searches and retrieval of document vectors, which is crucial for applications like semantic search or retrieval-augmented generation systems.

Let’s do a similarity search in the vector database to test how efficiently it can track. Here is the code snippet.

query = "What are multimodal LLMs?"
vectorstore.similarity_search(query, k=1)


[Document(metadata={‘source’: ‘’, ‘pk’: 451021951537774601}, page_content=’Role of Multimodal LLMs in Advancing Towards AGI’)]

In the above code snippet we can see that the search can identify the document and is providing the link to the article. 

The LLM that we would be using in the demonstration is GPT3.5 turbo by OpenAI. The temperature parameter is set to zero so that the response is factually correct. Now, we would set the template for the prompt. Here is the code snippet.

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE, input_variables=["context", "question"]
retriever = vectorstore.as_retriever()

We are all set to define an RAG chain which would be used for the retriever of the query.

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
response = rag_chain.invoke(query)

The template is converted into a PromptTemplate object. The code transforms the previously created vector store into a retriever, which can fetch relevant documents. We would define a function format_docs to concatenate the content of retrieved documents. 

After that the RAG chain is constructed using LangChain’s composable components: it retrieves relevant documents, formats them, combines them with the user’s question using the prompt template, passes this to a LLM, and then parses the output to a string. Finally, the chain is invoked with a specific query, generating an AI response based on the retrieved context. This setup allows for context-aware, knowledge-grounded responses to user queries.


By leveraging Milvus’s scalable architecture and efficient vector search capabilities, RAG systems can quickly retrieve relevant information from vast knowledge bases. This retrieval process, when coupled with Langchain’s versatile framework for building AI applications, enables the creation of sophisticated systems that can generate responses grounded in up-to-date and domain-specific knowledge.


  1. Milvus Documentation
  2. Link to the above code
Picture of Sourabh Mehta

Sourabh Mehta

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.