Advancing Multilingual Text Embeddings through Nomic Embed Text V2

Nomic Embed Text V2 revolutionizes text embeddings with Mixture-of-Experts (MoE), enhancing efficiency, multilingual support, and scalability

Modern AI applications rely heavily on embedding models to perform tasks like clustering, retrieval-augmented generation (RAG), and semantic search. Nomic Embed Text V2 optimizes efficiency and multilingual capabilities by incorporating the Mixture-of-Experts (MoE) architecture into text embeddings, thereby introducing a new paradigm. This article examines the model’s design, training process, performance on benchmarks, and practical uses.

Table of Content

  1. The Need for High-Performance Embeddings
  2. Key Innovations of Nomic Embed Text V2
  3. Mixture-of-Experts Architecture
  4. Multilingual Training Dataset
  5. Performance Benchmarks
  6. Integrations

Lets begin by understanding the need for high performance Embeddings.

The Need for High-Performance Embeddings

Traditional embedding models struggle with scalability, efficiency, and multilingual generalization. Nomic Embed Text V2 addresses these issues by improving multilingual support which includes various Indic languages such as Hindi, Marathi through training on 1.6 billion high-quality text pairs, Thus boosting inference efficiency without compromising performance, and utilizing sparse activation (MoE) to lower computational overhead. Because of these enhancements, Nomic Embed Text V2 is perfect for high-volume applications in large-scale NLP pipelines, search, and retrieval.

Key Innovations of Nomic Embed Text V2

In this release, the first Mixture-of-Experts (MoE) embedding model for optimized parameter efficiency is introduced, along with multilingual embeddings in dozens of languages, state-of-the-art (SOTA) performance on BEIR and MIRACL benchmarks, and flexible dimensionality reduction that allows embeddings to be truncated from 768 to 256 dimensions without sacrificing quality. This builds upon Nomic Embed Text V1. These improvements result in improved multilingual comprehension, reduced memory consumption, and faster inference in Nomic Embed Text V2.

Key Innovations of Nomic Embed Text V2

Mixture-of-Experts Architecture

Why MoE for Embeddings?

The majority of embedding models have large computing costs since they activate all parameters for each input. The number of active parameters per inference is decreased by MoE, which dynamically directs each input to specialized expert layers.

How It Works

  • 8 Experts per MoE Layer: Only the top 2 experts are activated per input.
  • Total Model Size: 475M parameters, but only 305M are active at any time.
  • Result: Lower latency and efficient parameter utilization for large-scale applications.

By reducing active parameters, Nomic Embed Text V2 achieves 30-40% lower inference costs while maintaining SOTA accuracy.

Multilingual Training Dataset

Strong cross-lingual generalization is ensured through training on a variety of languages. There are 1.6 billion high-quality multilingual text pairs in the dataset, which was curated using mC4 and multilingual CC-News corpora. Low-quality text pairs were eliminated by consistency filtering. This strategy improves the model’s performance in high-resource environments while enabling it to handle low-resource languages.

Breakdown of 1.6 billion data pairs used for multilingual contrastive pretraining

Breakdown of multilingual data pairs

Performance Benchmarks

Benchmarking Against SOTA Models

Nomic Embed Text V2 against other multilingual embedding models

Nomic against other multilingual embedding models

 Dimension Reduction

Nomic Embed Text V2 supports Matryoshka Representation Learning, allowing dimensionality reduction from 768 to 256 while retaining 97% of performance.

Dimension Reduction Comparison

Dimension Reduction Comparison

Integrations

It is designed for easy integration with popular libraries and frameworks like Transformers, SentenceTransformers, LangChain, and LlamaIndex.

Final Words

With its flexible dimension reduction for cost-effective deployments, multilingual support across 1.6 billion high-quality text pairs, SOTA results on BEIR and MIRACL benchmarks, and effective Mixture-of-Experts routing for optimal performance, Nomic Embed Text V2 is a major advancement in embedding technology.

References

Picture of Aniruddha Shrikhande

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.