Generative AI Crash Course for Non-Tech Professionals. Register Now >

Can MultiModal LLMs be a key to AGI?

By integrating textual, visual, and other modalities, MultiModal LLMs pave the way for human-like intelligence.

Over the years, Large Language Models (LLMs) have made enormous leaps with remarkable abilities like instruction following, InContext Learning (ICL), and Chain of Thought (CoT). Such improvements which were realized by enhancing data and model size have led to highly impressive zero/few-shot reasoning performance on almost all Natural language Processing (NLP) tasks. Nonetheless, LLMs are intrinsically designed to only understand discrete texts and thus they tend to be “blind” towards visual information. Conversely, Large Vision Models (LVMs) are good at vision tasks but poor at reasoning. This complementarity has given rise to a new field called Multimodal Large Language Models (MLLMs), which combine LLMs and LVMs for performing processing, reasoning and generating outputs across several modalities. In this article by exploring the potential of MLLMs, we can better understand their pivotal role in the pursuit of Artificial General Intelligence (AGI).

Table of contents

  1. The Evolution of Large Language Models (LLMs)
  2. Overview of Multimodal Large Language Models (MLLMs)
  3. The architecture of Multimodal Large Language Models
  4. How Multimodal LLMs are connected to AGI?

Let’s connect the dots by understanding the evolution of large language models. 

The Evolution of Large Language Models (LLMs)

The development of Large Language Models (LLMs) has been a significant milestone in the field of artificial intelligence, particularly in natural language processing (NLP). These models have evolved rapidly, showcasing extraordinary capabilities and transforming how machines understand and generate human language.

Key Milestones in LLM Development

The journey of LLMs began with simpler models designed to understand and generate text. Over time, the models grew in complexity and size, leading to remarkable advancements:

  • Early NLP Models: The initial models would focus on the basic language parts such as the tagging of a part of speech and recognition of named entities like Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) techniques.
  • Introduction of Word Embeddings: The rise of word embeddings, like Word2Vec and GloVe, was a big leap. Such methods enabled models to understand the context meaning of words by representing them as dense vectors in a continuous vector space.
  • Transformer Architecture: The introduction of Transformer architecture by Vaswani et al. in 2017 changed NLP forever. Transformers with their self-attention mechanism made it possible to parallelize the processing of words making the models even more efficient and powerful.
  • BERT and Beyond: Bidirectional Encoder Representation from Transformers (BERT) took this further by pre-training on large text corpora and fine-tuning for specific tasks. This method improved performance considerably across a range of NLP benchmarks.
  • GPT Series: The Generative Pre-trained Transformer (GPT) series demonstrated how LLMs can be used to produce text that is both coherent and contextually valid. As each new iteration moved from GPT through GPT-3, these models became larger as well as more capable ending up at GPT-4 which integrated multimodal capabilities.

Major Breakthroughs: Instruction Following, InContext Learning, and Chain of Thought

As LLMs evolved, they demonstrated several groundbreaking abilities:

  • Instruction Following: Modern LLMs can follow complex instructions, making them versatile tools for various applications. By understanding and executing user commands, they can perform a wide range of tasks, from writing essays to generating code.
  • InContext Learning (ICL): In certain cases, LLMs learn from the context within which they find themselves in a discussion or even when reading texts. Their responses are tailored so that they change according to the information provided thus enabling them to predict with much accuracy and relevancy. Their responses are based on the given information, enhancing their accuracy and relevance.
  • Chain of Thought (CoT): It is a concept used to describe LLMs’ abilities to reason through a list of ideas or steps just like humans do when solving problems. This feature is responsible for making models capable of addressing more complex issues requiring multi-step reasoning as well as decision-making.

Impact on Natural Language Processing (NLP) Tasks

The advancements in LLMs have significantly impacted various NLP tasks, including:

  • Text Generation: LLMs can generate high-quality, coherent text for applications such as content creation, storytelling, and chatbot responses.
  • Machine Translation: The improved understanding of language nuances has enhanced the accuracy of machine translation systems.
  • Text Summarization: LLMs can effectively summarize long texts, making information more accessible and digestible.
  • Sentiment Analysis: By accurately interpreting the sentiment expressed in text, LLMs support applications in customer feedback analysis and social media monitoring.

The evolution of LLMs has set the stage for the development of Multimodal Large Language Models (MLLMs), which integrate visual and textual information to push the boundaries of what artificial intelligence can achieve.

Overview of Multimodal Large Language Models (MLLMs)

MLLMs leverage the strengths of both Large Language Models (LLMs) and Large Visual Models (LVMs), using multimodal instruction tuning to follow new instructions and perform tasks that were previously unattainable. From writing website code based on images to understanding the deeper meaning of memes and OCR-free math reasoning, MLLMs demonstrate unprecedented capabilities. The release of GPT-4 has spurred a research frenzy in this field, highlighting the transformative potential of MLLMs.

MLLMs are designed to overcome the limitations of traditional LLMs, which are primarily text-based and cannot interpret visual information. By integrating multimodal data, MLLMs enhance their cognitive abilities and enable more complex and human-like interactions.

Key Components: LLMs and LVMs

  1. Large Language Models (LLMs):
  • Function: LLMs are designed to understand, generate, and manipulate text. They have demonstrated exceptional performance in tasks such as language translation, text generation, and sentiment analysis.
  • Examples: GPT-3, BERT, and T5 are prominent examples of LLMs that have revolutionized the field of natural language processing.
  1. Large Vision Models (LVMs):
  • Function: LVMs are specialized in interpreting visual data, such as images and videos. They excel in tasks like object detection, image classification, and visual scene understanding.
  • Examples: Models like CLIP and Vision Transformers (ViTs) have set new benchmarks in the field of computer vision.

Integration of Large Language Models and Large Vision Models

The fusion of LLMs and LVMs in MLLMs allows these models to process and understand both textual and visual information simultaneously. This integration is achieved through advanced training techniques that align the representations of different modalities, enabling seamless information exchange between them.

Core Traits and Capabilities of MLLMs

Multimodal Instruction Tuning

  • Description: MLLMs are trained using multimodal instruction tuning, which involves providing instructions that encompass various types of data. This training paradigm enhances the model’s ability to follow complex instructions that require understanding and reasoning across multiple modalities.
  • Impact: This capability allows MLLMs to perform tasks such as writing code based on images, understanding the nuances of memes, and solving mathematical problems without optical character recognition (OCR).

Real-World Applications

  • Image-Based Coding: MLLMs can generate website code or other types of programming code based on visual input, bridging the gap between design and development.
  • Meme Understanding: By interpreting both the textual and visual elements of memes, MLLMs can understand and generate humorous or contextually relevant responses.
  • OCR-Free Math Reasoning: MLLMs can solve math problems presented in images without needing traditional OCR, showcasing their advanced reasoning capabilities.

Enhanced Cognitive Abilities

  • Description: The integration of multimodal data significantly boosts the cognitive abilities of MLLMs, enabling them to perform more complex and human-like tasks.
  • Examples: From detailed scene descriptions to context-aware responses in chatbots, MLLMs are pushing the boundaries of what AI can achieve.

The architecture of Multimodal Large Language Models

A typical MLLM can be abstracted into three modules: a pre-trained modality encoder, a pre-trained LLM, and a modality interface to connect them. Drawing an analogy to humans, modality encoders such as image/audio encoders are human eyes/ears that receive and preprocess optical/acoustic signals, while LLMs are like human brains that understand and reason with the processed signals. In between, the modality interface serves to align different modalities. Some MLLMs also include a generator to output other modalities apart from text.

Modality Encoder

The encoders compress raw information, such as images or audio, into a more compact representation. Rather than training from scratch, a common approach is to use a pre-trained encoder that has been aligned to other modalities. For example, CLIP incorporates a visual encoder semantically aligned with the text through large-scale pretraining on image-text pairs. Therefore, it is easier to use such initially pre-aligned encoders to align with LLMs through alignment pre-training.

Commonly used image encoders include vanilla CLIP image encoders, EVA-CLIP (ViT-G/14) encoders, and convolution-based ConvNext-L encoders, each chosen based on factors like resolution, parameter size, and pretraining corpus. Scaling up input resolution has been found to achieve remarkable performance gains, with approaches like direct scaling and patch-division methods.

Pre-trained LLM

Instead of training an LLM from scratch, it is more efficient and practical to start with a pre-trained one. Through extensive pre-training on web corpus, LLMs have been embedded with rich world knowledge and demonstrate strong generalization and reasoning capabilities. Commonly used LLMs include the FlanT5 series, LLaMA series, and Vicuna family, with scaling up the parameter size bringing additional gains. Recent explorations of Mixture of Experts (MoE) architecture for LLMs have garnered attention for scaling up total parameter size without increasing computational cost.

Modality Interface

Since LLMs can only perceive text, bridging the gap between natural language and other modalities is necessary. A practical way is to introduce a learnable connector between the pre-trained visual encoder and LLM or use expert models to translate images into languages. Learnable connectors project information into a space that LLMs can understand efficiently, implemented through token-level or feature-level fusion. Expert models convert multimodal inputs into languages, allowing LLMs to understand multimodality through converted languages.

How Multimodal LLMs are connected to AGI?

Artificial General Intelligence (AGI) seeks to emulate human intelligence, encompassing the ability to reason, learn from experience, understand natural language, and perceive and interpret sensory inputs (such as vision and sound). While current AI systems excel in specific tasks, achieving true AGI requires integrating these capabilities seamlessly and flexibly across different domains.

Role of Multimodal LLMs in Advancing Towards AGI

  • Integration of Modalities: MLLMs integrate textual, visual, and potentially other modalities, enabling a deeper and more holistic understanding of information. This integration mirrors human cognition, where sensory inputs are combined to form a unified understanding of the world.
  • Contextual Reasoning: MLLMs can learn and reason based on contextual information, adapting their responses and behaviours dynamically. This capability is crucial for AGI, as it enables AI systems to understand and respond appropriately in diverse situations.
  • Multimodal Interaction: By processing and generating outputs in multiple modalities, MLLMs facilitate more natural interactions with users and the environment. This capability is essential for applications ranging from human-computer interaction to complex problem-solving scenarios.
  • Transfer Learning and Adaptability: MLLMs, pre-trained on vast datasets, can generalize their knowledge and adapt to new tasks with minimal additional training. This ability is fundamental for AGI, as it allows AI systems to continuously learn and improve across diverse domains and environments.


Multimodal Large Language Models (MLLMs) represent a significant advancement in AI towards achieving Artificial General Intelligence (AGI). By integrating textual, visual, and other modalities, enhancing contextual reasoning, and enabling flexible interaction and adaptation, MLLMs pave the way for future AI systems capable of performing diverse tasks with human-like intelligence.


  1. White paper – A Survey on MultiModel LLMs
  2. Modality Encoder in MLLMs
Picture of Sourabh Mehta

Sourabh Mehta

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.