Deep Dives

Mastering Multimodal Understanding and Generation with Janus-Pro

Janus-Pro advances multimodal AI by decoupling visual understanding and generation, optimizing training strategies for superior performance.

Explore more from ADaSci

Beyond Traditional Vibration classification

A Semi-Automated approach to Measure Effectiveness of Call Center Agents

Why Organizations Should Choose ADaSci for AI Corporate Trainings: A Case Study of Genpact’s SkyDive Global Campus Academy

Enhancing Taxpayer Risk Prediction through LLM-Driven Profile Tuning

Mastering Tiledesk for Building Chatbots with Custom Knowledge Bases

A Hands-on Guide to Enhance RAG with Re-Ranking

How Causal Knowledge Graphs Outperform Traditional Knowledge Graphs?

Predicting incremental sales in personalised marketing campaigns with Netlift science

An Algorithm for the Automated Detection of Sigmoidal Filaments From Carrington Maps

Training Efficient CNNS: Tweaking the Nuts and Bolts of Neural Networks for Lighter, Faster and Robust Models

Janus-Pro enhances multimodal AI by effectively balancing textual-visual understanding (e.g., image captioning, document processing) and generation (e.g., text-to-image synthesis). Traditional models struggle with conflicts between these tasks, leading to performance trade-offs. As an evolution of the original Janus model, Janus-Pro introduces architectural refinements and optimized training methodologies to mitigate these challenges. This article explores its core design, training strategies, and technical innovations, highlighting how it achieves state-of-the-art performance across diverse multimodal tasks.

Table of Content

Introduction to Janus-Pro
Key Features
Understanding Janus-Pro Architecture
Overview of Janus’s Optimized Training Pipeline
Scaling Data and Model for Performance Gains
Hands-on Implementation

Let’s start by understanding what Janus-Pro is.

Introduction to Janus-Pro

Janus-Pro is a unified multimodal model, meaning it supports both understanding and generation tasks within a single framework. Unlike previous approaches that use a shared encoder for both tasks (which causes interference), It decouples visual encoding into separate components for each task.

Comparison between Janus-Pro and its predecessor, janus.

Comparison between Janus-Pro and Janus.

Key Features

Unified Transformer Architecture: Processes multimodal inputs with autoregressive modeling.
Decoupled Encoding: Independent encoding paths for understanding (SigLIP-based) and generation (VQ-tokenizer-based).
Scalability: Available in 1B and 7B parameter configurations for different computational constraints.
Improved Instruction Following: Outperforms state-of-the-art models like TokenFlow-XL, DALL-E 3, and SDXL in text-to-image benchmarks.

Key Features of Janus-Pro

Understanding Janus-Pro Architecture

Janus-Pro refines the autoregressive transformer architecture, focusing on modular design for efficient multimodal processing.

Janus-Pro’s Architecture

Dual Encoding for Multimodal Understanding and Generation

Understanding Encoder: Uses SigLIP to extract high-dimensional semantic embeddings from images, mapping them into language model (LLM) input space.
Generation Encoder: Employs a VQ-tokenizer to convert images into discrete latent representations, which are then processed autoregressively for text-to-image synthesis.

The decoupling of these encoders eliminates interference, significantly improving accuracy in both tasks.

Autoregressive Multimodal Fusion

The model concatenates processed text and image features into a unified feature sequence, enabling cross-domain reasoning.
Uses separate prediction heads for textual and visual outputs, improving generation quality.

Scalable Transformer Backbone

Model sizes: Janus-Pro-1B (compact) and Janus-Pro-7B (high-performance).
Uses context length of 4096 tokens, ensuring long-sequence handling.
Employs multi-layer MLP adaptors to efficiently integrate multimodal embeddings.

Overview of Janus’s Optimized Training Pipeline

Janus-Pro introduces refinements in its three-stage training process:

Stage I: Early Visual Representation Learning

Increased training steps on ImageNet-based pixel modeling, improving generalization.
Adopts contrastive learning to enhance multimodal alignment.

Stage II: Unified Pretraining with Focused Data

Switches from ImageNet data to full text-to-image data, increasing efficiency.
Pretraining adds 90M samples from diverse datasets

Stage III: Fine-Tuning for Enhanced Instruction Following

Adjusts data mix ratios: multimodal (5), pure text (1), text-to-image (4) instead of 7:3:10 for better balance.
Trains on Additional datasets from DeepSeek-VL2 datasets, incorporating document processing, OCR, and image-text retrieval tasks.

Performance benchmarks

These refinements eliminate data inefficiencies in the original Janus model, allowing faster convergence and better generalization.

Scaling Data and Model for Performance Gains

Multimodal Data Scaling

Increases training corpus by 90M samples, including captioned images, documents, and synthetic aesthetic datasets.
Introduces real-to-synthetic data ratio of 1:1, stabilizing text-to-image generation.

Detailed hyperparameters of Janus-Pro

Model Scaling from 1B to 7B Parameters

Expands vocabulary size to 100K, improving multimodal reasoning.
Larger models demonstrate faster convergence and superior visual-text alignment.
Uses HAI-LLM for distributed training on A100 (40GB) GPU clusters, reducing training time from 14 days (1B) to 9 days (7B).

Hands-on Implementation

Step 1: Install Required Libraries

Ensure all necessary libraries are available before execution.

!pip install torch transformers janus

Step 2: Import Required Modules

Load essential Python libraries for processing and model execution.

import os
import PIL.Image
import torch
import numpy as np

from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor

Step 3: Verify GPU Availability

Ensure GPU acceleration is enabled for faster processing.

assert torch.cuda.is_available(), "Enable GPU in Colab: Runtime > Change runtime type > Hardware accelerator = GPU"

Step 4: Load the Model Efficiently

Utilize reduced memory footprint techniques for optimal performance.

model_path = "deepseek-ai/Janus-Pro-7B"
vl_chat_processor = VLChatProcessor.from_pretrained(model_path)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    load_in_4bit=True  # Reduces memory usage
)

Step 5: Prepare the Conversation Template

Define a structured prompt for image generation.

conversation = [{

    "role": "<|User|>",
    "content": "A stunning princess from Kabul in red, white traditional clothing, blue eyes, brown hair",

}, {"role": "<|Assistant|>", "content": ""}]

prompt = vl_chat_processor.apply_sft_template_for_multi_turn_prompts(

    conversations=conversation,
    sft_format=vl_chat_processor.sft_format,
    system_prompt="",

) + vl_chat_processor.image_start_tag

Step 6: Define the Image Generation Function

Implement a modified function optimized for Colab.

@torch.inference_mode()

def generate_colab(
    prompt: str,
    temperature: float = 1,
    parallel_size: int = 4,
    cfg_weight: float = 5,
    image_token_num_per_image: int = 576,
    img_size: int = 384,
    patch_size: int = 16,
):

    input_ids = vl_chat_processor.tokenizer.encode(prompt)
    input_ids = torch.LongTensor(input_ids).to(model.device)
    tokens = torch.zeros((parallel_size*2, len(input_ids)), dtype=torch.int, device=model.device)

    for i in range(parallel_size*2):
        tokens[i, :] = input_ids
        if i % 2 != 0:
            tokens[i, 1:-1] = vl_chat_processor.pad_id
    inputs_embeds = model.language_model.get_input_embeddings()(tokens)

    generated_tokens = torch.zeros((parallel_size, image_token_num_per_image), dtype=torch.int, device=model.device)

    past_key_values = None

    for i in range(image_token_num_per_image):

        outputs = model.language_model.model(
            inputs_embeds=inputs_embeds,
            use_cache=True,
            past_key_values=past_key_values

        )

        past_key_values = outputs.past_key_values
        logits = model.gen_head(outputs.last_hidden_state[:, -1, :])
        logit_cond = logits[0::2, :]
        logit_uncond = logits[1::2, :]
        logits = logit_uncond + cfg_weight * (logit_cond - logit_uncond)
        probs = torch.softmax(logits / temperature, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        generated_tokens[:, i] = next_token.squeeze(dim=-1)

        next_token = torch.cat([next_token.unsqueeze(1), next_token.unsqueeze(1)], dim=1).view(-1)
        img_embeds = model.prepare_gen_img_embeds(next_token)
        inputs_embeds = img_embeds.unsqueeze(1)

    dec = model.gen_vision_model.decode_code(
        generated_tokens.to(dtype=torch.int),
        shape=[parallel_size, 8, img_size//patch_size, img_size//patch_size]

    )

    dec = dec.to(torch.float32).cpu().numpy().transpose(0, 2, 3, 1)
    dec = np.clip((dec + 1) / 2 * 255, 0, 255).astype(np.uint8)
    os.makedirs('generated_samples', exist_ok=True)

    for i in range(parallel_size):
        PIL.Image.fromarray(dec[i]).save(f'generated_samples/img_{i}.jpg')
    print(f"Generated {parallel_size} images in 'generated_samples' directory")

Step 7: Execute the Image Generation Function

Run the function to generate images based on the provided prompt.

generate_colab(prompt)

Output

Final Words

Janus-Pro represents a significant leap forward in multimodal AI by decoupling visual processing, optimizing training strategies, and scaling model size. Key takeaways include: Improved multimodal understanding with SigLIP encoding. Enhanced text-to-image generation through VQ-tokenization and synthetic data augmentation. Efficient scaling from 1B to 7B models, ensuring adaptability to diverse applications.

References

Aniruddha Shrikhande

Aniruddha Shrikhande is an AI enthusiast and technical writer with a strong focus on Large Language Models (LLMs) and generative AI. Committed to demystifying complex AI concepts, he specializes in creating clear, accessible content that bridges the gap between technical innovation and practical application. Aniruddha's work explores cutting-edge AI solutions across various industries. Through his writing, Aniruddha aims to inspire and educate, contributing to the dynamic and rapidly expanding field of artificial intelligence.

The Chartered Data Scientist Designation

Achieve the highest distinction in the data science profession.

Elevate Your Team's AI Skills with our Proven Training Programs

Strengthen Critical AI Skills with Trusted Generative AI Training by Association of Data Scientists.

Our Latest Courses

Mastering Multimodal Understanding and Generation with Janus-Pro

Explore more from ADaSci

Table of Content

Introduction to Janus-Pro

Key Features

Understanding Janus-Pro Architecture

Dual Encoding for Multimodal Understanding and Generation

Autoregressive Multimodal Fusion

Scalable Transformer Backbone

Overview of Janus’s Optimized Training Pipeline

Stage I: Early Visual Representation Learning

Stage II: Unified Pretraining with Focused Data

Stage III: Fine-Tuning for Enhanced Instruction Following

Scaling Data and Model for Performance Gains

Multimodal Data Scaling

Model Scaling from 1B to 7B Parameters

Hands-on Implementation

Step 1: Install Required Libraries

Step 2: Import Required Modules

Step 3: Verify GPU Availability

Step 4: Load the Model Efficiently

Step 5: Prepare the Conversation Template

Step 6: Define the Image Generation Function

Step 7: Execute the Image Generation Function

Final Words

References

Aniruddha Shrikhande

The Chartered Data Scientist Designation

Elevate Your Team's AI Skills with our Proven Training Programs

Our AI Courses

Agentic AI in Production: Hands-On Workshop

Agentic AI Workforce Readiness Strategies for CXOs

MCP and A2A – The AI Protocols for Next-Gen Agent Ecosystems

Our Accreditations

Get global recognition for AI skills

Chartered Data Scientist (CDS™)

The highest distinction in the data science profession. Not just earn a charter, but use it as a designation.

Certified Data Scientist - Associate Level

Global recognition of data science skills at the beginner level.

Certified Generative AI Engineer

An upskilling-linked certification initiative designed to recognize talent in generative AI and large language models

Join thousands of members and receive all benefits.

Become Our Member

We offer both Individual & Institutional Membership.

The power of intelligence to propel humanity and make a difference

Our Accrediations

CDS Program

Membership

About

For Organizations

Journal