Multi-modal and multi-agent architectures have gained a huge amount of dominance and prevalence in the current LLM landscape. These systems are designed to process and synthesize information from diverse sources while orchestrating and collaborative efforts of multiple autonomous agents and they hold immense potential for solving complex real-world problems. Understanding the inner complexities of these agents presents significant challenges, these challenges can be tackled with ease by using agent observability and tracing platforms like Portkey. This article explains multi-modal multi-agent development and its observation in detail.
Table of Contents
- Understanding Multi-Modal and Multi-Agent Development
- Comprehending Replicate and its Usability
- AI Agent Observability and Tracing
- Hands-on Implementation of Portkey for Observing Agents
Understanding Multi-Modal and Multi-Agent Development
A multimodal agent is a system that can process and understand information from multiple types of data such as text, images, audio, videos, and sensor data. These systems integrate information from different modalities to achieve a more comprehensive understanding and capability. These types of systems integrate specialized encoders for each modality and employ sophisticated alignment techniques, for creating a unified semantic representation that can capture the relationships between different data types.
The architecture of multimodal systems involves modal-specific processing pathways that eventually converge into shared representations. Modern implementations often use a transformer-based approach, which is highly effective in handling cross-modal attention. Applications of multimodal agents have expanded rapidly, ranging from content-understanding systems that can understand text and images to generative systems that can create images from textual descriptions or generate captions for visual content. This ability to process multiple modalities simultaneously has enabled more natural human-computer interactions and more comprehensive data analysis capabilities across industries.
On the other hand, multi-agent systems are composed of multiple intelligent agents that can interact with each other and their environment to achieve a common goal. Key characteristics of multi-agent systems are autonomy, interaction, and collaboration. Autonomy refers to the ability of agents to make decisions independently. Interaction is a process in which agents can communicate and coordinate with each other, and finally, when agents can work together to solve complex problems refers to the process of collaboration.
The effectiveness of multiagent systems depends on the agent coordination mechanisms. These include task allocation, information sharing, and conflict resolution. Agents may interact in cooperative frameworks where the agents operate towards shared goals and objectives. The agents in multiagent systems can operate sequentially, one after another, or hierarchically, where a supervisor agent directs other agents to operate and finish tasks.
Frameworks like CrewAI, AutoGen, and LangGraph are the most widely used multiagent multimodal agent development systems. These frameworks provide infrastructure for defining agent roles, communication channels, shared knowledge management, and collaborative workflow orchestrations. They abstract a huge amount of complexity involved in building such sophisticated architectures, making it user-friendly and efficient for users in their agentic development requirements. Tools such as AgentOps and Portkey support users by effectively monitoring, tracing, and evaluating agents.
Comprehending Replicate and its Usability
Replicate is a platform that makes it easy to run AI models in the cloud making it easy for users to access advanced AI models through API tokens. It serves as a hub for deploying, sharing, and using AI models without needing to set up the complex infrastructure required to use and run them. Replicate enables users to run open-source models, create fine-tuned models, or build and publish custom models.
Replicate hosts a wide range of models, covering diverse applications such as image generation, language modeling, and audio & video processing. It also offers API access to users enabling them to integrate models into their applications with ease. This API token simplifies the process of sending inputs to models and receiving outputs.
Replicate provides different features such as models, predictions, deployments, and webhooks. Models are trained, packaged, and published software programs that accept user input and return output. Predictions represent the execution of a model, including the inputs provided and the outputs generated. Deployments, on the other hand, allow for more control over how the models are executed and webhooks provide real-time updates about user predictions.
AI Agent Observability and Tracing
AI agent observability and tracing refers to the practice of tracking and understanding the internal workings and behavior of agents. It extends beyond simple monitoring, which focuses on external metrics like uptime and resource utilization, to understand the agent’s decision-making processes, interactions, and states. In a complex AI agent, powered by LLMs, observability becomes important for debugging and ensuring the agent’s actions align with the intended goals and objectives.
The process of observability of an AI agent involves collecting and analyzing various forms of data, including metrics, logs, and traces, which are used for building a complete view of the agent’s operation. Tracing is another component that primarily focuses on tracking the path of a specific request or action as it flows through the agent system. It provides a granular view, a step-by-step breakdown of how the agent processes information and interacts with other components or agents.
By following the trace of a request, users can understand the sequence of actions and events that led to an outcome, and analyze the interactions between different agents. This level of detail is important for optimizing performance, improving agent reliability, and building trust in AI agent systems. The importance of AI agent observability and tracing increases with the complexity of AI agent systems. As agents become more complex and implement critical tasks, ensuring their reliability and trustworthiness is of prime importance.
Portkey is one comprehensive platform designed to act as a unified interface for interacting with AI models, observing and tracing agents, and implementing guardrails with ease. It also features advanced features such as multimodal capabilities, load balancing, virtual keys, conditional routing, caching, and budget limits. The most important feature that Portkey offers is observability and logs. It can assist users in gaining real-time insights, tracking important metrics, and streamlining debugging with ease and efficiency.
Hands-on Implementation of Portkey for Observing Agents
Let’s implement a multi-modal multi-agent system using CrewAI & Replicate and implementing agent observability through Portkey.
Pre-requisites:
- Tavily account is needed for creating and implementing a web-search agent
- Replicate account is required for using a model for a text-to-image agent.
- Portkey account is needed for monitoring the agent run and tracking execution phases.
Step 1: Library Installation
!pip install -qU langchain langchain_community tavily-python langchain-groq groq replicate crewai crewai[tools]
!pip install portkey-ai
Step 2: API Initialization
from google.colab import userdata
import os
os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_APIKEY")
os.environ["REPLICATE_API_TOKEN"] = userdata.get("REPLICATE_API_TOKEN")
os.environ["TAVILY_API_KEY"] = userdata.get("TAVILY_API_KEY")
PORTKEY_API_KEY = userdata.get("PORTKEY_API_KEY")
Step 3: Implement a web search tool helper function using Tavily
from langchain_community.tools.tavily_search import TavilySearchResults
def web_search_tool(question: str) -> str:
"""This tool is useful when we want web search for current events."""
websearch = TavilySearchResults()
response = websearch.invoke({"query":question})
return response
Step 4: Create a text-to-image creation helper function using Replicate, the model we will use is “adirik/flux-cinestill:216a43b9975de9768114644bbf8cd0cba54a923c6d0f65adceaccfc9383a938f”
import replicate
def text2image(text:str) -> str:
"""This tool is useful when we want to generate images from textual descriptions."""
output = replicate.run(
"adirik/flux-cinestill:216a43b9975de9768114644bbf8cd0cba54a923c6d0f65adceaccfc9383a938f",
input={
"steps": 28,
"prompt": text,
"lora_url": "",
"control_type": "depth",
"control_image": "https://replicate.delivery/pbxt/LUSNInCegT0XwStCCJjXOojSBhPjpk2Pzj5VNjksiP9cER8A/ComfyUI_02172_.png",
"lora_strength": 1,
"output_format": "webp",
"guidance_scale": 2.5,
"output_quality": 100,
"negative_prompt": "low quality, ugly, distorted, artefacts",
"control_strength": 0.45,
"depth_preprocessor": "DepthAnything",
"soft_edge_preprocessor": "HED",
"image_to_image_strength": 0,
"return_preprocessed_image": False
}
)
print(output)
return output[0]
Step 5: Create a text-to-image processing helper function
def image2text(image_url:str,prompt:str) -> str:
"""This tool is useful when we want to generate textual descriptions from images."""
# Function
output = replicate.run(
"adirik/flux-cinestill:216a43b9975de9768114644bbf8cd0cba54a923c6d0f65adceaccfc9383a938f",
input={
"image": image_url,
"top_p": 1,
"prompt": prompt,
"max_tokens": 1024,
"temperature": 0.2
}
)
return "".join(output)
Step 6: Setup a Router Tool
from crewai.tools import tool
@tool("router tool")
def router_tool(question:str) -> str:
"""Router Function"""
prompt = f"""Based on the Question provide below determine the following:
1. Is the question directed at generating image ?
2. Is the question directed at describing the image ?
3. Is the question a generic one and needs to be answered by searching the web?
Question: {question}
RESPONSE INSTRUCTIONS:
- Answer either 1 or 2 or 3.
- Answer should strictly be a string.
- Do not provide any preamble or explanations except for 1 or 2 or 3.
OUTPUT FORMAT:
1
"""
response = llm.invoke(prompt).content
if response == "1":
return 'text2image'
elif response == "3":
return 'web_search'
else:
return 'image2text'
Step 7: Setup a Retriever Tool
@tool("retriver tool")
def retriver_tool(router_response:str,question:str,image_url:str) -> str:
"""Retriver Function"""
if router_response == 'text2image':
return text2image(question)
elif router_response == 'image2text':
return image2text(image_url,question)
else:
return web_search_tool(question)
Step 8: Portkey Setup
from langchain_openai import ChatOpenAI
from portkey_ai import createHeaders, PORTKEY_GATEWAY_URL
portkey_headers = createHeaders(
api_key = PORTKEY_API_KEY,
virtual_key = "open-ai-virtual-07f788",
)
llm = ChatOpenAI(api_key=PORTKEY_API_KEY, base_url=PORTKEY_GATEWAY_URL, default_headers=portkey_headers)
Step 9: Create a Router Agent
from crewai import Agent
Router_Agent = Agent(
role = 'Router',
goal = 'Route user question to a text to image or text to speech or web search',
backstory = (
"You are an expert at routing a user question to a text to image or web search."
"Use the text to image to generate images from textual descriptions."
"Use the image to text to generate text describing the image based on the textual description."
"Use the web search to search for current events."
"You do not need to be stringent with the keywords in the question related to these topics. Otherwise, use web-search."
),
verbose = True,
allow_delegation = False,
llm = llm,
tools = [router_tool],
)
Step 9: Create a Retriever Agent
Retriever_Agent = Agent(
role = "Retriever",
goal = "Use the information retrieved from the Router to answer the question and image url provided.",
backstory = (
"You are an assistant for directing tasks to respective agents based on the response from the Router."
"Use the information from the Router to perform the respective task."
"Do not provide any other explanation"
),
verbose = True,
allow_delegation = False,
llm = llm,
tools = [retriver_tool],
)
Step 10: Create the Router Task and Retriever Task
from crewai import Task
router_task = Task(
description=("Analyse the keywords in the question {question}"
"If the question {question} instructs to describe a image then use the image url {image_url} to generate a detailed and high quality images covering all the nuances secribed in the textual descriptions provided in the question {question}."
"Based on the keywords decide whether it is eligible for a text to image or text to speech or web search."
"Return a single word 'text2image' if it is eligible for generating images from textual description."
"Return a single word 'image2text' if it is eligible for describing the image based on the question {question} and image url{image_url}."
"Return a single word 'web_search' if it is eligible for web search."
"Do not provide any other explaination."
),
expected_output=("Give a choice 'web_search' or 'text2image' or 'image2text' based on the question {question} and image url {image_url}"
"Do not provide any preamble or explanations except for 'text2image' or 'web_search' or 'image2text'."),
agent=Router_Agent,
)
retriever_task = Task(
description=("Based on the response from the 'router_task' generate response for the question {question} with the help of the respective tool."
"Use the web_serach_tool to retrieve information from the web in case the router task output is 'web_search'."
"Use the text2image tool to convert the test to speech in English in case the router task output is 'text2image'."
"Use the image2text tool to describe the image provide in the image url in case the router task output is 'image2text'."
),
expected_output=("You should analyse the output of the 'router_task'"
"If the response is 'web_search' then use the web_search_tool to retrieve information from the web."
"If the response is 'text2image' then use the text2image tool to generate a detailed and high quality images covering all the nuances secribed in the textual descriptions provided in the question {question}."
"If the response is 'image2text' then use the 'image2text' tool to describe the image based on the question {question} and {image_url}."
),
agent=Retriever_Agent,
context=[router_task],
)
Step 11: Initiate the CrewAI
from crewai import Crew,Process
crew = Crew(
agents=[Router_Agent,Retriever_Agent],
tasks=[router_task,retriever_task],
verbose=True,
)
inputs = {
"question": "Generate an image based upon this text: a cinematic portrait of a majestic black panther, piercing yellow eyes, moody lighting, soft bokeh, 85mm lens, deep blue jungle background",
"image_url": " "
}
result = crew.kickoff(inputs=inputs)
Step 12: Check the generated image as per the executed agents
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt
image_url = result.raw
response = requests.get(image_url)
if response.status_code == 200:
img = Image.open(BytesIO(response.content))
plt.imshow(img)
plt.axis('off') # Hide the axis
plt.show()
else:
print("Failed to retrieve image. Status code:", response.status_code)
Output:
Check the Portkey Web UI for agent tracing and logs –
Final Words
Continuous improvement of AI agents by providing valuable feedback for refining their behavior and capabilities is of prime importance in a multi-agent and multi-modal setting. Observability and tracing of such agents where multiple data type processing is done becomes even more complex and more necessary, as the data needs to be tracked from different sources, also the transparency and accountability factors are crucial for building public trust in AI. Observability and tracing of AI agents provide the means to achieve these goals and objectives, by providing important insights into the agent’s inner workings.