Hello World to Multimodal GPT

With Azure OpenAI GPT-4V, LlamaIndex and Langchain

Xin Cheng
3 min readJan 17, 2024

Last year was the year of “GPT”. This year “Multimodal” (beyond text, e.g. vision, audio) could be another hot trend. Here is a quick way to use multimodal GPT to analyze image.

TLDR: LlamaIndex and Langchain have integration with Azure OpenAI GPT-4V for image analysis task

Steps

  1. Create Azure OpenAI resource
  2. Deploy GPT-4V model in Azure OpenAI

Code

We will setup environment variables to configure Azure OpenAI. Create a configuration file called .env.

.env

OPENAI_API_TYPE=azure
AZURE_OPENAI_ENDPOINT=https://<deployment>.openai.azure.com/
AZURE_OPENAI_API_VERSION=2023-12-01-preview
AZURE_OPENAI_API_KEY=<api key>
AZURE_OPENAI_MODEL_NAME=gpt-4-vision-preview
AZURE_OPENAI_DEPLOYMENT_NAME=gpt-4-vision-preview

Make sure your Python environment has correct packages

pip install llama-index openai, langchain, dotenv

LlamaIndex

Load .env config file value to environment variables

from dotenv import load_dotenv
load_dotenv()
import os

os.environ["OPENAI_API_VERSION"] = os.environ['AZURE_OPENAI_API_VERSION']

Load image (U.S. 30-year fixed-rate mortgage from 2014–2023)

import requests
from llama_index.schema import ImageDocument

image_url = "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg"

response = requests.get(image_url, verify=False)
if response.status_code != 200:
raise ValueError("Error: Could not retrieve image from URL.")

image_document = ImageDocument(image=base64str, image_mimetype="image/jpeg")

Create LLM

from llama_index.multi_modal_llms.azure_openai import AzureOpenAIMultiModal
azure_openai_mm_llm = AzureOpenAIMultiModal(
engine=os.environ['AZURE_OPENAI_DEPLOYMENT_NAME'],
api_version=os.environ["OPENAI_API_VERSION"],
model=os.environ['AZURE_OPENAI_MODEL_NAME'],
max_new_tokens=300,
)

Analyze image

complete_response = azure_openai_mm_llm.complete(
prompt="Describe the images as an alternative text",
image_documents=[image_document],
)
print(complete_response)

Result

The image is a line graph showing the U.S. 30-year fixed-rate mortgage and existing home sales from 2014 to 2023. The mortgage rate is represented by a red line, while the home sales are represented by a blue line. The graph shows that the mortgage rate has reached its highest level in over 20 years, while home sales have fluctuated over time. There is also a note that the data is sourced from the U.S. Federal Reserve, Trading Economics, and Visual Capitalist.

Langchain

Load .env config file value to environment variables

from dotenv import load_dotenv
load_dotenv()
import os
os.environ["OPENAI_API_VERSION"] = os.environ['AZURE_OPENAI_API_VERSION']

Load image to base64 string

import base64
import requests

image_url = "https://www.visualcapitalist.com/wp-content/uploads/2023/10/US_Mortgage_Rate_Surge-Sept-11-1.jpg"

response = requests.get(image_url, verify=False)
if response.status_code != 200:
raise ValueError("Error: Could not retrieve image from URL.")
base64str = base64.b64encode(response.content).decode("utf-8")

Create LLM

from langchain.chat_models import AzureChatOpenAI
from langchain.schema import HumanMessage
chat = AzureChatOpenAI(
azure_deployment=os.environ['AZURE_OPENAI_DEPLOYMENT_NAME'],
openai_api_version=os.environ["OPENAI_API_VERSION"],
max_tokens=256)

Analyze image

chat.invoke(
[
HumanMessage(
content=[
{"type": "text", "text": "Describe the images as an alternative text"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64str}",
"detail": "auto",
},
},
]
)
]
)

Result

AIMessage(content='The image is a graph titled "The U.S. Mortgage Rate Surge" showing the U.S. 30-year fixed-rate mortgage versus existing home sales from 2014 to 2023. The graph has two lines, one representing the mortgage rate (in percentage) and the other representing existing home sales (in millions). The mortgage rate line fluctuates between 2% and 8%, while the existing home sales line fluctuates between 3M and 7M. The graph indicates that in 2023, the U.S. 30-year fixed-rate mortgage has reached its highest level in over 20 years, with high mortgage rates, rising home prices, and a constrained housing inventory leading to U.S. housing affordability being at its lowest point since 1989. The source of the data is the National Association of Realtors, and the collaborators of the graph are Visual Capitalist with research and writing by Selin Oquz, art direction and design by Joyce Ma. The Visual Capitalist logo and social media icons are displayed at the bottom.')

Appendix

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified