Azure OpenAI GPT-4o with Langchain

Latest model that can reason across audio, vision, and text

4 min readMay 25, 2024

In previous post, we interacted with GPT-4o through UI. How to access it programmatically? We can follow OpenAI sample notebook (I notice audio functionality is provided by whisper model, hopefully it will be natively supported by GPT-4o in the future). This post I will access it using Langchain.

Setup environment variables

.env

OPENAI_API_TYPE=azure
AZURE_OPENAI_ENDPOINT=https://<azure openai resource name>.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_OPENAI_API_KEY=<azure openai api key>
AZURE_OPENAI_GPT4O_MODEL_NAME=gpt-4o
AZURE_OPENAI_GPT4O_DEPLOYMENT_NAME=gpt-4o

Load environment variables

import os

from dotenv import load_dotenv
load_dotenv()

Create LLM

Use Chat-type LLM. We need to set system prompt, user and assistant.

from langchain.chat_models import AzureChatOpenAI
llm = AzureChatOpenAI(
    openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
    azure_deployment=os.getenv("AZURE_OPENAI_GPT4O_DEPLOYMENT_NAME"),
    temperature=0,
)

Basic text chat

from langchain_core.prompts.chat import ChatPromptTemplate
chat_template = ChatPromptTemplate.from_messages(
    [
        ("system", "You are a helpful assistant. Help me with my math homework!"),
        ("human", "{user_input}" ),
    ]
)
messages = chat_template.format_messages(
  user_input="Hello! Could you solve 2+2?"
)
ai_message = llm.invoke(messages)
print(ai_message.content)

Result

Of course! The sum of 2 + 2 is 4.

Image understanding

Use US mortgage picture

from IPython.display import Image, display, Audio, Markdown
import base64

IMAGE_PATH = "data/US_Mortgage_Rate_Surge-Sept-11-1.jpg"

# Preview image for context
display(Image(IMAGE_PATH))

# Open the image file and encode it as a base64 string
def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

base64_image = encode_image(IMAGE_PATH)

messages=[
        {"role": "system", "content": "You are a helpful assistant that responds in Markdown."},
        {"role": "user", "content": [
            {"type": "text", "text": "Describe the images as an alternative text"},
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{base64_image}"}
            }
        ]}
    ]
ai_message = llm.invoke(messages)
print(ai_message.content)

Result


The image is a line graph titled "The U.S. Mortgage Rate Surge" with a subtitle "U.S. 30-Year Fixed-Rate Mortgage vs. Existing Home Sales." The graph shows data from 2014 to 2023. The y-axis on the left represents the mortgage rate percentage, ranging from 0% to 8%, while the y-axis on the right represents existing home sales in millions, ranging from 3M to 7M. 

Two lines are plotted:
- The red line represents the mortgage rate, which fluctuates around 4% from 2014 to 2020, then spikes sharply in 2022, reaching above 6%.
- The blue line represents existing home sales, which fluctuates between 4M and 6M, with a noticeable dip in 2020 and a peak in 2021, followed by a decline in 2022 and 2023.

A text box on the right side of the graph states: "2023: With high mortgage rates, rising home prices, and a constrained housing inventory, U.S. housing affordability is at its lowest point since 1989." The sources of the data are FreddieMac and Trading Economics. The graph is created by Visual Capitalist, with collaborators listed as Selin Oguz (Research & Writing) and Joyce Ma (Art Direction & Design).

Video understanding

It transforms video into frames first. Currently Azure OpenAI GPT-4o only supports 10 images at one time.

import cv2
from moviepy.editor import VideoFileClip
import time
import base64

# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"

def process_video(video_path, seconds_per_frame=2):
    base64Frames = []
    base_video_path, _ = os.path.splitext(video_path)

    video = cv2.VideoCapture(video_path)
    total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
    fps = video.get(cv2.CAP_PROP_FPS)
    frames_to_skip = int(fps * seconds_per_frame)
    curr_frame=0

    # Loop through the video and extract frames at specified sampling rate
    while curr_frame < total_frames - 1:
        video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
        curr_frame += frames_to_skip
    video.release()

    # Extract audio from video
    audio_path = f"{base_video_path}.mp3"
    clip = VideoFileClip(video_path)
    clip.audio.write_audiofile(audio_path, bitrate="32k")
    clip.audio.close()
    clip.close()

    print(f"Extracted {len(base64Frames)} frames")
    print(f"Extracted audio to {audio_path}")
    return base64Frames, audio_path

# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)

## Display the frames and audio for context
from IPython.display import Image, display, Audio, Markdown
import base64
display_handle = display(None, display_id=True)
for img in base64Frames[:9]:
    display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
    time.sleep(0.025)

Audio(audio_path)

We need to change each element in user content to be {“type”: “<type>”, “<type name>”: “<input>”}. Otherwise, there will be unexpected input type error.

# visual summary
# azure gpt-4o max image limit is 10
messages=[
    {"role": "system", "content": "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
    {"role": "user", "content": [
        {"type": "text", "text": "These are the frames from the video."},
        *map(lambda x: {"type": "image_url", 
                        "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames[:9])
        ],
    }
    ]
ai_message = llm.invoke(messages)
print(ai_message.content)

Result

The video appears to be a summary of an event called "OpenAI DevDay." It starts with a title screen displaying "OpenAI DevDay" and transitions to a "Keynote Recap" screen. The video then shows the exterior of the event venue with a sign that reads "OpenAI DevDay," followed by a close-up of the OpenAI logo. The final frame depicts a bustling interior space with attendees moving around, suggesting a lively and well-attended event.

Appendix

Model Availability

Azure OpenAI Service models - Azure OpenAI

Learn about the different model capabilities that are available with Azure OpenAI.

learn.microsoft.com

API version

Azure OpenAI Service REST API reference - Azure OpenAI

Learn how to use Azure OpenAI's REST API. In this article, you learn about authorization options, how to structure a…

learn.microsoft.com

Azure OpenAI GPT-4o with Langchain

Latest model that can reason across audio, vision, and text

Setup environment variables

Load environment variables

Create LLM

Basic text chat

Image understanding

Video understanding

Appendix

Azure OpenAI Service models - Azure OpenAI

Learn about the different model capabilities that are available with Azure OpenAI.

Azure OpenAI Service REST API reference - Azure OpenAI

Learn how to use Azure OpenAI's REST API. In this article, you learn about authorization options, how to structure a…

Written by Xin Cheng

No responses yet