Azure OpenAI GPT-4o with Langchain
In previous post, we interacted with GPT-4o through UI. How to access it programmatically? We can follow OpenAI sample notebook (I notice audio functionality is provided by whisper model, hopefully it will be natively supported by GPT-4o in the future). This post I will access it using Langchain.
Setup environment variables
.env
OPENAI_API_TYPE=azure
AZURE_OPENAI_ENDPOINT=https://<azure openai resource name>.openai.azure.com/
AZURE_OPENAI_API_VERSION=2024-05-01-preview
AZURE_OPENAI_API_KEY=<azure openai api key>
AZURE_OPENAI_GPT4O_MODEL_NAME=gpt-4o
AZURE_OPENAI_GPT4O_DEPLOYMENT_NAME=gpt-4o
Load environment variables
import os
from dotenv import load_dotenv
load_dotenv()
Create LLM
Use Chat-type LLM. We need to set system prompt, user and assistant.
from langchain.chat_models import AzureChatOpenAI
llm = AzureChatOpenAI(
openai_api_version=os.getenv("AZURE_OPENAI_API_VERSION"),
azure_deployment=os.getenv("AZURE_OPENAI_GPT4O_DEPLOYMENT_NAME"),
temperature=0,
)
Basic text chat
from langchain_core.prompts.chat import ChatPromptTemplate
chat_template = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant. Help me with my math homework!"),
("human", "{user_input}" ),
]
)
messages = chat_template.format_messages(
user_input="Hello! Could you solve 2+2?"
)
ai_message = llm.invoke(messages)
print(ai_message.content)
Result
Of course! The sum of 2 + 2 is 4.
Image understanding
from IPython.display import Image, display, Audio, Markdown
import base64
IMAGE_PATH = "data/US_Mortgage_Rate_Surge-Sept-11-1.jpg"
# Preview image for context
display(Image(IMAGE_PATH))
# Open the image file and encode it as a base64 string
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
base64_image = encode_image(IMAGE_PATH)
messages=[
{"role": "system", "content": "You are a helpful assistant that responds in Markdown."},
{"role": "user", "content": [
{"type": "text", "text": "Describe the images as an alternative text"},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{base64_image}"}
}
]}
]
ai_message = llm.invoke(messages)
print(ai_message.content)
Result
The image is a line graph titled "The U.S. Mortgage Rate Surge" with a subtitle "U.S. 30-Year Fixed-Rate Mortgage vs. Existing Home Sales." The graph shows data from 2014 to 2023. The y-axis on the left represents the mortgage rate percentage, ranging from 0% to 8%, while the y-axis on the right represents existing home sales in millions, ranging from 3M to 7M.
Two lines are plotted:
- The red line represents the mortgage rate, which fluctuates around 4% from 2014 to 2020, then spikes sharply in 2022, reaching above 6%.
- The blue line represents existing home sales, which fluctuates between 4M and 6M, with a noticeable dip in 2020 and a peak in 2021, followed by a decline in 2022 and 2023.
A text box on the right side of the graph states: "2023: With high mortgage rates, rising home prices, and a constrained housing inventory, U.S. housing affordability is at its lowest point since 1989." The sources of the data are FreddieMac and Trading Economics. The graph is created by Visual Capitalist, with collaborators listed as Selin Oguz (Research & Writing) and Joyce Ma (Art Direction & Design).
Video understanding
It transforms video into frames first. Currently Azure OpenAI GPT-4o only supports 10 images at one time.
import cv2
from moviepy.editor import VideoFileClip
import time
import base64
# We'll be using the OpenAI DevDay Keynote Recap video. You can review the video here: https://www.youtube.com/watch?v=h02ti0Bl6zk
VIDEO_PATH = "data/keynote_recap.mp4"
def process_video(video_path, seconds_per_frame=2):
base64Frames = []
base_video_path, _ = os.path.splitext(video_path)
video = cv2.VideoCapture(video_path)
total_frames = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
fps = video.get(cv2.CAP_PROP_FPS)
frames_to_skip = int(fps * seconds_per_frame)
curr_frame=0
# Loop through the video and extract frames at specified sampling rate
while curr_frame < total_frames - 1:
video.set(cv2.CAP_PROP_POS_FRAMES, curr_frame)
success, frame = video.read()
if not success:
break
_, buffer = cv2.imencode(".jpg", frame)
base64Frames.append(base64.b64encode(buffer).decode("utf-8"))
curr_frame += frames_to_skip
video.release()
# Extract audio from video
audio_path = f"{base_video_path}.mp3"
clip = VideoFileClip(video_path)
clip.audio.write_audiofile(audio_path, bitrate="32k")
clip.audio.close()
clip.close()
print(f"Extracted {len(base64Frames)} frames")
print(f"Extracted audio to {audio_path}")
return base64Frames, audio_path
# Extract 1 frame per second. You can adjust the `seconds_per_frame` parameter to change the sampling rate
base64Frames, audio_path = process_video(VIDEO_PATH, seconds_per_frame=1)
## Display the frames and audio for context
from IPython.display import Image, display, Audio, Markdown
import base64
display_handle = display(None, display_id=True)
for img in base64Frames[:9]:
display_handle.update(Image(data=base64.b64decode(img.encode("utf-8")), width=600))
time.sleep(0.025)
Audio(audio_path)
We need to change each element in user content to be {“type”: “<type>”, “<type name>”: “<input>”}. Otherwise, there will be unexpected input type error.
# visual summary
# azure gpt-4o max image limit is 10
messages=[
{"role": "system", "content": "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
{"role": "user", "content": [
{"type": "text", "text": "These are the frames from the video."},
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames[:9])
],
}
]
ai_message = llm.invoke(messages)
print(ai_message.content)
Result
The video appears to be a summary of an event called "OpenAI DevDay." It starts with a title screen displaying "OpenAI DevDay" and transitions to a "Keynote Recap" screen. The video then shows the exterior of the event venue with a sign that reads "OpenAI DevDay," followed by a close-up of the OpenAI logo. The final frame depicts a bustling interior space with attendees moving around, suggesting a lively and well-attended event.
Appendix
Model Availability
API version