Building Large Language Model-powered AI Applications
Challenges of building LLM-powered apps
I want to survey building AI applications powered by large language models and related emerging technologies. I have written several articles (1, 2, 3) on large language models and generative AI. However, there are 2 main challenges in building applications powered by LLMs:
- LLM has no memory or state. How can we provide LLM with proper context of our own data
- LLM has token limit (usually several K tokens). We cannot feed all data at once (it is limited by the design and expensive even there is no such limit)
Emerging LLM Tech stack
The following article on main components in emerging LLM tech stack. It proposes embedding and LLM programming framework before LLM endpoint.
Machine learning tech stack with Large language model
Similar arch mentioned in this article
However, how does this architecture solve above challenges? Let’s look at the application flow of following article
Separate knowledge from language model. This allows us to leverage the semantic understanding of our language model while also providing our users with the most relevant information.
The approach for this would be as follows:
- User asks a question
- Application finds the most relevant text that (most likely) contains the answer
- A concise prompt with relevant document text is sent to the LLM
- User will receive an answer or ‘No answer found’ response
From above article, we know that context is key. To ensure the language model has the right information to work with, we need to build a knowledge base that can be used to find the most relevant documents through semantic search. So we need to provide context in the limit that LLM can accept (we cannot just throw all data to LLM and hope it magically returns what we want). To do this, we need
- Embedding, which encapsulates semantic relationship of text strings
- Vector search technology that can search based on semantic similarity
- Knowledge layer, could combine embedding/vector search, knowledge graph technology to find correct context
With the context provided, LLM is now returning what you want. However, users may not just want text response, but actions. So there is a concept called autonomous agent which takes actions (connecting digital world to physical world). Now it seems there is workflow involved, you need some framework to orchestrate different steps in the workflow.
Other articles on common problems building LLM apps
- Intra-conversation/short-term memory: LLM does not keep state, and there context limit of the LLM (e.g. GPT-3.5 has a 3000-token limit). You can use buffer window memory which is similar to ChatGPT — just discard any messages before the context window size, either by the number of messages or by tokens; or summarization which is to summarize the messages and attach the summary as context for the conversation; or create knowledge graph of the entities, their attributes, and their relationships; or use vector store/database to save the entire conversation and query the top_k most relevant messages as context (which loses sequential order of conversation interactions)
- Long-term memory with vector databases with Chunking (Fixed-size by tokens, Split by sentence, Overlapping chunking, Recursive chunking, Chunk by document format), Embedding (fastTex, SentenceTransformers, Commercial APIs), Storing to vector database, Retrieving relevant chunks, Sending to LLM in prompts
Components of LLM-powered apps
Core components involved: LLMs, vector databases, agents
Use case 1: Using large language models with your own data to build a “corporate brain” for your organisation
Embeddings are numeric measurements held in multi-dimensional vectors determining the relatedness of text strings.
Embeddings are typically used in the following use cases (as per OpenAI):
- Search (where results are ranked by relevance to a query string)
- Clustering (where text strings are grouped by similarity)
- Recommendations (where items with related text strings are recommended)
- Anomaly detection (where outliers with little relatedness are identified)
- Diversity measurement (where similarity distributions are analysed)
- Classification (where text strings are classified by their most similar label)
Use case 2: Integrating “tools” into LLMs
Agents use an LLM to determine which actions to take and in what order to take them when analysing a user query, whereas tools are functions that agents can use to interact with the world, e.g. python_repl (python shell to execute python commands), serpapi (search engine), wolfram-alpha (search engine for answering Math, Science, Technology, Culture, Society and Everyday Life questions), requests (get content from url), terminal (execute command), llm-math (answer questions about math), open-meteo-api (get weather information from the OpenMeteo API), news-api (get information about the top headlines of current news stories), google-search (wrapper around Google Search), wikipedia (wrapper around Wikipedia)
Author gives an example to use serpapi to search the date of an event and llm-math to calculate how many days apart from today.
Technique to overview token limit of LLMs
Chunking
During data preprocessing, you need to break down very large document into chunks, because later, when you create embeddings, there is also limit on how long the input is allowed
Since large language model has token limits, you need to apply chunking strategies to break large documents within LLM token limit. Chunking strategies:
- By paragraph with no overlapping, e.g. 3 paragraphs as a chunk (spacy.sents can return paragraphs in a document)
- By paragraph with overlapping, e.g. 1–3, 2–4, etc.
Similar things are mentioned in article below
Following articles have the similar patterns of using text embeddings, vector database, similarity search, GPT/large language model (knowledge embedding) to create more intelligent chatbot:
Customize OpenAI’s GPT-3 to give an accurate answer based on your knowledge base and stay on a specific topic.
- Create a knowledge base database using embedding (stored in Chroma, the AI-native open-source embedding database).
- The semantic search of the knowledge base using the user question.
- Include the semantic search result(s) in the prompt with the same user question.
- Ask OpenAI GPT-3 to find the answer within the semantic search result(s).
- If GPT-3 finds an answer, it returns the answer.
- If GPT-3 does not find an answer, it returns “I’m sorry, but the given context does not provide information on …..”
There are currently 2 main ways to extend your knowledge base to the GPT models:
- Finetuning — covered in this post. Straight-forward approach, but you possess no control over the model response apart from the initial prompt engineering.
- Embeddings — a better approach to extend the model’s domain-specific knowledge, allowing more flexibility and control over the generated model output.
Process: Use OpenAI embeddings API to get embeddings for the document, store in vector database (e.g. pinecone, weaviate) where you can search similar text based on your question, and use your existing knowledge base as the ground source of the truth/context, then pass this as prompt to OpenAI ChatGPT.
To create TaxGPT, the following steps are taken by the author;
- An embedding database of the Internal Revenue Codes was created, which was scraped from Bloomberg Tax.
- An embedding database of the Internal Revenue Regulations was created, which was scraped from Internal Revenue Service.
- The embedding database of Internal Revenue Codes was queried using the tax question, which will yield a list of applicable Internal Revenue Codes (I.R.C.). This was done to assist in querying Internal Revenue Regulations since that database is big.
- This list of I.R.C . was then appended to the tax question, and the embedding database of Internal Revenue Regulations was queried.
- Finally, using GPT on search results, an answer that includes relevant citations can be generated.
In next few articles, I will look into details of these enabling technologies.