Open Source LLMs

Large Language Models that can run locally

Xin Cheng
5 min readOct 8, 2024

While ChatGPT, GPT-4o, Anthropic Claude, Google Gemini are popular LLMs, they are proprietary and hidden behind cloud and API. Sometimes we want to run LLM locally. That is why we need open-source LLMs. Here are some articles about open-source LLMs.

Overview

Definition of open-source LLMs: code and weights are publicly available, can be customized without restrictions, transparency into model design, community-driven innovation

Benefit: avoid vendor lock-in

Open-source LLM use cases: Text Content Creation, Conversational Assistants, Data Analysis and Insights Generation, Edge computing/on-device inference

Benefits of open source LLMs: Transparency (how it is trained), Data security and Privacy (avoid sending sensitive data to cloud LLMs)

Benefit: Collaboration and Innovation: allows researchers and developers from various industries to contribute, Customization (you can easily adapt it to your domain data), Cost (you can combine small and large LLMs in your use case to save up cost of sending all requests to expensive and powerful cloud LLM).

Downside: Hallucination, Hidden Costs (need technical experts on DS, ML, SE), IP concerns (Who has rights to proprietary data, and how can the app be monetized under the open-source license)

Most well-known open-source LLM is Meta’s Llama, released 2–3 months after ChatGPT, Llama 2 on July 18, 2023, 3.1 on July 23, 2024. Latest version is 3.2 on September 25, 2024.

Examples of popular open-source LLMs include:

  • LlaMA series (2, 3)
  • Mistral series (7, 7x8B, etc.)
  • Falcon 180B
  • Grok AI
  • MPT series (7B, 30B)
  • HuggingFace Bloom

Examples of popular closed-source LLMs include:

  • OpenAI GPT series (3.5, 4, 4-o, o1)
  • Google Gemini
  • Anthropic Claude series (2, 3, 3.5)

Available in 8B and 70B (better performance than Claude Sonnet and GPT-3.5).

405B competitive with leading foundation models across a range of tasks, including GPT-4, GPT-4o, and Claude 3.5 Sonnet

DBRX is LLM on a transformer-based, decoder-only framework, built upon a sophisticated mixture-of-experts (MoE), also has a specialized version designed for instruction-following tasks, known as DBRX Instruct.

Short Context RAG (SCR) (< 5,000 tokens), Medium Context RAG (MCR) (5,000 to 25,000 tokens), Long Context RAG (LCR) (40,000 to 100,000 tokens). Leveraged Chainpoll to detect hallucination.

Most open-source LLMs do not support long context, while Llama 3.1 and Qwen2 are best-performing open-source LLMs in short and medium context.

Tips from the article regarding summarization with open-source LLMs: Mistral-based OpenChat and Zephyr LLMs, though small, produce excellent results. If their training is scaled up to bigger LLMs, they’ll likely outperform all other LLMs.

To get close to ChatGPT and even surpass it in quality, always fine-tune your open-source LLMs for your domain, domain terminology, and content structures! The base open-source LLMs can never produce the best quality summaries you need because of their generic training. Instead, plan for both supervised and RLHF fine-tuning to condition the LLM to your domain’s concepts as well as your users’ expectations of summary structures and quality of information.

Another technique to improve the quality of summaries is to be smart about how you chunk your documents. The first level of chunking should break the document into logical sections that are natural to your domain’s documents. Subsequent levels can target subsections if possible or default to token-based chunking.

Finally, better prompting strategies like PEARL can first reason about the content before summarizing, and improve the overall quality.

Multimodal

https://scifilogic.com/best-local-multimodal-llm

moe-llava, llava, minigpt, fuyu-8b

Llama 3.2 includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) (context length 128K tokens)

https://klu.ai/blog/open-source-llm-models

Options to run open source LLMs locally

Ollama, LM Studio, Jan, Llama.cpp

Open-source project that has active community, docker container support, lots of models supported, including multimodal, has web UI.

The article went through running Ollama docker and Ollama Web-UI docker container, use Ollama’s models library, tested mixtral and llava 34b model.

Serving LLM in cloud: While Ollama performs well at lower request rates (up to 0.6 req/s), its performance in high-load scenarios, which are typical in effective cloud deployments, may be inadequate in Throughput, Time to First Token (TTFT), Time per Output Token (TPOT), Token Generation Rate (TGR). BentoML releases OpenLLM which has advantage to Ollama.

Appendix

Leaderboard of open-source LLMs is constantly changing. Use this to make decisions.

Model directory

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified