Machine Learning stories roundup 2024.7
Local LLMs
RAG using local LLM GPT4All to avoid sending sensitive data to third party service
GPT4All contains a number of models: commercially licensable model based on GPT-J, on-commercially licensable model based on Llama 13b, non-commercially licensable chat model based on MPT (author tried two models — ggml-gpt4all-j-v1.3-groovy.bin and ggml-gpt4all-l13b-snoozy.bin, finding that the ggml-gpt4all-l13b-snoozy.bin is much more accurate. It is a 8.14GB model.)
Example code
gpt = GPT4All("ggml-gpt4all-l13b-snoozy.bin")
messages = [{"role": "user", "content": "What is the national flower of Canada"}]
gpt.chat_completion(messages)
Also example is shown on Gradio as WebUI integrating GPT4All.
Computer vision
VoxelGPT combines the power of large language models (LLMs) with FiftyOne’s flexible computer vision query language, to get insights about your image and video datasets without writing code. VoxelGPT provides a chat-like interface for interacting with your computer vision dataset by translating natural language queries into FiftyOne Python syntax and constructing the resulting DatasetView
. This helps reduce steep learning curve of mastering the FiftyOne query language.
Stable diffusion and ControlNet: better control for text-to-image, image-to-image generation. ControlNet is a neural network structure to control diffusion models by adding extra conditions. It provides a way to augment Stable Diffusion with conditional inputs such as scribbles, edge maps, segmentation maps, pose key points, etc during text-to-image generation. As a result, the generated image will be a lot closer to the input image, which is a big improvement over traditional methods such as image-to-image generation, e.g. convert image to canny edge image, generate image based on canny image and update the image with your own text, convert image to openpose bone image, generate image based on openpose bone image.
Article talks about the use case to understand visual data (e.g. use open source library FiftyOne to query computer vision data, to programmatically filter, sort, and semantically slice datasets consisting of images, videos, and 3D point clouds, which means you need to learn the Python library syntax). Can we use natural language to accomplish the understanding?
Example prompts to generate FiftyOne code using natural language
Your task is to convert input natural language queries into Python code to generate ViewStages for the computer vision library FiftyOne.
Here is your first natural language query: “Images that only contain dogs”
Give me the FiftyOne code.
Online object detection service and open source Yolo model performance comparison
https://ai.plainenglish.io/meta-empowered-ai-with-vision-and-released-open-source-data-910e276a1a41
Segment Anything Model (SAM) that can remove any object from any image just by clicking once. Capabilities:
- SAM can segment objects effortlessly by clicking on them, interactively selecting points to include or exclude from the object, or by drawing bounding boxes and using the polygon tool to segment regions, which snap to the object.
- In situations where there is ambiguity in identifying the object to be segmented, SAM can produce multiple valid masks, which is impressive.
- SAM can automatically identify and generate masks for all objects present in an image, which is a valuable feature.
Impact: improve AI-assisted labelling and decrease the need for manual labour in tasks involving image segmentation, which can have a significant impact on industries such as agriculture, retail, medical imagery, and geospatial imagery.
DINOv2 is a cutting-edge method for training computer vision models using self-supervised learning. It allows the model to learn from any collection of images without needing labels or metadata, meaning no large amounts of labeled data is required.
real-world applications: Object Identification, Depth measurement, Object Classification, Object retrieval, Image Data Curation
Video generation platform
Gen 2: RunwayML’s Contribution
open-source model that performs complex vision-language tasks (e.g. generating detailed image descriptions and creating websites from handwritten drafts), utilizing open-source Vicuna as a language decoder and the BLIP-2 Vision Language Model as a visual decoder
Capabilities
- Detailed image description generation
- Website creation from hand-written drafts
- Writing stories and poems inspired by given images
- Providing solutions to problems shown in images
- Teaching users how to cook based on food photos
Use open-source onnxstream project to generate image with Stable Diffusion XL Turbo on Raspberry Pi
https://medium.com/mlearning-ai/how-to-run-dreambooth-locally-a-step-by-step-gyu-88c028ab01a4
DreamBooth is a tool to fine-tune an existing text-to-image model like Stable Diffusion using only a few of your own images. You can install it as an extension in Stable diffusion webui
Large Language Models
Talk to codebase solution. It follows typical langchain-based Q&A solution, treating code base as text files, ingestion: Load and split documents, Chunk files and process them, Add documents to DeepLake (to create embeddings), conversational retrieval/retrieval-augmented generation: Initialize DeepLake retriever, Configure retrieval settings, Use ConversationalRetrievalChain to wire up retriever and LLM
Multimodal Large Language Models and use Language Models as General-Purpose Interfaces to perceive beyond text (e.g. visual, audio)
Microsoft MM-REACT is the core component combine individual vision models with the language model in a more flexible manner to overcome complicated visual understanding (hence the name MM, which means multimodal). The file path is used as a placeholder and inputted into ChatGPT to enable the system to accept images as input. Whenever the system requires specific information from the image, such as identifying a celebrity name or box coordinates, ChatGPT seeks help from a specific vision component from Azure, e.g. OCR. The vision component’s output is then serialized as text and combined with the input to activate ChatGPT further.
If you go to MM-REACT repo, you will see a few environment variables defined for Azure Cognitive services resources URL and api keys.
Problem: different vision problems often require different models, which need manual selection and composition for each use case. One way to address this issue is by combining the vision and language modules into a single end-to-end model. Microsoft Research has explored this research direction, with systems like Flamingo and PaLM-E. These systems encode vision signals into special text tokens or features that the language module can understand, enabling the system to utilize the language module for understanding user queries and providing responses.
Microsoft MM-REACT is to enhance ChatGPT’s visual understanding by composing numerous vision experts (computer vision models, especially Azure cognitive services computer vision, e.g. image captioning, object tagging, face recognition, OCR/form recognizer, celebrity recognition, bing search/visual search)
For complex task, the low-code LLM interaction can be completed in four steps (it is able to help in situations that you can break down tasks into subtasks, with clear jump logic):
I. A Planning LLM generates a highly structured workflow for complex tasks.
II. Users edit the workflow with predefined low-code operations (including adding/removing steps by clicking buttons; modifying step names or descriptions by clicking and text editing; adding/removing a jump logic by clicking; changing the processing order by dragging; extending a step in the flowchart by clicking the button; regeneration and confirmation by clicking buttons).
III. An Executing LLM generates responses with the reviewed workflow.
IV. Users continue to refine the workflow until satisfactory results are obtained.
https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate
Accelerator.split_between_processes() context manager (which also exists in PartialState and AcceleratorState). This function will automatically split whatever data you pass to it (be it a prompt, a set of tensors, a dictionary of the prior data, etc.) across all the processes (with a potential to be padded) for you to use right away.
https://www.youtube.com/watch?v=9G9GY8s-tHY
InstructEval paper: MMLU/problem-solving, DROP/math-based discrete reasoning, BBH/logical deduction, fallacy detection, CRASS/causal reasoning capability, humaneva/code-generative models, instructeval benchmark leaderboard from declare lab, helm from Stanford
7 Patterns mentioned
- Evals: To measure performance
- RAG: To add recent, external knowledge
- Fine-tuning: To get better at specific tasks
- Caching: To reduce latency & cost
- Guardrails: To ensure output quality
- Defensive UX: To anticipate & manage errors gracefully
- Collect user feedback: To build our data flywheel
Creative
Create music lyrics with GPT-4, example prompts
Create music chords with GPT-4, example prompts
Prompt: create melody in guitar tab format, pre-chorus
GPT model to creating music
Processing music is not at all easy for several reasons:
- Data, in fact getting human-labeled speech data is a much more expensive task than scraping web text (and also more time-consuming). Also, there is much less material and therefore less data.
- is computationally much more expensive.
One solution is to use LLM as an interface, this LLM then dialogues with foundation models dedicated to speech and an interface to speech dialog (input/output interface (ASR, TTS)), like HuggingGPT.
It uses ChatGPT to understand user’s intention, then map the task to a specific model, which finally sends the results to ChatGPT and further to user.
Interesting use cases: image to audio, audio to face, sound event detection, noise elimination
Others
Rule-based Method
Statistics-based Methods: IQR, z-score, Minimum Covariance Determinant (MCD)/Gaussian Mixture (GM)
Distance-based Methods
Linear Models
Density-based Methods
Tree-based method
Graph-based Models
Autoencoders