Machine Learning stories roundup 2024.7

Machine learning articles that are interesting to read

Xin Cheng

8 min readJul 8, 2024

Local LLMs

How to build a RAG application using Local Large Language Models

Stop sending your private data through OpenAI API! Use local and secure LLMs like GPT4all-J from Langchain instead.

pub.towardsai.net

RAG using local LLM GPT4All to avoid sending sensitive data to third party service

Free Open Source Alternative to ChatGPT — GPT4All

Build a conversational chatbot using the free open source GPT4All

levelup.gitconnected.com

GPT4All contains a number of models: commercially licensable model based on GPT-J, on-commercially licensable model based on Llama 13b, non-commercially licensable chat model based on MPT (author tried two models — ggml-gpt4all-j-v1.3-groovy.bin and ggml-gpt4all-l13b-snoozy.bin, finding that the ggml-gpt4all-l13b-snoozy.bin is much more accurate. It is a 8.14GB model.)

Example code

gpt = GPT4All("ggml-gpt4all-l13b-snoozy.bin")
messages = [{"role": "user", "content": "What is the national flower of Canada"}]
gpt.chat_completion(messages)

Also example is shown on Gradio as WebUI integrating GPT4All.

Computer vision

VoxelGPT: Your AI Assistant for Computer Vision

VoxelGPT revolutionizes computer vision by combining the power of large language models and the FiftyOne query…

voxel51.com

VoxelGPT combines the power of large language models (LLMs) with FiftyOne’s flexible computer vision query language, to get insights about your image and video datasets without writing code. VoxelGPT provides a chat-like interface for interacting with your computer vision dataset by translating natural language queries into FiftyOne Python syntax and constructing the resulting DatasetView. This helps reduce steep learning curve of mastering the FiftyOne query language.

Introduction to ControlNet for Stable Diffusion

Better control for text-to-image generation

ngwaifoong92.medium.com

Stable diffusion and ControlNet: better control for text-to-image, image-to-image generation. ControlNet is a neural network structure to control diffusion models by adding extra conditions. It provides a way to augment Stable Diffusion with conditional inputs such as scribbles, edge maps, segmentation maps, pose key points, etc during text-to-image generation. As a result, the generated image will be a lot closer to the input image, which is a big improvement over traditional methods such as image-to-image generation, e.g. convert image to canny edge image, generate image based on canny image and update the image with your own text, convert image to openpose bone image, generate image based on openpose bone image.

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets

A process of prompt engineering, software engineering, trial and error, and elbow grease

towardsdatascience.com

Article talks about the use case to understand visual data (e.g. use open source library FiftyOne to query computer vision data, to programmatically filter, sort, and semantically slice datasets consisting of images, videos, and 3D point clouds, which means you need to learn the Python library syntax). Can we use natural language to accomplish the understanding?

Example prompts to generate FiftyOne code using natural language

Your task is to convert input natural language queries into Python code to generate ViewStages for the computer vision library FiftyOne.

Here is your first natural language query: “Images that only contain dogs”

Give me the FiftyOne code.

Benchmarking the Major Cloud Vision AutoML Tools

Until now, there has been little independent research published on the performance of AutoML tools - (both relative to…

blog.roboflow.com

Online object detection service and open source Yolo model performance comparison

https://ai.plainenglish.io/meta-empowered-ai-with-vision-and-released-open-source-data-910e276a1a41

Segment Anything Model (SAM) that can remove any object from any image just by clicking once. Capabilities:

SAM can segment objects effortlessly by clicking on them, interactively selecting points to include or exclude from the object, or by drawing bounding boxes and using the polygon tool to segment regions, which snap to the object.
In situations where there is ambiguity in identifying the object to be segmented, SAM can produce multiple valid masks, which is impressive.
SAM can automatically identify and generate masks for all objects present in an image, which is a valuable feature.

Impact: improve AI-assisted labelling and decrease the need for manual labour in tasks involving image segmentation, which can have a significant impact on industries such as agriculture, retail, medical imagery, and geospatial imagery.

Inside DINOv2: Meta AI’s New Self-Supervised Learning Model for Computer Vision

The model uses a novel architecture to remove the dependencies on text fine tuning and exhibits interesting emerging…

pub.towardsai.net

DINOv2 is a cutting-edge method for training computer vision models using self-supervised learning. It allows the model to learn from any collection of images without needing labels or metadata, meaning no large amounts of labeled data is required.

real-world applications: Object Identification, Depth measurement, Object Classification, Object retrieval, Image Data Curation

6 Text-to-Video Generative AI Platforms Worth Using in 2024

The frontiers of content creation are being pushed further into realms previously relegated to the imagination. The…

odsc.medium.com

Video generation platform

Sora — OpenAI’sTrailblazer

Imagen: Google’s Answer

Make A Video

Phenaki: AI Video Storyteller

Gen 2: RunwayML’s Contribution

CogVideo: A GitHub Sensation

Pika Labs

MiniGPT-4: Open-Source Model for Complex Vision-Language Tasks Like GPT-4

Learn MiniGPT-4, open-source model that mirrors GPT-4's multimodal capabilities and utilizes sophisticated LLM driving…

www.analyticsvidhya.com

open-source model that performs complex vision-language tasks (e.g. generating detailed image descriptions and creating websites from handwritten drafts), utilizing open-source Vicuna as a language decoder and the BLIP-2 Vision Language Model as a visual decoder

MiniGPT4 competing with GPT4? A New Era in Open Source AI Models

Towards competing against GPT4 through fast groundbreaking open source development!

medium.com

Capabilities

Detailed image description generation
Website creation from hand-written drafts
Writing stories and poems inspired by given images
Providing solutions to problems shown in images
Teaching users how to cook based on food photos

Generating Images with Stable Diffusion and OnnxStream on the Raspberry Pi

Learn how to use OnnxStream to generate images with Stable Diffusion XL Turbo on the Raspberry Pi!

towardsdatascience.com

Use open-source onnxstream project to generate image with Stable Diffusion XL Turbo on Raspberry Pi

https://medium.com/mlearning-ai/how-to-run-dreambooth-locally-a-step-by-step-gyu-88c028ab01a4

DreamBooth is a tool to fine-tune an existing text-to-image model like Stable Diffusion using only a few of your own images. You can install it as an extension in Stable diffusion webui

Large Language Models

How to Talk to a CodeBase using LangChain

Simple setup to ask questions to a codebase using LangChain

lucas-soares.medium.com

Talk to codebase solution. It follows typical langchain-based Q&A solution, treating code base as text files, ingestion: Load and split documents, Chunk files and process them, Add documents to DeepLake (to create embeddings), conversational retrieval/retrieval-augmented generation: Initialize DeepLake retriever, Configure retrieval settings, Use ConversationalRetrievalChain to wire up retriever and LLM

🪐 KOSMOS-1 — giving Large Language Models eyes 👀

Everyone was impressed with the capabilities of Large Language Models (LLM) such as ChatGPT. Some people even…

medium.com

Multimodal Large Language Models and use Language Models as General-Purpose Interfaces to perceive beyond text (e.g. visual, audio)

Amazing! I Created My Own Image-Based Chatbot Powered By ChatGPT API + Azure

A step-by-step guide to integrate MM-REACT into your AI Chatbot website

levelup.gitconnected.com

Microsoft MM-REACT is the core component combine individual vision models with the language model in a more flexible manner to overcome complicated visual understanding (hence the name MM, which means multimodal). The file path is used as a placeholder and inputted into ChatGPT to enable the system to accept images as input. Whenever the system requires specific information from the image, such as identifying a celebrity name or box coordinates, ChatGPT seeks help from a specific vision component from Azure, e.g. OCR. The vision component’s output is then serialized as text and combined with the input to activate ChatGPT further.

If you go to MM-REACT repo, you will see a few environment variables defined for Azure Cognitive services resources URL and api keys.

Meet MM-REACT: Microsoft Research New Model that Enables Visual Reasoning on top of ChatGPT

Reasoning is often referred to as the next frontier for foundation models. The reasoning problem is incredibly…

pub.towardsai.net

Problem: different vision problems often require different models, which need manual selection and composition for each use case. One way to address this issue is by combining the vision and language modules into a single end-to-end model. Microsoft Research has explored this research direction, with systems like Flamingo and PaLM-E. These systems encode vision signals into special text tokens or features that the language module can understand, enabling the system to utilize the language module for understanding user queries and providing responses.

Microsoft MM-REACT is to enhance ChatGPT’s visual understanding by composing numerous vision experts (computer vision models, especially Azure cognitive services computer vision, e.g. image captioning, object tagging, face recognition, OCR/form recognizer, celebrity recognition, bing search/visual search)

Inside Low-Code LLM: Microsoft Research’s Novel Prompt Engineering Method Based on Human-LLM…

The new approach improves the effectiveness of LLMs in complex tasks.

pub.towardsai.net

For complex task, the low-code LLM interaction can be completed in four steps (it is able to help in situations that you can break down tasks into subtasks, with clear jump logic):

I. A Planning LLM generates a highly structured workflow for complex tasks.

II. Users edit the workflow with predefined low-code operations (including adding/removing steps by clicking buttons; modifying step names or descriptions by clicking and text editing; adding/removing a jump logic by clicking; changing the processing order by dragging; extending a step in the flowchart by clicking the button; regeneration and confirmation by clicking buttons).

III. An Executing LLM generates responses with the reviewed workflow.

IV. Users continue to refine the workflow until satisfactory results are obtained.

https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference#distributed-inference-with-accelerate
Accelerator.split_between_processes() context manager (which also exists in PartialState and AcceleratorState). This function will automatically split whatever data you pass to it (be it a prompt, a set of tensors, a dictionary of the prior data, etc.) across all the processes (with a potential to be padded) for you to use right away.
https://www.youtube.com/watch?v=9G9GY8s-tHY
InstructEval paper: MMLU/problem-solving, DROP/math-based discrete reasoning, BBH/logical deduction, fallacy detection, CRASS/causal reasoning capability, humaneva/code-generative models, instructeval benchmark leaderboard from declare lab, helm from Stanford

Patterns for Building LLM-based Systems & Products

Evals, RAG, fine-tuning, caching, guardrails, defensive UX, and collecting user feedback.

eugeneyan.com

7 Patterns mentioned

Evals: To measure performance
RAG: To add recent, external knowledge
Fine-tuning: To get better at specific tasks
Caching: To reduce latency & cost
Guardrails: To ensure output quality
Defensive UX: To anticipate & manage errors gracefully
Collect user feedback: To build our data flywheel

Creative

Writing Songs with GPT-4: Part 1, Lyrics

How to use the latest language model from OpenAI to help write lyrics for original songs

towardsdatascience.com

Create music lyrics with GPT-4, example prompts

Writing Songs with GPT-4: Part 2, Chords

How to use the latest large language model from OpenAI to help compose chords for original songs

towardsdatascience.com

Create music chords with GPT-4, example prompts

Writing Songs with GPT-4: Part 3, Melodies

How to use the latest language model from OpenAI to help write melodies for new songs

towardsdatascience.com

Prompt: create melody in guitar tab format, pre-chorus

Making Music-Tagging AI Explainable through Source Separation

Let’s open the black box

towardsdatascience.com

AudioGPT — A Glimpse into the Future of Creating Music

Research Paper Analyzed and Explained

towardsdatascience.com

GPT model to creating music

AudioGPT: bridging text to music

A new AI model connects ChatGPT with audio and music models

levelup.gitconnected.com

Processing music is not at all easy for several reasons:

Data, in fact getting human-labeled speech data is a much more expensive task than scraping web text (and also more time-consuming). Also, there is much less material and therefore less data.
is computationally much more expensive.

One solution is to use LLM as an interface, this LLM then dialogues with foundation models dedicated to speech and an interface to speech dialog (input/output interface (ASR, TTS)), like HuggingGPT.

It uses ChatGPT to understand user’s intention, then map the task to a specific model, which finally sends the results to ChatGPT and further to user.

Interesting use cases: image to audio, audio to face, sound event detection, noise elimination

Others

8 Anomaly Detection Techniques: Summary, Comparison, and Code

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was…

blog.gopenai.com

Rule-based Method

Statistics-based Methods: IQR, z-score, Minimum Covariance Determinant (MCD)/Gaussian Mixture (GM)

Distance-based Methods

Linear Models

Density-based Methods

Tree-based method

Graph-based Models

Autoencoders

Machine Learning stories roundup 2024.7

Machine learning articles that are interesting to read

Local LLMs

How to build a RAG application using Local Large Language Models

Stop sending your private data through OpenAI API! Use local and secure LLMs like GPT4all-J from Langchain instead.

Free Open Source Alternative to ChatGPT — GPT4All

Build a conversational chatbot using the free open source GPT4All

Computer vision

VoxelGPT: Your AI Assistant for Computer Vision

VoxelGPT revolutionizes computer vision by combining the power of large language models and the FiftyOne query…

Introduction to ControlNet for Stable Diffusion

Better control for text-to-image generation

How I Turned ChatGPT into an SQL-Like Translator for Image and Video Datasets

A process of prompt engineering, software engineering, trial and error, and elbow grease

Benchmarking the Major Cloud Vision AutoML Tools

Until now, there has been little independent research published on the performance of AutoML tools - (both relative to…

Inside DINOv2: Meta AI’s New Self-Supervised Learning Model for Computer Vision

The model uses a novel architecture to remove the dependencies on text fine tuning and exhibits interesting emerging…

6 Text-to-Video Generative AI Platforms Worth Using in 2024

The frontiers of content creation are being pushed further into realms previously relegated to the imagination. The…

MiniGPT-4: Open-Source Model for Complex Vision-Language Tasks Like GPT-4

Learn MiniGPT-4, open-source model that mirrors GPT-4's multimodal capabilities and utilizes sophisticated LLM driving…

MiniGPT4 competing with GPT4? A New Era in Open Source AI Models

Towards competing against GPT4 through fast groundbreaking open source development!

Generating Images with Stable Diffusion and OnnxStream on the Raspberry Pi

Learn how to use OnnxStream to generate images with Stable Diffusion XL Turbo on the Raspberry Pi!

Large Language Models

How to Talk to a CodeBase using LangChain

Simple setup to ask questions to a codebase using LangChain

🪐 KOSMOS-1 — giving Large Language Models eyes 👀

Everyone was impressed with the capabilities of Large Language Models (LLM) such as ChatGPT. Some people even…

Amazing! I Created My Own Image-Based Chatbot Powered By ChatGPT API + Azure

A step-by-step guide to integrate MM-REACT into your AI Chatbot website

Meet MM-REACT: Microsoft Research New Model that Enables Visual Reasoning on top of ChatGPT

Reasoning is often referred to as the next frontier for foundation models. The reasoning problem is incredibly…

Inside Low-Code LLM: Microsoft Research’s Novel Prompt Engineering Method Based on Human-LLM…

The new approach improves the effectiveness of LLMs in complex tasks.

Patterns for Building LLM-based Systems & Products

Evals, RAG, fine-tuning, caching, guardrails, defensive UX, and collecting user feedback.

Creative

Writing Songs with GPT-4: Part 1, Lyrics

How to use the latest language model from OpenAI to help write lyrics for original songs

Writing Songs with GPT-4: Part 2, Chords

How to use the latest large language model from OpenAI to help compose chords for original songs

Writing Songs with GPT-4: Part 3, Melodies

How to use the latest language model from OpenAI to help write melodies for new songs

Making Music-Tagging AI Explainable through Source Separation

Let’s open the black box

AudioGPT — A Glimpse into the Future of Creating Music

Research Paper Analyzed and Explained

AudioGPT: bridging text to music

A new AI model connects ChatGPT with audio and music models

Others

8 Anomaly Detection Techniques: Summary, Comparison, and Code

“An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was…

Written by Xin Cheng

No responses yet