Semantic search in Machine Learning, part 2

Search with meaning, intent and context

11 min readJul 20, 2023

This is the fifth article of building LLM-powered AI applications series. From the previous article, we introduced semantic search and applications. In this article, we will dive into technical details of semantic search.

Technical mechanisms

Concepts

Vector similarity

Vector Similarity Explained | Pinecone

Vector embeddings have proven to be an effective tool in a variety of fields, including natural language processing and…

www.pinecone.io

Vector’s proximity in the vector space determines how similar it is.

Distance metric

You need a way to determine if two vectors are similar. Vectors are represented as numbers and “distance” indicates closeness of vectors. Three common vector similarity metrics are mentioned: Euclidean distance, cosine similarity, and dot product similarity. (The basic rule of thumb in selecting the best similarity metric for your vector index is to match it to the one used to train your embedding model, all-MiniLM-L6-v2 model was trained using cosine similarity — so using cosine similarity for the index will produce the most accurate result.)

Cosine similarity is probably not suitable when you have data where the magnitude of the vectors is important and should be taken into account when determining similarity. For example, it is not appropriate for comparing the similarity of image embeddings based on pixel intensities.

Vector search strategies

You need to search closest vectors to your vector.

Introduction to Vector Similarity Search

Learn what vector search is and the metrics pertinent to decide the distance (or similarity) between objects.

zilliz.com

Linear search: A native solution would be computing distance between all other vectors and your vector, then order by distance. However, it will be extremely slow. All other strategies are also called “Indexing”. According to this article, there are four types of vector search algorithms

hash-based (e.g. locality-sensitive hashing),
tree-based (e.g. ANNOY),
cluster-based (e.g. product quantization), and
graph-based (e.g. HNSW).

Space partitioning: family of algorithms that all use the same concept, K-dimensional trees (kd-trees): continuously bisecting the search space (splitting the vectors into “left” and “right” buckets) in a manner similar to binary search trees; Inverted file index (IVF): works by assigning each vector to its nearest centroid — searches are then conducted by first determining the query vector’s closest centroid and conducting the search around there.

Quantization: Scalar quantization (SQ): works by multiplying high-precision floating point vectors with a scalar value, then casting the elements of the resultant vector to their nearest integers; Product quantization (PQ): works similar to dictionary compression. In PQ, all vectors are split into equally-sized subvectors, and each subvector is then replaced with a centroid.

Hierarchical Navigable Small Worlds (HNSW): HNSW creates a multi-layer graph from the original data. Upper layers contain only “long connections” while lower layers contain only “short connections” between vectors in the database. During search, we greedily traverse the uppermost graph (the one with the longest inter-vector connections) for the vector closest to our query vector. We then do the same for the second layer, using the result of the first layer search as the starting point. This continues until we complete the search at the bottommost layer, the result of which becomes the nearest neighbor of the query vector.

Approximate Nearest Neighbors Oh Yeah: ANNOY works by converting vector search space into binary-tree, by first randomly selecting two vectors in the database and bisecting the search space along the hyperplane separating those two vectors. This is done iteratively until there are fewer than some predefined parameter NUM_MAX_ELEMS per node.

ANN search libraries

Embeddings in practice — Deploying embedding-based machine learning systems

Which libraries and databases to use for approximate nearest neighbor search

medium.com

ANN is approximate nearest neighbor search (ANN). ANN search is a technique used to find points in a high-dimensional space that are closest to a given query point. ANN search library mentioned: Faiss (Facebook), Annoy (Spotify), Hnswlib, NMSLIB.

Beside “index build” example with Annoyindex, get_nns_by_vector can get back n closest item from passed-in vector.

Master Semantic Search at Scale: Index Millions of Documents with Lightning-Fast Inference Times…

Dive into an end-to-end demo of a high-performance semantic search engine leveraging GPU acceleration, efficient…

towardsdatascience.com

FAISS for efficient indexing of semantic vectors and Sentence Transformers for encoding sentences into these vectors. FAISS is an outstanding library designed for the fast retrieval of nearest neighbors in high-dimensional spaces, enabling quick semantic nearest neighbor search even at a large scale. Sentence Transformers, a deep learning model, generates dense vector representations of sentences, effectively capturing their semantic meanings.

FAISS supports various index structures optimized for medium, high-dimensional, very large use cases.

Inverted files (IVF): Indexes clusters of similar vectors. Suitable for medium-dimensional vectors.
Product quantization (PQ): Encodes vectors into quantized subspaces. Suitable for high-dimensional vectors.
Cluster-based strategies: Organizes vectors into a hierarchical set of clusters for multi-level search. Suitable for very large datasets.

Understanding FAISS : Part 2

Compression Techniques and Product Quantization on FAISS

medium.com

High-level of Product Quantization

Each vector is divided into sub-vectors, e.g. 1,000-dimension vector is divided into 10 chunks
The number of clusters is chosen, then all chunks will be clustered and assigned a centroid
During query, query vector is also divided into sub-vectors and then find corresponding centroid
For each corresponding centroid, find partial distances of database vectors, use simple sum to calculate overall distance

Simple mechanism, but computing sum of distances of all vectors in target centroids seem still quite a lot of work.

Various data type similarity search

Technical details are different for different data types (e.g. text, image, video, audio). Let’s understand on high-level how they can be done (e.g. ETL, embedding, vector index to use, etc.)

Text

Semantic search with embeddings: index anything

Building scalable semantic retrieval from image, text, graph, and interaction data

rom1504.medium.com

The article first starts with traditional search and related techniques, including Page rank to rank web pages, Tf idf to rank words in documents, Color histogram and Surf are simple image descriptors, Inverted index to index content efficiently, Crawling to find pages in the web, Item item similarities is a classic method to find similar items using rating and trends.

Then article mentions that semantic search system is composed of two parts: an encoding pipeline that builds indices, and a search pipeline that lets the user use these indices to search for items.

Encoding pipeline

When encoding data, the data representation can be multimodal (combination of visual, audio, text), e.g. Star Wars character C-3PO can be encoded with a picture of it, a description, how it appears in a graph (it’s in a star wars movie, appearing at the date of these movies, …), how popular it is but also that it appears along with R2-D2 often, and it has a robotic voice. Different attributes can have different performance in different systems, e.g. for a recommendation system, the co-occurrence information might work the best, but for a visual search system, the picture might be the most relevant.

Image encoder: Networks like ResNet or EfficientNet are really good feature extractors for images, it’s also possible to use segmentation or object detection before applying the image encoder: Segmentation can be used to extract part of the image pixel by pixel, it can be relevant to extract shirts and pants from a fashion picture; Detection is useful to extract rectangular zones from the images.

Text encoder: Word2vec, Transformers, Bert

Other data type: Jina ai vector hub is mentioned, but not accessible anymore

Embeddings composition: Concatenation: concatenating the embeddings is a basic method that works surprisingly well; Multimodal model: vision and language deep learning is becoming really popular, and many models (imagebert, vilbert, uniter, vl-bert, see this interesting demo) propose to learn from both language and text, to produce cross model representations.

Training embeddings: Image specific training: Groknet, FaceNet; Text specific training: huggingface transformers library and the sentence transformers library based on it are great to fine-tune a text model for a specific use case. StarSpace a facebook project to learn embeddings from images, text, graph, and distribution for various objectives

Indexing pipeline

Vector indexing libraries for efficient vector search: Faiss A very broad library that implements many algorithms and clean interfaces to build them and search from them, Hnswlib is currently the fastest implementation of HNSW. Highly specialized and optimized, Annoy is another knn algorithm, implemented by Spotify, Scann from Google is a new method that is state of the art, beating HNSW in speed and recall using anisotropic quantization, Catalyzer from Facebook that proposes to train the quantizer with a neural network for a specific task

Search pipeline

Encode the query, search through index

Open source solution for building scalable semantic search: Jina, Milvus, Elastic search with hnsw integration, Vectorhub, Haystack

Semantic Search with S-BERT is all you need

Building In-house Semantic Search Engine from Scratch — Fast and Accurate

medium.com

Some considerations in choosing semantic search technologies:

Embedding: whether you are doing symmetric semantic search ( query and the entries in your corpus are of about the same length and have the same amount of content), or asymmetric semantic search (short query and you want to find a longer paragraph answering the query). Models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents (latest documentation seems not emphasizing the difference).

Vector index storage: elasticsearch, faiss, annoy

Fine-tuning sbert (sentence transformer, most popular text embedding model): need query & relevant passages information to fine-tune the model, and author mentions “synthetic query generation” (use a model to generate query based on passage) to get that.

Computer vision

Image Search

BackgroundWhat is Image Search and how will we use it?One may find themselves with an image, looking for similar images…

docs.pinecone.io

Embedding: SqueezeNet model, very small model and basic model that has been trained on millions of images across 1000 classes

Video Similarity Search

Build a video similarity search system with Milvus v2.2.x.

milvus.io

Reverse video search is similar like reverse image search. In simple words, it takes a video as input to search for similar videos. Embedding provided by Towhee to extract features and generate embeddings (using X3D model)

Two minutes NLP — Semantic search of images with CLIP and Unsplash

CLIP’s text and image encoders, the Unsplash dataset, and cosine similarity

medium.com

Search image with natural language

OpenAI’s CLIP (Contrastive Language-Image Pretraining) is a deep learning model that combines vision and language to understand and interpret images. It learns to associate images and their textual descriptions by maximizing the similarity between corresponding pairs of image and text representations. You can perform a semantic search for open-source images using natural language descriptions. Below is the pipeline

Download the CLIP model and the Unsplash dataset.
Use CLIP’s image encoder to encode all the images in the Unsplash dataset and store them.
Use CLIP’s text encoder to encode a text query.
Compute the cosine similarity between the query embedding and all the images embeddings.
Retrieve the top N images with the highest similarity and show them to the user.

Building Image search with OpenAI Clip

OpenAI’s Clip is a neural network that was trained on a huge number of image and text pairs and has therefore learned…

anttihavanko.medium.com

Pipeline of image similarity search

sentence-transformers library to load the pretrained Clip model to generate image embeddings
Faiss to index vector embeddings
Faiss to search similar images

Creating a Semantic Search Engine for My Photos

In this article, we explore the results of using a CLIP model to find photos in a personal image library using…

wandb.ai

Similar article to search photos with natural language: CLIP is a multimodal vision and text model that attempts to encode both images and the textual descriptions of those images in the same latent space. This means that if a text sentence is a good description of an image, both items will be represented as very close points in that vector space. CLIP is usually applied to measure the similarity scores between some images and some text descriptions.

Facial Similarity Search

In this notebook, we will demonstrate how to use Pinecone to build an image-based vector search application to discover…

docs.pinecone.io

Two models are used: one for extracting faces (MTCNN, which is a popular choice due to its ability to accurately detect and align faces in images despite variations in pose and appearance) and another for generating vector embeddings of the face (VGGFace2, which is a deep learning model for facial recognition that was trained on the VGGFace2 dataset).

Audio

Audio Search

Audio Similarity SearchThis notebook shows how to use Pinecone as the vector DB within an audio search application…

docs.pinecone.io

audio embeddings: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, Pinecone uses HNSW for index

Audio Similarity Search

Build an audio similarity search system with Milvus v2.2.x.

milvus.io

From the page, embedding is also pann, but Jupyter notebook is not available now, Milvus support IVF, HNSW, Annoy, Quantization.

Appendix

Kaushik Shakkari

Understanding Semantic Search

View list

10 stories

A series of articles about semantic search.

Part 1: Machine Reading Comprehension, SQuAD, BERT, Allen NLP

Part 2: limitation of Machine Reading Comprehension: input size of the model. Resolution: reader and retriever architecture (document splitter to turn document into chunks of text). Retriever algorithms: Classical or Sparse Retrieval Algorithms (TF-IDF and BM25) (the higher the frequency of a word in a piece of text, the higher the likelihood of the piece of text to be about that word), Neural or Dense Retrieval Algorithms (DPR and SBERT)

Part 3: knowledge graph, main three NLP tasks: Named Entity Extraction, Relationship Extraction, Coreference Resolution (if different words in the text refer to the same entity (e.g. Alternative Labels (CEO and Chief Executive Officer), mapping Synonyms (House, Home, and Residence)). To query knowledge graph, query text is converted into structured query

Part 4: Answer Quality Metrics — Evaluating the Reader Models for Machine Reading Comprehension Task: Lexical or Keyword-based Evaluation Metrics (Exact Match (EM), F1-Score), Neural Based Evaluation Metrics (BertScore, Bi-Encoder Score)

Part 5: Ranking Metrics for Evaluating Question Answering Systems and Recommendation Systems, Top-N Accuracy, Precision over k, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG) (the most common metric, NDCG considers that some answers are more relevant than others, ranking the most relevant responses first, then those of lesser relevant ones, and finally, the least relevant answers)

Part 7: mentions vector database to run compute-intensive dense retriever algorithms against frequently changing data at scale

Elasticsearch Nearest Neighbor Search Options

Dmitry Kan, Principal AI Scientist at Silo AI, recently wrote a blog post where he compared four different approaches…

medium.com

Mentioned KNN. The difference between KNN and ANN is that in the prediction phase, all training points are involved in searching k-nearest neighbors in the KNN algorithm, but in ANN this search starts only on a small subset of candidates points. if N is set to the size of the training set, the ANN reduces to KNN with enormous time spent in the training phase.

Introducing approximate nearest neighbor search in Elasticsearch 8.0

Elasticsearch 8.0 improves the scalability of vector search with the introduction of fast approximate nearest neighbor…

www.elastic.co

Elasticsearch 8.0 starts to support fast, approximate nearest neighbor search (ANN). Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW shows strong search performance across a variety of ann-benchmarks datasets.

Add Similarity Search to DynamoDB with Faiss

How about setting up a scalable semantic similarity search engine for your website or app, and pay only for the…

medium.com

Embedding is generated with sentence transformers. Semantic index is generated with faiss library. The semantic index will be updated automatically by a microservice to reflect new additions to your DynamoDB table (vector in json format (how is the performance?), sparse index). A second microservice will be responsible for querying the index.

https://learn.deeplearning.ai/large-language-models-semantic-search

Semantic search in Machine Learning, part 2

Search with meaning, intent and context

Technical mechanisms

Concepts

Vector Similarity Explained | Pinecone

Vector embeddings have proven to be an effective tool in a variety of fields, including natural language processing and…

Introduction to Vector Similarity Search

Learn what vector search is and the metrics pertinent to decide the distance (or similarity) between objects.

ANN search libraries

Embeddings in practice — Deploying embedding-based machine learning systems

Which libraries and databases to use for approximate nearest neighbor search

Master Semantic Search at Scale: Index Millions of Documents with Lightning-Fast Inference Times…

Dive into an end-to-end demo of a high-performance semantic search engine leveraging GPU acceleration, efficient…

Understanding FAISS : Part 2

Compression Techniques and Product Quantization on FAISS

Various data type similarity search

Text

Semantic search with embeddings: index anything

Building scalable semantic retrieval from image, text, graph, and interaction data

Semantic Search with S-BERT is all you need

Building In-house Semantic Search Engine from Scratch — Fast and Accurate

Computer vision

Image Search

BackgroundWhat is Image Search and how will we use it?One may find themselves with an image, looking for similar images…

Video Similarity Search

Build a video similarity search system with Milvus v2.2.x.

Two minutes NLP — Semantic search of images with CLIP and Unsplash

CLIP’s text and image encoders, the Unsplash dataset, and cosine similarity

Building Image search with OpenAI Clip

OpenAI’s Clip is a neural network that was trained on a huge number of image and text pairs and has therefore learned…

Creating a Semantic Search Engine for My Photos

In this article, we explore the results of using a CLIP model to find photos in a personal image library using…

Facial Similarity Search

In this notebook, we will demonstrate how to use Pinecone to build an image-based vector search application to discover…

Audio

Audio Search

Audio Similarity SearchThis notebook shows how to use Pinecone as the vector DB within an audio search application…

Audio Similarity Search

Build an audio similarity search system with Milvus v2.2.x.

Appendix

Understanding Semantic Search

Elasticsearch Nearest Neighbor Search Options

Dmitry Kan, Principal AI Scientist at Silo AI, recently wrote a blog post where he compared four different approaches…

Introducing approximate nearest neighbor search in Elasticsearch 8.0

Elasticsearch 8.0 improves the scalability of vector search with the introduction of fast approximate nearest neighbor…

Add Similarity Search to DynamoDB with Faiss

How about setting up a scalable semantic similarity search engine for your website or app, and pay only for the…

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Xin Cheng

No responses yet