Semantic search in Machine Learning, part 2
This is the fifth article of building LLM-powered AI applications series. From the previous article, we introduced semantic search and applications. In this article, we will dive into technical details of semantic search.
Technical mechanisms
Concepts
Vector similarity
Vector’s proximity in the vector space determines how similar it is.
You need a way to determine if two vectors are similar. Vectors are represented as numbers and “distance” indicates closeness of vectors. Three common vector similarity metrics are mentioned: Euclidean distance, cosine similarity, and dot product similarity. (The basic rule of thumb in selecting the best similarity metric for your vector index is to match it to the one used to train your embedding model, all-MiniLM-L6-v2 model was trained using cosine similarity — so using cosine similarity for the index will produce the most accurate result.)
Cosine similarity is probably not suitable when you have data where the magnitude of the vectors is important and should be taken into account when determining similarity. For example, it is not appropriate for comparing the similarity of image embeddings based on pixel intensities.
Vector search strategies
You need to search closest vectors to your vector.
Linear search: A native solution would be computing distance between all other vectors and your vector, then order by distance. However, it will be extremely slow. All other strategies are also called “Indexing”. According to this article, there are four types of vector search algorithms
- hash-based (e.g. locality-sensitive hashing),
- tree-based (e.g. ANNOY),
- cluster-based (e.g. product quantization), and
- graph-based (e.g. HNSW).
Space partitioning: family of algorithms that all use the same concept, K-dimensional trees (kd-trees): continuously bisecting the search space (splitting the vectors into “left” and “right” buckets) in a manner similar to binary search trees; Inverted file index (IVF): works by assigning each vector to its nearest centroid — searches are then conducted by first determining the query vector’s closest centroid and conducting the search around there.
Quantization: Scalar quantization (SQ): works by multiplying high-precision floating point vectors with a scalar value, then casting the elements of the resultant vector to their nearest integers; Product quantization (PQ): works similar to dictionary compression. In PQ, all vectors are split into equally-sized subvectors, and each subvector is then replaced with a centroid.
Hierarchical Navigable Small Worlds (HNSW): HNSW creates a multi-layer graph from the original data. Upper layers contain only “long connections” while lower layers contain only “short connections” between vectors in the database. During search, we greedily traverse the uppermost graph (the one with the longest inter-vector connections) for the vector closest to our query vector. We then do the same for the second layer, using the result of the first layer search as the starting point. This continues until we complete the search at the bottommost layer, the result of which becomes the nearest neighbor of the query vector.
Approximate Nearest Neighbors Oh Yeah: ANNOY works by converting vector search space into binary-tree, by first randomly selecting two vectors in the database and bisecting the search space along the hyperplane separating those two vectors. This is done iteratively until there are fewer than some predefined parameter NUM_MAX_ELEMS per node.
ANN search libraries
ANN is approximate nearest neighbor search (ANN). ANN search is a technique used to find points in a high-dimensional space that are closest to a given query point. ANN search library mentioned: Faiss (Facebook), Annoy (Spotify), Hnswlib, NMSLIB.
Beside “index build” example with Annoyindex, get_nns_by_vector can get back n closest item from passed-in vector.
FAISS for efficient indexing of semantic vectors and Sentence Transformers for encoding sentences into these vectors. FAISS is an outstanding library designed for the fast retrieval of nearest neighbors in high-dimensional spaces, enabling quick semantic nearest neighbor search even at a large scale. Sentence Transformers, a deep learning model, generates dense vector representations of sentences, effectively capturing their semantic meanings.
FAISS supports various index structures optimized for medium, high-dimensional, very large use cases.
- Inverted files (IVF): Indexes clusters of similar vectors. Suitable for medium-dimensional vectors.
- Product quantization (PQ): Encodes vectors into quantized subspaces. Suitable for high-dimensional vectors.
- Cluster-based strategies: Organizes vectors into a hierarchical set of clusters for multi-level search. Suitable for very large datasets.
High-level of Product Quantization
- Each vector is divided into sub-vectors, e.g. 1,000-dimension vector is divided into 10 chunks
- The number of clusters is chosen, then all chunks will be clustered and assigned a centroid
- During query, query vector is also divided into sub-vectors and then find corresponding centroid
- For each corresponding centroid, find partial distances of database vectors, use simple sum to calculate overall distance
Simple mechanism, but computing sum of distances of all vectors in target centroids seem still quite a lot of work.
Various data type similarity search
Technical details are different for different data types (e.g. text, image, video, audio). Let’s understand on high-level how they can be done (e.g. ETL, embedding, vector index to use, etc.)
Text
The article first starts with traditional search and related techniques, including Page rank to rank web pages, Tf idf to rank words in documents, Color histogram and Surf are simple image descriptors, Inverted index to index content efficiently, Crawling to find pages in the web, Item item similarities is a classic method to find similar items using rating and trends.
Then article mentions that semantic search system is composed of two parts: an encoding pipeline that builds indices, and a search pipeline that lets the user use these indices to search for items.
Encoding pipeline
When encoding data, the data representation can be multimodal (combination of visual, audio, text), e.g. Star Wars character C-3PO can be encoded with a picture of it, a description, how it appears in a graph (it’s in a star wars movie, appearing at the date of these movies, …), how popular it is but also that it appears along with R2-D2 often, and it has a robotic voice. Different attributes can have different performance in different systems, e.g. for a recommendation system, the co-occurrence information might work the best, but for a visual search system, the picture might be the most relevant.
Image encoder: Networks like ResNet or EfficientNet are really good feature extractors for images, it’s also possible to use segmentation or object detection before applying the image encoder: Segmentation can be used to extract part of the image pixel by pixel, it can be relevant to extract shirts and pants from a fashion picture; Detection is useful to extract rectangular zones from the images.
Text encoder: Word2vec, Transformers, Bert
Other data type: Jina ai vector hub is mentioned, but not accessible anymore
Embeddings composition: Concatenation: concatenating the embeddings is a basic method that works surprisingly well; Multimodal model: vision and language deep learning is becoming really popular, and many models (imagebert, vilbert, uniter, vl-bert, see this interesting demo) propose to learn from both language and text, to produce cross model representations.
Training embeddings: Image specific training: Groknet, FaceNet; Text specific training: huggingface transformers library and the sentence transformers library based on it are great to fine-tune a text model for a specific use case. StarSpace a facebook project to learn embeddings from images, text, graph, and distribution for various objectives
Indexing pipeline
Vector indexing libraries for efficient vector search: Faiss A very broad library that implements many algorithms and clean interfaces to build them and search from them, Hnswlib is currently the fastest implementation of HNSW. Highly specialized and optimized, Annoy is another knn algorithm, implemented by Spotify, Scann from Google is a new method that is state of the art, beating HNSW in speed and recall using anisotropic quantization, Catalyzer from Facebook that proposes to train the quantizer with a neural network for a specific task
Search pipeline
Encode the query, search through index
Open source solution for building scalable semantic search: Jina, Milvus, Elastic search with hnsw integration, Vectorhub, Haystack
Some considerations in choosing semantic search technologies:
Embedding: whether you are doing symmetric semantic search ( query and the entries in your corpus are of about the same length and have the same amount of content), or asymmetric semantic search (short query and you want to find a longer paragraph answering the query). Models tuned for cosine-similarity will prefer the retrieval of short documents, while models tuned for dot-product will prefer the retrieval of longer documents (latest documentation seems not emphasizing the difference).
Vector index storage: elasticsearch, faiss, annoy
Fine-tuning sbert (sentence transformer, most popular text embedding model): need query & relevant passages information to fine-tune the model, and author mentions “synthetic query generation” (use a model to generate query based on passage) to get that.
Computer vision
Embedding: SqueezeNet model, very small model and basic model that has been trained on millions of images across 1000 classes
Reverse video search is similar like reverse image search. In simple words, it takes a video as input to search for similar videos. Embedding provided by Towhee to extract features and generate embeddings (using X3D model)
Search image with natural language
OpenAI’s CLIP (Contrastive Language-Image Pretraining) is a deep learning model that combines vision and language to understand and interpret images. It learns to associate images and their textual descriptions by maximizing the similarity between corresponding pairs of image and text representations. You can perform a semantic search for open-source images using natural language descriptions. Below is the pipeline
- Download the CLIP model and the Unsplash dataset.
- Use CLIP’s image encoder to encode all the images in the Unsplash dataset and store them.
- Use CLIP’s text encoder to encode a text query.
- Compute the cosine similarity between the query embedding and all the images embeddings.
- Retrieve the top N images with the highest similarity and show them to the user.
Pipeline of image similarity search
- sentence-transformers library to load the pretrained Clip model to generate image embeddings
- Faiss to index vector embeddings
- Faiss to search similar images
Similar article to search photos with natural language: CLIP is a multimodal vision and text model that attempts to encode both images and the textual descriptions of those images in the same latent space. This means that if a text sentence is a good description of an image, both items will be represented as very close points in that vector space. CLIP is usually applied to measure the similarity scores between some images and some text descriptions.
Two models are used: one for extracting faces (MTCNN, which is a popular choice due to its ability to accurately detect and align faces in images despite variations in pose and appearance) and another for generating vector embeddings of the face (VGGFace2, which is a deep learning model for facial recognition that was trained on the VGGFace2 dataset).
Audio
audio embeddings: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition, Pinecone uses HNSW for index
From the page, embedding is also pann, but Jupyter notebook is not available now, Milvus support IVF, HNSW, Annoy, Quantization.
Appendix
A series of articles about semantic search.
Part 1: Machine Reading Comprehension, SQuAD, BERT, Allen NLP
Part 2: limitation of Machine Reading Comprehension: input size of the model. Resolution: reader and retriever architecture (document splitter to turn document into chunks of text). Retriever algorithms: Classical or Sparse Retrieval Algorithms (TF-IDF and BM25) (the higher the frequency of a word in a piece of text, the higher the likelihood of the piece of text to be about that word), Neural or Dense Retrieval Algorithms (DPR and SBERT)
Part 3: knowledge graph, main three NLP tasks: Named Entity Extraction, Relationship Extraction, Coreference Resolution (if different words in the text refer to the same entity (e.g. Alternative Labels (CEO and Chief Executive Officer), mapping Synonyms (House, Home, and Residence)). To query knowledge graph, query text is converted into structured query
Part 4: Answer Quality Metrics — Evaluating the Reader Models for Machine Reading Comprehension Task: Lexical or Keyword-based Evaluation Metrics (Exact Match (EM), F1-Score), Neural Based Evaluation Metrics (BertScore, Bi-Encoder Score)
Part 5: Ranking Metrics for Evaluating Question Answering Systems and Recommendation Systems, Top-N Accuracy, Precision over k, Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG) (the most common metric, NDCG considers that some answers are more relevant than others, ranking the most relevant responses first, then those of lesser relevant ones, and finally, the least relevant answers)
Part 7: mentions vector database to run compute-intensive dense retriever algorithms against frequently changing data at scale
Mentioned KNN. The difference between KNN and ANN is that in the prediction phase, all training points are involved in searching k-nearest neighbors in the KNN algorithm, but in ANN this search starts only on a small subset of candidates points. if N is set to the size of the training set, the ANN reduces to KNN with enormous time spent in the training phase.
Elasticsearch 8.0 starts to support fast, approximate nearest neighbor search (ANN). Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW shows strong search performance across a variety of ann-benchmarks datasets.
Embedding is generated with sentence transformers. Semantic index is generated with faiss library. The semantic index will be updated automatically by a microservice to reflect new additions to your DynamoDB table (vector in json format (how is the performance?), sparse index). A second microservice will be responsible for querying the index.
https://learn.deeplearning.ai/large-language-models-semantic-search