Semantic search in Machine Learning, part 1
This is the fourth article of building LLM-powered AI applications series. From the previous article, we introduced embeddings/vectors which are the enabling technologies for semantic search.
We introduced semantic search in second article of the series. A quick brief: semantic search understands the searcher’s intent and the contextual meaning of terms. Another article on semantic search.
Overview
The article introduces semantic search and mentions two embeddings in NLP: word embedding and sentence embedding (could be images, audio, video embedding, as long as it is numerical representation of a piece of information), and considerations when you decide whether to use semantic search or not.
The article uses txtai.embeddings library to demo semantic search.
Transformer-based models can better understand the context of words and phrases, making it easier to provide more relevant search results; Multi-modal search refers to the ability to search for information across multiple modes, such as text, images, and videos.
Applications of Semantic Search
- Intelligent Query/Search/Question Answering: Semantic search enables users to enter search queries consisting of words, semantic expressions, or sample documents, and retrieve results ranked based on semantic similarity. It helps in finding pieces of information, understanding user intent, and providing relevant search results.
- Metadata Extraction/Indexing/Categorization: Semantic search aids in extracting metadata from various sources such as images and documents. It allows for categorizing information based on its intent and contextual meaning, enabling efficient management and organization of unstructured data.
- Knowledge Management: Semantic search solutions facilitate knowledge management within organizations. They enable search and retrieval of organization-wide information, sentiment analysis of reviews and comments, and seamless integration of business intelligence strategies.
- Spelling Tolerance: Semantic search can handle spelling errors and variations in search queries. It automatically corrects simple errors like inclusion, omission, exchange, and permutation of characters, enhancing search accuracy and user experience.
Build semantic search applications
The article talks about building question answering solution using open-source haystack, which powers following use cases
- Extractive Question Answering, where answers are “extracted” from the body of text.
- Generative Question Answering, where answers are “generated” from existing examples of questions and correct answers.
- FAQ search: generating answers to questions based on a corpus of existing questions and answers.
- Search in text-based internal systems, like financial reports or legal case search systems (Document Search).
Haystack supports lots of backend datastores (you can see Elasticsearch and popular vector databases) and serves as indexing, querying frontend. It also has pipeline (e.g. Elasticsearch retriever node, question answering node).
More components in haystack-based semantic search:
- EmbeddingRetriever
- FAISSDocumentStore
- DocumentSearchPipeline
If you are unhappy with the retriever’s results, you can add a “ranker” node to rerank documents.
A simple small-scale search application using OpenAI embedding to convert document (then resulting embeddings are stored in a new column of the dataframe and saved to a CSV file), when querying, using cosine_similarity to compute similarity. To enable semantic search:
- Embedding generator
- Embedding store
- Embedding indexing (for efficient search, like relational database indexing)
- Embedding retriever/query engine
Series about embeddings. This first article first talks about transformers and where embeddings fit in transformers architecture. Then two main families
- BERT-like, only using the encoder part of the transformer. Good at classification, summarization, and entity recognition.
- GPT family, decoder only. Good at generative tasks like translation and QA.
MTEB/BEIR benchmark for evaluation of Information Retrieval Models. Current state-of-the-art embedding models:
- SBERT models (all-MiniLM-L6-v2, all-MiniLM-L12-v2, and all-mpnet-base-v2) are striking a good balance between simplicity and ranking quality.
- SGPT (5.8B, 2.7B, 1.3B) is a recent take on the LoRa-finetuned open-source GPT-NeoX model for ranking.
- GTR-T5 is Google’s open-source embedding model for semantic search, using the T5 LLM as a base.
- E5 (v1 and v2) is the newest embedding model from Microsoft.
Majority of models have multi-lingual versions available:
- E5: multilingual-e5-base, which is aligned version of multilingual XLM-Roberta-base.
- SGPT: sgpt-bloom, which is based on a BLOOM model.
- SBERT: multilingual-MiniLM-L12-v2 based on mUSE.
Benefit of metadata: If text data comes with metadata, we can easily exploit that fact in a question answering task — by passing a filter to our retriever-reader pipeline. The retriever then only preselects those documents that match our filter. As a result, we greatly reduce the search space for the rest of the pipeline, e.g. you want to look at one competitor at a time. To that end, you pass a filter to the retriever with a company’s name, and the financial years you’d like to investigate.
Metadata filtering: an eCommerce store would have a product’s data list on their website, however, you’d see additional filters such as price range, colors, material. The article talks about using deep learning for filter recommendation (e.g. from query “golden pendant”, predict tecommended facets: price_range, chain_length, metal_color)
Traditional search solution with embeddings
Background: in search world, Elasticsearch is the market leader for full-text search. It was invented before embeddings were widely used. How does it incorporate semantic search?
The article mentions 3 most important moving parts in vector search: embedding, similarity score/distance, ANN algorithm (how to search efficiently in high dimensional embedding spaces, at scale)
Steps to add vector search Elasticsearch:
- ES process the initial data with an Inference processor that will add an embedding for each passage. For this, we create a text embedding ingest pipeline and then reindex our initial data with this pipeline.
- we need to create and define a mapping for our destination index, in particular for the field text_embedding.predicted_value where the ingest processor will store embeddings. If we don’t do that, embeddings will be indexed into regular float fields and can’t be used for vector similarity search. Then reindex collection with the pipeline.
- During, get embedding for the query. Then plug the resulted dense vector into _knn_search.
Elasticsearch 8.8 add support for native hybrid search.
Programmatic way to index and query with vectors in Elasticsearch.
Appendix
Keyword search is still relevant. One benefit of keyword is facet: which are filters to refine search results to only view items that are of particular interest based on a common characteristic. Think of the left-hand side of an Amazon search results page. They are based on the metadata associated with your documents, so the richer your metadata, the better options you can provide to your users. In an enterprise setting, common facets are geography-based ones (State, Country), enterprise-based ones (Department, Business Unit), and time-based ones (Published Date, Modified Date)
Azure cognitive service semantic search