Semantic search in Machine Learning, part 1

Search with meaning, intent and context

6 min readJun 29, 2023

This is the fourth article of building LLM-powered AI applications series. From the previous article, we introduced embeddings/vectors which are the enabling technologies for semantic search.

We introduced semantic search in second article of the series. A quick brief: semantic search understands the searcher’s intent and the contextual meaning of terms. Another article on semantic search.

Overview

Semantic search: a practical overview

A practical overview of what semantic search is and why/how to use it

blog.ml6.eu

The article introduces semantic search and mentions two embeddings in NLP: word embedding and sentence embedding (could be images, audio, video embedding, as long as it is numerical representation of a piece of information), and considerations when you decide whether to use semantic search or not.

Getting started with semantic search

Learn about this rapidly developing capability

medium.com

The article uses txtai.embeddings library to demo semantic search.

An overview of semantic search, knowledge graphs & vector databases

Coupled with an overview of performing semantic search on your own private data

medium.com

Transformer-based models can better understand the context of words and phrases, making it easier to provide more relevant search results; Multi-modal search refers to the ability to search for information across multiple modes, such as text, images, and videos.

Applications of Semantic Search

Semantic Search And Its Applications

"Semantic search is Google's growing ability to make associations between things in ways that come closer to how we…

www.linkedin.com

Intelligent Query/Search/Question Answering: Semantic search enables users to enter search queries consisting of words, semantic expressions, or sample documents, and retrieve results ranked based on semantic similarity. It helps in finding pieces of information, understanding user intent, and providing relevant search results.
Metadata Extraction/Indexing/Categorization: Semantic search aids in extracting metadata from various sources such as images and documents. It allows for categorizing information based on its intent and contextual meaning, enabling efficient management and organization of unstructured data.
Knowledge Management: Semantic search solutions facilitate knowledge management within organizations. They enable search and retrieval of organization-wide information, sentiment analysis of reviews and comments, and seamless integration of business intelligence strategies.
Spelling Tolerance: Semantic search can handle spelling errors and variations in search queries. It automatically corrects simple errors like inclusion, omission, exchange, and permutation of characters, enhancing search accuracy and user experience.

Build semantic search applications

Building a REST API for Question Answering with Haystack

If you are building web or mobile application with an API-driven question answering (QA) component, the solution is to…

www.deepset.ai

The article talks about building question answering solution using open-source haystack, which powers following use cases

Extractive Question Answering, where answers are “extracted” from the body of text.
Generative Question Answering, where answers are “generated” from existing examples of questions and correct answers.
FAQ search: generating answers to questions based on a corpus of existing questions and answers.
Search in text-based internal systems, like financial reports or legal case search systems (Document Search).

Haystack supports lots of backend datastores (you can see Elasticsearch and popular vector databases) and serves as indexing, querying frontend. It also has pipeline (e.g. Elasticsearch retriever node, question answering node).

How to Build a Semantic Search Engine in Python | deepset

Semantic search engines beat keyword-based systems hands down. Here's how to use Haystack, our Python framework, to set…

www.deepset.ai

More components in haystack-based semantic search:

EmbeddingRetriever
FAISSDocumentStore
DocumentSearchPipeline

If you are unhappy with the retriever’s results, you can add a “ranker” node to rerank documents.

Build a Personal Search Engine Web App using Open AI Text Embeddings

Create a semantic search engine using Open AI embeddings and models powered by Databutton — an all-in-one Python…

medium.com

A simple small-scale search application using OpenAI embedding to convert document (then resulting embeddings are stored in a new column of the dataframe and saved to a CSV file), when querying, using cosine_similarity to compute similarity. To enable semantic search:

Embedding generator
Embedding store
Embedding indexing (for efficient search, like relational database indexing)
Embedding retriever/query engine

From zero to semantic search embedding model

A series of articles on building an accurate Large Language Model for neural search from scratch. We’ll start with BERT…

blog.metarank.ai

Series about embeddings. This first article first talks about transformers and where embeddings fit in transformers architecture. Then two main families

BERT-like, only using the encoder part of the transformer. Good at classification, summarization, and entity recognition.
GPT family, decoder only. Good at generative tasks like translation and QA.

MTEB/BEIR benchmark for evaluation of Information Retrieval Models. Current state-of-the-art embedding models:

SBERT models (all-MiniLM-L6-v2, all-MiniLM-L12-v2, and all-mpnet-base-v2) are striking a good balance between simplicity and ranking quality.
SGPT (5.8B, 2.7B, 1.3B) is a recent take on the LoRa-finetuned open-source GPT-NeoX model for ranking.
GTR-T5 is Google’s open-source embedding model for semantic search, using the T5 LLM as a base.
E5 (v1 and v2) is the newest embedding model from Microsoft.

Majority of models have multi-lingual versions available:

E5: multilingual-e5-base, which is aligned version of multilingual XLM-Roberta-base.
SGPT: sgpt-bloom, which is based on a BLOOM model.
SBERT: multilingual-MiniLM-L12-v2 based on mUSE.

Metadata Filtering with Haystack | deepset

How to make use of metadata filtering to boost the quality of answers in Haystack question answering system.

www.deepset.ai

Benefit of metadata: If text data comes with metadata, we can easily exploit that fact in a question answering task — by passing a filter to our retriever-reader pipeline. The retriever then only preselects those documents that match our filter. As a result, we greatly reduce the search space for the rest of the pipeline, e.g. you want to look at one competitor at a time. To that end, you pass a filter to the retriever with a company’s name, and the financial years you’d like to investigate.

Keyword Search using Dense Vectors & Filter Recommendation using Deep Learning and Haystack

Keyword Search using Dense Vectors

& Filter Recommendation using Deep Learning and Haystack Keyword Search using Dense Vectorsmedium.com

Metadata filtering: an eCommerce store would have a product’s data list on their website, however, you’d see additional filters such as price range, colors, material. The article talks about using deep learning for filter recommendation (e.g. from query “golden pendant”, predict tecommended facets: price_range, chain_length, metal_color)

Traditional search solution with embeddings

Background: in search world, Elasticsearch is the market leader for full-text search. It was invented before embeddings were widely used. How does it incorporate semantic search?

What is vector search? Better search with ML

What is vector search? Vector search captures the meaning and context of unstructured data. Using vector search makes…

www.elastic.co

The article mentions 3 most important moving parts in vector search: embedding, similarity score/distance, ANN algorithm (how to search efficiently in high dimensional embedding spaces, at scale)

How to deploy NLP: Text Embeddings and Vector Search

Taking Sentiment Analysis as the example task, this blog describes the process for getting up and running using deep…

www.elastic.co

Steps to add vector search Elasticsearch:

ES process the initial data with an Inference processor that will add an embedding for each passage. For this, we create a text embedding ingest pipeline and then reindex our initial data with this pipeline.
we need to create and define a mapping for our destination index, in particular for the field text_embedding.predicted_value where the ingest processor will store embeddings. If we don’t do that, embeddings will be indexed into regular float fields and can’t be used for vector similarity search. Then reindex collection with the pipeline.
During, get embedding for the query. Then plug the resulted dense vector into _knn_search.

Elasticsearch 8.8 add support for native hybrid search.

Vector-Based Semantic Search using Elasticsearch

Semantic Search, a form of search usually used in search engines, serves content to the users understanding the intent…

medium.com

Semantic Search With HuggingFace and Elasticsearch

Let’s rank passages in a dataset using the nearest neighbor search

betterprogramming.pub

Programmatic way to index and query with vectors in Elasticsearch.

Appendix

Expert Analysis: Keyword Search vs Semantic Search - Part One - Enterprise Knowledge

Keyword search has been the predominant method to provide search to an enterprise application; however, semantic search…

enterprise-knowledge.com

Keyword search is still relevant. One benefit of keyword is facet: which are filters to refine search results to only view items that are of particular interest based on a common characteristic. Think of the left-hand side of an Amazon search results page. They are based on the metadata associated with your documents, so the richer your metadata, the better options you can provide to your users. In an enterprise setting, common facets are geography-based ones (State, Country), enterprise-based ones (Department, Business Unit), and time-based ones (Published Date, Modified Date)

Quickstart: semantic search - Azure Cognitive Search

Change an existing index to use semantic search.

learn.microsoft.com

Azure cognitive service semantic search

Semantic search in Machine Learning, part 1

Search with meaning, intent and context

Overview

Semantic search: a practical overview

A practical overview of what semantic search is and why/how to use it

Getting started with semantic search

Learn about this rapidly developing capability

An overview of semantic search, knowledge graphs & vector databases

Coupled with an overview of performing semantic search on your own private data

Applications of Semantic Search

Semantic Search And Its Applications

"Semantic search is Google's growing ability to make associations between things in ways that come closer to how we…

Build semantic search applications

Building a REST API for Question Answering with Haystack

If you are building web or mobile application with an API-driven question answering (QA) component, the solution is to…

How to Build a Semantic Search Engine in Python | deepset

Semantic search engines beat keyword-based systems hands down. Here's how to use Haystack, our Python framework, to set…

Build a Personal Search Engine Web App using Open AI Text Embeddings

Create a semantic search engine using Open AI embeddings and models powered by Databutton — an all-in-one Python…

From zero to semantic search embedding model

A series of articles on building an accurate Large Language Model for neural search from scratch. We’ll start with BERT…

Metadata Filtering with Haystack | deepset

How to make use of metadata filtering to boost the quality of answers in Haystack question answering system.

Keyword Search using Dense Vectors & Filter Recommendation using Deep Learning and Haystack

Keyword Search using Dense Vectors

Traditional search solution with embeddings

What is vector search? Better search with ML

What is vector search? Vector search captures the meaning and context of unstructured data. Using vector search makes…

How to deploy NLP: Text Embeddings and Vector Search

Taking Sentiment Analysis as the example task, this blog describes the process for getting up and running using deep…

Vector-Based Semantic Search using Elasticsearch

Semantic Search, a form of search usually used in search engines, serves content to the users understanding the intent…

Semantic Search With HuggingFace and Elasticsearch

Let’s rank passages in a dataset using the nearest neighbor search

Appendix

Expert Analysis: Keyword Search vs Semantic Search - Part One - Enterprise Knowledge

Keyword search has been the predominant method to provide search to an enterprise application; however, semantic search…

Quickstart: semantic search - Azure Cognitive Search

Change an existing index to use semantic search.

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Xin Cheng

No responses yet