Vector Database Part 1

Vector search at scale

9 min readJan 29, 2024

This is the sixth article of building LLM-powered AI applications series. From the previous article, we introduced semantic search and applications, as well as technical details of semantic search. This article will focus on how to scale semantic search with vector database (like traditional relational database enables keyword search at scale).

Overview

What is a Vector Database & How Does it Work? Use Cases + Examples | Pinecone

Discover Vector Databases: How They Work, Examples, Use Cases, Pros & Cons, Selection and Implementation. They have…

www.pinecone.io

Capabilities: vector data management, inserting, deleting, updating and querying; store metadata associated with each vector entry and let user query the database using additional metadata filters; scalability and efficiency; real-time update; similar to traditional database: backup, security and access control

pipeline: Indexing: The vector database indexes vectors using an algorithm such as PQ (product quantization), LSH (locality sensitive hashing), or HNSW (Hierarchical Navigable Small Worlds). This step maps the vectors to a data structure that will enable faster searching.
Querying: The vector database compares the indexed query vector to the indexed vectors in the dataset to find the nearest neighbors (applying a similarity metric used by that index)
Post Processing: In some cases, the vector database retrieves the final nearest neighbors from the dataset and post-processes them to return the final results. This step can include re-ranking the nearest neighbors using a different similarity measure.

vector index: the most important piece in vector database, enabling fast querying; algorithms:

Random Projection: creating: project the high-dimensional vectors to a lower-dimensional space using a random projection matrix. querying: use the same projection matrix to project the query vector onto the lower-dimensional space, then compare the projected query vector to the projected vectors in the database to find the nearest neighbors. Since the dimensionality of the data is reduced, the search process is significantly faster than searching the entire high-dimensional space.

A 101 overview of vector databases, vector embeddings, and indexing

An overview of some of the key parts of the future of NLP

medium.com

Vector embeddings are used to capture the semantic or contextual meaning of the data, allowing it to be compared and analyzed in a meaningful way. For example, in natural language processing, words can be represented as vectors in a high-dimensional space, where words with similar meanings are closer together in the space than words with different meanings. (can be created with various methods such as neural networks, word2vec, and other machine learning techniques)

Vector database is a specialized type of database that stores data as high-dimensional vectors. Vectors are mathematical representations of features or attributes that can range from tens to thousands of dimensions depending on the complexity of the data. The vectors are generated by applying an embedding function to the raw data, such as text, images, audio, and video. Vector databases offer the advantage of fast and accurate similarity search and retrieval of data based on their vector distance or similarity.

Indexing involves breaking down a document or website into smaller segments and converting these segments into vectors that can be stored in a vector database. When a user submits a query, the indexing module calculates the vector similarity between the embedded query and each vector in the database. Finally, the indexing module fetches the top k most similar embeddings to generate the response.

Power of Vector Databases for Gen AI Applications

Introduction

medium.com

Lists top 10 Best Vector Databases & Libraries: Elasticsearch, Faiss, Milvus, Weaviate, Pinecone, Qdrant, Vespa, Vald, ScaNN, pgvector

Vector Database: New Hype or Hope for AI’s(LLM)? :)

Accelerating the new generation of Artificial Intelligence solutions and Industry 4.0!

medium.com

Milvus, Weaviate, Qdrant all support RESTful and gRPC APIs, Weaviate also supports GraphQL

Why You Shouldn’t Invest In Vector Databases?

Delving into the vector database field from three perspectives: technology, applications, and the market.

blog.det.life

Landscape of vector search, 4 categories based on open-source or commercial, dedicated vector database or general database

Open-source dedicated vector database: milvus, chroma
Commercial vector database: pinecone, weaviate
Open-source database that supports vector search: postgresql, clickhouse
Commercial database that supports vector search: elasticsearch, rockset

Traditionally, database fills simple query needs over structured data, elasticsearch fills fulltext query needs over semi-structured data. Now there are more advanced needs for search on unstructured data (e.g. image/video/audio similarity search, natural language search of image/video/audio). Whether you need dedicated technology for different needs or there will be unified stack, we have to wait and see.

What is a Vector Database? - Zilliz Vector database blog

Learn what a vector database is and how it can revolutionize your AI apps. Get started today!

zilliz.com

Vector Database: The Secret Behind Large Language Models Capabilities

What are Vector Databases and Why Are They Important for LLMs?

youssefh.substack.com

High-level introduction of index creation algorithms

Random Projection: dimensionality reduction technique used to approximate high-dimensional data by projecting it onto a lower-dimensional space. Random projection matrix is created, dot product of the input vectors and the random projection matrix yields a projected matrix with fewer dimensions, during querying, the same projection matrix is applied to project the query vector to lower-dimensional space to enable faster search
Product Quantization: The PQ process involves four main steps: splitting (high-dimensional vectors are divided into smaller segments), training (“codebook” is created by finding the center points of clusters through k-means clustering applied to each segment), encoding (PQ code for a segment represents the identifier of the corresponding value in the codebook), and querying (the algorithm breaks down the vectors into sub-vectors and quantizes them using the same codebook. Then, it leverages the indexed codes to find the nearest vectors to the query vector.).
Hierarchical Navigable Small World (HNSW): creates a hierarchical, tree-like structure where each node of the tree represents a set of vectors. The edges between the nodes represent the similarity between the vectors. During querying, it uses this graph to navigate through the tree, visiting the nodes that are most likely to contain the closest vectors to the query vector
Locality-Sensitive Hashing: LSH maps similar vectors into “buckets” using a set of hashing functions. During querying, query vector is bucketed to has table using the same hashing functions, then is compared with the other vectors in that same table to find the closest matches

Announcing LanceDB

🙋 Are you building a generative AI app using an LLM or API? Are you working on a new recsys, a modern search engine…

blog.lancedb.com

The post claims there is challenge to get vector database like Pinecone even running, and the “retrieve-filter-hydrate workflow” was often time a big bottleneck in productivity and app latency. And it mentions LangChain integration and roadmap like integration for LlamaIndex, integrations into OpenAI plugin / AutoGPT etc, gallery of generative AI apps powered by LanceDB, solutions for cloud deployment

Semantic search

Using Pinecone Vector Database: A beginner guide

Pinecone is not a traditional database, but rather a cloud-native vector database specifically designed for similarity…

medium.com

Pinecone is managed vector database. It enables efficient and accurate retrieval of similar vectors, making it suitable for recommendation systems, anomaly detection, and search engines.

GPT-4, Semantic Search, and Vector Databases: Revolutionizing Question Answering

Discover how GPT-4, semantic search, and vector databases are transforming question answering by combining sentence…

blog.futuresmart.ai

Question answering with pinecone vector database and GPT-4

Demystifying LLM Apps with Lance

There’s a lot of justifiable excitement around using LLMs as the basis for building a new generation of smart apps…

blog.lancedb.com

Reduce 200K sentences of youtube transcription to 50K (using context windows of size 20 with stride 4: every 4 sentences, it creates a “context” by concatenating the next 20 sentences together.); then convert context to embeddings; merge the embeddings (vectors) with the original data and write it to storage; create ANN index; search context with query vector; send context to LLM as context, with prompt like “answer question based on context: <context>”.

List of Embedding Datasets

An incomplete list of embedding datasets or vector embeddings of existing text corpus

blog.devgenius.io

List embeddings created for existing text corpus, good for semantic search, e.g. Wikipedia Article Embeddings in Many Languages created by Cohere.ai

Vector Databases, Large Language Models and Case Based Reasoning

In 2018, language models burst onto the scene with BERT (Bidirectional Encoder Representations from Transformers)…

medium.com

Case-based reasoning with vector database can be transformed to searching similar cases and similar responses with vector search.

Vector Databases: Taking Data Revolution to The Next Level

Here’s how AI-centric vector databases are gaining their place in modern software stacks.

medium.com

Vector database vs traditional Relational databases

Relational databases store structured data and is generally easy to search and analyze. In contrast, unstructured data is more complex and requires more work to understand.

Machine learning and deep learning models can help us understand this unstructured data by transforming it into vector embeddings. These embeddings are high dimensional vectors that describe complex data as numerical values in different dimensions.

A relational database retrieves results that are an exact/keyword match, while a vector database offers more complex search capabilities (semantic search/match).

Vector Databases Use Cases: Recommendation Engines, Semantic Search, Similarity Search.

Vector Databases as Memory for your AI Agents

LLM Persistence with Pinecone, Chroma, and LangChain

medium.com

Pinecone vs. Chroma: Pinecone is a managed database persistence service, Chroma is open source and you can host your own instance.

Building a Semantic Search Engine With OpenAI and Pinecone

A walkthrough of how to build a semantic search engine for scientific papers using OpenAI and Pinecone.

sigmoidprime.com

Semantic search: uses the arXiv metadataset on Kaggle, which contains metadata (such as title and abstract), OpenAI embedding model text-embedding-ada-002, Pinecone

Vector database indexing

Vector Index Basics and the Inverted File Index - Zilliz Vector database blog

This tutorial analyzes the components of a modern indexer before going over two simplest indexing strategies - flat…

zilliz.com

hash-based indexing (e.g. locality-sensitive hashing),
tree-based indexing (e.g. ANNOY),
cluster-based or cluster indexing (e.g. product quantization), and
graph-based indexing (e.g. HNSW).

What is a Vector Index? An Introduction to Vector Indexing

Unlock the potential of vector indexing with our comprehensive guide. Learn everything you need to know about what a…

www.datastax.com

The purpose of a vector index is to search and retrieve data from a large set of vectors (like table index in relational database). The class of algorithms used to build and search vector indexes is called Approximate Nearest Neighbor (ANN) search. ANN algorithms rely on a similarity measure to determine the nearest neighbors.

Flat indexing: the similarity between the query vector and every other vector in the index is computed. Slow

Locality Sensitive Hashing (LSH) indexes: The index is built using a hashing function. Vector embeddings that are nearby each other are hashed to the same bucket. We can then store all these similar vectors in a single table or bucket. When a query vector is provided, its nearest neighbors can be found by hashing the query vector, and then computing the similarity metric for all the vectors in the table for all other vectors that hashed to the same value.

Inverted file (IVF) indexes: the vector space is first partitioned or clustered, and then centroids of each cluster are found. For a given query vector, we then find the closest centroid. Then for that centroid, we search all the vectors in the associated cluster. When the query vector is near the edge of multiple clusters, the nearest vectors may be in the neighboring cluster. In these cases, we generally need to search multiple clusters.

Hierarchical Navigable Small Worlds (HNSW) indexes: fast query, multi-layered graph. In a single layer, points are connected based on their similarity. Data points in each layer are also connected to data points in the next layer. To search the index, we first search for the highest layer of the graph. The closest match from this graph is then taken to the next layer down where we again find the closest matches to the query vector. We continue this process until we reach the lowest layer in the graph.

Vector Indexing | Weaviate - Vector Database

Edit description

weaviate.io

Weaviate supports hnsw (default) and flat index types. Available distance metrics: cosine, dot, l2-squared, hamming, manhattan.

The memory cache only stores the highest layer instead of storing all of the data objects in the lowest layer. When the search moves from a higher layer to a lower one, HNSW only adds the data objects that are closest to the search query. This means HNSW uses a relatively small amount of memory compared to other search algorithms.

HNSW parameters: The ef parameter is a critical setting for balancing the trade-off between search speed and quality (size of the ANN list during search process. A higher ef value results in a more extensive search, enhancing accuracy but potentially slowing down the query.)

Product quantization is a multi-step quantization technique that is available for use with hnsw indexes. Quantization techniques represent numbers with lower precision numbers (original vector is represented as a product of smaller vectors that are called ‘segments’ or ‘subspaces.’).

How to perform High-Performance Search using FAISS

A Beginner’s Guide to FAISS, use-cases, Mathematical foundations & implementation

medium.com

Supports LSH, IVF, and PQ index type, and GPU

flat_config = faiss.GpuIndexFlatConfig()
flat_config.device = 0

index = faiss.GpuIndexFlatL2(res, d, flat_config)

Appendix

Similarity Search, Part 1: kNN & Inverted File Index

Similarity search is a popular problem where given a query Q we need to find the most similar documents to it among all…

towardsdatascience.com

Vector Database Part 1

Vector search at scale

Overview

What is a Vector Database & How Does it Work? Use Cases + Examples | Pinecone

Discover Vector Databases: How They Work, Examples, Use Cases, Pros & Cons, Selection and Implementation. They have…

A 101 overview of vector databases, vector embeddings, and indexing

An overview of some of the key parts of the future of NLP

Power of Vector Databases for Gen AI Applications

Introduction

Vector Database: New Hype or Hope for AI’s(LLM)? :)

Accelerating the new generation of Artificial Intelligence solutions and Industry 4.0!

Why You Shouldn’t Invest In Vector Databases?

Delving into the vector database field from three perspectives: technology, applications, and the market.

What is a Vector Database? - Zilliz Vector database blog

Learn what a vector database is and how it can revolutionize your AI apps. Get started today!

Vector Database: The Secret Behind Large Language Models Capabilities

What are Vector Databases and Why Are They Important for LLMs?

Announcing LanceDB

🙋 Are you building a generative AI app using an LLM or API? Are you working on a new recsys, a modern search engine…

Semantic search

Using Pinecone Vector Database: A beginner guide

Pinecone is not a traditional database, but rather a cloud-native vector database specifically designed for similarity…

GPT-4, Semantic Search, and Vector Databases: Revolutionizing Question Answering

Discover how GPT-4, semantic search, and vector databases are transforming question answering by combining sentence…

Demystifying LLM Apps with Lance

There’s a lot of justifiable excitement around using LLMs as the basis for building a new generation of smart apps…

List of Embedding Datasets

An incomplete list of embedding datasets or vector embeddings of existing text corpus

Vector Databases, Large Language Models and Case Based Reasoning

In 2018, language models burst onto the scene with BERT (Bidirectional Encoder Representations from Transformers)…

Vector Databases: Taking Data Revolution to The Next Level

Here’s how AI-centric vector databases are gaining their place in modern software stacks.

Vector Databases as Memory for your AI Agents

LLM Persistence with Pinecone, Chroma, and LangChain

Building a Semantic Search Engine With OpenAI and Pinecone

A walkthrough of how to build a semantic search engine for scientific papers using OpenAI and Pinecone.

Vector database indexing

Vector Index Basics and the Inverted File Index - Zilliz Vector database blog

This tutorial analyzes the components of a modern indexer before going over two simplest indexing strategies - flat…

What is a Vector Index? An Introduction to Vector Indexing

Unlock the potential of vector indexing with our comprehensive guide. Learn everything you need to know about what a…

Vector Indexing | Weaviate - Vector Database

Edit description

How to perform High-Performance Search using FAISS

A Beginner’s Guide to FAISS, use-cases, Mathematical foundations & implementation

Appendix

Similarity Search, Part 1: kNN & Inverted File Index

Similarity search is a popular problem where given a query Q we need to find the most similar documents to it among all…

Written by Xin Cheng