Vector Database Part 1

Vector search at scale

Xin Cheng
9 min readJan 29, 2024

This is the sixth article of building LLM-powered AI applications series. From the previous article, we introduced semantic search and applications, as well as technical details of semantic search. This article will focus on how to scale semantic search with vector database (like traditional relational database enables keyword search at scale).

Overview

Capabilities: vector data management, inserting, deleting, updating and querying; store metadata associated with each vector entry and let user query the database using additional metadata filters; scalability and efficiency; real-time update; similar to traditional database: backup, security and access control

pipeline: Indexing: The vector database indexes vectors using an algorithm such as PQ (product quantization), LSH (locality sensitive hashing), or HNSW (Hierarchical Navigable Small Worlds). This step maps the vectors to a data structure that will enable faster searching.
Querying: The vector database compares the indexed query vector to the indexed vectors in the dataset to find the nearest neighbors (applying a similarity metric used by that index)
Post Processing: In some cases, the vector database retrieves the final nearest neighbors from the dataset and post-processes them to return the final results. This step can include re-ranking the nearest neighbors using a different similarity measure.

vector index: the most important piece in vector database, enabling fast querying; algorithms:

Random Projection: creating: project the high-dimensional vectors to a lower-dimensional space using a random projection matrix. querying: use the same projection matrix to project the query vector onto the lower-dimensional space, then compare the projected query vector to the projected vectors in the database to find the nearest neighbors. Since the dimensionality of the data is reduced, the search process is significantly faster than searching the entire high-dimensional space.

Vector embeddings are used to capture the semantic or contextual meaning of the data, allowing it to be compared and analyzed in a meaningful way. For example, in natural language processing, words can be represented as vectors in a high-dimensional space, where words with similar meanings are closer together in the space than words with different meanings. (can be created with various methods such as neural networks, word2vec, and other machine learning techniques)

Vector database is a specialized type of database that stores data as high-dimensional vectors. Vectors are mathematical representations of features or attributes that can range from tens to thousands of dimensions depending on the complexity of the data. The vectors are generated by applying an embedding function to the raw data, such as text, images, audio, and video. Vector databases offer the advantage of fast and accurate similarity search and retrieval of data based on their vector distance or similarity.

Indexing involves breaking down a document or website into smaller segments and converting these segments into vectors that can be stored in a vector database. When a user submits a query, the indexing module calculates the vector similarity between the embedded query and each vector in the database. Finally, the indexing module fetches the top k most similar embeddings to generate the response.

Lists top 10 Best Vector Databases & Libraries: Elasticsearch, Faiss, Milvus, Weaviate, Pinecone, Qdrant, Vespa, Vald, ScaNN, pgvector

Milvus, Weaviate, Qdrant all support RESTful and gRPC APIs, Weaviate also supports GraphQL

Landscape of vector search, 4 categories based on open-source or commercial, dedicated vector database or general database

  1. Open-source dedicated vector database: milvus, chroma
  2. Commercial vector database: pinecone, weaviate
  3. Open-source database that supports vector search: postgresql, clickhouse
  4. Commercial database that supports vector search: elasticsearch, rockset

Traditionally, database fills simple query needs over structured data, elasticsearch fills fulltext query needs over semi-structured data. Now there are more advanced needs for search on unstructured data (e.g. image/video/audio similarity search, natural language search of image/video/audio). Whether you need dedicated technology for different needs or there will be unified stack, we have to wait and see.

High-level introduction of index creation algorithms

  1. Random Projection: dimensionality reduction technique used to approximate high-dimensional data by projecting it onto a lower-dimensional space. Random projection matrix is created, dot product of the input vectors and the random projection matrix yields a projected matrix with fewer dimensions, during querying, the same projection matrix is applied to project the query vector to lower-dimensional space to enable faster search
  2. Product Quantization: The PQ process involves four main steps: splitting (high-dimensional vectors are divided into smaller segments), training (“codebook” is created by finding the center points of clusters through k-means clustering applied to each segment), encoding (PQ code for a segment represents the identifier of the corresponding value in the codebook), and querying (the algorithm breaks down the vectors into sub-vectors and quantizes them using the same codebook. Then, it leverages the indexed codes to find the nearest vectors to the query vector.).
  3. Hierarchical Navigable Small World (HNSW): creates a hierarchical, tree-like structure where each node of the tree represents a set of vectors. The edges between the nodes represent the similarity between the vectors. During querying, it uses this graph to navigate through the tree, visiting the nodes that are most likely to contain the closest vectors to the query vector
  4. Locality-Sensitive Hashing: LSH maps similar vectors into “buckets” using a set of hashing functions. During querying, query vector is bucketed to has table using the same hashing functions, then is compared with the other vectors in that same table to find the closest matches

The post claims there is challenge to get vector database like Pinecone even running, and the “retrieve-filter-hydrate workflow” was often time a big bottleneck in productivity and app latency. And it mentions LangChain integration and roadmap like integration for LlamaIndex, integrations into OpenAI plugin / AutoGPT etc, gallery of generative AI apps powered by LanceDB, solutions for cloud deployment

Semantic search

Pinecone is managed vector database. It enables efficient and accurate retrieval of similar vectors, making it suitable for recommendation systems, anomaly detection, and search engines.

Question answering with pinecone vector database and GPT-4

Other semantic search articles 1, 2, 3

Reduce 200K sentences of youtube transcription to 50K (using context windows of size 20 with stride 4: every 4 sentences, it creates a “context” by concatenating the next 20 sentences together.); then convert context to embeddings; merge the embeddings (vectors) with the original data and write it to storage; create ANN index; search context with query vector; send context to LLM as context, with prompt like “answer question based on context: <context>”.

List embeddings created for existing text corpus, good for semantic search, e.g. Wikipedia Article Embeddings in Many Languages created by Cohere.ai

Case-based reasoning with vector database can be transformed to searching similar cases and similar responses with vector search.

Vector database vs traditional Relational databases

Relational databases store structured data and is generally easy to search and analyze. In contrast, unstructured data is more complex and requires more work to understand.

Machine learning and deep learning models can help us understand this unstructured data by transforming it into vector embeddings. These embeddings are high dimensional vectors that describe complex data as numerical values in different dimensions.

A relational database retrieves results that are an exact/keyword match, while a vector database offers more complex search capabilities (semantic search/match).

Vector Databases Use Cases: Recommendation Engines, Semantic Search, Similarity Search.

Pinecone vs. Chroma: Pinecone is a managed database persistence service, Chroma is open source and you can host your own instance.

Semantic search: uses the arXiv metadataset on Kaggle, which contains metadata (such as title and abstract), OpenAI embedding model text-embedding-ada-002, Pinecone

Vector database indexing

  • hash-based indexing (e.g. locality-sensitive hashing),
  • tree-based indexing (e.g. ANNOY),
  • cluster-based or cluster indexing (e.g. product quantization), and
  • graph-based indexing (e.g. HNSW).

The purpose of a vector index is to search and retrieve data from a large set of vectors (like table index in relational database). The class of algorithms used to build and search vector indexes is called Approximate Nearest Neighbor (ANN) search. ANN algorithms rely on a similarity measure to determine the nearest neighbors.

Flat indexing: the similarity between the query vector and every other vector in the index is computed. Slow

Locality Sensitive Hashing (LSH) indexes: The index is built using a hashing function. Vector embeddings that are nearby each other are hashed to the same bucket. We can then store all these similar vectors in a single table or bucket. When a query vector is provided, its nearest neighbors can be found by hashing the query vector, and then computing the similarity metric for all the vectors in the table for all other vectors that hashed to the same value.

Inverted file (IVF) indexes: the vector space is first partitioned or clustered, and then centroids of each cluster are found. For a given query vector, we then find the closest centroid. Then for that centroid, we search all the vectors in the associated cluster. When the query vector is near the edge of multiple clusters, the nearest vectors may be in the neighboring cluster. In these cases, we generally need to search multiple clusters.

Hierarchical Navigable Small Worlds (HNSW) indexes: fast query, multi-layered graph. In a single layer, points are connected based on their similarity. Data points in each layer are also connected to data points in the next layer. To search the index, we first search for the highest layer of the graph. The closest match from this graph is then taken to the next layer down where we again find the closest matches to the query vector. We continue this process until we reach the lowest layer in the graph.

Weaviate supports hnsw (default) and flat index types. Available distance metrics: cosine, dot, l2-squared, hamming, manhattan.

The memory cache only stores the highest layer instead of storing all of the data objects in the lowest layer. When the search moves from a higher layer to a lower one, HNSW only adds the data objects that are closest to the search query. This means HNSW uses a relatively small amount of memory compared to other search algorithms.

HNSW parameters: The ef parameter is a critical setting for balancing the trade-off between search speed and quality (size of the ANN list during search process. A higher ef value results in a more extensive search, enhancing accuracy but potentially slowing down the query.)

Product quantization is a multi-step quantization technique that is available for use with hnsw indexes. Quantization techniques represent numbers with lower precision numbers (original vector is represented as a product of smaller vectors that are called ‘segments’ or ‘subspaces.’).

Supports LSH, IVF, and PQ index type, and GPU

flat_config = faiss.GpuIndexFlatConfig()
flat_config.device = 0

index = faiss.GpuIndexFlatL2(res, d, flat_config)

Appendix

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified