Vector Database Part 2: Index

Vector search at scale

Xin Cheng
9 min readApr 7, 2024

This is the 7th article of building LLM-powered AI applications series. From the previous article, we introduced vector databases and overview of different index types. Let’s understand a bit more details of various index to understand difference.

Vector database indexing types

Flat

Description: kNN is referred to as flat in Faiss. No approximation or clustering, produce the most accurate results. query vector xq is compared against every other full-size vector to get distances, return the nearest k of xq.

When to use: Search quality is a very high priority. Search time does not matter OR when using a small index (<10K).

faiss.IndexFlatL2. The similarity is calculated as Euclidean distance.

faiss.IndexFlatIP. The similarity is calculated as an inner product.

Ideas to make vector search faster

  1. Reduce vector size — through dimensionality reduction or reducing the number of bits representing our vectors values.
  2. Reduce search scope — hashing (Locality Sensitive Hashing), clustering (Inverted File Index) or organizing vectors into tree structures based on certain attributes, similarity, or distance — and restricting our search to closest clusters or filter through most similar branches.

Locality Sensitive Hashing: faiss.IndexLSH. Best for low dimensional data, or small datasets

Inverted File Index: faiss.IndexIVFFlat. Good scalable option. High-quality, at reasonable speed and memory usage

IVF (mapping from content to document location, in contrast to forward index, which maps from documents to content): assigns all database vectors to the partition with the closest centroid (determined using unsupervised clustering (typically k-means)). Generally a solid choice for small- to medium-size datasets.

Product quantization: high-dimensional vectors are mapped to low-dimensional quantized vectors assigning fixed-length chunks of the original vector to a single quantized value. Process involves splitting vectors, applying k-means clustering across all splits, and converting centroid indices.

Hierarchical Navigable Small World Graphs: faiss.IndexHNSWFlat. Very good for quality, high speed, but large memory usage

Approximate Nearest Neighbors Oh Yeah (Annoy): Generally not recommended. It partitions the vector space recursively to create a binary tree, where each node is split by a hyperplane equidistant from two randomly selected child vectors. The splitting process continues until leaf nodes have fewer than a predefined number of elements. Querying involves iteratively the tree to determine which side of the hyperplane the query vector falls on.

Decision on choosing which index: 100% recall or index_size < 10MB, use FLAT. 10MB < index_size < 2GB, use IVF. 2GB < index_size < 20GB, PQ allows you to use significantly less memory at the expense of low recall, while HNSW often gives you 95%+ recall at the expense of high memory usage. 20GB < index_size < 200GB, use composite indexes, IVF_PQ for memory-constrained applications and HNSW_SQ for applications that require high recall.

Locality Sensitive Hashing (LSH)

Shingling, MinHashing, and banded LSH (traditional approach)

3-step traditional approach

  1. Shingling: encoding original texts into vectors.
  2. MinHashing: transforming vectors into a special representation called signature which can be used to compare similarity between them.
  3. LSH function: hashing signature blocks into different buckets. If the signatures of a pair of vectors fall into the same bucket at least once, they are considered candidates.

The hashing function is interesting that it should have the property that the probability of collision (i.e., two items being hashed to the same value) is higher for items that are more similar according to your chosen metric, e.g. function to convert higher-dimension to lower-dimension where similar items are closer together. Key difference is that with dictionaries, our goal is to minimize the chances of multiple key-values being mapped to the same bucket — we minimize collisions. LSH is almost the opposite. In LSH, we want to maximize collisions — although ideally only for similar inputs.

LSH has a very similar process to that used in Python dictionaries. We have a key-value pair which we feed into the dictionary. The key is processed through the dictionary hash function and mapped to a specific bucket.

Pinecone article is good for people who want to get deep into vector database, as it provides some native implementations for better understanding.

Random hyperplanes with dot-product and Hamming distance

Reduce highly-dimensional vectors into low-dimensionality binary vectors with random hyperplane. Then use hamming distance to compute distance.

Inverted File Index

One implementation is Voronoi diagrams, creating several non-intersecting regions to which each dataset point will belong. Each region has its own centroid which points to the center of that region. During query, distances to all the centroids of Voronoi partitions are calculated. Then the centroid with the lowest distance is chosen and vectors contained in this partition are then taken as candidates. Final answer is computing the distances to the candidates and choosing the top k nearest of them.

Downside: when the queried object is located near the border with another region, while actual nearest neighbour is in another region. Mitigation is to increase the search scope and choose several regions to search for candidates based on the top m closest centroids to the object.

faiss.IndexIVFFlat

  • nlist: defines a number of regions (Voronoi cells) to create during training.
  • nprobe: determines how many regions to take for the search of candidates. Changing nprobe parameter does not require retraining.

Product Quantization

Can dramatically compress high-dimensional vectors to use 97% less memory, and for making nearest-neighbor search speeds 5.5x faster.

  • Splitting high-dimensional vector into equally sized chunks — our subvectors,
  • Assigning each of these subvectors to its nearest centroid (also called reproduction/reconstruction values),
  • Replacing these centroid values with unique IDs — each ID represents a centroid

Each dataset vector is converted into a short memory-efficient representation (called PQ code). During training, divides each vector into several equal part ssubvectors and forms subspaces. Find centroids in each subspace, and subvector is encoded with the ID of the centroid.

  • Original vector: 1024 * 32 bits = 4096 bytes.
  • Encoded vector: 8 * 8 bits = 8 bytes.

Inference

A query vector is divided into subvectors. For each of its subvectors, distances to all the centroids of the corresponding subspace are computed. Ultimately, this information is stored in table d.

Then we calculate approximate distances for all database rows and search for vectors with the smallest values. Approximate distance:

  1. For each of subvectors of a database vector, the closest centroid j is found (by using mapping values from PQ codes) and the partial distance d[i][j]from that centroid to the query subvector i (by using the calculated matrix d) is taken.
  2. All the partial distances are squared and summed up. By taking the square root of this value, the approximate euclidean distance is obtained. If you want to know how to get approximate results for other metrics as well, navigate to the section below “Approximation of other distance metrics”.

An inverted file index is constructed that divides the set of vectors into n Voronoi partitions. Inside each Voronoi partition, the coordinates of the centroid are subtracted from each vector and residuals are stored. After that, the product quantization algorithm is run on vectors from all the partitions.

Inference

For a given query, the k nearest centroids of Voronoi partitions are found. All the points inside these regions are considered as candidates. The query residual is then split into subvectors and uses Product Quantization

faiss.IndexIVFPQ

In IVFPQ, an Inverted File index (IVF) is integrated with Product Quantization (PQ) to facilitate a rapid and effective approximate nearest neighbor search by initial broad-stroke that narrows down the scope of vectors in our search.

After this, we continue our PQ search as we did before — but with a significantly reduced number of vectors. By minimizing our Search Scope, it is anticipated to achieve significantly improved search speeds.

Hierarchical Navigable Small Worlds (HNSW)

Based on the same principles as skip list and navigable small world. Its structure represents a multi-layered graph with fewer connections on the top layers and more dense regions on the bottom layers. Search starts from the highest layer and proceeds to one level below every time the local nearest neighbour is greedily found among the layer nodes. Instead of finding only one nearest neighbour on each layer, the efSearch (a hyperparameter) closest nearest neighbours to the query vector are found and each of these neighbours is used as the entry point on the next layer.

faiss.IndexHNSWFlat

Navigable Small World Graphs

When searching an NSW graph, we begin at a pre-defined entry-point. This entry point connects to several nearby vertices. We identify which of these vertices is the closest to our query vector and move there. We repeat the greedy-routing search process of moving from vertex to vertex by identifying the nearest neighboring vertices in each friend list. Eventually, we will find no nearer vertices than our current vertex — this is a local minimum and acts as our stopping condition.

HNSW, each layer is NSW, while layer is connected hierarchically using probability skip list structure. During search, we traverse edges in each layer just as we did for NSW, greedily moving to the nearest vertex until we find a local minimum. Unlike NSW, at this point, we shift to the current vertex in a lower layer and begin searching again. We repeat this process until finding the local minimum of our bottom layer — layer 0.

Graph construction: starts at the top layer. At each layer, greedily traverse across edges, finding the ef nearest neighbors to our inserted vector q. After finding the local minimum, it moves down to the next layer (just as is done during search). This process is repeated until reaching our chosen insertion layer. Then find efConstruction nearest neighbors, these nearest neighbors are candidates for the links to the new inserted element q and as entry points to the next layer.

Vector database comparison

LanceDB is positioned as embedded database, as sending a beefy vector over HTTP often takes longer than finding its nearest neighbor. LanceDB uses IVF_PQ and Qdrant uses HNSW.

Appendix

To compute the nearest neighbors it is splitting the set of points into half and is doing this recursively until each set is having k items. Usually k should be around 100.

https://blog.qdrant.tech/batch-vector-search-with-qdrant-8c4d598179d5

--

--

Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified