Semantic search with vector database
Recently I heard of “vector database” a lot. After doing some research, I found that it is related to “semantic search”. Let’s understand these concepts and why we need them.
Semantic search: different people have different meanings. However, here I mean “search by query intent/query meaning”. The purpose is to achieve more accurate search result beyond simple keyword search. Suppose you want to search “who is number 1 soccer player in the world”, actually you mean “who is the best soccer player in the world” or “top soccer players in the world”, not “which soccer player in the world wears ‘number 1’ shirt”. This could be widely useful in any search scenario (not only natural language, but also audio, image, etc.)
Vector: vector is just a list of numbers. For natural language processing, you can tokenize the word and translate the sentence into a list of word-index.
Embedding: If your sentence is very long and you have lots of words, the vector will be very long, and with one-hot encoding, a sentence could become sparse-vector and have curse-of-dimensionality. An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space.
Vector similarity search: now when you search embedding-encoded representation of the query, you don’t want to search for exact match, but similar embedding vectors or nearby embedding vectors in proximity, which will have similar semantic meaning.
Vector database: A vector database indexes and stores vector embeddings for fast retrieval and similarity search, with capabilities like CRUD operations, metadata filtering, and horizontal scaling. Traditional database is not designed to easily store, index and search vectors.
Pinecone is a vector database, and again, the best thing in learning a new thing is a working code sample repo.
This notebook shows how to do a semantic search with Pinecone. It is pretty clear, so I just describe the main steps, you can follow the notebook:
- Use Quora question duplicates dataset, which contains pairs of questions that are not syntactically the same, but share the same meaning
- Connect to Pinecore and create an index to store vectors
- For each Quora question, generate (id, vectors, metadata) tuple and insert into Pinecore index
- Use a sample Quora question to search for similar questions in the index