Unraveling Search Puzzle: Keyword vs. Fuzzy vs. Full-Text vs. Semantic
This is the second article of building LLM-powered AI applications series. We are going to look at a few enabling technologies.
If you read through the first article, you know that context is important to work with LLM. However, they have token limit, so you need to provide most relevant information to LLM. However, how can you sift through plethora of information your company has? You need to use search/query technologies, and there are 4 main search techniques: keyword, fuzzy, fulltext and semantic.
The first 3 are somewhat “lexical search” as they all rely on matching words, although there are some difference between them, e.g. keyword search means finding exact match, while fuzzy search can allow some typos (e.g. investment, invesment) and different word forms (e.g. invest, investing, invested), while fulltext can further enable scenarios like if “investment” is near to “vanguard”, or if a word ends with “coin”, or you want to find synonyms for stock, like share, security.
However, if you want to find all articles related to a more complex query “impact of inflation on cryptocurrency”, these techniques are not enough, because if article 1 has “impact of inflation on bitcoin” and article 2 has “cryptocurrency impact on economy”, you would want to return only article 1, as it is more relevant, since bitcoin is one type of cryptocurrency. Previous search technologies have challenges on dealing with this type of search, as you need to understand the semantic meaning of the word/sentence. That’s why semantic search/semantic similarity comes into play. And to enable semantic search, you need something called embedding, which will be covered in next article.