Semantic Search With MongoDB

Vector search in NoSQL database

Xin Cheng
7 min readAug 12, 2024

If your application heavily relies on vector search and similar operations, dedicated vector databases like Pinecone, Weaviate, and Milvus might provide better performance and more specialized features compared to MongoDB. However, if vector search is just one aspect of your application and you’re already using MongoDB for other purposes, you can also leverage a familiar database enhanced with vector search capability.

MongoDB Atlas

Get Atlas connection string

Get database user password

Create Azure OpenAI embeddings, LLM

embeddings = AzureOpenAIEmbeddings(model=os.getenv('AZURE_OPENAI_EMBEDDINGS_MODEL_NAME'),deployment=os.getenv('AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME'))
llm = AzureChatOpenAI(
azure_deployment=os.getenv('AZURE_OPENAI_DEPLOYMENT_NAME'),
api_version="2024-05-01-preview",
temperature=0,
max_tokens=None,
timeout=None,
max_retries=2,
# other params...
)

Atlas Vector Search index JSON

{
"mappings": {
"dynamic": true
}
}

Query tester

Error running

query = "MongoDB Atlas security"
results = vector_store.similarity_search(query)
pprint.pprint(results)
PlanExecutor error during aggregation :: caused by :: embedding is not indexed as knnVector,

Correct JSON

{
"mappings": {
"dynamic": true,
"fields": {
"embedding": {
"dimensions": 1536,
"similarity": "cosine",
"type": "knnVector"
}
}
}
}

Basic RAG result

Question: How can I secure my MongoDB Atlas cluster?
Answer: To secure your MongoDB Atlas cluster, you can take the following steps based on the provided context:

1. **Enable Authentication and IP Address Whitelisting**: Authentication and IP address whitelisting are automatically enabled to ensure a secure system right out of the box.

2. **Encryption**: Utilize the encryption features provided by MongoDB Atlas. Data at rest is encrypted with encrypted storage volumes, and you can optionally configure an additional layer of encryption on your data at rest.

3. **Security Features**: Leverage the built-in security features of MongoDB Atlas to protect access to your data. This includes defining authorization rules for administrators.

4. **Avoid Public IP Addresses**: Configure your connectivity to avoid using public IP addresses and reduce the need to whitelist every client in your MongoDB Atlas group.

5. **Monitor and Manage**: Ensure that the MongoDB Atlas team is monitoring the underlying infrastructure to keep it in a healthy state. Additionally, review application logs and database logs for any suspicious activity.

By following these steps, you can enhance the security of your MongoDB Atlas cluster.

Source documents:
[Document(metadata={'_id': '66a97a44549d46b76f5c95cf', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 17}, page_content='To ensure a secure system right out of the b ox,\nauthentication and I P Address whitelisting are\nautomatically enabled.\nReview the security section of the MongoD B Atlas'),
Document(metadata={'_id': '66a97a44549d46b76f5c95dd', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 18}, page_content='Atlas provides encryption of data at rest with encrypted\nstorage volumes.\nOptionally , Atlas users can configure an additional layer of\nencryption on their data at rest using the MongoD B'),
Document(metadata={'_id': '66a97a44549d46b76f5c9473', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 2}, page_content='instance size, region, and f eatures you need. MongoD B\nAtlas provides:\n•Security f eatures to protect access to your data\n•Built in replication for always-on availability , tolerating'),
Document(metadata={'_id': '66a97a44549d46b76f5c95d3', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 17}, page_content='connectivity without using public I P addresses, and without\nneeding to whitelist every client in your MongoD B Atlas\ngroup.\nAuthorization\nMongoD B Atlas allows administrators to define'),
Document(metadata={'_id': '66a97a44549d46b76f5c95ca', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 17}, page_content='Security\nAs with all software, MongoD B administrators must\nconsider security and risk e xposure for a MongoD B\ndeployment. T here are no magic solutions for risk'),
Document(metadata={'_id': '66a97a44549d46b76f5c959a', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 15}, page_content='MongoD B Atlas team are also monitoring the underlying\ninfrastructure, ensuring that it is always in a healthy state.\nApplication L ogs And Database L ogs'),
Document(metadata={'_id': '66a97a44549d46b76f5c9588', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 14}, page_content='All the user needs to do in order for MongoD B Atlas to\nautomatically deploy the cluster is to select a handful of\noptions:\n•Instance size\n•Storage size (optional)\n•Storage speed (optional)'),
Document(metadata={'_id': '66a97a44549d46b76f5c9478', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 2}, page_content='your databases onto your own infrastructure and manage\nthem using MongoD B Ops Manager or MongoD B Cloud\nManager . The user e xperience across MongoD B Atlas,'),
Document(metadata={'_id': '66a97a44549d46b76f5c946f', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 1}, page_content='Table of Contents\n1 Introduction\n2 Preparing for a MongoD B Deployment\n9 Scaling a MongoD B Atlas Cluster\n11 Continuous A vailability & Data Consistency\n12 Managing MongoD B\n16 Security'),
Document(metadata={'_id': '66a97a44549d46b76f5c957d', 'source': 'https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4HkJP', 'page': 13}, page_content='MongoD B.\nMongoD B Atlas incorporates best practices to help keep\nmanaged databases healthy and optimized. T hey ensure\noperational continuity by converting comple x manual tasks')]
import getpass, os, pymongo, pprint
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext
from llama_index.core.settings import Settings
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, ExactMatchFilter, FilterOperator
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

set embeddings, LLM

import openai
azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("AZURE_OPENAI_API_VERSION")
api_key = os.getenv("AZURE_OPENAI_API_KEY")
openai.api_key = api_key

Settings.llm = AzureOpenAI(model=os.getenv("AZURE_OPENAI_MODEL_NAME"),
deployment_name=os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME"),
temperature=0,
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,)
Settings.embed_model = AzureOpenAIEmbedding(model=os.getenv("AZURE_OPENAI_EMBEDDINGS_MODEL_NAME"),
deployment_name=os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"),
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,)
Settings.chunk_size = 100
Settings.chunk_overlap = 10

Connect to Atlas vector index

# Connect to your Atlas cluster
mongodb_client = pymongo.MongoClient(ATLAS_CONNECTION_STRING)

# Instantiate the vector store
atlas_vector_store = MongoDBAtlasVectorSearch(
mongodb_client,
db_name = "<db name>",
collection_name = "test",
vector_index_name = "vector_index"
)
vector_store_context = StorageContext.from_defaults(vector_store=atlas_vector_store)

Azure Cosmos DB for MongoDB

Choose vCore cluster (seems RU type does not support vector search, which is serverless and easy to setup)

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores.azure_cosmos_db import (
AzureCosmosDBVectorSearch,
CosmosDBSimilarityType,
CosmosDBVectorSearchType,
)
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

SOURCE_FILE_NAME = "./texts/state_of_the_union.txt"

loader = TextLoader(SOURCE_FILE_NAME)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# OpenAI Settings
model_deployment = os.getenv(
"AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT_NAME"
)
model_name = os.getenv("AZURE_OPENAI_EMBEDDINGS_MODEL_NAME")


embeddings = AzureOpenAIEmbeddings(
deployment=model_deployment, model=model_name, chunk_size=1
)
from pymongo import MongoClient

client: MongoClient = MongoClient(CONNECTION_STRING)
collection = client[DB_NAME][COLLECTION_NAME]

vectorstore = AzureCosmosDBVectorSearch.from_documents(
docs,
embeddings,
collection=collection,
index_name=INDEX_NAME,
)

# Read more about these variables in detail here. https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/vector-search
num_lists = 100
dimensions = 1536
similarity_algorithm = CosmosDBSimilarityType.COS
kind = CosmosDBVectorSearchType.VECTOR_IVF
m = 16
ef_construction = 64
ef_search = 40
score_threshold = 0.1

vectorstore.create_index(
num_lists, dimensions, similarity_algorithm, kind, m, ef_construction
)
# perform a similarity search between the embedding of the query and the embeddings of the documents
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)

Result

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.

Querying from Azure Cosmos DB vector search is similar

vectorstore = AzureCosmosDBVectorSearch.from_connection_string(
CONNECTION_STRING, NAMESPACE, embeddings, index_name=INDEX_NAME
)

# perform a similarity search between a query and the ingested documents
query = "What did the president say about Ketanji Brown Jackson"
docs = vectorstore.similarity_search(query)

print(docs[0].page_content)

Appendix

--

--

Xin Cheng
Xin Cheng

Written by Xin Cheng

Multi/Hybrid-cloud, Kubernetes, cloud-native, big data, machine learning, IoT developer/architect, 3x Azure-certified, 3x AWS-certified, 2x GCP-certified

No responses yet