- generate-embeddings.ipynb - Learn how embeddings work (Steps 1-5, optional)
- vector-search.ipynb - Implement vector search (Steps 6+, start here if short on time)
By the end of this module, you will:
- Understand what vector embeddings are and how they enable semantic search
- Load and prepare data for vector search
- Generate embeddings using OpenAI's text-embedding model
- Create vector search indexes in DocumentDB
- Implement semantic search with similarity scoring
- Apply filters to refine search results
You'll implement a semantic search system that allows users to search for Airbnb listings using natural language. Instead of exact keyword matching, your search will understand the meaning and context of queries.
- "cozy place near downtown with parking" β finds listings matching the vibe, not just keywords
- "family-friendly home with backyard" β understands intent and returns relevant results
- "quiet retreat for remote work" β captures context and lifestyle needs
What are embeddings?
- Numerical representations of text that capture semantic meaning
- Each embedding is a list of numbers (vector) - typically 1536 dimensions for OpenAI's text-embedding-3-small
- Similar concepts have similar vectors, even if they use different words
Example:
"beach house" β [0.23, -0.45, 0.12, ..., 0.67] (1536 numbers)
"oceanfront property" β [0.21, -0.43, 0.15, ..., 0.69] (similar vector!)
"mountain cabin" β [-0.45, 0.67, -0.23, ..., 0.12] (different vector)
How Vector Search Works:
- Convert text (listings, queries) into embeddings
- Store embeddings in a database with vector search capabilities
- When searching, convert the query to an embedding
- Find the most similar vectors using cosine similarity or other distance metrics
- Return the corresponding listings
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Module 1 Flow β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β 1. Load Data 2. Generate 3. Store & Index β
β βββββββββββ ββββββββββββ ββββββββββββββββββ β
β β JSON β β OpenAI β β DocumentDB β β
β β File βββββββββββΆβ EmbeddingβββββββββΆβ + Vector β β
β β β β API β β Index β β
β β β β(1536-dim)β β (cosmosSearch)β β
β βββββββββββ ββββββββββββ ββββββββββββββββββ β
β β β
β β β
β 4. Search Query βΌ β
β βββββββββββββββ ββββββββββββ ββββββββββββββββββ β
β β "cozy place"βββββββββββΆβ Convert ββββββΆβ Vector Search β β
β β "near beach"β β to Vectorβ β (cosmosSearch) β β
β βββββββββββββββ ββββββββββββ ββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββββ β
β β Top Similar β β
β β Listings β β
β ββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βοΈ Skip Ahead? Steps 1-5 are a learning demonstration to help you understand how embeddings work. They do not modify the application or database. If you're short on time, you can skip to Step 6 to start working with the pre-embedded data.
Let's first explore the dataset structure in the Jupyter Notebook.
Our dataset contains Airbnb listings with the following key fields:
| Field | Type | Description | Example |
|---|---|---|---|
id |
number | Unique identifier | 360 |
listing_url |
string | URL to the listing | "https://www.airbnb.com/rooms/360" |
name |
string | Property title | "Chickadee Cottage in LoHi" |
description |
string | Full description | Text used for embeddings |
neighborhood_overview |
string | Area information | "Located in Lower Highlands..." |
amenities |
array | List of amenities | ["Wifi", "Kitchen", "TV", ...] |
property_type |
string | Type of property | "Entire guesthouse", "Apartment", etc. |
room_type |
string | Room configuration | "Entire home/apt" |
bedrooms |
number | Number of bedrooms | 1, 2, 3, etc. |
beds |
number | Number of beds | 1, 2, 3, etc. |
price |
number | Nightly price | 161.0 |
latitude |
number | Latitude coordinate | 39.766414 |
longitude |
number | Longitude coordinate | -105.002098 |
π‘ Key Insight: The description field is what we'll convert into vector embeddings for semantic search.
For this workshop, we provide pre-embedded data in data/embedded_data.json that already contains the descriptionVector field. This saves time and API costs during the workshop.
Understanding how embeddings are generated is essential! The next steps demonstrate embedding 50 sample documents in the notebook as a learning exercise.
Run the cells in generate-embeddings.ipynb to load and explore the raw data:
# Load raw data (without embeddings)
with open('../data/raw_data.json', 'r', encoding='utf-8') as f:
data = json.load(f)
print(f"π Loaded {len(data)} listings from raw_data.json")
# Examine the first listing
sample = data[0]
print(f"\nπ Sample Listing:")
print(f" ID: {sample['id']}")
print(f" Name: {sample['name']}")
print(f" Property Type: {sample['property_type']}")
print(f" Bedrooms: {sample.get('bedrooms', 'N/A')}")
print(f" Price: ${sample.get('price', 'N/A')}")
print(f" Amenities: {', '.join(sample.get('amenities', [])[:5])}...")
print(f"\nπ Description Preview:")
print(f" {sample.get('description', '')[:200]}...")We'll use OpenAI's text-embedding-3-small model to generate 1536-dimension vectors that capture semantic meaning.
Each number in the 1536-dimension vector represents a learned feature. The model has discovered that certain combinations of these numbers correspond to semantic concepts like "cozy", "parking", "downtown", etc.
The notebook contains the generate_embedding() function:
def generate_embedding(text):
"""
Generate a vector embedding for the given text using OpenAI.
Args:
text (str): The text to embed
Returns:
list: A 1536-dimension vector representing the text
"""
if not text or not isinstance(text, str):
return None
try:
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
except Exception as e:
print(f"Error generating embedding: {e}")
return None
# Test the function with a sample query
test_text = "Cozy apartment near downtown with free parking"
test_embedding = generate_embedding(test_text)
print(f"\nπ§ͺ Testing Embedding Generation:")
print(f" Input: '{test_text}'")
print(f" β
Generated embedding")
print(f" π Dimensions: {len(test_embedding)}")
print(f" π First 5 values: {test_embedding[:5]}")Now let's embed 50 documents from raw_data.json to understand the full process. Note: This is a learning exercise - we won't write to the database since pre-embedded data is already available.
The notebook contains the embed_documents() function:
def embed_documents(documents, limit=50):
"""
Generate embeddings for a list of documents.
Args:
documents (list): List of listing documents
limit (int): Maximum number of documents to process
Returns:
list: Documents with descriptionVector added
"""
docs_to_process = documents[:limit]
embedded_docs = []
print(f"\nπ Generating embeddings for {len(docs_to_process)} documents...")
for idx, doc in enumerate(docs_to_process):
description = doc.get('description', '')
embedding = generate_embedding(description)
if embedding:
doc_copy = doc.copy()
doc_copy['descriptionVector'] = embedding
embedded_docs.append(doc_copy)
# Progress update every 10 documents
if (idx + 1) % 10 == 0:
print(f" β
Processed {idx + 1}/{len(docs_to_process)} documents...")
print(f"\nβ
Generated embeddings for {len(embedded_docs)} documents")
return embedded_docs
# Run the embedding process
embedded_documents = embed_documents(data, limit=50)
# Show results
print(f"\nπ Results Summary:")
print(f" Documents processed: {len(embedded_documents)}")
print(f" Embedding dimensions: {len(embedded_documents[0]['descriptionVector'])}")
# Show a sample embedded document
sample_embedded = embedded_documents[0]
print(f"\nπ Sample Embedded Document:")
print(f" Name: {sample_embedded['name']}")
print(f" Has embedding: {'descriptionVector' in sample_embedded}")
print(f" Vector preview: {sample_embedded['descriptionVector'][:3]}...")Run the embedding cells in the notebook and verify the output:
Expected Output:
β
Libraries imported and environment loaded
π Loaded 1000 listings from raw_data.json
π Sample Listing:
ID: 360
Name: Sit in the Peaceful Garden of the Chickadee Cottage in LoHi
...
π Generating embeddings for 50 documents...
β
Processed 10/50 documents...
β
Processed 20/50 documents...
...
β
Generated embeddings for 50 documents
π Results Summary:
Documents processed: 50
Embedding dimensions: 1536
π‘ Note: The full dataset is already embedded in data/embedded_data.json. This exercise demonstrates the embedding process without the cost of re-embedding all 1,000 listings.
π Start Here if you skipped the embedding demonstration (Steps 1-5).
Now that your data with embeddings is loaded in DocumentDB, you need to create a vector search index to enable fast similarity searches.
-
Open the DocumentDB Extension in VS Code (click the database icon in the sidebar)
-
Navigate to your Scrapbook:
- Right-click on your collection
listings - Select "New Scrapbook"
- Right-click on your collection
-
Run the following commands in your scrapbook (select each block and press
Ctrl+Enteror click "Run"):
// Create vector search index on the descriptionVector field
db.runCommand({
createIndexes: "listings",
indexes: [{
key: { "descriptionVector": "cosmosSearch" },
name: "vectorSearchIndex",
cosmosSearchOptions: {
kind: "vector-ivf",
numLists: 100,
similarity: "COS",
dimensions: 1536
}
}]
})
// Check all indexes on the collection
db.listings.getIndexes()Expected Output:
[
{ "name": "_id_", "key": { "_id": 1 } },
{ "name": "vectorSearchIndex", "key": { "descriptionVector": "cosmosSearch" } },
]| Parameter | Value | Description |
|---|---|---|
kind |
"vector-ivf" |
Uses Inverted File Index for fast approximate search |
numLists |
100 |
Number of clusters (higher = more accurate but slower) |
similarity |
"COS" |
Cosine similarity (range: 0 to 1, where 1 = identical) |
dimensions |
1536 |
Must match your embedding size (OpenAI text-embedding-3-small) |
DocumentDB supports native vector search with two index types:
- IVF (Inverted File Index): Fast, approximate search suitable for large datasets
- HNSW (Hierarchical Navigable Small World): More accurate but uses more memory
For this workshop, we'll use IVF for better performance with our dataset.
π Follow along in vector-search.ipynb - Steps 1-4
def search_listings(query, limit=5):
"""
Search for listings using semantic similarity.
Args:
query (str): Natural language search query
limit (int): Maximum number of results to return
Returns:
list: Matching listings with similarity scores
"""
# Generate embedding for the query
query_embedding = generate_embedding(query)
if not query_embedding:
print("β Failed to generate query embedding")
return []
# Perform vector search using cosmosSearch
pipeline = [
{
"$search": {
"cosmosSearch": {
"vector": query_embedding,
"path": "descriptionVector",
"k": limit # Number of nearest neighbors
},
"returnStoredSource": True
}
},
{
"$project": {
"_id": 1,
"name": 1,
"description": 1,
"property_type": 1,
"bedrooms": 1,
"beds": 1,
"price": 1,
"neighborhood_overview": 1,
"amenities": 1,
"searchScore": {"$meta": "searchScore"}
}
}
]
results = list(collection.aggregate(pipeline))
return results
# Test the search
query = "cozy apartment with parking near downtown"
results = search_listings(query, limit=5)
print(f"\nπ Search Query: '{query}'")
print(f"π Found {len(results)} results\n")
for idx, result in enumerate(results, 1):
print(f"{idx}. {result['name']}")
print(f" Property Type: {result.get('property_type', 'N/A')}")
print(f" Neighborhood: {result.get('neighborhood_overview', 'N/A')[:80]}")
print(f" Bedrooms: {result.get('bedrooms', 'N/A')} | Price: ${result.get('price', 'N/A')}")
print(f" Similarity Score: {result.get('searchScore', 0):.4f}")
print(f" Preview: {result.get('description', '')[:100]}...")
print()Expected Output:
π Search Query: 'cozy apartment with parking near downtown'
π Found 5 results
1. Downtown Studio with Parking
Property Type: Apartment
Neighborhood: Located in a vibrant area near downtown Denver...
Bedrooms: 1 | Price: $95.0
Similarity Score: 0.8523
Preview: Cozy studio apartment in the heart of downtown. Free parking included. Walking distance to...
2. City Center Apartment
Property Type: Apartment
Neighborhood: This quiet neighborhood is close to restaurants, shops, and parks...
Bedrooms: 1 | Price: $120.0
Similarity Score: 0.8201
Preview: Modern apartment with dedicated parking spot. Located near downtown shopping and dining...
- Scores range from 0 to 1 (with cosine similarity)
- Higher scores = more similar to the query
- Scores above 0.75 typically indicate strong semantic relevance
- Scores between 0.5-0.75 are moderately relevant
- Scores below 0.5 may be weak matches
π Follow along in vector-search.ipynb - Step 5
def search_listings_with_filters(query, filters=None, limit=5):
"""
Search for listings with semantic similarity and additional filters.
Args:
query (str): Natural language search query
filters (dict): Optional filters (bedrooms, price_max, neighborhood, amenities)
limit (int): Maximum number of results to return
Returns:
list: Matching listings with similarity scores
"""
# Generate embedding for the query
query_embedding = generate_embedding(query)
if not query_embedding:
print("β Failed to generate query embedding")
return []
# Build match stage for filters
match_conditions = {}
if filters:
if 'bedrooms' in filters:
match_conditions['bedrooms'] = {"$gte": filters['bedrooms']}
if 'price_max' in filters:
match_conditions['price'] = {"$lte": filters['price_max']}
if 'neighborhood' in filters:
match_conditions['neighborhood_overview'] = {
"$regex": filters['neighborhood'],
"$options": "i"
}
if 'amenities' in filters:
# Amenities is a list, so we check if all required amenities are present
match_conditions['amenities'] = {"$all": filters['amenities']}
# Build aggregation pipeline
pipeline = [
{
"$search": {
"cosmosSearch": {
"vector": query_embedding,
"path": "descriptionVector",
"k": limit * 10 # Fetch more to account for filtering
},
"returnStoredSource": True
}
}
]
# Add filter stage if we have conditions
if match_conditions:
pipeline.append({"$match": match_conditions})
# Add projection and limit
pipeline.extend([
{
"$project": {
"_id": 1,
"name": 1,
"description": 1,
"property_type": 1,
"bedrooms": 1,
"beds": 1,
"price": 1,
"neighborhood_overview": 1,
"amenities": 1,
"searchScore": {"$meta": "searchScore"}
}
},
{"$limit": limit}
])
results = list(collection.aggregate(pipeline))
return results
# Test with filters
query = "family-friendly home with outdoor space"
filters = {
"bedrooms": 3,
"price_max": 200,
"amenities": ["Wifi", "Kitchen"]
}
results = search_listings_with_filters(query, filters, limit=5)
print(f"\nπ Search Query: '{query}'")
print(f"π― Filters:")
print(f" - Bedrooms: {filters['bedrooms']}+")
print(f" - Max Price: ${filters['price_max']}")
print(f" - Amenities: {', '.join(filters['amenities'])}")
print(f"\nπ Found {len(results)} results\n")
for idx, result in enumerate(results, 1):
print(f"{idx}. {result['name']}")
print(f" Property Type: {result.get('property_type', 'N/A')}")
print(f" Neighborhood: {result.get('neighborhood_overview', 'N/A')[:80]}")
print(f" Bedrooms: {result.get('bedrooms', 'N/A')} | Price: ${result.get('price', 'N/A')}")
print(f" Similarity Score: {result.get('searchScore', 0):.4f}")
amenities_preview = ', '.join(result.get('amenities', [])[:5])
print(f" Amenities: {amenities_preview}...")
print()π Follow along in vector-search.ipynb - Step 6
Try these queries to see how semantic search works:
# Test various semantic queries
test_queries = [
"romantic getaway for couples",
"pet-friendly place near parks",
"business travel with home office",
"beachfront property for surfing",
"quiet retreat for meditation and yoga"
]
print("π§ͺ Testing Semantic Search Capabilities\n")
print("=" * 80)
for query in test_queries:
results = search_listings(query, limit=3)
print(f"\nπ Query: '{query}'")
print(f"π Top 3 Results:")
for idx, result in enumerate(results, 1):
print(f"\n {idx}. {result['name']}")
print(f" Score: {result.get('searchScore', 0):.4f}")
print(f" {result.get('property_type', 'N/A')} | "
f"{result.get('bedrooms', 'N/A')} bed | "
f"${result.get('price', 'N/A')}/night")
print("\n" + "-" * 80)π‘ Observations:
- Notice how the search understands context (e.g., "romantic getaway" finds properties with ambiance descriptions)
- "Pet-friendly" matches listings that mention pets, animals, or outdoor areas
- "Business travel" finds properties with workspaces, desks, and good wifi
- The semantic understanding goes beyond exact keyword matching
Now let's start the application to see it in action!
The backend is a FastAPI application that provides the search and chat APIs.
pip install -r src/api/requirements.txt
uvicorn src.api.main:app --reload --host 0.0.0.0 --port 8000You should see output like:
INFO: Uvicorn running on http://0.0.0.0:8000
INFO: Started reloader process
β
Connected to DocumentDB: db.listings
π‘ Tip: In Codespaces, click the "Open in Browser" button when prompted, or go to the Ports tab and click the globe icon for port 8000 to access the API docs at
/docs.
Open a new terminal (Terminal β New Terminal) and run:
cd src/frontend
npm install
npm startYou should see:
Compiled successfully!
You can now view the app in the browser.
Local: http://localhost:3000
π‘ Tip: In Codespaces, click the "Open in Browser" button when prompted for port 3000 to view the application.
-
Check the Backend Health:
- Open:
http://localhost:8000/health(or the Codespaces URL) - You should see a JSON response with
"status": "ok"or"status": "degraded"
- Open:
-
Check the Frontend:
- Open:
http://localhost:3000(or the Codespaces URL) - You should see the booking search interface with a map
- Open:
-
Test the Connection:
- The frontend header shows a connection indicator
- Green = connected to backend
- Yellow = demo mode (backend not connected yet)
β
Vector Embeddings: How to convert text into numerical representations
β
OpenAI Embeddings API: Using text-embedding-3-small for semantic encoding
β
DocumentDB Vector Indexes: Creating IVF indexes for efficient similarity search
β
Semantic Search: Implementing cosine similarity search with cosmosSearch
β
Search Filters: Combining vector search with traditional filters
β
Query Understanding: How embeddings capture meaning and context
Now it's your turn! Enhance the search_listings_with_filters function with these features:
Instead of just price_max, support both price_min and price_max.
Requirements:
- Accept
price_minandprice_maxin the filters dict - Add proper MongoDB query conditions
- Test with:
{"price_min": 50, "price_max": 150}
π‘ Hint
if 'price_min' in filters or 'price_max' in filters:
price_condition = {}
if 'price_min' in filters:
price_condition['$gte'] = filters['price_min']
if 'price_max' in filters:
price_condition['$lte'] = filters['price_max']
match_conditions['price'] = price_conditionAdd support for filtering by property type (e.g., "House", "Apartment", "Condominium").
Requirements:
- Accept
property_typein the filters dict - Can be a single string or a list of types
- Test with:
{"property_type": "House"}and{"property_type": ["House", "Apartment"]}
π‘ Hint
if 'property_type' in filters:
if isinstance(filters['property_type'], list):
match_conditions['property_type'] = {"$in": filters['property_type']}
else:
match_conditions['property_type'] = filters['property_type']Add support for searching within a radius of a given location.
Requirements:
- Accept
location(coordinates as[lng, lat]) andradius_kmin filters - Note: Our data uses separate
latitude/longitudefields, so use a bounding-box approach - Test with Denver coordinates:
{"location": [-104.9903, 39.7392], "radius_km": 10}
π‘ Hint
Since our data has separate latitude/longitude fields (not GeoJSON), use a bounding box approach:
import math
if 'location' in filters and 'radius_km' in filters:
lng, lat = filters['location']
# Approximate degrees per km at this latitude
lat_delta = filters['radius_km'] / 111.0
lng_delta = filters['radius_km'] / (111.0 * abs(math.cos(math.radians(lat))))
match_conditions['latitude'] = {"$gte": lat - lat_delta, "$lte": lat + lat_delta}
match_conditions['longitude'] = {"$gte": lng - lng_delta, "$lte": lng + lng_delta}Combine semantic similarity with price preference (favor cheaper listings).
Requirements:
- Calculate a hybrid score:
final_score = semantic_score * 0.7 + price_score * 0.3 - Price score: normalize price to 0-1 range (lower price = higher score)
- Resort results by hybrid score
π‘ Hint
# After getting results, calculate hybrid scores
for result in results:
semantic_score = result.get('searchScore', 0)
price = result.get('price', 100)
# Normalize price (assuming max price is 500)
price_score = 1 - (min(price, 500) / 500)
# Calculate hybrid score
result['hybridScore'] = semantic_score * 0.7 + price_score * 0.3
# Sort by hybrid score
results.sort(key=lambda x: x.get('hybridScore', 0), reverse=True)Once you're comfortable with the search functionality, try loading the full dataset:
# Load all 35K listings (this will take several minutes)
full_documents = load_data_with_embeddings(
'data/datasets without embeddings/large_35K.json',
limit=None # Process all documents
)
# Insert into DocumentDB
insert_documents(full_documents)
# Recreate indexes
create_vector_index()
# Test search on full dataset
results = search_listings("luxury penthouse with city views", limit=10)- Take approximately 10-15 minutes
- Cost around $0.05-0.10 in OpenAI API usage
- Require proper rate limit handling (already built into our function)
Before moving to Module 2, ensure you have:
- Successfully connected to DocumentDB
- Generated embeddings using OpenAI's API
- Created a vector search index (IVF)
- Implemented basic semantic search
- Added filters to refine search results
- Tested with various natural language queries
- Completed at least one challenge exercise
In Module 2: RAG Pattern Implementation, you'll learn how to:
- Build a conversational AI that uses your vector search
- Implement Retrieval-Augmented Generation (RAG) with LangChain
- Create context-aware responses using retrieved listings
- Handle conversation memory and follow-up questions
- Optimize prompts for better AI responses
π¬ Questions or Issues?
If you're stuck, check the troubleshooting section in Module 0, or ask your instructor for help!