Skip to content

alexmillerdb/databricks-vector-search-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Databricks Vector Search Examples

A comprehensive collection of examples for performing batch vector search operations using Databricks Vector Search with different processing approaches, query types, and configuration management.

🎯 Overview

This repository provides production-ready examples for scaling vector search operations across large datasets using:

  • Python Async Processing (< 1M records, serverless CPU)
  • Ray Distributed Processing (> 1M records, multi-node clusters)
  • Centralized Configuration Management (environment-specific settings)
  • Multiple Query Types (ANN vector-only, HYBRID text+vector)

πŸš€ Quick Start

Prerequisites

  • Databricks workspace with Vector Search enabled
  • Unity Catalog configured
  • Python libraries: databricks-vectorsearch, httpx, ray[default]

1. Data Preparation

# Download and prepare the IMDB dataset
Run: vector_search_batch/01-download-dataset.ipynb

2. Index Creation

# Create vector search index with embeddings
Run: vector_search_batch/02-create-vector-search-index.ipynb

3. Configuration Setup

# Choose your configuration approach
from vector_search_batch.config import VectorSearchConfig, ConfigPresets

# Option A: Default configuration (quickest start)
config = VectorSearchConfig()

# Option B: Environment preset
config = ConfigPresets.development()  # or staging, production

# Option C: Custom configuration
config = load_config(
    uc_catalog="your_catalog",
    uc_schema="your_schema",
    default_query_type="ANN"
)

4. Run Batch Processing

# For datasets < 1M records
Run: vector_search_batch/03-vs-async-batch-python.py

# For datasets > 1M records
Run: vector_search_batch/03-vs-async-batch-ray.py

πŸ“ Repository Structure

databricks-vector-search-examples/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ databricks.yml                     # Databricks configuration
└── vector_search_batch/               # Main examples directory
    β”œβ”€β”€ README.md                      # Detailed documentation
    β”œβ”€β”€ config.py                      # πŸ†• Configuration management system
    β”œβ”€β”€ 01-download-dataset.ipynb      # Data preparation
    β”œβ”€β”€ 02-create-vector-search-index.ipynb # Index creation
    β”œβ”€β”€ 03-vs-async-batch-python.py    # Python async processing
    β”œβ”€β”€ 03-vs-async-batch-ray.py       # Ray distributed processing
    └── [legacy files]                 # Older implementations

πŸ”§ Configuration System (NEW!)

Key Features

  • βœ… Type-safe configuration with dataclass validation
  • βœ… Multiple configuration sources (default, presets, environment variables, overrides)
  • βœ… Environment-specific settings (dev, staging, production)
  • βœ… Easy customization without editing core code
  • βœ… CI/CD integration via environment variables

Configuration Options

Default Configuration

from vector_search_batch.config import VectorSearchConfig
config = VectorSearchConfig()  # Uses sensible defaults

Environment Presets

from vector_search_batch.config import ConfigPresets

config = ConfigPresets.development()  # Lower concurrency, smaller samples
config = ConfigPresets.staging()      # Medium settings
config = ConfigPresets.production()   # High performance settings

Environment Variables

export UC_CATALOG="prod"
export UC_SCHEMA="vector_search"
export VS_INDEX_NAME="prod_vs_index"
export DEFAULT_QUERY_TYPE="HYBRID"
export DEFAULT_CONCURRENCY="100"
from vector_search_batch.config import load_config
config = load_config(use_env=True)

Custom Overrides

from vector_search_batch.config import load_config
config = load_config(
    uc_catalog="my_catalog",
    default_query_type="ANN",
    default_concurrency=50,
    max_sample_size=1000
)

🎯 Processing Approaches

Python Async Processing

Best for: < 1M records, serverless CPU, rapid prototyping

# Features:
- Configurable concurrency (default: 100)
- Automatic retry logic with exponential backoff
- Memory-efficient processing
- All query types supported (ANN, HYBRID)
- Serverless CPU compatible

Ray Distributed Processing

Best for: > 1M records, multi-node clusters, production scale

# Features:
- Distributed processing across multiple worker nodes
- Memory-efficient batch processing
- Automatic fault tolerance and retries
- Scales to very large datasets
- Advanced resource management

πŸ” Query Types

ANN (Vector-Only Search)

  • Use Case: Pure vector similarity search
  • API: Only query_vector parameter
  • Best For: Finding semantically similar content based on embeddings

HYBRID (Text + Vector Search)

  • Use Case: Combines semantic text matching with vector similarity
  • API: Both query_text and query_vector parameters
  • Best For: Balanced search combining text and vector matching

πŸ“Š Performance Comparison

Aspect Python Async Ray Distributed
Dataset Size < 1M records > 1M records
Memory Usage Moderate Very Low
Setup Complexity Simple Moderate
Fault Tolerance Basic retry Advanced
Scalability Vertical Horizontal
Compute Type Single-node Multi-node

🎯 Use Case Examples

Movie Recommendation System

config = load_config(
    default_query_type="HYBRID",    # Text + vector similarity
    default_num_results=10,         # Top 10 recommendations
    default_concurrency=50
)

Content Deduplication

config = load_config(
    default_query_type="ANN",       # Pure vector similarity
    default_num_results=5,          # Find top duplicates
    default_concurrency=100
)

Large-Scale Production Processing

config = ConfigPresets.production()
# Use Ray distributed processing for > 1M records

πŸ”§ Advanced Configuration

Environment-Specific Deployments

# Development
config = ConfigPresets.development()
# - Lower concurrency (20)
# - Smaller sample size (100)
# - Faster timeouts (15s)

# Production  
config = ConfigPresets.production()
# - High concurrency (100)
# - Large sample size (10000)
# - Longer timeouts (60s)

Custom Configuration for Specific Use Cases

# High-throughput processing
config = load_config(
    default_concurrency=200,
    max_sample_size=50000,
    request_timeout=60
)

# Memory-constrained environment
config = load_config(
    default_concurrency=20,
    max_sample_size=100,
    request_timeout=15
)

πŸ› οΈ Installation and Setup

1. Clone Repository

git clone https://github.com/databricks/databricks-vector-search-examples.git
cd databricks-vector-search-examples

2. Install Dependencies

%pip install databricks-vectorsearch httpx ray[default]

3. Configure Environment (Optional)

# Set environment variables for your deployment
export UC_CATALOG="your_catalog"
export UC_SCHEMA="your_schema"
export VS_INDEX_NAME="your_index"
export VECTOR_SEARCH_ENDPOINT="your_endpoint"

4. Run Examples

# Start with data preparation
# Then run either Python async or Ray distributed processing

πŸ” Troubleshooting

Configuration Issues

  • Config not found: Ensure config.py is in the correct directory
  • Environment variables not loading: Use load_config(use_env=True)
  • Validation errors: Check parameter types and ranges

Performance Issues

  • Memory errors: Use Ray distributed approach or reduce max_sample_size
  • Rate limiting: Reduce default_concurrency or use development preset
  • Timeouts: Increase request_timeout or use production preset

πŸ“š Documentation

🀝 Contributing

  1. Use the configuration system for all new examples
  2. Test with all presets (development, staging, production)
  3. Follow existing naming conventions
  4. Include comprehensive documentation
  5. Add performance benchmarks

πŸ“„ License

This repository is provided as examples for Databricks customers and field engineering teams.

πŸ†˜ Support

For issues and questions:

About

Databricks Vector Search Examples for customers and FE

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors