A comprehensive collection of examples for performing batch vector search operations using Databricks Vector Search with different processing approaches, query types, and configuration management.
This repository provides production-ready examples for scaling vector search operations across large datasets using:
- Python Async Processing (< 1M records, serverless CPU)
- Ray Distributed Processing (> 1M records, multi-node clusters)
- Centralized Configuration Management (environment-specific settings)
- Multiple Query Types (ANN vector-only, HYBRID text+vector)
- Databricks workspace with Vector Search enabled
- Unity Catalog configured
- Python libraries:
databricks-vectorsearch,httpx,ray[default]
# Download and prepare the IMDB dataset
Run: vector_search_batch/01-download-dataset.ipynb# Create vector search index with embeddings
Run: vector_search_batch/02-create-vector-search-index.ipynb# Choose your configuration approach
from vector_search_batch.config import VectorSearchConfig, ConfigPresets
# Option A: Default configuration (quickest start)
config = VectorSearchConfig()
# Option B: Environment preset
config = ConfigPresets.development() # or staging, production
# Option C: Custom configuration
config = load_config(
uc_catalog="your_catalog",
uc_schema="your_schema",
default_query_type="ANN"
)# For datasets < 1M records
Run: vector_search_batch/03-vs-async-batch-python.py
# For datasets > 1M records
Run: vector_search_batch/03-vs-async-batch-ray.pydatabricks-vector-search-examples/
βββ README.md # This file
βββ databricks.yml # Databricks configuration
βββ vector_search_batch/ # Main examples directory
βββ README.md # Detailed documentation
βββ config.py # π Configuration management system
βββ 01-download-dataset.ipynb # Data preparation
βββ 02-create-vector-search-index.ipynb # Index creation
βββ 03-vs-async-batch-python.py # Python async processing
βββ 03-vs-async-batch-ray.py # Ray distributed processing
βββ [legacy files] # Older implementations
- β Type-safe configuration with dataclass validation
- β Multiple configuration sources (default, presets, environment variables, overrides)
- β Environment-specific settings (dev, staging, production)
- β Easy customization without editing core code
- β CI/CD integration via environment variables
from vector_search_batch.config import VectorSearchConfig
config = VectorSearchConfig() # Uses sensible defaultsfrom vector_search_batch.config import ConfigPresets
config = ConfigPresets.development() # Lower concurrency, smaller samples
config = ConfigPresets.staging() # Medium settings
config = ConfigPresets.production() # High performance settingsexport UC_CATALOG="prod"
export UC_SCHEMA="vector_search"
export VS_INDEX_NAME="prod_vs_index"
export DEFAULT_QUERY_TYPE="HYBRID"
export DEFAULT_CONCURRENCY="100"from vector_search_batch.config import load_config
config = load_config(use_env=True)from vector_search_batch.config import load_config
config = load_config(
uc_catalog="my_catalog",
default_query_type="ANN",
default_concurrency=50,
max_sample_size=1000
)Best for: < 1M records, serverless CPU, rapid prototyping
# Features:
- Configurable concurrency (default: 100)
- Automatic retry logic with exponential backoff
- Memory-efficient processing
- All query types supported (ANN, HYBRID)
- Serverless CPU compatibleBest for: > 1M records, multi-node clusters, production scale
# Features:
- Distributed processing across multiple worker nodes
- Memory-efficient batch processing
- Automatic fault tolerance and retries
- Scales to very large datasets
- Advanced resource management- Use Case: Pure vector similarity search
- API: Only
query_vectorparameter - Best For: Finding semantically similar content based on embeddings
- Use Case: Combines semantic text matching with vector similarity
- API: Both
query_textandquery_vectorparameters - Best For: Balanced search combining text and vector matching
| Aspect | Python Async | Ray Distributed |
|---|---|---|
| Dataset Size | < 1M records | > 1M records |
| Memory Usage | Moderate | Very Low |
| Setup Complexity | Simple | Moderate |
| Fault Tolerance | Basic retry | Advanced |
| Scalability | Vertical | Horizontal |
| Compute Type | Single-node | Multi-node |
config = load_config(
default_query_type="HYBRID", # Text + vector similarity
default_num_results=10, # Top 10 recommendations
default_concurrency=50
)config = load_config(
default_query_type="ANN", # Pure vector similarity
default_num_results=5, # Find top duplicates
default_concurrency=100
)config = ConfigPresets.production()
# Use Ray distributed processing for > 1M records# Development
config = ConfigPresets.development()
# - Lower concurrency (20)
# - Smaller sample size (100)
# - Faster timeouts (15s)
# Production
config = ConfigPresets.production()
# - High concurrency (100)
# - Large sample size (10000)
# - Longer timeouts (60s)# High-throughput processing
config = load_config(
default_concurrency=200,
max_sample_size=50000,
request_timeout=60
)
# Memory-constrained environment
config = load_config(
default_concurrency=20,
max_sample_size=100,
request_timeout=15
)git clone https://github.com/databricks/databricks-vector-search-examples.git
cd databricks-vector-search-examples%pip install databricks-vectorsearch httpx ray[default]# Set environment variables for your deployment
export UC_CATALOG="your_catalog"
export UC_SCHEMA="your_schema"
export VS_INDEX_NAME="your_index"
export VECTOR_SEARCH_ENDPOINT="your_endpoint"# Start with data preparation
# Then run either Python async or Ray distributed processing- Config not found: Ensure
config.pyis in the correct directory - Environment variables not loading: Use
load_config(use_env=True) - Validation errors: Check parameter types and ranges
- Memory errors: Use Ray distributed approach or reduce
max_sample_size - Rate limiting: Reduce
default_concurrencyor use development preset - Timeouts: Increase
request_timeoutor use production preset
- Detailed Documentation: See vector_search_batch/README.md
- Configuration Reference: See vector_search_batch/config.py
- API Documentation: Databricks Vector Search
- Use the configuration system for all new examples
- Test with all presets (development, staging, production)
- Follow existing naming conventions
- Include comprehensive documentation
- Add performance benchmarks
This repository is provided as examples for Databricks customers and field engineering teams.
For issues and questions:
- Check the troubleshooting section in vector_search_batch/README.md
- Review configuration documentation
- Open an issue in this repository