Skip to content

[FEATURE] ChromaProvider for Chroma Vector Database #29

@sajeerzeji

Description

@sajeerzeji

Is your feature request related to a problem? Please describe.

Chroma is a popular open-source vector database that's easy to self-host and provides excellent performance. Many developers prefer Chroma for:

  • Local development with persistent storage
  • Self-hosted deployments without cloud dependencies
  • Docker-based deployments
  • Python/JavaScript interoperability

Currently, toolpack-knowledge doesn't support Chroma, forcing users to either use in-memory storage (no persistence) or PostgreSQL (requires database setup).

Describe the solution you'd like

Implement ChromaProvider that integrates with Chroma's JavaScript client for vector storage and retrieval.

Implementation Details

1. Create ChromaProvider Class

File: packages/toolpack-knowledge/src/providers/chroma.ts

Implement a provider that:

  • Uses the official chromadb JavaScript client
  • Supports connection to local or remote Chroma instances
  • Implements collection management
  • Handles authentication (token and basic auth)
  • Supports multi-tenancy via tenant/database isolation

Configuration Options:

  • url - Chroma server URL (default: 'http://localhost:8000')
  • collection - Collection name (required)
  • tenant - Tenant for multi-tenancy (default: 'default_tenant')
  • database - Database name (default: 'default_database')
  • auth - Authentication configuration (optional)
    • provider: 'token' or 'basic'
    • credentials: token string or {username, password} object

2. Collection Management

Initialization:

  • Initialize ChromaClient with URL and auth config
  • Try to get existing collection
  • If collection doesn't exist, create it with cosine similarity
  • Configure HNSW index parameters

Collection Configuration:

  • Use cosine distance for similarity metric
  • Set appropriate HNSW parameters for performance

3. Core Methods Implementation

validateDimensions():

  • Get collection count to check if it has data
  • If data exists, fetch a sample vector to check dimensions
  • Compare with embedder dimensions
  • Throw DimensionMismatchError if mismatch detected
  • Store dimensions for future validation

add():

  • Extract IDs, embeddings, metadatas, and documents from chunks
  • Validate that all chunks have vectors
  • Use Chroma's upsert() method (automatic upsert behavior)
  • Handle errors with KnowledgeProviderError

query():

  • Build where clause for metadata filtering
  • Use Chroma's query() method with query embeddings
  • Specify nResults (limit)
  • Include documents, metadatas, distances, and optionally embeddings
  • Convert Chroma's distance to similarity score (1 - distance)
  • Filter results by threshold
  • Return formatted QueryResult array

delete():

  • Use Chroma's delete() method with array of IDs
  • Handle empty arrays gracefully

clear():

  • Delete the entire collection
  • Recreate it with same configuration
  • Reset stored dimensions

getAllChunks():

  • Use Chroma's get() method to fetch all data
  • Include documents, metadatas, and embeddings
  • Map results to Chunk format

close():

  • Clean up resources (Chroma client doesn't require explicit cleanup)
  • Set collection reference to null

4. Metadata Filtering

Implement buildWhereClause() helper to convert MetadataFilter to Chroma format:

  • Simple equality: {key: value}
  • $in operator: {key: {$in: [values]}}
  • $gt operator: {key: {$gt: value}}
  • $lt operator: {key: {$lt: value}}
  • Chroma natively supports these operators

5. Multi-Tenancy Support

Leverage Chroma's built-in multi-tenancy:

  • Use tenant parameter to isolate data by user/organization
  • Use database parameter for additional isolation
  • Same collection name can exist in different tenants
  • Enables efficient multi-user deployments

6. Export from Index

File: packages/toolpack-knowledge/src/index.ts

Add exports for ChromaProvider and ChromaProviderOptions.

7. Add Dependencies

File: packages/toolpack-knowledge/package.json

Add dependency:

  • chromadb - Official Chroma JavaScript client

8. Testing

File: packages/toolpack-knowledge/src/__tests__/chroma-provider.test.ts

Create comprehensive tests (requires running Chroma instance):

  • Add and retrieve chunks - Basic CRUD operations
  • Metadata filtering - Test equality, $in, $gt, $lt operators
  • Dimension validation - Test mismatch detection
  • Delete operations - Test chunk deletion
  • Upsert behavior - Test updating existing chunks (automatic in Chroma)
  • Tenant isolation - Test multi-tenancy with same collection name
  • Error handling - Test failure scenarios

Use environment variable CHROMA_URL for test server connection.
Skip tests if Chroma not available.

9. Documentation

File: packages/toolpack-knowledge/README.md

Add section on Chroma setup:

  • Docker setup instructions
  • Local installation guide
  • Connection configuration examples
  • Multi-tenancy patterns
  • Authentication setup

10. Docker Compose Example

File: packages/toolpack-knowledge/examples/docker-compose.yml

Provide Docker Compose configuration for:

  • Chroma service with persistent volume
  • Health check configuration
  • Port mapping
  • Environment variables
  • Volume management

Acceptance Criteria

  • ChromaProvider implements all KnowledgeProvider methods
  • Automatic collection creation
  • HNSW index with cosine similarity
  • Metadata filtering with $in, $gt, $lt operators
  • Upsert support (automatic in Chroma)
  • Tenant and database isolation support
  • Authentication support (token and basic)
  • Dimension validation
  • Comprehensive tests (requires Chroma instance)
  • Documentation with setup instructions
  • Docker Compose example
  • Migration guide from other providers

Describe alternatives you've considered

  1. Chroma Python client with bridge: Use Python client via child process - rejected because it adds complexity and latency
  2. REST API directly: Implement REST calls without SDK - rejected because the official SDK is well-maintained
  3. Embedded Chroma: Run Chroma in-process - not available in JavaScript

Additional context

  • Chroma is designed for ease of use and developer experience
  • HNSW index provides excellent performance
  • Native multi-tenancy support via tenant/database isolation
  • Docker deployment is straightforward
  • Good choice for self-hosted deployments without cloud dependencies
  • Interoperable with Python applications using Chroma

Dependencies:

  • chromadb - Official Chroma JavaScript client
  • Requires Chroma server running (Docker or local)

Related Issues:

  • Complements PgVectorProvider for different deployment scenarios
  • Provides alternative to cloud-based vector databases
  • Enables easy local development with persistence

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestmedium-priorityMedium priority issuestoolpack-knowledgeIssues related to toolpack-knowledge package

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions