Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
fec389a
feat: Add BM25 full-text search with pg_textsearch (BRIC-7)
Kaiohz Apr 7, 2026
d60e233
refactor: simplify BM25 implementation and remove docling patch
Kaiohz Apr 7, 2026
40c5574
feat: Add Alembic migrations and lifespan for BM25 support (BRIC-7)
Kaiohz Apr 7, 2026
bdbcda9
test: Add tests for Alembic lifespan, BM25 modes, and close() method …
Kaiohz Apr 7, 2026
804912d
style: Fix lint issues - combine with statements, trailing whitespace…
Kaiohz Apr 7, 2026
050e18b
fix: Address code review critical and high-severity issues (BRIC-7)
Kaiohz Apr 7, 2026
9c9f722
refactor: Simplify code after code review (BRIC-7)
Kaiohz Apr 7, 2026
c3e5cb3
refactor: Reduce cognitive complexity in RRF combiner (sonar S3776) (…
Kaiohz Apr 7, 2026
c8751d5
chore: Update uv.lock for alembic dependency (BRIC-7)
Kaiohz Apr 7, 2026
b917fd6
fix: Use asyncpg driver URL in Alembic env.py (BRIC-7)
Kaiohz Apr 8, 2026
9b7cbcd
fix: Create chunks table in migration instead of ALTER TABLE (BRIC-7)
Kaiohz Apr 8, 2026
e15d418
fix: Use separate alembic version table for raganything (BRIC-7)
Kaiohz Apr 8, 2026
bc71c0c
feat: Add pg_textsearch extension to migration and DB config (BRIC-7)
Kaiohz Apr 8, 2026
7eb474e
docs: Update README with BM25, hybrid+, Alembic migration docs (BRIC-7)
Kaiohz Apr 8, 2026
294630e
fix: hybrid+ mode now correctly merges BM25 and vector results by chu…
Kaiohz Apr 8, 2026
0791dcf
fix: use configurable French text_config for BM25 search
Kaiohz Apr 8, 2026
4c725a5
refactor: remove score/bm25_rank/vector_rank/combined_score from chun…
Kaiohz Apr 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,11 @@ COSINE_THRESHOLD=0.2
MAX_CONCURRENT_FILES=1
MAX_WORKERS=1

# BM25 Configuration
BM25_ENABLED=true
BM25_TEXT_CONFIG=english
BM25_RRF_K=60

# Server Configuration
MCP_TRANSPORT=sse
ALLOWED_ORIGINS=["*"]
Expand Down
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,5 @@ trivy-report.json
trivy-report-fixed.json
coverage.xml
.ruff_cache
.pytest_cache
.pytest_cache
trivy-report-current.json
4 changes: 0 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,6 @@ COPY --from=builder /app/.venv /app/.venv
COPY src/ /app/src/
COPY .env.example /app/.env

# Patch docling to fix TXT file format detection (PR #3161 incomplete)
COPY patch_docling_txt.py /tmp/patch_docling_txt.py
RUN /app/.venv/bin/python /tmp/patch_docling_txt.py && rm /tmp/patch_docling_txt.py

# Set Python path to include src directory
ENV PYTHONPATH=/app/src:$PYTHONPATH
ENV PATH="/app/.venv/bin:$PATH"
Expand Down
19 changes: 15 additions & 4 deletions Dockerfile.db
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,24 @@ RUN apt-get update && apt-get install -y \
bison \
&& rm -rf /var/lib/apt/lists/*

# Install Apache AGE (v1.6.0 for PG17) and cleanup
# Install Apache AGE (v1.6.0 for PG17)
RUN cd /tmp && \
git clone --branch PG17/v1.6.0-rc0 https://github.com/apache/age.git && \
cd age && \
make PG_CONFIG=/usr/lib/postgresql/17/bin/pg_config install || \
(echo "Failed to build AGE" && exit 1) && \
rm -rf /tmp/age
(echo "Failed to build AGE" && exit 1)

# Install pg_textsearch extension for BM25 full-text search
RUN cd /tmp && \
git clone https://github.com/timescale/pg_textsearch.git && \
cd pg_textsearch && \
make PG_CONFIG=/usr/lib/postgresql/17/bin/pg_config || \
(echo "Failed to build pg_textsearch" && exit 1) && \
make PG_CONFIG=/usr/lib/postgresql/17/bin/pg_config install || \
(echo "Failed to install pg_textsearch" && exit 1)

# Cleanup build artifacts
RUN rm -rf /tmp/age /tmp/pg_textsearch

# Switch back to non-root user for security
USER postgres
USER postgres
214 changes: 173 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,42 +7,47 @@ Multi-modal RAG service exposing a REST API and MCP server for document indexing
```
Clients
(REST / MCP / Claude)
|
+-----------------------+
| FastAPI App |
+-----------+-----------+
|
+---------------+---------------+
| |
Application Layer MCP Tools
+------------------------------+ (FastMCP)
| api/ | |
| indexing_routes.py | |
| query_routes.py | |
| health_routes.py | |
| use_cases/ | |
| IndexFileUseCase | |
| IndexFolderUseCase | |
| requests/ responses/ | |
+------------------------------+ |
| | |
v v v
Domain Layer (ports)
+--------------------------------------+
| RAGEnginePort StoragePort |
+--------------------------------------+
| |
v v
Infrastructure Layer (adapters)
+--------------------------------------+
| LightRAGAdapter MinioAdapter |
| (RAGAnything) (minio-py) |
+--------------------------------------+
| |
v v
PostgreSQL MinIO
(pgvector + (object
Apache AGE) storage)
+-----------------------+
| FastAPI App |
+-----------+-----------+
|
+---------------+---------------+
| |
Application Layer MCP Tools
+------------------------------+ (FastMCP)
| api/ | |
| indexing_routes.py | |
| query_routes.py | |
| health_routes.py | |
| use_cases/ | |
| IndexFileUseCase | |
| IndexFolderUseCase | |
| QueryUseCase | |
| requests/ responses/ | |
+------------------------------+ |
| | | |
v v v v
Domain Layer (ports)
+------------------------------------------+
| RAGEnginePort StoragePort BM25EnginePort|
+------------------------------------------+
| | |
v v v
Infrastructure Layer (adapters)
+------------------------------------------+
| LightRAGAdapter MinioAdapter |
| (RAGAnything) (minio-py) |
| |
| PostgresBM25Adapter RRFCombiner |
| (pg_textsearch) (hybrid+ fusion) |
+------------------------------------------+
| | |
v v v
PostgreSQL MinIO
(pgvector + (object
Apache AGE storage)
pg_textsearch)
```

## Prerequisites
Expand Down Expand Up @@ -220,8 +225,97 @@ Response (`200 OK`):
|-------|------|----------|---------|-------------|
| `working_dir` | string | yes | -- | RAG workspace directory for this project |
| `query` | string | yes | -- | The search query |
| `mode` | string | no | `"naive"` | Search mode (see Query Modes below) |
| `top_k` | integer | no | `10` | Number of chunks to retrieve |
| `mode` | string | no | `"naive"` | Search mode: `naive`, `local`, `global`, `hybrid`, `hybrid+`, `mix`, `bm25`, `bypass` |

#### BM25 query mode

Returns results ranked by PostgreSQL full-text search using `pg_textsearch`. Each chunk includes a `score` field with the BM25 relevance score.

```bash
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"working_dir": "project-alpha",
"query": "quarterly revenue growth",
"mode": "bm25",
"top_k": 10
}'
```

Response (`200 OK`):

```json
{
"status": "success",
"message": "",
"data": {
"entities": [],
"relationships": [],
"chunks": [
{
"chunk_id": "abc123",
"content": "Quarterly revenue grew 12% year-over-year...",
"file_path": "reports/financials-q4.pdf",
"score": 3.456,
"metadata": {}
}
],
"references": []
},
"metadata": {
"query_mode": "bm25",
"total_results": 10
}
}
```

#### Hybrid+ query mode

Runs BM25 and vector search in parallel, then merges results using Reciprocal Rank Fusion (RRF). Each chunk includes `bm25_rank`, `vector_rank`, and `combined_score` fields.

```bash
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{
"working_dir": "project-alpha",
"query": "quarterly revenue growth",
"mode": "hybrid+",
"top_k": 10
}'
```

Response (`200 OK`):

```json
{
"status": "success",
"message": "",
"data": {
"entities": [],
"relationships": [],
"chunks": [
{
"chunk_id": "abc123",
"content": "Quarterly revenue grew 12% year-over-year...",
"file_path": "reports/financials-q4.pdf",
"score": 0.0328,
"bm25_rank": 1,
"vector_rank": 3,
"combined_score": 0.0328,
"metadata": {}
}
],
"references": []
},
"metadata": {
"query_mode": "hybrid+",
"total_results": 10,
"rrf_k": 60
}
}
```

The `combined_score` is the sum of `bm25_score` and `vector_score`, each computed as `1 / (k + rank)`. Results are sorted by `combined_score` descending. A chunk that appears in both result sets will have a higher combined score than one that appears in only one.

## MCP Server

Expand All @@ -233,7 +327,7 @@ The MCP server is mounted at `/mcp` and exposes a single tool: `query_knowledge_
|-----------|------|---------|-------------|
| `working_dir` | string | required | RAG workspace directory for this project |
| `query` | string | required | The search query |
| `mode` | string | `"naive"` | Search mode: `naive`, `local`, `global`, `hybrid`, `mix`, `bypass` |
| `mode` | string | `"naive"` | Search mode: `naive`, `local`, `global`, `hybrid`, `hybrid+`, `mix`, `bm25`, `bypass` |
| `top_k` | integer | `10` | Number of chunks to retrieve |

### Transport modes
Expand Down Expand Up @@ -321,6 +415,16 @@ All configuration is via environment variables, loaded through Pydantic Settings
| `ENABLE_TABLE_PROCESSING` | `true` | Process tables during indexing |
| `ENABLE_EQUATION_PROCESSING` | `true` | Process equations during indexing |

### BM25 (`BM25Config`)

| Variable | Default | Description |
|----------|---------|-------------|
| `BM25_ENABLED` | `true` | Enable BM25 full-text search |
| `BM25_TEXT_CONFIG` | `english` | PostgreSQL text search configuration |
| `BM25_RRF_K` | `60` | RRF constant K for hybrid search (must be >= 1) |

When `BM25_ENABLED` is `false` or the pg_textsearch extension is not available, `hybrid+` mode falls back to `naive` (vector-only) and `bm25` mode returns an error.

### MinIO (`MinioConfig`)

| Variable | Default | Description |
Expand All @@ -339,7 +443,9 @@ All configuration is via environment variables, loaded through Pydantic Settings
| `local` | Entity-focused search using the knowledge graph |
| `global` | Relationship-focused search across the knowledge graph |
| `hybrid` | Combines local + global strategies |
| `hybrid+` | Parallel BM25 + vector search using Reciprocal Rank Fusion (RRF). Best of both worlds |
| `mix` | Knowledge graph + vector chunks combined |
| `bm25` | BM25 full-text search only. PostgreSQL pg_textsearch |
| `bypass` | Direct LLM query without retrieval |

## Development
Expand All @@ -361,6 +467,22 @@ docker compose logs -f raganything-api # Follow API logs
docker compose down -v # Stop and remove volumes
```

## Database Migrations

Alembic migrations run automatically at startup via the `db_lifespan` context manager in `main.py`. The migration state is tracked in the `raganything_alembic_version` table, which is separate from the `composable-agents` Alembic table to avoid conflicts.

The initial migration (`001_add_bm25_support`) creates the `chunks` table with a `tsvector` column for full-text search, GIN and BM25 indexes, and an auto-update trigger.

### Production requirements

The PostgreSQL server must have the `pg_textsearch` extension installed and loaded. In production, this requires:

1. **Dockerfile.db** builds a custom PostgreSQL image that compiles `pg_textsearch` from source (along with `pgvector` and `Apache AGE`).

2. **docker-compose.yml** must configure `shared_preload_libraries=pg_textsearch` for the `bricks-db` service. The local dev `docker-compose.yml` in this repository includes this by default.

3. The Alembic migration `001_add_bm25_support` will fail if `pg_textsearch` is not available. Ensure the database image is built from `Dockerfile.db` and the shared library is preloaded.

## Project Structure

```
Expand All @@ -374,25 +496,35 @@ src/
ports/
rag_engine.py -- RAGEnginePort (abstract)
storage_port.py -- StoragePort (abstract)
bm25_engine.py -- BM25EnginePort (abstract)
application/
api/
health_routes.py -- GET /health
indexing_routes.py -- POST /file/index, /folder/index
query_routes.py -- POST /query
mcp_tools.py -- MCP tool: query_knowledge_base
query_routes.py -- POST /query
mcp_tools.py -- MCP tool: query_knowledge_base
requests/
indexing_request.py -- IndexFileRequest, IndexFolderRequest
query_request.py -- QueryRequest
query_request.py -- QueryRequest, QueryMode
responses/
query_response.py -- QueryResponse, QueryDataResponse
use_cases/
index_file_use_case.py -- Downloads from MinIO, indexes single file
index_folder_use_case.py -- Downloads from MinIO, indexes folder
query_use_case.py -- Query with bm25/hybrid+ support
infrastructure/
rag/
lightrag_adapter.py -- LightRAGAdapter (RAGAnything/LightRAG)
storage/
minio_adapter.py -- MinioAdapter (minio-py client)
bm25/
pg_textsearch_adapter.py -- PostgresBM25Adapter (pg_textsearch)
hybrid/
rrf_combiner.py -- RRFCombiner (Reciprocal Rank Fusion)
alembic/
env.py -- Alembic migration environment (async)
versions/
001_add_bm25_support.py -- BM25 table, indexes, triggers
```

## License
Expand Down
Loading
Loading