Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .env.template
Original file line number Diff line number Diff line change
Expand Up @@ -29,12 +29,13 @@ DATABASE_URL=postgresql://eu_fact_force:eu_fact_force@localhost:5432/eu_fact_for
# -----------------------------------------------------------------------------

# If AWS_STORAGE_BUCKET_NAME is set, Django uses S3 for default file storage.
# For local dev with LocalStack, set USE_LOCAL_STACK=1 and the endpoint below.
# For local dev with RustFS (docker-compose): set AWS_S3_ENDPOINT_URL and use the same credentials.
# RustFS Web Console: http://localhost:9001 | S3 API: http://localhost:9000
# USE_LOCAL_STACK=1
# AWS_ACCESS_KEY_ID=test
# AWS_SECRET_ACCESS_KEY=test
# AWS_ACCESS_KEY_ID=minioadmin
# AWS_SECRET_ACCESS_KEY=minioadmin
# AWS_STORAGE_BUCKET_NAME=eu-fact-force-files
# AWS_S3_REGION_NAME=eu-west-1
# AWS_S3_ENDPOINT_URL=http://localhost:4566
# AWS_S3_ENDPOINT_URL=http://localhost:9000

# In production: set real AWS credentials and do not set AWS_S3_ENDPOINT_URL.
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@ eu_fact_force/ingestion/parsing/output/analysis/
s3/
eu_fact_force/exploration/
annotated_pdf/

# docker volumes
eu_fact_force/exploration/docling/results/annotated_pdf/
postgres_data/
rustfs_data
rustfs_data/
data/
53 changes: 44 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -112,13 +112,13 @@ uv run pytest

### Déploiement de l'application

L'application se compose d'un serveur Django, d'une base PostgreSQL (avec pgvector) et de LocalStack pour le stockage S3.
L'application se compose d'un serveur Django, d'une base PostgreSQL (avec pgvector) et de **RustFS** pour le stockage S3 (compatible AWS), avec une interface web pour déposer des fichiers manuellement.
Pour déployer et utiliser l'application en local :

**1. Prérequis**

- [Python 3.12+](https://www.python.org/) et [uv](https://docs.astral.sh/uv/)
- [Docker](https://www.docker.com/) et Docker Compose (pour Postgres et LocalStack)
- [Docker](https://www.docker.com/) et Docker Compose (pour Postgres et RustFS)

**2. Variables d'environnement**

Expand All @@ -130,15 +130,15 @@ cp .env.template .env

Pour un usage local avec les services Docker, les valeurs par défaut de `.env.template` (notamment `DATABASE_URL=postgresql://eu_fact_force:eu_fact_force@localhost:5432/eu_fact_force`) conviennent.

**3. Lancer les services (Postgres et LocalStack)**
**3. Lancer les services (Postgres et RustFS)**

À la racine du projet :

```bash
docker compose up -d
```

Cela démarre PostgreSQL (port 5432) et LocalStack S3 (port 4566). Le bucket configuré est créé automatiquement au démarrage de LocalStack.
Cela démarre PostgreSQL (port 5432) et RustFS (API S3 sur le port 9000). Le bucket configuré est créé automatiquement au premier démarrage. **Interface web RustFS** : [http://localhost:9001](http://localhost:9001) — identifiants S3 (Access Key / Secret Key) : ceux définis dans `.env` (par défaut `minioadmin`). Vous pouvez y créer des buckets, des dossiers et déposer des fichiers manuellement.

**4. Installer les dépendances et appliquer les migrations**

Expand All @@ -165,14 +165,49 @@ L'application est alors disponible sur [http://127.0.0.1:8000/](http://127.0.0.1

**Utilisation du stockage S3 en local**

Pour que Django utilise LocalStack pour le stockage des fichiers, décommentez et renseignez dans `.env` les variables S3 (voir `.env.template`), par exemple :
Avec `docker compose`, l’app est configurée pour utiliser RustFS. Pour lancer Django au host (sans conteneur app) et pointer vers RustFS, décommentez dans `.env` les variables S3 (voir `.env.template`) et définissez par exemple :

```bash
USE_LOCAL_STACK=1
AWS_ACCESS_KEY_ID=test
AWS_SECRET_ACCESS_KEY=test
AWS_S3_ENDPOINT_URL=http://localhost:9000
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_STORAGE_BUCKET_NAME=eu-fact-force-files
AWS_S3_REGION_NAME=eu-west-1
```

Sans ces variables, l'application utilise le stockage fichier local par défaut.
Sans ces variables, l’application utilise le stockage fichier local par défaut.

## Test de performance

Le projet propose un ensemble de documents relatifs aux liens entre les vaccins et l'autisme.
Ces documents vont permettre de tester de bout en bout la pipeline :
- parsing des pdf,
- extraction des chunks,
- vectorisation des chuncks,
- mécanisme de recherche.

Puisque tous les documents ne sont pas nécessairement facilement accessible via les API, les documents et les metadata sont réunis dans un archive (puis un S3 dans un second temps).
L'archive contient :
- la liste des paragraphes les plus pertinents à extraire dans le json `vaccins_annotated.json`,
- les fichiers pdf,
- un fichier json par pdf contenant les métadonnées.

Le fichier json contient la structure suivante :

```json
{
"tags_pubmed": [
"tag1",
"tag2",
"tag3"
],
"title" : "Title",
"category" : "category",
"type" : "type",
"journal": "journal",
"authors" : ["first author", "seocond author"],
"year": 2022,
"url" : "http",
"doi" : "test_doi"
}
```
52 changes: 33 additions & 19 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@ services:
DEBUG: ${DEBUG:-0}
SECRET_KEY: ${SECRET_KEY:-dev-secret-key-change-in-production}
DATABASE_URL: ${DATABASE_URL:-postgresql://eu_fact_force:eu_fact_force@postgres:5432/eu_fact_force}
AWS_ENDPOINT_URL: ${AWS_ENDPOINT_URL:-http://localstack:4566}
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-test}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-test}
AWS_S3_ENDPOINT_URL: ${AWS_S3_ENDPOINT_URL:-http://rustfs:9000}
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-minioadmin}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-minioadmin}
AWS_S3_REGION_NAME: ${AWS_S3_REGION_NAME:-eu-west-1}
AWS_STORAGE_BUCKET_NAME: ${AWS_STORAGE_BUCKET_NAME:-eu-fact-force-files}
depends_on:
postgres:
condition: service_healthy
localstack:
rustfs:
condition: service_started
labels:
- traefik.enable=true
Expand All @@ -30,6 +30,7 @@ services:
- traefik.docker.network=d4g-internal
- traefik.http.services.eu-fact-force.loadbalancer.server.port=8000

# PostgreSQL 18+ : monter sur /var/lib/postgresql (l’image gère le sous-dossier de version).
postgres:
image: pgvector/pgvector:pg18-trixie
environment:
Expand All @@ -39,29 +40,42 @@ services:
ports:
- 5432
volumes:
- postgres_data:/var/lib/postgresql
- ./postgres_data:/var/lib/postgresql
- ./docker/postgres/init:/docker-entrypoint-initdb.d:ro
healthcheck:
test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-eu_fact_force} -d ${POSTGRES_DB:-eu_fact_force}"]
interval: 5s
timeout: 5s
retries: 5

localstack:
image: localstack/localstack:latest
# RustFS: S3-compatible object storage with web console (Apache 2.0).
# Console UI: http://localhost:9001 | S3 API: http://localhost:9000
rustfs:
image: rustfs/rustfs:latest
restart: unless-stopped
ports:
- 4566
- "9000:9000"
- "9001:9001"
environment:
SERVICES: s3
PERSISTENCE: 1
AWS_DEFAULT_REGION: ${AWS_S3_REGION_NAME:-eu-west-1}
AWS_STORAGE_BUCKET_NAME: ${AWS_STORAGE_BUCKET_NAME:-eu-fact-force-files}
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-test}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-test}
DEBUG: ${DEBUG:-0}
RUSTFS_ACCESS_KEY: ${AWS_ACCESS_KEY_ID:-minioadmin}
RUSTFS_SECRET_KEY: ${AWS_SECRET_ACCESS_KEY:-minioadmin}
RUSTFS_CONSOLE_ENABLE: "true"
command: ["--console-enable", "/data"]
volumes:
- ./s3:/var/lib/localstack
- ./docker/localstack/init-ready.d:/etc/localstack/init/ready.d:ro
- ./rustfs_data:/data

volumes:
postgres_data:
# Create default S3 bucket on first run (depends on RustFS).
rustfs-init:
image: amazon/aws-cli:latest
depends_on:
- rustfs
entrypoint: ["/bin/sh", "-c"]
command:
- |
sleep 5
aws s3 mb s3://$${AWS_STORAGE_BUCKET_NAME} --endpoint-url http://rustfs:9000 2>/dev/null || true
echo "Bucket ready."
environment:
AWS_ACCESS_KEY_ID: ${AWS_ACCESS_KEY_ID:-minioadmin}
AWS_SECRET_ACCESS_KEY: ${AWS_SECRET_ACCESS_KEY:-minioadmin}
AWS_STORAGE_BUCKET_NAME: ${AWS_STORAGE_BUCKET_NAME:-eu-fact-force-files}
20 changes: 12 additions & 8 deletions docker/localstack/init-ready.d/01-create-bucket.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
#!/usr/bin/env python3
"""Create the default S3 bucket when LocalStack is ready."""
"""Create the default S3 bucket and the performances bucket when LocalStack is ready."""

import os

import boto3

PERFORMANCES_BUCKET_NAME = "performances"

bucket = os.environ.get("AWS_STORAGE_BUCKET_NAME", "eu-fact-force")
region = os.environ.get("AWS_S3_REGION_NAME", "eu-west-1")

Expand All @@ -15,10 +17,12 @@
aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY", "test"),
region_name=region,
)
try:
client.create_bucket(Bucket=bucket)
print(f"Created bucket: {bucket}")
except client.exceptions.BucketAlreadyOwnedByYou:
print(f"Bucket already exists: {bucket}")
except Exception as e:
print(f"Bucket creation skipped: {e}")

for name in (bucket, PERFORMANCES_BUCKET_NAME):
try:
client.create_bucket(Bucket=name)
print(f"Created bucket: {name}")
except client.exceptions.BucketAlreadyOwnedByYou:
print(f"Bucket already exists: {name}")
except Exception as e:
print(f"Bucket creation skipped for {name}: {e}")
41 changes: 27 additions & 14 deletions eu_fact_force/app/settings.py
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ici j'ai laissé l'IA gérer la conf dans un premier temps. Il faudra une deuxième passe lorsque l'on aura défini comme déployer proprement du S3.

Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"""

import os
import sys
from pathlib import Path
from urllib.parse import urlparse

Expand All @@ -25,6 +26,13 @@
if _load_env.exists():
load_dotenv(_load_env)

# Sous pytest : forcer S3 local (RustFS) pour éviter InvalidAccessKeyId avec des clés .env
_run_by_pytest = "pytest" in sys.argv[0] or "pytest" in str(sys.argv)
if _run_by_pytest:
os.environ["AWS_S3_ENDPOINT_URL"] = "http://localhost:9000"
os.environ["AWS_ACCESS_KEY_ID"] = "minioadmin"
os.environ["AWS_SECRET_ACCESS_KEY"] = "minioadmin"
os.environ["AWS_STORAGE_BUCKET_NAME"] = "eu-fact-force-files"

# Quick-start development settings - unsuitable for production
# See https://docs.djangoproject.com/en/6.0/howto/deployment/checklist/
Expand Down Expand Up @@ -106,21 +114,15 @@ def _get_databases():
"PASSWORD": parsed.password,
"HOST": parsed.hostname,
"PORT": parsed.port or "5432",
"TEST": {
"NAME": f"test_{parsed.path.lstrip('/')}",
},
}
}


DATABASES = _get_databases()

# Use a dedicated test database name so tests do not overwrite dev/prod data
if (
"default" in DATABASES
and DATABASES["default"]["ENGINE"] == "django.db.backends.postgresql"
):
_db_name = DATABASES["default"].get("NAME", "eu_fact_force")
DATABASES["default"].setdefault("TEST", {})["NAME"] = f"test_{_db_name}"


# Password validation
# https://docs.djangoproject.com/en/6.0/ref/settings/#auth-password-validators

Expand Down Expand Up @@ -159,17 +161,28 @@ def _get_databases():
# Must be an absolute filesystem path (string) for collectstatic / StaticFilesStorage
STATIC_ROOT = str((BASE_DIR.parent / "staticfiles").resolve())

# S3 / LocalStack storage (switch via AWS_S3_ENDPOINT_URL or USE_LOCAL_STACK)
# S3 / MinIO / LocalStack storage (switch via AWS_S3_ENDPOINT_URL or USE_LOCAL_STACK)
# django-storages reads AWS_S3_ENDPOINT_URL from this module
AWS_S3_ENDPOINT_URL = os.environ.get("AWS_S3_ENDPOINT_URL") or (
"http://localhost:4566" if os.environ.get("USE_LOCAL_STACK") else None
# Valeurs par défaut : RustFS local (9000) ou LocalStack (4566) si USE_LOCAL_STACK=1
AWS_S3_ENDPOINT_URL = os.environ.get("AWS_S3_ENDPOINT_URL") or "http://localhost:9000"
if AWS_S3_ENDPOINT_URL and (
"localhost" in AWS_S3_ENDPOINT_URL or "127.0.0.1" in AWS_S3_ENDPOINT_URL
):
os.environ.setdefault("AWS_ACCESS_KEY_ID", "minioadmin")
os.environ.setdefault("AWS_SECRET_ACCESS_KEY", "minioadmin")
# Must match eu_fact_force.ingestion.s3.save_file_to_s3 / get_default_bucket(): uploads use
# boto3 with this default bucket even when AWS_STORAGE_BUCKET_NAME is unset, so default_storage
# must use the same bucket or opens fall back to FileSystemStorage and FileNotFoundError.
_DEFAULT_FILES_BUCKET = "eu-fact-force-files"
_AWS_STORAGE_BUCKET_NAME = (
os.environ.get("AWS_STORAGE_BUCKET_NAME") or _DEFAULT_FILES_BUCKET
)
if os.environ.get("AWS_STORAGE_BUCKET_NAME"):
if _AWS_STORAGE_BUCKET_NAME:
STORAGES = {
"default": {
"BACKEND": "storages.backends.s3boto3.S3Boto3Storage",
"OPTIONS": {
"bucket_name": os.environ.get("AWS_STORAGE_BUCKET_NAME"),
"bucket_name": _AWS_STORAGE_BUCKET_NAME,
"region_name": os.environ.get("AWS_S3_REGION_NAME", "eu-west-1"),
"custom_domain": False,
},
Expand Down
40 changes: 40 additions & 0 deletions eu_fact_force/ingestion/management/commands/ingest_vaccins.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
import json
import logging
from pathlib import Path

from django.core.management.base import BaseCommand

from eu_fact_force.ingestion.embedding import add_embeddings
from eu_fact_force.ingestion.parsing import parse_file
from eu_fact_force.ingestion.services import save_chunks, save_to_s3_and_postgres

logger = logging.getLogger(__name__)

PERFORMANCES_BUCKET_NAME = "performances"
VACCINS_ANNOTATED_KEY = "vaccins_annotated.json"
PDF_PREFIX = "pdf"


class Command(BaseCommand):
help = (
"Read vaccins_annotated.json from S3 bucket performances, "
"download PDF + JSON per entry, run full ingestion pipeline."
)

def handle(self, *args, **options):
performance_dir = Path(__file__).resolve().parents[4] / "data" / "vaccine_perfs"
pdfs = list(performance_dir.glob("*.pdf"))
for pdf_path in pdfs:
logger.info(f"Processing {pdf_path.stem}")
key = pdf_path.stem
metadata = json.load(pdf_path.with_suffix(".json").open())
source_file = save_to_s3_and_postgres(
pdf_path,
tags_pubmed=metadata.get("tags_pubmed", []),
doi=key,
)
document_parts = parse_file(source_file)
chunks = save_chunks(source_file, document_parts)
add_embeddings(chunks)

self.stdout.write(self.style.SUCCESS(f"Done. Processed: {len(pdfs)}"))
Loading
Loading