A CamemBERT-based NLP System for French Train Itinerary Generation
An end-to-end AI pipeline that transforms free-form French natural language queries into optimal train itineraries on the SNCF rail network.
Features · Architecture · Getting Started · Performance · API Reference
Converting a French sentence like "je voudrais aller d'Épinal à Paris" into a concrete train itinerary is harder than it appears:
- Variable word order: "de X à Y" vs "à Y depuis X"
- French elisions: d'Épinal → de Épinal, l'arrivée → la arrivée
- Ambiguous names: Albert, Paris, Florence — first names or cities?
- Compound city names: Port-Boulet = "port" + "boulet" (common words)
- User errors: missing accents, typos, inconsistent casing
Regex-based approaches fail to generalize. This project solves it with deep learning.
- Intent Classification — 5 classes (VALID, INCOMPLETE, NOT_TRIP, GARBAGE, OTHER_LANG) with 99.8% accuracy
- Named Entity Recognition — BIO-tagged departure/arrival extraction with 96.81% token F1
- Optimal Pathfinding — A* algorithm with Haversine heuristic over 3,497 SNCF stations
- Voice Input — Speech-to-text via Voxtral Mini (Mistral API)
- Interactive Map — Real-time route visualization with animated train on Leaflet
- Full-stack App — FastAPI backend + React/TypeScript frontend
- CLI Tools — Standalone NLP, pathfinding, and full pipeline CLIs
- Docker Ready — One-command deployment with Docker Compose
┌──────────────────────────────────────────────────────────────────┐
│ Frontend — React + TypeScript + Vite │
│ SearchView · ResultView · TrainMap (Leaflet) │
└──────────────────────────────┬───────────────────────────────────┘
│ HTTP (REST)
┌──────────────────────────────▼───────────────────────────────────┐
│ API — FastAPI │
│ POST /api/search · POST /api/search/voice · GET /health │
└──────┬──────────────────┬──────────────────┬─────────────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼───────┐ ┌──────▼──────┐
│ CamemBERT │ │ CamemBERT │ │ GTFS Graph │
│ Intent │ │ NER │ │ + A* │
│ (99.8%) │ │ (96.81% F1) │ │ (3497 stn) │
└─────────────┘ └───────────────┘ └─────────────┘
User Input ──→ [Preprocessing] ──→ [Intent Classification] ──→ [NER Extraction] ──→ [A* Pathfinding] ──→ Optimal Itinerary
text/voice elision expansion 5-class softmax B-DEP/B-ARR/O Haversine heuristic stations + trains
│
if NOT valid → error message
| Metric | Baseline (TF-IDF + Regex) | CamemBERT | Improvement |
|---|---|---|---|
| Intent Accuracy | 60.3% | 99.8% | +39.5 pp |
| NER Exact Match | 33.3% | 80.7% | +47.3 pp |
| Departure Similarity | 70.4% | 92.9% | +22.6 pp |
| Arrival Similarity | 43.6% | 92.1% | +48.6 pp |
| Latency / sample | 0.045 ms | 18.3 ms | ×407 |
| Metric | Value |
|---|---|
| Stations (nodes) | 3,497 |
| Connections (edges) | 10,770 |
| A* speedup over Dijkstra | 33% |
| Path optimality | 100% identical |
| Metric | Value |
|---|---|
| Full pipeline latency | ~100 ms |
| Dataset size | 12,000 intent + 6,000 NER |
| Data augmentation strategies | 7 |
- Python 3.12+
- Node.js 20+ (via nvm)
- SNCF GTFS data (included in
data/sncf/gtfs/)
# Clone the repository
git clone https://github.com/iamhmh/train_order_resolver.git
cd train_order_resolver
# Backend setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Frontend setup
cd frontend
nvm use 20
npm install
cd ..Backend:
source .venv/bin/activate
python -m uvicorn api.main:app --port 8000Frontend:
cd frontend
npm run devOpen http://localhost:5173 in your browser.
docker-compose up --build# NLP only (intent + NER)
python -m cli.cli_nlp
# Pathfinding only
python -m cli.cli_pathfinding
# Full pipeline (NLP → Pathfinding)
python -m cli.cli_full| Method | Endpoint | Description |
|---|---|---|
POST |
/api/search |
Process a text query and return the optimal itinerary |
POST |
/api/search/voice |
Process an audio file (STT → NLP → Pathfinding) |
POST |
/api/transcribe |
Transcribe audio to text via Voxtral |
GET |
/api/stations |
List all available SNCF stations |
GET |
/health |
Health check |
curl -X POST http://localhost:8000/api/search \
-H "Content-Type: application/json" \
-d '{"text": "je veux aller de Lyon à Marseille"}'{
"intent": "VALID",
"departure": "Lyon Part Dieu",
"arrival": "Marseille Saint-Charles",
"route": [
{
"from": "Lyon Part Dieu",
"to": "Marseille Saint-Charles",
"train_type": "TGV INOUI",
"duration_min": 100,
"lat_from": 45.7606,
"lon_from": 4.8593,
"lat_to": 43.3026,
"lon_to": 5.3803
}
],
"total_duration_min": 100,
"num_connections": 0
}train_order_resolver/
│
├── api/
│ └── main.py # FastAPI application & endpoints
│
├── src/
│ ├── nlp/
│ │ ├── camembert_intent.py # CamemBERT intent classification
│ │ ├── camembert_ner.py # CamemBERT NER extraction
│ │ ├── intent_classifier.py # Baseline TF-IDF classifier
│ │ ├── ner_extractor.py # Baseline regex NER
│ │ ├── language_detector.py # Language detection module
│ │ ├── metrics.py # Evaluation metrics
│ │ └── pipeline.py # End-to-end NLP pipeline
│ ├── pathfinding/
│ │ ├── gtfs_graph.py # GTFS graph construction + A*
│ │ ├── graph.py # Graph data structures
│ │ └── benchmark_algorithms.py # Dijkstra vs A* benchmarks
│ ├── stt/
│ │ └── voxtral_stt.py # Voxtral speech-to-text
│ └── data_generation/
│ ├── generate_dataset.py # Synthetic dataset generation
│ ├── generate_dataset_v2.py # V2 with augmentation
│ └── extract_cities.py # French cities extraction
│
├── cli/
│ ├── cli_nlp.py # NLP-only CLI
│ ├── cli_pathfinding.py # Pathfinding-only CLI
│ └── cli_full.py # Full pipeline CLI
│
├── scripts/
│ ├── compare_intent.py # Intent model comparison
│ ├── compare_models.py # Full model comparison
│ └── evaluate_baseline.py # Baseline evaluation
│
├── frontend/ # React + TypeScript + Vite
│ └── src/
│ ├── App.tsx # Main application
│ └── components/
│ ├── SearchView.tsx # Natural language search bar
│ ├── ResultView.tsx # Itinerary timeline display
│ ├── TrainMap.tsx # Leaflet map with train animation
│ ├── LoadingView.tsx # Loading state
│ └── ErrorView.tsx # Error handling
│
├── models/ # Fine-tuned model weights (gitignored)
│ ├── camembert_intent/
│ └── camembert_ner/
│
├── data/
│ ├── sncf/gtfs/ # SNCF GTFS schedule data
│ ├── datasets/ # Generated training data
│ ├── cities/ # French cities database
│ └── templates/ # Sentence templates
│
├── docs/ # Project documentation
├── reports/ # Benchmark reports
├── Dockerfile
├── docker-compose.yml
└── requirements.txt
CamemBERT (camembert-base, 110M parameters) fine-tuned with:
| Intent | NER | |
|---|---|---|
| Architecture | CamembertForSequenceClassification |
CamembertForTokenClassification |
| Learning Rate | 2×10⁻⁵ | 5×10⁻⁵ |
| Epochs | 3 | 3 |
| Batch Size | 16 | 16 |
| Optimizer | AdamW | AdamW |
| Warmup | 10% linear | 10% linear |
| Best Metric | 99.8% accuracy | 96.81% macro F1 |
- 105,474 generated sentences from 546 templates × 8,973 French cities
- 12,000 intent samples (50% VALID, 20% NOT_TRIP, 10% each INCOMPLETE/GARBAGE/OTHER_LANG)
- 6,000 NER samples with BIO annotation (B-DEP, I-DEP, B-ARR, I-ARR, O)
- 7 augmentation strategies: original, lowercase, no_accents, combined, typo, punctuation, random_case
- Graph: 3,497 stop areas, 10,770 weighted edges (duration in minutes)
- Algorithm: A* with Haversine heuristic
- Heuristic: h(n) = d_haversine(n, goal) / v_max, where v_max = 5.5 km/min (admissible)
- Multi-station search: all candidate pairs evaluated, minimum effective cost selected
- Penalties: +30 min on exurban hub stations (CDG, Massy TGV, etc.)
| Name | Role | |
|---|---|---|
| 👨💻 | Gouia Hichem | Lead Developer — NLP, Pathfinding, API, Frontend |
| 👨💻 | Denis Melvyn | Developer |
Epitech Technology — MSc Pro Artificial Intelligence — Promotion 2026
This project was developed as part of the T-AIA-911 module at Epitech Technology.