Skip to content

iamhmh/train_order_resolver

Repository files navigation

🚆 Travel Order Resolver

A CamemBERT-based NLP System for French Train Itinerary Generation

Python PyTorch FastAPI React TypeScript License

An end-to-end AI pipeline that transforms free-form French natural language queries into optimal train itineraries on the SNCF rail network.

Features · Architecture · Getting Started · Performance · API Reference


📌 Problem Statement

Converting a French sentence like "je voudrais aller d'Épinal à Paris" into a concrete train itinerary is harder than it appears:

  • Variable word order: "de X à Y" vs "à Y depuis X"
  • French elisions: d'Épinalde Épinal, l'arrivéela arrivée
  • Ambiguous names: Albert, Paris, Florence — first names or cities?
  • Compound city names: Port-Boulet = "port" + "boulet" (common words)
  • User errors: missing accents, typos, inconsistent casing

Regex-based approaches fail to generalize. This project solves it with deep learning.


✨ Features

  • Intent Classification — 5 classes (VALID, INCOMPLETE, NOT_TRIP, GARBAGE, OTHER_LANG) with 99.8% accuracy
  • Named Entity Recognition — BIO-tagged departure/arrival extraction with 96.81% token F1
  • Optimal Pathfinding — A* algorithm with Haversine heuristic over 3,497 SNCF stations
  • Voice Input — Speech-to-text via Voxtral Mini (Mistral API)
  • Interactive Map — Real-time route visualization with animated train on Leaflet
  • Full-stack App — FastAPI backend + React/TypeScript frontend
  • CLI Tools — Standalone NLP, pathfinding, and full pipeline CLIs
  • Docker Ready — One-command deployment with Docker Compose

🏗️ Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    Frontend — React + TypeScript + Vite           │
│          SearchView  ·  ResultView  ·  TrainMap (Leaflet)        │
└──────────────────────────────┬───────────────────────────────────┘
                               │ HTTP (REST)
┌──────────────────────────────▼───────────────────────────────────┐
│                        API — FastAPI                             │
│    POST /api/search  ·  POST /api/search/voice  ·  GET /health  │
└──────┬──────────────────┬──────────────────┬─────────────────────┘
       │                  │                  │
┌──────▼──────┐   ┌───────▼───────┐   ┌──────▼──────┐
│  CamemBERT  │   │  CamemBERT    │   │ GTFS Graph  │
│   Intent    │   │     NER       │   │   + A*      │
│  (99.8%)    │   │  (96.81% F1)  │   │ (3497 stn)  │
└─────────────┘   └───────────────┘   └─────────────┘

Pipeline Flow

User Input ──→ [Preprocessing] ──→ [Intent Classification] ──→ [NER Extraction] ──→ [A* Pathfinding] ──→ Optimal Itinerary
   text/voice     elision expansion     5-class softmax        B-DEP/B-ARR/O        Haversine heuristic     stations + trains
                                            │
                                     if NOT valid → error message

📊 Performance

NLP — Baseline vs CamemBERT

Metric Baseline (TF-IDF + Regex) CamemBERT Improvement
Intent Accuracy 60.3% 99.8% +39.5 pp
NER Exact Match 33.3% 80.7% +47.3 pp
Departure Similarity 70.4% 92.9% +22.6 pp
Arrival Similarity 43.6% 92.1% +48.6 pp
Latency / sample 0.045 ms 18.3 ms ×407

Pathfinding — Dijkstra vs A*

Metric Value
Stations (nodes) 3,497
Connections (edges) 10,770
A* speedup over Dijkstra 33%
Path optimality 100% identical

End-to-End

Metric Value
Full pipeline latency ~100 ms
Dataset size 12,000 intent + 6,000 NER
Data augmentation strategies 7

🚀 Getting Started

Prerequisites

  • Python 3.12+
  • Node.js 20+ (via nvm)
  • SNCF GTFS data (included in data/sncf/gtfs/)

Installation

# Clone the repository
git clone https://github.com/iamhmh/train_order_resolver.git
cd train_order_resolver

# Backend setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Frontend setup
cd frontend
nvm use 20
npm install
cd ..

Running the Application

Backend:

source .venv/bin/activate
python -m uvicorn api.main:app --port 8000

Frontend:

cd frontend
npm run dev

Open http://localhost:5173 in your browser.

Docker

docker-compose up --build

CLI Tools

# NLP only (intent + NER)
python -m cli.cli_nlp

# Pathfinding only
python -m cli.cli_pathfinding

# Full pipeline (NLP → Pathfinding)
python -m cli.cli_full

📡 API Reference

Method Endpoint Description
POST /api/search Process a text query and return the optimal itinerary
POST /api/search/voice Process an audio file (STT → NLP → Pathfinding)
POST /api/transcribe Transcribe audio to text via Voxtral
GET /api/stations List all available SNCF stations
GET /health Health check

Example Request

curl -X POST http://localhost:8000/api/search \
  -H "Content-Type: application/json" \
  -d '{"text": "je veux aller de Lyon à Marseille"}'

Example Response

{
  "intent": "VALID",
  "departure": "Lyon Part Dieu",
  "arrival": "Marseille Saint-Charles",
  "route": [
    {
      "from": "Lyon Part Dieu",
      "to": "Marseille Saint-Charles",
      "train_type": "TGV INOUI",
      "duration_min": 100,
      "lat_from": 45.7606,
      "lon_from": 4.8593,
      "lat_to": 43.3026,
      "lon_to": 5.3803
    }
  ],
  "total_duration_min": 100,
  "num_connections": 0
}

📁 Project Structure

train_order_resolver/
│
├── api/
│   └── main.py                     # FastAPI application & endpoints
│
├── src/
│   ├── nlp/
│   │   ├── camembert_intent.py     # CamemBERT intent classification
│   │   ├── camembert_ner.py        # CamemBERT NER extraction
│   │   ├── intent_classifier.py    # Baseline TF-IDF classifier
│   │   ├── ner_extractor.py        # Baseline regex NER
│   │   ├── language_detector.py    # Language detection module
│   │   ├── metrics.py              # Evaluation metrics
│   │   └── pipeline.py             # End-to-end NLP pipeline
│   ├── pathfinding/
│   │   ├── gtfs_graph.py           # GTFS graph construction + A*
│   │   ├── graph.py                # Graph data structures
│   │   └── benchmark_algorithms.py # Dijkstra vs A* benchmarks
│   ├── stt/
│   │   └── voxtral_stt.py          # Voxtral speech-to-text
│   └── data_generation/
│       ├── generate_dataset.py     # Synthetic dataset generation
│       ├── generate_dataset_v2.py  # V2 with augmentation
│       └── extract_cities.py       # French cities extraction
│
├── cli/
│   ├── cli_nlp.py                  # NLP-only CLI
│   ├── cli_pathfinding.py          # Pathfinding-only CLI
│   └── cli_full.py                 # Full pipeline CLI
│
├── scripts/
│   ├── compare_intent.py           # Intent model comparison
│   ├── compare_models.py           # Full model comparison
│   └── evaluate_baseline.py        # Baseline evaluation
│
├── frontend/                       # React + TypeScript + Vite
│   └── src/
│       ├── App.tsx                 # Main application
│       └── components/
│           ├── SearchView.tsx      # Natural language search bar
│           ├── ResultView.tsx      # Itinerary timeline display
│           ├── TrainMap.tsx        # Leaflet map with train animation
│           ├── LoadingView.tsx     # Loading state
│           └── ErrorView.tsx       # Error handling
│
├── models/                         # Fine-tuned model weights (gitignored)
│   ├── camembert_intent/
│   └── camembert_ner/
│
├── data/
│   ├── sncf/gtfs/                  # SNCF GTFS schedule data
│   ├── datasets/                   # Generated training data
│   ├── cities/                     # French cities database
│   └── templates/                  # Sentence templates
│
├── docs/                           # Project documentation
├── reports/                        # Benchmark reports
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

🔬 Technical Details

NLP Models

CamemBERT (camembert-base, 110M parameters) fine-tuned with:

Intent NER
Architecture CamembertForSequenceClassification CamembertForTokenClassification
Learning Rate 2×10⁻⁵ 5×10⁻⁵
Epochs 3 3
Batch Size 16 16
Optimizer AdamW AdamW
Warmup 10% linear 10% linear
Best Metric 99.8% accuracy 96.81% macro F1

Dataset

  • 105,474 generated sentences from 546 templates × 8,973 French cities
  • 12,000 intent samples (50% VALID, 20% NOT_TRIP, 10% each INCOMPLETE/GARBAGE/OTHER_LANG)
  • 6,000 NER samples with BIO annotation (B-DEP, I-DEP, B-ARR, I-ARR, O)
  • 7 augmentation strategies: original, lowercase, no_accents, combined, typo, punctuation, random_case

Pathfinding

  • Graph: 3,497 stop areas, 10,770 weighted edges (duration in minutes)
  • Algorithm: A* with Haversine heuristic
  • Heuristic: h(n) = d_haversine(n, goal) / v_max, where v_max = 5.5 km/min (admissible)
  • Multi-station search: all candidate pairs evaluated, minimum effective cost selected
  • Penalties: +30 min on exurban hub stations (CDG, Massy TGV, etc.)

👥 Authors

Name Role
👨‍💻 Gouia Hichem Lead Developer — NLP, Pathfinding, API, Frontend
👨‍💻 Denis Melvyn Developer

Epitech Technology — MSc Pro Artificial Intelligence — Promotion 2026


📄 License

This project was developed as part of the T-AIA-911 module at Epitech Technology.

About

End-to-end NLP pipeline that converts French natural language queries into optimal train itineraries using CamemBERT (intent classification + NER) and A* pathfinding over the SNCF GTFS network.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors