🚆 Travel Order Resolver

A CamemBERT-based NLP System for French Train Itinerary Generation

An end-to-end AI pipeline that transforms free-form French natural language queries into optimal train itineraries on the SNCF rail network.

Features · Architecture · Getting Started · Performance · API Reference

📌 Problem Statement

Converting a French sentence like "je voudrais aller d'Épinal à Paris" into a concrete train itinerary is harder than it appears:

Variable word order: "de X à Y" vs "à Y depuis X"
French elisions: d'Épinal → de Épinal, l'arrivée → la arrivée
Ambiguous names: Albert, Paris, Florence — first names or cities?
Compound city names: Port-Boulet = "port" + "boulet" (common words)
User errors: missing accents, typos, inconsistent casing

Regex-based approaches fail to generalize. This project solves it with deep learning.

✨ Features

Intent Classification — 5 classes (VALID, INCOMPLETE, NOT_TRIP, GARBAGE, OTHER_LANG) with 99.8% accuracy
Named Entity Recognition — BIO-tagged departure/arrival extraction with 96.81% token F1
Optimal Pathfinding — A* algorithm with Haversine heuristic over 3,497 SNCF stations
Voice Input — Speech-to-text via Voxtral Mini (Mistral API)
Interactive Map — Real-time route visualization with animated train on Leaflet
Full-stack App — FastAPI backend + React/TypeScript frontend
CLI Tools — Standalone NLP, pathfinding, and full pipeline CLIs
Docker Ready — One-command deployment with Docker Compose

🏗️ Architecture

┌──────────────────────────────────────────────────────────────────┐
│                    Frontend — React + TypeScript + Vite           │
│          SearchView  ·  ResultView  ·  TrainMap (Leaflet)        │
└──────────────────────────────┬───────────────────────────────────┘
                               │ HTTP (REST)
┌──────────────────────────────▼───────────────────────────────────┐
│                        API — FastAPI                             │
│    POST /api/search  ·  POST /api/search/voice  ·  GET /health  │
└──────┬──────────────────┬──────────────────┬─────────────────────┘
       │                  │                  │
┌──────▼──────┐   ┌───────▼───────┐   ┌──────▼──────┐
│  CamemBERT  │   │  CamemBERT    │   │ GTFS Graph  │
│   Intent    │   │     NER       │   │   + A*      │
│  (99.8%)    │   │  (96.81% F1)  │   │ (3497 stn)  │
└─────────────┘   └───────────────┘   └─────────────┘

Pipeline Flow

User Input ──→ [Preprocessing] ──→ [Intent Classification] ──→ [NER Extraction] ──→ [A* Pathfinding] ──→ Optimal Itinerary
   text/voice     elision expansion     5-class softmax        B-DEP/B-ARR/O        Haversine heuristic     stations + trains
                                            │
                                     if NOT valid → error message

📊 Performance

NLP — Baseline vs CamemBERT

Metric	Baseline (TF-IDF + Regex)	CamemBERT	Improvement
Intent Accuracy	60.3%	99.8%	+39.5 pp
NER Exact Match	33.3%	80.7%	+47.3 pp
Departure Similarity	70.4%	92.9%	+22.6 pp
Arrival Similarity	43.6%	92.1%	+48.6 pp
Latency / sample	0.045 ms	18.3 ms	×407

Pathfinding — Dijkstra vs A*

Metric	Value
Stations (nodes)	3,497
Connections (edges)	10,770
A* speedup over Dijkstra	33%
Path optimality	100% identical

End-to-End

Metric	Value
Full pipeline latency	~100 ms
Dataset size	12,000 intent + 6,000 NER
Data augmentation strategies	7

🚀 Getting Started

Prerequisites

Python 3.12+
Node.js 20+ (via nvm)
SNCF GTFS data (included in data/sncf/gtfs/)

Installation

# Clone the repository
git clone https://github.com/iamhmh/train_order_resolver.git
cd train_order_resolver

# Backend setup
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Frontend setup
cd frontend
nvm use 20
npm install
cd ..

Running the Application

Backend:

source .venv/bin/activate
python -m uvicorn api.main:app --port 8000

Frontend:

cd frontend
npm run dev

Open http://localhost:5173 in your browser.

Docker

docker-compose up --build

CLI Tools

# NLP only (intent + NER)
python -m cli.cli_nlp

# Pathfinding only
python -m cli.cli_pathfinding

# Full pipeline (NLP → Pathfinding)
python -m cli.cli_full

📡 API Reference

Method	Endpoint	Description
`POST`	`/api/search`	Process a text query and return the optimal itinerary
`POST`	`/api/search/voice`	Process an audio file (STT → NLP → Pathfinding)
`POST`	`/api/transcribe`	Transcribe audio to text via Voxtral
`GET`	`/api/stations`	List all available SNCF stations
`GET`	`/health`	Health check

Example Request

curl -X POST http://localhost:8000/api/search \
  -H "Content-Type: application/json" \
  -d '{"text": "je veux aller de Lyon à Marseille"}'

Example Response

{
  "intent": "VALID",
  "departure": "Lyon Part Dieu",
  "arrival": "Marseille Saint-Charles",
  "route": [
    {
      "from": "Lyon Part Dieu",
      "to": "Marseille Saint-Charles",
      "train_type": "TGV INOUI",
      "duration_min": 100,
      "lat_from": 45.7606,
      "lon_from": 4.8593,
      "lat_to": 43.3026,
      "lon_to": 5.3803
    }
  ],
  "total_duration_min": 100,
  "num_connections": 0
}

📁 Project Structure

train_order_resolver/
│
├── api/
│   └── main.py                     # FastAPI application & endpoints
│
├── src/
│   ├── nlp/
│   │   ├── camembert_intent.py     # CamemBERT intent classification
│   │   ├── camembert_ner.py        # CamemBERT NER extraction
│   │   ├── intent_classifier.py    # Baseline TF-IDF classifier
│   │   ├── ner_extractor.py        # Baseline regex NER
│   │   ├── language_detector.py    # Language detection module
│   │   ├── metrics.py              # Evaluation metrics
│   │   └── pipeline.py             # End-to-end NLP pipeline
│   ├── pathfinding/
│   │   ├── gtfs_graph.py           # GTFS graph construction + A*
│   │   ├── graph.py                # Graph data structures
│   │   └── benchmark_algorithms.py # Dijkstra vs A* benchmarks
│   ├── stt/
│   │   └── voxtral_stt.py          # Voxtral speech-to-text
│   └── data_generation/
│       ├── generate_dataset.py     # Synthetic dataset generation
│       ├── generate_dataset_v2.py  # V2 with augmentation
│       └── extract_cities.py       # French cities extraction
│
├── cli/
│   ├── cli_nlp.py                  # NLP-only CLI
│   ├── cli_pathfinding.py          # Pathfinding-only CLI
│   └── cli_full.py                 # Full pipeline CLI
│
├── scripts/
│   ├── compare_intent.py           # Intent model comparison
│   ├── compare_models.py           # Full model comparison
│   └── evaluate_baseline.py        # Baseline evaluation
│
├── frontend/                       # React + TypeScript + Vite
│   └── src/
│       ├── App.tsx                 # Main application
│       └── components/
│           ├── SearchView.tsx      # Natural language search bar
│           ├── ResultView.tsx      # Itinerary timeline display
│           ├── TrainMap.tsx        # Leaflet map with train animation
│           ├── LoadingView.tsx     # Loading state
│           └── ErrorView.tsx       # Error handling
│
├── models/                         # Fine-tuned model weights (gitignored)
│   ├── camembert_intent/
│   └── camembert_ner/
│
├── data/
│   ├── sncf/gtfs/                  # SNCF GTFS schedule data
│   ├── datasets/                   # Generated training data
│   ├── cities/                     # French cities database
│   └── templates/                  # Sentence templates
│
├── docs/                           # Project documentation
├── reports/                        # Benchmark reports
├── Dockerfile
├── docker-compose.yml
└── requirements.txt

🔬 Technical Details

NLP Models

CamemBERT (camembert-base, 110M parameters) fine-tuned with:

	Intent	NER
Architecture	`CamembertForSequenceClassification`	`CamembertForTokenClassification`
Learning Rate	2×10⁻⁵	5×10⁻⁵
Epochs	3	3
Batch Size	16	16
Optimizer	AdamW	AdamW
Warmup	10% linear	10% linear
Best Metric	99.8% accuracy	96.81% macro F1

Dataset

105,474 generated sentences from 546 templates × 8,973 French cities
12,000 intent samples (50% VALID, 20% NOT_TRIP, 10% each INCOMPLETE/GARBAGE/OTHER_LANG)
6,000 NER samples with BIO annotation (B-DEP, I-DEP, B-ARR, I-ARR, O)
7 augmentation strategies: original, lowercase, no_accents, combined, typo, punctuation, random_case

Pathfinding

Graph: 3,497 stop areas, 10,770 weighted edges (duration in minutes)
Algorithm: A* with Haversine heuristic
Heuristic: h(n) = d_haversine(n, goal) / v_max, where v_max = 5.5 km/min (admissible)
Multi-station search: all candidate pairs evaluated, minimum effective cost selected
Penalties: +30 min on exurban hub stations (CDG, Massy TGV, etc.)

👥 Authors

	Name	Role
👨‍💻	Gouia Hichem	Lead Developer — NLP, Pathfinding, API, Frontend
👨‍💻	Denis Melvyn	Developer

Epitech Technology — MSc Pro Artificial Intelligence — Promotion 2026

📄 License

This project was developed as part of the T-AIA-911 module at Epitech Technology.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚆 Travel Order Resolver

📌 Problem Statement

✨ Features

🏗️ Architecture

Pipeline Flow

📊 Performance

NLP — Baseline vs CamemBERT

Pathfinding — Dijkstra vs A*

End-to-End

🚀 Getting Started

Prerequisites

Installation

Running the Application

Docker

CLI Tools

📡 API Reference

Example Request

Example Response

📁 Project Structure

🔬 Technical Details

NLP Models

Dataset

Pathfinding

👥 Authors

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
api		api
cli		cli
data		data
docs		docs
frontend		frontend
models		models
reports		reports
scripts		scripts
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
Travel_Order_Resolver_Fine-tuning_CamemBERT_for_French_Natural_Language_Travel_Request_Understanding_and_Optimal_Train_Route_Generation.pdf		Travel_Order_Resolver_Fine-tuning_CamemBERT_for_French_Natural_Language_Travel_Request_Understanding_and_Optimal_Train_Route_Generation.pdf
cli.md		cli.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚆 Travel Order Resolver

📌 Problem Statement

✨ Features

🏗️ Architecture

Pipeline Flow

📊 Performance

NLP — Baseline vs CamemBERT

Pathfinding — Dijkstra vs A*

End-to-End

🚀 Getting Started

Prerequisites

Installation

Running the Application

Docker

CLI Tools

📡 API Reference

Example Request

Example Response

📁 Project Structure

🔬 Technical Details

NLP Models

Dataset

Pathfinding

👥 Authors

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages