Skip to content
Syed Ibrahim Omer edited this page Apr 12, 2026 · 4 revisions

Migration Surge Prediction Wiki

Welcome to the knowledge graph for the Migration Surge Prediction Early Warning System. This wiki documents the full project — data sources, pipeline stages, ML models, analysis methods, source modules, key findings, and infrastructure.


Quick Start

  • Project Overview — Goals, research questions, methodology, and team
  • Glossary — Key terms used throughout this wiki

Data Sources

Raw inputs that feed the prediction system.

Page Description
Visa Data US Department of State visa issuance statistics (108 monthly PDFs)
Encounter Data CBP Southwest border encounter statistics (FY2019–2026)
Google News 170K+ news articles across 15 countries × 8 topics
Google Trends Monthly search-interest time series (15 countries × 8 keywords)
Exchange Rates IMF Real Effective Exchange Rate for 6 countries

Pipeline

The end-to-end flow from raw data to production forecasts.

Collection → Processing → NLP Enrichment → Panel Construction → Training → Inference
Page Description
Data Collection Ingestion layer: async scraping, bounded concurrency, retry logic
Data Processing PDF parsing, JSON→Parquet, encounter merging
NLP Enrichment Embedding → Clustering → Labeling → Sentiment
Panel Construction Feature engineering: 18 lag features, 6 lead targets
Training Pipeline Out-of-time train/test split, 4 architectures
Inference Pipeline Horizon-aware ensemble, production prediction flow

Models

Machine learning architectures and their roles in the ensemble.

Page Description
Random Forest cuML GPU Random Forest — best at short horizons (Lead 1–2)
LSTM MigrationLSTM — country-aware with SurgeJointLoss
Transformer MigrationTransformer — best at long horizons (Lead 5–6)
Horizon-Aware Ensemble Dynamic weighting: RF→short, Transformer→long
SurgeJointLoss Dual-objective loss: Huber + BCE for crisis detection
Jina v5 Embeddings TensorRT INT8 news article embeddings (768-dim)
Flan-T5 Summarization TensorRT INT8 cluster labeling engine

Analysis Methods

Statistical techniques driving the lead-lag and surge analysis.

Page Description
Lead-Lag Analysis Pearson correlation at 0–6 month offsets
Surge Detection Quantile-based and σ-threshold spike identification
Sentiment Analysis Rule-based lexicon scoring for migration-relevant news
Event Clustering HDBSCAN GPU clustering + LED label generation
Cross-Correlation Analysis CCF analysis, VAR benchmarking, ADF stationarity tests
Multiple Comparison Correction Benjamini-Hochberg FDR for 58 significant signals

Key Findings

What the system discovered about migration predictability.

Page Description
Event-Visa Findings News events as leading indicators (r=0.617 at 3-month lag)
Exchange Rate Findings Exchange rate signals (DR r=0.498 at 2-month lag)
Model Performance Ensemble results: F1=0.96 at Lead 1, F1=0.86 at Lead 6

Source Modules

Reference documentation for every src/ subpackage and key files.

Page Description
Main Entry Point src/main.py CLI: bootstrap, collect-live, sync-data
Collection Module src/collection/* — visa, encounter, news, trends, HF sync
Processing Module src/processing/* — parse, merge, build_panel, summarize
Analysis Module src/analysis/* — events, exchange_rate, trends_analysis, plots
Models Module src/models/* — surge_model, train_and_evaluate, inference
News Scraper Deep dive: batch decoding, checkpoint recovery, throttling
PDF Parser Deep dive: PyMuPDF table extraction, VISA_MAP normalization
TensorRT Engines Deep dive: Jina-v5, Flan-T5, LED TensorRT engines
Build Panel Detail Deep dive: lag/lead construction, forward-fill strategies
HF Sync Deep dive: bidirectional Hugging Face Hub sync

Infrastructure

Compute, reproducibility, and operational details.

Page Description
GPU Acceleration TensorRT INT8, cuML, CUDA streams, NVML profiling
Reproducibility HF bootstrap, run.sh pipeline, dependency checking

Wiki navigation

Quick Start

  • Project Overview — Goals, research questions, methodology, and team
  • Glossary — Key terms used throughout this wiki

Data Sources

Raw inputs that feed the prediction system.

Page Description
Visa Data US Department of State visa issuance statistics (108 monthly PDFs)
Encounter Data CBP Southwest border encounter statistics (FY2019–2026)
Google News 170K+ news articles across 15 countries × 8 topics
Google Trends Monthly search-interest time series (15 countries × 8 keywords)
Exchange Rates IMF Real Effective Exchange Rate for 6 countries

Pipeline

The end-to-end flow from raw data to production forecasts.

Page Description
Data Collection Ingestion layer: async scraping, bounded concurrency, retry logic
Data Processing PDF parsing, JSON→Parquet, encounter merging
NLP Enrichment Embedding → Clustering → Labeling → Sentiment
Panel Construction Feature engineering: 18 lag features, 6 lead targets
Training Pipeline Out-of-time train/test split, 4 architectures
Inference Pipeline Horizon-aware ensemble, production prediction flow

Models

Machine learning architectures and their roles in the ensemble.

Page Description
Random Forest cuML GPU Random Forest — best at short horizons (Lead 1–2)
LSTM MigrationLSTM — country-aware with SurgeJointLoss
Transformer MigrationTransformer — best at long horizons (Lead 5–6)
Horizon-Aware Ensemble Dynamic weighting: RF→short, Transformer→long
SurgeJointLoss Dual-objective loss: Huber + BCE for crisis detection
Jina v5 Embeddings TensorRT INT8 news article embeddings (768-dim)
Flan-T5 Summarization TensorRT INT8 cluster labeling engine

Analysis Methods

Statistical techniques driving the lead-lag and surge analysis.

Page Description
Lead-Lag Analysis Pearson correlation at 0–6 month offsets
Surge Detection Quantile-based and σ-threshold spike identification
Sentiment Analysis Rule-based lexicon scoring for migration-relevant news
Event Clustering HDBSCAN GPU clustering + LED label generation
Cross-Correlation Analysis CCF analysis, VAR benchmarking, ADF stationarity tests
Multiple Comparison Correction Benjamini-Hochberg FDR for 58 significant signals

Key Findings

What the system discovered about migration predictability.

Page Description
Event-Visa Findings News events as leading indicators (r=0.617 at 3-month lag)
Exchange Rate Findings Exchange rate signals (DR r=0.498 at 2-month lag)
Model Performance Ensemble results: F1=0.96 at Lead 1, F1=0.86 at Lead 6

Source Modules

Reference documentation for every src/ subpackage and key files.

Page Description
Main Entry Point src/main.py CLI: bootstrap, collect-live, sync-data
Collection Module src/collection/* — visa, encounter, news, trends, HF sync
Processing Module src/processing/* — parse, merge, build_panel, summarize
Analysis Module src/analysis/* — events, exchange_rate, trends_analysis, plots
Models Module src/models/* — surge_model, train_and_evaluate, inference
News Scraper Deep dive: batch decoding, checkpoint recovery, throttling
PDF Parser Deep dive: PyMuPDF table extraction, VISA_MAP normalization
TensorRT Engines Deep dive: Jina-v5, Flan-T5, LED TensorRT engines
Build Panel Detail Deep dive: lag/lead construction, forward-fill strategies
HF Sync Deep dive: bidirectional Hugging Face Hub sync

Infrastructure

Compute, reproducibility, and operational details.

Page Description
GPU Acceleration TensorRT INT8, cuML, CUDA streams, NVML profiling
Reproducibility HF bootstrap, run.sh pipeline, dependency checking

Clone this wiki locally