GitHub - onurds/MTA_Project: NER & RE on MTA Public Transport Alerts

1. Project Overview

Title: A Comparative Study of Neural Models for Direction Aware NER and Relation Extraction in Transit Service Alerts

Objective:

The primary goal is to develop and compare machine learning models that can extract structured information from noisy MTA transit service alerts.

NER: Identifying ROUTE (e.g "M15") and DIRECTION (e.g "Northbound").
RE: Linking each ROUTE to its correct DIRECTION.

Key Challenge:

Transit alerts are short (Mean length: approximately 22 tokens) but they are dense, often containing multiple routes and directions (e.g "Northbound M15 and Southbound M103 delays"). The model must understand the scope and context.

2. Methodology & Design Choices

I implemented a Weak Supervision pipeline detailed below, i also referenced the specific code modules.

2.1. Data Preprocessing

Code: 1_data_preprocessing.py

Source: MTA Service Alerts (NYC Open Data).
My Design Choice: Strict filtering to remove noise.
Agency Filter: Kept only "NYCT Bus" and "NYCT Subway".
Heuristics: Removed rows containing "this bus" (often generic), "bound track" (infrastructure, not service), and very short headers (less than 3 words).
Result: A clean dataset of 227,210 unique alerts with %0 missing values (verified in MTA_eda_analysis.ipynb ).

2.2. Silver Label Generation (Weak Supervision)

To overcome the lack of labeled training data, I implemented a two stage and rule based pipeline to generate Silver labels.

Stage 1: Route Span Labeling

Code: 2_route_span_labeling.ipynb

First, I identified the exact spans of Routes (e.g "Q44", "2 Train") mentioned in the text.

My Method: Agency specific Regex & Context Rules:

Bus: Matches standard patterns (e.g Q, Bx, M prefixes) and handles suffixes (-SBS ).
Subway: Distinguishes between Single Letters (A, Q), Digits (1-7), and Special codes (SIR, Shuttles). Crucially, filters false positives (e.g "E 149 St" is a street, not the E train).
Normalization: Maps variants to canonical forms (e.g "Q44-SBS" to "Q44").
Stats: I updated the column affected for approximately %4 of alerts where the text contained codes not present in the original alert.

Stage 2: Direction Labeling

Code: 3_direction_labeling_silver.ipynb

Next, I detected and classified directions relative to the routes.

My Design Choice: I defined a strict Priority Hierarchy for labeling to resolve ambiguities:

BOROUGH ("Queens-bound") - Highest specificity.
LOCAL ("Uptown", "Downtown").
COMPASS ("Northbound", "Southbound").
PLACE_BOUND ("JFK-bound"):

Used a Leftward Scan Algorithm starting from "bound".
Includes Abbreviation Awareness (treats "St." as a token, not a stop).
Grammatical Rejection: Filters out misleading phrases like "in both direction of".

BOTH/EITHER - Valid directionality for alerts affecting both sides.

Outcome: ~254k silver-labeled direction spans loaded into the RE pipeline.

2.3. Baseline Relation Extraction

Code: 4_baseline_re.ipynb

The code to generate silver relation labels (linking Route to Direction).

My Design Choice: A Two Pass Segment Algorithm:

Pass 1 (Left-to-Right): Tracks the Active Direction until a Breakpoint (newline, parenthesis, colon) resets the context.

Pass 2 (Look Ahead): Recovers unpaired routes by searching forward in the same segment (e.g., linking "M15" in "M15 delays in both directions").

EDA Statistics for RE:

Coverage: %79.4 of alerts contain extractable relations.
Total Pairs: 293,474 relation pairs.
Distribution: Southbound (%30.4) and Northbound (%29.8) dominate, but Both Directions (%21.5) is also a major category that required some attention.

2.4. Gold Dataset Creation

Code: 5_create_gold_dataset.py

Status: Ongoing annotation, haven't finished yet.
My Design Choice: I stratified sampling by Complexity. As random sampling would be dominated by simple 1 Route, 1 Direction cases.

I enforced a specific distribution:

Simple (%50): less than or equal to 2 entities (e.g "M1 delays").
Moderate (%30): 3-4 entities.
Complex (%20): 5+ entities (e.g multiple routes/directions).
Also ensured diversity of direction types within each case. Total samples: 600.

2.5. Neural Models

Codes: 6_bilstm_ner.ipynb, 6_deberta_ner.ipynb

Trained on Silver data to learn the labeling logic.

My Design Choice for the Class Imbalance Handling: The dataset is dominated by "O" tags (approximately %85). I implemented Weighted Cross Entropy Loss with specific boost factors:
- I-ROUTE & I-DIRECTION: 1.5x boost (to encourage continuity)
- B-DIRECTION: 1.5x boost (rare compared to routes)
- B-ROUTE: 1.2x boost

Model A: BiLSTM-CRF

Code: 6_bilstm_ner.ipynb

I implemented a custom PyTorch model trained from scratch.

1.Architecture Details:

Embeddings:
- Words: 128-dim embeddings (Learned from scratch).
- Characters: 50-dim embeddings -> CharCNN (50 filters, kernels [3,4,5]) to capture morphology (e.g. "-bound").
Encoder: Bidirectional LSTM (Hidden Dim: 256, Layers: 2, Dropout: 0.3).
Decoder: CRF (Conditional Random Field) for global sequence validity.

2.Hyperparameters:

Optimizer: Adam (LR: 1e-3) with ReduceLROnPlateau.
Batch Size: 64.
Loss: Weighted Cross-Entropy (Fallback) / CRF Loss.
Epoch: 3

Model B: DeBERTa-v3

Code: 6_deberta_ner.ipynb

I fine-tuned microsoft/deberta-v3-base using a manual training loop.

1.Architecture & Training:

Model: Pretrained Transformer with a Token Classification Head.

2.Differential Learning Rates:

Base Model: 3e-5 (Preserve pretrained knowledge).
Classifier Head: 1e-4 (Learn task fast).

3.Optimization:

Batch Size: 128.
Scheduler: Linear Warmup (%10 of steps).
Precision: Mixed Precision (FP16) via GradScaler.
Gradient Clipping: Norm 1.0 (Stability).
Epoch: 3

3. Preliminary Results for NER on Silver Dataset

I compared the models on their ability to reproduce the Silver labels (Test set: %15 split).

Test scores:

Model	Precision	Recall	F1-Score
BiLSTM-CRF	0.9973	0.9916	0.9944
DeBERTa-v3	0.9847	0.9993	0.9920

My Analysis:

The F1 and other scores are basically perfect, but i think that is just because the data is very simple, and uses a very standard structured language. The regex labeling i used matches the text patterns exactly (i might have gone a bit on the feature engineering side). Since %72 of the alerts just have a direction right next to a route name, like "Northbound M15 delays," the model learns the pattern too easily. The rules for finding the direction handle over %80 of the cases just by looking at word proximity.

I'm suspecting the model might be memorizing the templates. It likely overfits to where the words sit in the sentence rather than understanding the text, especially in the %8 of alerts that are more confusing.

When i test on the gold dataset with the harder examples, i think the scores will probably drop to something like 0.90. That should show if the model actually works when the simple regex rules do not apply.

4. Current Status

Completed - Data Pipeline: Complete. Silver dataset is generated and robust.
Completed - Design: Stratified sampling design is implemented to ensure rigorous evaluation.
Completed - Neural NER: Implementation complete and validated on Silver data.
Completed - Gold Annotation: Currently in progress (600 samples).
Completed - Comparison of the gold dataset to heuristic RE.
Completed - Using the gold dataset to compare the neural models and the weak supervision.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
Data		Data
EDA		EDA
Extra Analysis		Extra Analysis
Preprocessed		Preprocessed
Results		Results
.gitignore		.gitignore
1_data_preprocessing.py		1_data_preprocessing.py
2_route_span_labeling.ipynb		2_route_span_labeling.ipynb
3_direction_labeling_silver.ipynb		3_direction_labeling_silver.ipynb
4_baseline_re.ipynb		4_baseline_re.ipynb
5_create_gold_dataset.py		5_create_gold_dataset.py
6_bilstm_ner.ipynb		6_bilstm_ner.ipynb
6_bilstm_re.ipynb		6_bilstm_re.ipynb
6_deberta_ner.ipynb		6_deberta_ner.ipynb
6_deberta_re.ipynb		6_deberta_re.ipynb
6a_prepare_final_deberta_data.py		6a_prepare_final_deberta_data.py
7_silver_RE_Metrics.py		7_silver_RE_Metrics.py
8_llm_zeroshot.py		8_llm_zeroshot.py
MTA_eda_analysis.ipynb		MTA_eda_analysis.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Project Overview

2. Methodology & Design Choices

2.1. Data Preprocessing

2.2. Silver Label Generation (Weak Supervision)

Stage 1: Route Span Labeling

Stage 2: Direction Labeling

2.3. Baseline Relation Extraction

2.4. Gold Dataset Creation

2.5. Neural Models

Model A: BiLSTM-CRF

Model B: DeBERTa-v3

3. Preliminary Results for NER on Silver Dataset

4. Current Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

1. Project Overview

2. Methodology & Design Choices

2.1. Data Preprocessing

2.2. Silver Label Generation (Weak Supervision)

Stage 1: Route Span Labeling

Stage 2: Direction Labeling

2.3. Baseline Relation Extraction

2.4. Gold Dataset Creation

2.5. Neural Models

Model A: BiLSTM-CRF

Model B: DeBERTa-v3

3. Preliminary Results for NER on Silver Dataset

4. Current Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages