A machine learning pipeline that predicts cloud-to-ground lightning strikes over Hong Kong up to 5 minutes in advance, using historical LLIS sensor data and an LSTM neural network.
- Overview
- Features
- Project Structure
- Requirements
- Installation
- Usage
- How It Works
- Model Architecture
- Evaluation Metrics
- Configuration
- Known Limitations
- License
This project trains a separate binary classifier for each of five prediction windows (1, 2, 3, 4, and 5 minutes). Given a 10-minute history of lightning activity(one feature vector per minutes) within a bounding box covering Hong Kong, the LSTM model predicts whether a cloud-to-ground strike will occur in the next N minutes. The use of LSTM allowss the model to capture temporal patterns in lightning activity over the history window.
| Prediction Window | Task |
|---|---|
| 1 minute | Very short-term nowcasting |
| 2 – 3 minutes | Short-term warning |
| 4 – 5 minutes | Early advisory |
- Loads and parses raw LLIS (Lightning Location and Information System) data files
- Filters for cloud-to-ground strikes only (
cloud_indicator = 0) - Creates temporal seuences with 1-minute resolution over a 10-minute history window
- Spatial filtering to the Hong Kong bounding box
- Handles severe class imbalance via computed class weights
- LSTM architecture with BatchNormalization, Dropout, and L2 regularisation
- Early stopping and learning rate scheduling via callbacks
- Saves both the trained model (
.keras) and fitted scaler (.pkl) per window - Prints a final side-by-side comparison table across all five windows
.
├── llis_20250315/ # Raw LLIS lightning data files (not included)
├── model/ # Saved models and scalers (auto-created)
│ ├── lightning_model_1min.keras
│ ├── scaler_1min.pkl
│ └── ...
├── .logs/fit/ # TensorBoard logs (auto-created)
├── lightning_forecast.py
└── README.md
- Python 3.10+
- TensorFlow / Keras 3.x
- scikit-learn
- pandas
- numpy
# Clone the repository
git clone https://github.com/phching/lightning-forecast.git
cd lightning-forecast
# Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install tensorflow scikit-learn pandas numpy- Place your LLIS data files inside the
llis_20250315/folder. - Run the pipeline:
uv run lightning_forecast.py- Monitor training in TensorBoard (optional):
tensorboard --logdir .logs/fitTrained models and scalers are saved automatically to the model/ directory.
import pickle
import numpy as np
import keras
model = keras.models.load_model("model/lightning_model_3min.keras")
with open("model/scaler_3min.pkl", "rb") as f:
scaler = pickle.load(f)
# Example : 10 minutes of history x 9 features
# features: [strike_count, avg_peak_current, max_peak_current, std_peak_current,
# avg_num_sensors, avg_chi_square, hour, day_of_week, is_weekend]
history_sequence = np.random.rand(10, 9).astype(np.float32) # replace with real data
history_scaled = scaler.transform(history_sequence.reshape(-1,9)).reshape(1,10,9)
prob = model.predict(history_scaled, verbose=0)[0][0]
print(f"Probability of strike in next 3 minutes: {prob:.2%}")Raw LLIS files
│
▼
Load & filter cloud_indicator = 0 (CG strikes only)
│
▼
Sequence extraction 10-min history (1 vector per minute)
│
▼
Labelling Did any CG strike occur in the next N minutes?
│
▼
Train / test split Stratified 80/20 split per window
│
▼
StandardScaler Fitted on train set only — applied to test set
│
▼
LSTM training Class weights · EarlyStopping · ReduceLROnPlateau
│
▼
Evaluation Precision · Recall · F1 · ROC-AUC · confusion matrix
│
▼
Save .keras model + .pkl scaler (one pair per window)
| Boundary | Value |
|---|---|
| Latitude min | 22.15° N |
| Latitude max | 22.55° N |
| Longitude min | 113.85° E |
| Longitude max | 114.45° E |
| Feature | Description |
|---|---|
strike_count |
Number of CG strikes in the past 10 minutes |
avg_peak_current |
Mean absolute peak current (kA) |
max_peak_current |
Maximum absolute peak current (kA) |
std_peak_current |
Standard deviation of peak current |
avg_num_sensors |
Average number of sensors that detected each strike |
avg_chi_square |
Average chi-square value (location accuracy indicator) |
hour |
Hour of day (0–23) |
day_of_week |
Day of week (0 = Monday, 6 = Sunday) |
is_weekend |
Binary flag: 1 if Saturday or Sunday |
Input (shape(10, 9)) # time steps x features
│
LSTM(64, return_sequences=True) + BatchNormalization + Dropout(0.3)
│
LSTM(32, return_sequences=False) + BatchNormalization + Dropout(0.3)
│
Dense(16) + ReLU + Dropout(0.2)
│
Dense(1) + Sigmoid
│
Output: probability of CG strike in next N minutes
- Optimiser: Adam (lr = 0.001)
- Loss: Binary cross-entropy
- Regularisation: L2 (λ = 0.002) + Dropout
- Early stopping: patience = 7 epochs on val_loss
- LR scheduler: ReduceLROnPlateau (factor = 0.5, patience = 3)
Because lightning events are rare (~3% of samples are positive), accuracy alone is misleading. The pipeline reports:
| Metric | Why it matters |
|---|---|
| Precision | Of all predicted strikes, how many were real? |
| Recall | Of all real strikes, how many did we catch? |
| F1 score | Harmonic mean of precision and recall |
| ROC-AUC | Model's ability to rank positives above negatives |
| Confusion matrix | Breakdown of true/false positives and negatives |
Recall is the most safety-critical metric — missing a real strike is worse than a false alarm.
All hyperparameters are centralised in the Config dataclass at the top of the script:
@dataclass
class Config:
data_folder: str = "llis_20250315"
time_windows: list = field(default_factory=lambda: [1, 2, 3, 4, 5])
history_window_minutes: int = 10
epochs: int = 50
early_stopping_patience: int = 7
learning_rate: float = 0.001
dropout_rate: float = 0.3
threshold: float = 0.5
...No magic numbers are scattered across the code — edit one place to change any setting.
- Small dataset — the included data covers a single date (
20250315), which limits generalisation and makes the class imbalance more pronounced. - No spatial resolution — the model treats all strikes within the HK bounding box as equivalent; it does not predict where a strike will land.
- Static threshold — the 0.5 decision threshold is fixed; adjusting it can trade precision for recall depending on the use case.
- No temporal cross-validation — the train/test split is random, which can allow future data to leak into training. A time-based split would be more rigorous.
This project is released under the MIT License.