Physics-Driven Machine Learning for Event Classification

This repository presents a physics-motivated machine learning study for event classification, focusing on the relationship between variance structure, linear separability, and nonlinear decision boundaries.
The analysis emphasizes interpretability, robustness, and physics-aware evaluation rather than black-box optimization.

Technical Research Report

A formal technical research report documenting the full physics-driven machine learning analysis, methodology, and results is available below:

Download PDF Report

Project Motivation

In high-energy physics analyses, classification performance is often constrained by the structure of the available observables rather than by model complexity.
This project challenges a common implicit assumption in data-driven analyses: that directions of large variance necessarily carry discriminative information. We systematically investigate the relationship between variance structure, linear separability, and nonlinear decision boundaries in a physics context.

Analysis Overview

The workflow follows a structured, research-oriented pipeline:

Data cleaning and preprocessing using reusable utility functions
Exploratory data analysis
Dimensionality reduction via Principal Component Analysis (PCA)
Linear classification as an interpretable baseline
Nonlinear classification to probe weak nonlinear structure
Physics-motivated evaluation using weighted metrics and signal significance

Repository Structure and File Description

Notebooks/

1_Exploration.ipynb

Initial data inspection and exploratory analysis
Feature distributions and class imbalance
Validation of event weights and basic statistics

2_LogisticRegression.ipynb

Logistic Regression as a linear baseline
Weighted ROC and Precision–Recall evaluation
Threshold scans and physics-motivated signal significance
Serves as an interpretable reference model

3_PCA.ipynb

Principal Component Analysis of the feature space
Variance explained by leading components
2D and 3D PCA visualizations
Demonstrates misalignment between variance-dominant directions and class separation

4_SVM.ipynb

Linear SVM as a robust baseline
RBF kernel SVM to probe nonlinear discriminative structure
Validation-based hyperparameter selection
ROC, Precision–Recall, and physics-motivated threshold optimization
Explicit discussion of computational cost and overfitting considerations

Dataset

This study uses the "Particle Physics Event Classification Dataset" available on Kaggle:

https://www.kaggle.com/...

The dataset contains 250,001 collider-inspired events labeled as signal or background, with per-event weights used in physics-driven evaluation.

Preprocessing/

data_preprocessing.py

Centralized data cleaning and preprocessing utilities
Feature selection, scaling, and weight handling
Ensures consistency across all notebooks
Designed for reuse and modularity

Key Findings

The leading PCA components capture a large fraction of the total variance but do not provide strong class separation.
Linear models already capture most of the discriminative power in the data.
Nonlinear kernels introduce only modest performance improvements, indicating weak nonlinear structure.
Optimal physics performance is achieved through decision threshold optimization rather than reliance on global metrics alone.

Methodological Notes

All models are evaluated using weighted metrics to account for physical cross sections.
Model selection is performed exclusively on validation data.
Final performance is reported on an independent test set.
Emphasis is placed on robustness and interpretability rather than aggressive hyperparameter tuning.

Limitations and Future Work

The study is limited to a specific dataset and feature representation; results may not generalize to higher-dimensional or more complex observables.
Only classical ML models are considered; future work may explore physics-informed neural networks or representation learning.
The relationship between detector effects and feature separability is not explicitly modeled.

Scope

This project is intended as a research-oriented study suitable for graduate-level research in particle physics and machine learning. It prioritizes physical insight and methodological clarity over benchmark-driven performance. The scope is intentionally focused on interpretable models and controlled analysis, rather than large-scale optimization or complex model architectures.

Future work will extend this study to include more advanced machine learning methods and richer model classes.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Data - Events		Data - Events
Notebooks		Notebooks
Preprocessing		Preprocessing
report		report
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Physics-Driven Machine Learning for Event Classification

Technical Research Report

Project Motivation

Analysis Overview

Repository Structure and File Description

Notebooks/

1_Exploration.ipynb

2_LogisticRegression.ipynb

3_PCA.ipynb

4_SVM.ipynb

Dataset

Preprocessing/

data_preprocessing.py

Key Findings

Methodological Notes

Limitations and Future Work

Scope

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Physics-Driven Machine Learning for Event Classification

Technical Research Report

Project Motivation

Analysis Overview

Repository Structure and File Description

Notebooks/

1_Exploration.ipynb

2_LogisticRegression.ipynb

3_PCA.ipynb

4_SVM.ipynb

Dataset

Preprocessing/

data_preprocessing.py

Key Findings

Methodological Notes

Limitations and Future Work

Scope

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages