This repository presents a physics-motivated machine learning study for event classification, focusing on the relationship between variance structure, linear separability, and nonlinear decision boundaries.
The analysis emphasizes interpretability, robustness, and physics-aware evaluation rather than black-box optimization.
A formal technical research report documenting the full physics-driven machine learning analysis, methodology, and results is available below:
In high-energy physics analyses, classification performance is often constrained by the structure of the available observables rather than by model complexity.
This project challenges a common implicit assumption in data-driven analyses:
that directions of large variance necessarily carry discriminative information. We systematically investigate the relationship between variance structure, linear separability, and nonlinear decision boundaries in a physics context.
The workflow follows a structured, research-oriented pipeline:
- Data cleaning and preprocessing using reusable utility functions
- Exploratory data analysis
- Dimensionality reduction via Principal Component Analysis (PCA)
- Linear classification as an interpretable baseline
- Nonlinear classification to probe weak nonlinear structure
- Physics-motivated evaluation using weighted metrics and signal significance
- Initial data inspection and exploratory analysis
- Feature distributions and class imbalance
- Validation of event weights and basic statistics
- Logistic Regression as a linear baseline
- Weighted ROC and Precision–Recall evaluation
- Threshold scans and physics-motivated signal significance
- Serves as an interpretable reference model
- Principal Component Analysis of the feature space
- Variance explained by leading components
- 2D and 3D PCA visualizations
- Demonstrates misalignment between variance-dominant directions and class separation
- Linear SVM as a robust baseline
- RBF kernel SVM to probe nonlinear discriminative structure
- Validation-based hyperparameter selection
- ROC, Precision–Recall, and physics-motivated threshold optimization
- Explicit discussion of computational cost and overfitting considerations
This study uses the "Particle Physics Event Classification Dataset" available on Kaggle:
The dataset contains 250,001 collider-inspired events labeled as signal or background, with per-event weights used in physics-driven evaluation.
- Centralized data cleaning and preprocessing utilities
- Feature selection, scaling, and weight handling
- Ensures consistency across all notebooks
- Designed for reuse and modularity
- The leading PCA components capture a large fraction of the total variance but do not provide strong class separation.
- Linear models already capture most of the discriminative power in the data.
- Nonlinear kernels introduce only modest performance improvements, indicating weak nonlinear structure.
- Optimal physics performance is achieved through decision threshold optimization rather than reliance on global metrics alone.
- All models are evaluated using weighted metrics to account for physical cross sections.
- Model selection is performed exclusively on validation data.
- Final performance is reported on an independent test set.
- Emphasis is placed on robustness and interpretability rather than aggressive hyperparameter tuning.
-
The study is limited to a specific dataset and feature representation; results may not generalize to higher-dimensional or more complex observables.
-
Only classical ML models are considered; future work may explore physics-informed neural networks or representation learning.
-
The relationship between detector effects and feature separability is not explicitly modeled.
This project is intended as a research-oriented study suitable for graduate-level research in particle physics and machine learning. It prioritizes physical insight and methodological clarity over benchmark-driven performance. The scope is intentionally focused on interpretable models and controlled analysis, rather than large-scale optimization or complex model architectures.
Future work will extend this study to include more advanced machine learning methods and richer model classes.