🔬 MatPropNet — 数据驱动材料性能预测与归因系统

A rigorous, reproducible machine learning pipeline for predicting and attributing material mechanical properties from small experimental datasets.
一套面向材料科学小样本数据的严密 ML 流水线，实现性能预测、超参寻优与物理机理归因的全链路自动化。

🎯 Motivation — 为什么需要这个项目？

实验材料科学数据集天然具备 小样本（n ~ 10¹–10²）、相对高维、以及混合特征类型（连续配方参数 + 类别工艺条件）的特征。将标准 ML 工作流朴素地应用于此类数据，将系统性地陷入三大失效模式 (failure modes)：

失效模式 Failure Mode	根本原因 Root Cause	本框架的解决方案 Solution
📉 过拟合 Overfitting	高 p/n 比率	嵌套 CV + 正则化模型
🎰 乐观偏差 Optimistic Bias	HPO 阶段测试标签泄露	结构性分离的内外循环
🕳️ 黑盒不透明 Black-box Opacity	缺乏与物理机理的连接	SHAP 博弈论归因

MatPropNet is purpose-built to address all three failure modes with a production-grade, academically rigorous pipeline.

✨ Key Features — 核心功能亮点

🏗️ 双轨制实验管理 Dual-Track Experiment Management（Hydra）

基于 Hydra 的配置驱动执行，将调试运行 (outputs/) 与档案实验 (experiments/) 彻底解耦。切换模型、开关 HPO、多配置 Sweep —— 全部通过命令行完成，零代码改动 (zero code changes)。

🔐 嵌套交叉验证 Nested Cross-Validation — 无偏泛化评估

外层 CV 提供诚实的泛化估计，内层驱动超参搜索。这是小数据 ML 评估的金标准 (gold standard)。对于外层 K-fold、内层 J-fold HPO，总模型拟合次数为：

$$ N_{\text{fits}} = K \times n_{\text{trials}} \times J + K $$

其中 $n_{\text{trials}}$ 为 Optuna HPO 试验次数。每个外层 fold 的测试集在内层优化期间从未被访问 (never seen)，从结构上杜绝标签泄露 (label leakage)。

交叉验证泛化估计及其标准差 (standard deviation) 定义为：

$$ \hat{\mu}_{\text{CV}} = \frac{1}{K}\sum_{k=1}^{K} L\bigl(f_{\theta_k^*},, D_k^{\text{test}}\bigr) $$

$$ \hat{\sigma}_{\text{CV}} = \sqrt{\frac{1}{K-1}\sum_{k=1}^{K}\left(L_k - \hat{\mu}_{\text{CV}}\right)^2} $$

$\hat{\mu}{\text{CV}}$ 与 $\hat{\sigma}{\text{CV}}$ 同时输出，以呈现小样本评估中固有的不确定性 (uncertainty)。

🧠 贝叶斯超参寻优 Bayesian HPO via Optuna

Optuna 以 Tree-structured Parzen Estimator（TPE） 采样取代网格/随机搜索。TPE 在超参空间 Λ 上维护两个密度模型 (density models)：

$$ p(\lambda \mid H) = \begin{cases} \ell(\lambda) & \text{if } f(\lambda) < f^{*} \\ g(\lambda) & \text{otherwise} \end{cases} $$

采集函数 (acquisition function) 最大化比率 $\ell(\lambda)/g(\lambda)$，将试验集中于 Λ 中最有前途的区域。最优超参配置 (optimal configuration) 为：

$$ \theta^* = \arg\min_{\lambda \in \Lambda}; E_{D_{\text{val}}}\bigl[L(f_\lambda, D_{\text{val}})\bigr] $$

HPO 试验持久化至 SQLite，支持中断续跑 (fully resumable studies)。

🔍 基于 SHAP 的物理机理归因 SHAP-Based Physical Attribution

SHAP 基于合作博弈论 (cooperative game theory) 将每个预测分解为逐特征贡献。特征 $i$ 的 Shapley 值 (Shapley value) 定义为：

$$ \phi_i = \sum_{S \subseteq F \setminus {i}} \frac{|S|!,(|F|-|S|-1)!}{|F|!} \bigl[ v(S \cup {i}) - v(S) \bigr] $$

其中 $F$ 为完整特征集，$v(S)$ 为模型在特征子集 $S$ 上的输出。全局特征重要性 (global feature importance) 聚合为均值绝对 Shapley 值：

$$ \bar{\phi}_i = \frac{1}{n}\sum_{j=1}^{n} \left|\phi_i^{(j)}\right| $$

输出包含 bar charts（全局 $\bar{\phi}_i$ 排名）与 beeswarm plots（$\phi_i^{(j)}$ 的样本级分布）。

📊 出版级可视化 Publication-Ready Visualizations

所有图表同时保存为 .png（屏幕展示）和 .pdf（矢量图，适配 LaTeX 投稿）。exporter 模块自动将运行图表镜像至 paper_figures/，并生成包含配置快照 (config snapshot) 与 HPO 摘要的 JSON manifest。

🔄 MLflow 全程追踪 MLflow Integration

每次训练运行均向 MLflow 记录参数 (parameters)、指标 (metrics) 与 artifacts，支持多 run 跨实验对比 (cross-run comparison)。

🗂️ Project Structure — 项目结构

MatPropNet/
│
├── conf/                          # Hydra 配置（YAML）
│   ├── config.yaml                #     根配置：数据路径、HPO 开关、运行标志
│   └── model/
│       ├── xgboost.yaml           #     XGBoost 参数 & 搜索空间 search space
│       └── random_forest.yaml     #     Random Forest 参数 & 搜索空间
│
├── src/
│   ├── data/
│   │   ├── loader.py              #     数据摄入 + 质量报告 data quality report
│   │   └── schema.py              #     列契约 column contracts（TARGET_COLUMN 等）
│   ├── features/
│   │   └── processor.py           #     特征类型自动推断、预处理器构建
│   ├── models/
│   │   ├── hpo.py                 #     Optuna tuner 注册表（可插拔 pluggable）
│   │   └── train.py               #     嵌套 CV 训练 + MLflow 日志
│   ├── explainability/
│   │   └── shap_runner.py         #     SHAP bar + beeswarm 归因分析
│   ├── visualization/
│   │   └── plots.py               #     全部 matplotlib 图生成器
│   └── export/
│       └── exporter.py            #     图表镜像 -> paper_figures/ + manifest
│
├── data/
│   ├── raw/                       #     原始数据集（.xlsx / .csv）放置于此
│   └── processed/                 #     预处理后的中间数据（自动生成）
│
├── outputs/          # 调试区 DEBUG ZONE — Hydra 自动管理，已 gitignore
│   ├── optuna/                    #     Optuna SQLite study 持久化文件
│   └── paper_figures/             #     投稿图表镜像区 final figures for publication
│       ├── latest/                #     当前最新版本图表
│       └── archive/               #     历史版本存档
│
├── experiments/      # 档案区 ARCHIVE ZONE — 精选存档，可纳入版本控制
│   └── <experiment_name>/
│       └── YYYY-MM-DD_HH-MM-SS/   #     时间戳隔离，每次 multirun 一个目录
│           ├── 0_model=xgboost/
│           │   ├── .hydra/        #     完整配置快照 config snapshot
│           │   └── figures/
│           │       └── shap/      #     SHAP PNG + PDF 图表
│           └── 1_model=random_forest/
│               ├── .hydra/
│               └── figures/
│                   └── shap/
│
├── models/                        #     持久化模型文件存储（.pkl / .json）
├── notebooks/                     #     探索性分析 Jupyter Notebooks
├── tests/                         #     单元测试 unit tests
├── mlruns/                        #     MLflow tracking store（已 gitignore）
├── main.py                        # 流水线入口 pipeline entry point
└── requirements.txt

`outputs/` vs `experiments/` — 双轨解耦逻辑

	`outputs/` 调试区	`experiments/` 档案区
用途 Purpose	开发调试沙箱 debug sandbox	精选科学档案 curated archive
管理方式 Managed by	Hydra（全自动）	`exporter.mirror_figures()`
Git 状态	`.gitignore`d，不追踪	✅ 按需选择性提交
内容 Contents	原始日志、中间产物、`.hydra/` 快照	最终图表、JSON manifest
生命周期 Lifecycle	临时性 ephemeral — 可安全删除	永久记录 permanent record

这一解耦确保仓库始终整洁，同时每个有意义的实验都保持完整可溯源 (fully traceable)。

🔄 Pipeline Architecture — 全链路架构图

下图展示从原始数据输入到实验档案输出的完整执行路径。HPO 开关决定内层搜索是否激活；外层 K-fold 始终运行以保证泛化评估的无偏性。

flowchart TD
    A["Raw Dataset (.xlsx / .csv)"] --> B["Feature Processor: Impute / Scale / Encode"]
    B --> C{"HPO Enabled?"}
    C -- "Yes" --> D["Outer K-fold Loop"]
    C -- "No" --> D
    D --> E["Inner CV Loop: Search Space"]
    E --> F["Optuna TPE Sampling: n_trials x J fits"]
    F --> G["Best config: theta* = argmin E_Dval"]
    G --> D
    D --> H["Outer Fold Eval: RMSE / R2 on Test Set"]
    H --> I["K-Fold Aggregate: mu_CV +/- sigma_CV"]
    I --> J["SHAP Attribution: phi_i per sample"]
    I --> K["Pred vs Actual: +/- 10% Error Band"]
    I --> L["MLflow Log: Params / Metrics / Artifacts"]
    J --> M["experiments/ Archive: Figures + Manifest"]
    K --> M
    L --> M
    style A fill:#f0f4ff,stroke:#4a6cf7
    style G fill:#fff3e0,stroke:#ff9800
    style M fill:#e8f5e9,stroke:#4caf50

🧭 Data Flow — 特征处理细节

数值特征经中位数插补 + 标准化，类别特征经常量插补 + One-Hot 编码，两路特征在模型层合并。预处理器作为 sklearn Pipeline 的一部分，确保 SHAP 归因直接对应原始输入变量，不产生语义偏移。

flowchart LR
    subgraph INPUT["Input Features"]
        FN["Numerical features"]
        FC["Categorical features"]
    end
    subgraph PROC["Preprocessor - sklearn Pipeline"]
        FN --> NI["Median Imputer"]
        NI --> NS["Standard Scaler: x = (x - mean) / std"]
        FC --> CI["Constant Imputer"]
        CI --> CE["One-Hot Encoder"]
    end
    subgraph MODEL["Estimator"]
        NS --> EST["XGBoost / RandomForest"]
        CE --> EST
    end
    EST --> SHAP["SHAP TreeExplainer: phi_i per sample"]
    EST --> PRED["y_hat = f(x)"]

⚡ Quick Start — 快速上手

1. 安装依赖 Installation

git clone https://github.com/liqinglq666/MatPropNet.git
cd MatPropNet
pip install -r requirements.txt

2. 准备数据 Prepare Your Data

将数据集放置于 data/raw/，并在 conf/config.yaml 中更新路径：

data:
  path: data/raw/your_data.xlsx   # 替换为你的文件名 replace with your filename

3. 单次调试运行 Single Debug Run（无 HPO，速度快 fast）

python main.py model=random_forest hpo.enabled=false

输出落于 outputs/<date>/<time>/，检查日志与图表后再进行正式实验。
Outputs land in outputs/. Inspect logs and figures before committing to a full experiment.

4. 完整 HPO 运行 Full Run with Bayesian HPO

python main.py model=xgboost hpo.enabled=true experiment_name=exp_v1

5. 多模型 Sweep 扫参 Multi-Model Sweep（Hydra multirun）

python main.py --multirun model=xgboost,random_forest hpo.enabled=true experiment_name=sweep_01

所有 run 顺序执行，图表按模型归档于 experiments/sweep_01/ 下。
All runs execute sequentially; figures archived per model under experiments/sweep_01/.

6. 查看 MLflow 结果 Inspect Results

mlflow ui
# 打开浏览器访问 Open: http://localhost:5000

📦 Dependencies — 环境依赖

核心依赖如下（完整版本锁定见 requirements.txt）：

包 Package	用途 Purpose
`hydra-core >= 1.3`	配置管理 + multirun sweep
`optuna >= 3.0`	贝叶斯超参优化 (TPE)
`xgboost >= 2.0`	梯度提升树模型
`scikit-learn >= 1.3`	随机森林、预处理、交叉验证
`shap >= 0.44`	SHAP TreeExplainer 归因分析
`mlflow >= 2.10`	实验追踪与 artifact 管理
`matplotlib >= 3.7`	出版级图表渲染
`pandas >= 2.0`	数据读取与操作
`openpyxl`	`.xlsx` 文件支持

推荐使用 Python 3.9 – 3.11，conda 或 venv 环境隔离。

🔬 Methodology — 方法论与学术诚信声明

本项目实现了一套统计严密的 ML 流水线 (statistically rigorous ML pipeline)，旨在满足材料科学出版物所要求的证据标准 (evidentiary standards)。

核心方法论保证 Core methodological guarantees：

零标签泄露 No label leakage — HPO 内循环与泛化评估外循环结构性分离。报告指标反映真实的样本外性能 (true out-of-sample performance)。
无偏方差估计 Unbiased variance estimation — 重复分层 k-fold CV 同时量化 $\hat{\mu}{\text{CV}}$ 与 $\hat{\sigma}{\text{CV}}$。对于小 n，仅报告点估计在统计学上是不充分的 (statistically insufficient)。
完全可复现 Reproducibility — 每次运行记录完整配置快照（.hydra/）、随机种子 (random seeds) 和依赖清单。给定固定种子，结果可确定性复现 (deterministically reproducible)。
可解释性优先 Interpretability over accuracy — SHAP 归因作用于完整 pipeline（含预处理），确保 $\phi_i$ 映射回物理上有意义的输入变量 $x_i$，而非潜在的变换表示 (latent transformed representations)。

⚠️ 小数据 ML 警示 A note on small-data ML：当 n ≈ 100 时，即使是交叉验证指标也携带相当大的方差（$\hat{\sigma}_{\text{CV}} \gg 0$）。本流水线明确呈现这种不确定性 —— 以 CV 分数分布 (score distribution) 而非点估计的形式 —— 以防止在下游分析或发表中产生过度自信的断言 (over-confident claims)。

🤝 Contributing — 贡献指南

欢迎参与贡献 (contributions are welcome)！如需添加新模型类型 (model type)：

创建 conf/model/<your_model>.yaml，包含默认参数与 HPO 搜索空间 (search space)。
在 src/models/hpo.py 中通过扩展 _TUNER_REGISTRY 注册对应 tuner。
其余 pipeline（训练、可视化、SHAP、导出）自动适配，无需额外改动。

提交大型 PR 前请先开 issue 讨论。Please open an issue before submitting large PRs.

📄 License — 版权声明

本仓库以公开方式发布，仅供学术参考与可复现性目的使用 (for academic reference and reproducibility purposes)。未经作者书面许可，不允许重新分发、商业使用或创作衍生作品。
Redistribution, commercial use, or derivative works are not permitted without explicit written permission from the author.

📬 Citation — 引用

如果本项目对你的学术工作有所帮助，请引用 (if you use MatPropNet in academic work, please cite)：

@software{matpropnet2026,
  author  = {liqinglq666},
  title   = {MatPropNet: A Rigorous Data-Driven Pipeline for Material Property Prediction},
  year    = {2026},
  url     = {https://github.com/liqinglq666/MatPropNet}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 MatPropNet — 数据驱动材料性能预测与归因系统

🎯 Motivation — 为什么需要这个项目？

✨ Key Features — 核心功能亮点

🏗️ 双轨制实验管理 Dual-Track Experiment Management（Hydra）

🔐 嵌套交叉验证 Nested Cross-Validation — 无偏泛化评估

🧠 贝叶斯超参寻优 Bayesian HPO via Optuna

🔍 基于 SHAP 的物理机理归因 SHAP-Based Physical Attribution

📊 出版级可视化 Publication-Ready Visualizations

🔄 MLflow 全程追踪 MLflow Integration

🗂️ Project Structure — 项目结构

`outputs/` vs `experiments/` — 双轨解耦逻辑

🔄 Pipeline Architecture — 全链路架构图

🧭 Data Flow — 特征处理细节

⚡ Quick Start — 快速上手

1. 安装依赖 Installation

2. 准备数据 Prepare Your Data

3. 单次调试运行 Single Debug Run（无 HPO，速度快 fast）

4. 完整 HPO 运行 Full Run with Bayesian HPO

5. 多模型 Sweep 扫参 Multi-Model Sweep（Hydra multirun）

6. 查看 MLflow 结果 Inspect Results

📦 Dependencies — 环境依赖

🔬 Methodology — 方法论与学术诚信声明

🤝 Contributing — 贡献指南

📄 License — 版权声明

📬 Citation — 引用

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
conf		conf
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
clean.bat		clean.bat
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🔬 MatPropNet — 数据驱动材料性能预测与归因系统

🎯 Motivation — 为什么需要这个项目？

✨ Key Features — 核心功能亮点

🏗️ 双轨制实验管理 Dual-Track Experiment Management（Hydra）

🔐 嵌套交叉验证 Nested Cross-Validation — 无偏泛化评估

🧠 贝叶斯超参寻优 Bayesian HPO via Optuna

🔍 基于 SHAP 的物理机理归因 SHAP-Based Physical Attribution

📊 出版级可视化 Publication-Ready Visualizations

🔄 MLflow 全程追踪 MLflow Integration

🗂️ Project Structure — 项目结构

outputs/ vs experiments/ — 双轨解耦逻辑

🔄 Pipeline Architecture — 全链路架构图

🧭 Data Flow — 特征处理细节

⚡ Quick Start — 快速上手

1. 安装依赖 Installation

2. 准备数据 Prepare Your Data

3. 单次调试运行 Single Debug Run（无 HPO，速度快 fast）

4. 完整 HPO 运行 Full Run with Bayesian HPO

5. 多模型 Sweep 扫参 Multi-Model Sweep（Hydra multirun）

6. 查看 MLflow 结果 Inspect Results

📦 Dependencies — 环境依赖

🔬 Methodology — 方法论与学术诚信声明

🤝 Contributing — 贡献指南

📄 License — 版权声明

📬 Citation — 引用

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`outputs/` vs `experiments/` — 双轨解耦逻辑

Packages