This repository contains analysis code used in the study "A pan-cancer compendium of 1,294 plasma cell-free DNA methylomes and fragmentomes." It supports three linked workflows: methylation feature generation/processing, fragmentomics feature generation/processing, and downstream machine-learning analyses (cancer-vs-normal, cancer-type, and subtype classification). The codebase is optimized for HPC-style batch execution and assumes curated input tables from the study data resource.
1_Methylation_Scripts/: methylation feature generation (cluster runners) and downstream R analysis scripts.2_Fragmentomics_Scripts/: fragmentomics feature generation from BAM files and downstream R analysis scripts.3_Machine_Learning_Scripts/: feature selection/PCA, classifier training, and result plotting.docs/: end-to-end usage documentation, input/output schema notes, and reproducibility guidance.config/: local configuration templates for path overrides.Makefile: lightweight wrappers for setup, validation, and common run commands.
The repository does not include full study input data. Quickstart therefore means "configure environment + run a minimal command against your own prepared inputs".
# 1) Clone and enter repo
git clone <repo-url>
cd cfMeDIP-seq_Data_Resource_Codes
# 2) Python environment (for cancer-type/subtype classifiers)
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# 3) Optional R environment bootstrap
R -q -e 'install.packages("renv"); renv::init(bare = TRUE)'
# 4) Validate expected files/entrypoints
make validate
# 5) Run one cancer-type classifier job (requires prepared data directory)
export CFMEDIP_MAIN_DIR=/absolute/path/to/ML_workspace
python 3_Machine_Learning_Scripts/3_Scripts_for_running_ML_pipelines_PE_data_cancer_type_or_subtype/cancer_origin_clf.py motif "Breast Cancer"Recommended: R 4.1+ on Linux, with explicit package snapshots via renv.
R -q -e 'install.packages("renv")'
R -q -e 'renv::init(bare = TRUE)'
R -q -e 'renv::install(c("tidyverse","dplyr","ggplot2","caret","pROC","doParallel","DESeq2","sva"))'
R -q -e 'renv::snapshot()'Note: many scripts rely on additional Bioconductor and visualization packages. See docs/REPRODUCIBILITY.md for expanded package guidance.
Use requirements.txt:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtUse config/pipeline_paths.example.env as a local template:
cp config/pipeline_paths.example.env config/pipeline_paths.env
# edit values for your system
source config/pipeline_paths.envSupported overrides:
CFMEDIP_MAIN_DIR: root containingdata/andresults_*for Python classifiers.CFMEDIP_MODELS: optional comma-separated model subset for Python scripts (example:lr,rf).CFMEDIP_CN_CONFIG,CFMEDIP_CN_OUTPUT_DIR,CFMEDIP_CN_SLURM_DIR,CFMEDIP_CN_RUNNER_SCRIPT,CFMEDIP_CN_SCRIPTS_DIR: overrides for cancer-vs-normal shell/R runners.
- Generate methylation features from raw inputs:
1_Methylation_Scripts/Shell_scripts_to_generate_features/1_sbatch_methylation_analysis.sh
- Process methylation features in numbered order (
1_...through5.6_...):1_Methylation_Scripts/R_scripts_to_process_features/
- Generate fragmentomics features from BAMs (choose runner for your cohort):
2_Fragmentomics_Scripts/Shell_scripts_to_generate_features_from_bams/1_Runner_scripts/
- Process fragmentomics features in numbered order (
1.xto5.x, then optional9.x/10.xstatistics):2_Fragmentomics_Scripts/R_scripts_to_process_features/
- Build train/validation feature matrices and PCA transforms:
3_Machine_Learning_Scripts/1_Select_and_PCA_transform_features/6.2A_...PE...R3_Machine_Learning_Scripts/1_Select_and_PCA_transform_features/6.2B_...SE...R
- Run cancer-vs-normal classifiers:
3_Machine_Learning_Scripts/2_Shell_and_R_scripts_for_running_ML_pipelines_PE_data_cancer_vs_normal_classifier/1_Runner_scripts/Run_CN_classifier.sh.../Run_CN_classifier_SE.sh
- Run cancer-type and subtype classifiers (Python):
3_Machine_Learning_Scripts/3_Scripts_for_running_ML_pipelines_PE_data_cancer_type_or_subtype/cancer_origin_clf.py3_Machine_Learning_Scripts/3_Scripts_for_running_ML_pipelines_PE_data_cancer_type_or_subtype/subtype_clf.py
- Generate downstream summary plots:
3_Machine_Learning_Scripts/4_Machine_learning_plotting_scripts/
Detailed workflow text and flowchart are in docs/PIPELINE_OVERVIEW.md.
Primary outputs include:
- Processed feature matrices (
.rds,.csv) from methylation/fragmentomics processing. - PCA-transformed training/validation matrices for ML.
- Classifier outputs (
*.sum.csv,*.roc.csv, model objects.rds, kappa summaries.txt/.pdf). - Publication-style figure panels from plotting scripts.
See docs/OUTPUTS.md for file naming conventions and interpretation.
File not founderrors: most scripts contain hardcoded absolute paths from the original compute environment; replace these with your local paths before running.- Empty intersections of metadata/features: ensure sample IDs are normalized consistently (for example remove
_dedupwhere expected). - Cross-validation failures (
n_splitstoo large): class sizes are too small for requested folds. Reduce cohort scope or useCFMEDIP_MODELSto run simpler checks first. - R package missing: install package into your
renvor system library and rerun. - HPC module errors: cluster runner scripts assume
module loadand scheduler directives are available.
docs/PIPELINE_OVERVIEW.mddocs/INPUTS.mddocs/OUTPUTS.mddocs/REPRODUCIBILITY.mddocs/DOCS_IMPROVEMENTS_SUMMARY.md