Cross-platform transcriptomic integration workflow for biomarker discovery and machine learning applications
This project provides a reproducible workflow for integrating transcriptomic datasets generated from different studies and technologies (Microarrays and RNA-seq) with the goal of generating machine-learning-ready datasets.
The workflow focuses on reducing technical variability while preserving biological signals relevant to CDK inhibitor resistance studies.
The repository is structured as a tutorial-style analysis designed to explain each preprocessing step rather than simply executing a black-box pipeline.
- Integrate transcriptomic datasets from independent studies
- Harmonize data from different experimental platforms
- Reduce technical variability and batch effects
- Preserve biological information relevant to drug resistance
- Generate clean datasets suitable for downstream machine learning
- R
- tidyverse
- limma
- sva
- edgeR
- AnnotationDbi
- preprocessCore
- biomaRt
- ggplot2
- Probe annotation
- Probe summarization
- Quantile normalization
- Log2 transformation
- Low-expression filtering
- Cross-study gene matching
- Batch effect correction (ComBat / SVA)
- Exploratory quality control
- ML-ready matrix generation
- Gene annotation
- Count summarization
- Low-expression filtering
- TMM normalization
- Log transformation
- Cross-study integration
- Batch correction
- Exploratory PCA analysis
- ML-ready matrix generation
.
├── Microarrays_data
│ ├── merging_samples.Rmd
│ └── index.html
│
├── RNA-seq_data
│ ├── merging_samples_rnaseq.Rmd
│ └── index.html
│
├── index.html
└── README.md
- Multi-cohort transcriptomic integration
- Cross-platform harmonization
- Feature engineering
- Batch effect correction
- Exploratory data analysis
- Reproducible workflows
- Machine learning preprocessing
The final outputs generated by this workflow include:
- Annotated expression matrices
- Normalized datasets
- Batch-corrected datasets
- Exploratory QC visualizations
- ML-ready matrices for downstream predictive modeling
This repository was designed as a guided and educational workflow intended to explain the rationale behind each preprocessing step. The focus is on interpretability and reproducibility rather than only providing executable scripts.