Skip to content

LucianoAD/Transcriptomic-Data-Integration-for-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transcriptomic Data Integration for Machine Learning

Cross-platform transcriptomic integration workflow for biomarker discovery and machine learning applications

Status License R ML


📖 Overview

This project provides a reproducible workflow for integrating transcriptomic datasets generated from different studies and technologies (Microarrays and RNA-seq) with the goal of generating machine-learning-ready datasets.

The workflow focuses on reducing technical variability while preserving biological signals relevant to CDK inhibitor resistance studies.

The repository is structured as a tutorial-style analysis designed to explain each preprocessing step rather than simply executing a black-box pipeline.


🎯 Objectives

  • Integrate transcriptomic datasets from independent studies
  • Harmonize data from different experimental platforms
  • Reduce technical variability and batch effects
  • Preserve biological information relevant to drug resistance
  • Generate clean datasets suitable for downstream machine learning

🛠 Technologies and Packages

Programming

  • R

Main packages

  • tidyverse
  • limma
  • sva
  • edgeR
  • AnnotationDbi
  • preprocessCore
  • biomaRt
  • ggplot2

🔬 Workflow

Microarray workflow

  1. Probe annotation
  2. Probe summarization
  3. Quantile normalization
  4. Log2 transformation
  5. Low-expression filtering
  6. Cross-study gene matching
  7. Batch effect correction (ComBat / SVA)
  8. Exploratory quality control
  9. ML-ready matrix generation

RNA-seq workflow

  1. Gene annotation
  2. Count summarization
  3. Low-expression filtering
  4. TMM normalization
  5. Log transformation
  6. Cross-study integration
  7. Batch correction
  8. Exploratory PCA analysis
  9. ML-ready matrix generation

📂 Repository Structure

.
├── Microarrays_data
│   ├── merging_samples.Rmd
│   └── index.html
│
├── RNA-seq_data
│   ├── merging_samples_rnaseq.Rmd
│   └── index.html
│
├── index.html
└── README.md

📊 Key concepts

  • Multi-cohort transcriptomic integration
  • Cross-platform harmonization
  • Feature engineering
  • Batch effect correction
  • Exploratory data analysis
  • Reproducible workflows
  • Machine learning preprocessing

🚀 Outputs

The final outputs generated by this workflow include:

  • Annotated expression matrices
  • Normalized datasets
  • Batch-corrected datasets
  • Exploratory QC visualizations
  • ML-ready matrices for downstream predictive modeling

📚 Notes

This repository was designed as a guided and educational workflow intended to explain the rationale behind each preprocessing step. The focus is on interpretability and reproducibility rather than only providing executable scripts.

About

Reproducible transcriptomic data integration workflow for multi-cohort and cross-platform datasets including annotation, normalization, batch correction and ML-ready feature preparation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages