Transcriptomic Data Integration for Machine Learning

Cross-platform transcriptomic integration workflow for biomarker discovery and machine learning applications

📖 Overview

This project provides a reproducible workflow for integrating transcriptomic datasets generated from different studies and technologies (Microarrays and RNA-seq) with the goal of generating machine-learning-ready datasets.

The workflow focuses on reducing technical variability while preserving biological signals relevant to CDK inhibitor resistance studies.

The repository is structured as a tutorial-style analysis designed to explain each preprocessing step rather than simply executing a black-box pipeline.

🎯 Objectives

Integrate transcriptomic datasets from independent studies
Harmonize data from different experimental platforms
Reduce technical variability and batch effects
Preserve biological information relevant to drug resistance
Generate clean datasets suitable for downstream machine learning

🛠 Technologies and Packages

Programming

R

Main packages

tidyverse
limma
sva
edgeR
AnnotationDbi
preprocessCore
biomaRt
ggplot2

🔬 Workflow

Microarray workflow

Probe annotation
Probe summarization
Quantile normalization
Log2 transformation
Low-expression filtering
Cross-study gene matching
Batch effect correction (ComBat / SVA)
Exploratory quality control
ML-ready matrix generation

RNA-seq workflow

Gene annotation
Count summarization
Low-expression filtering
TMM normalization
Log transformation
Cross-study integration
Batch correction
Exploratory PCA analysis
ML-ready matrix generation

📂 Repository Structure

.
├── Microarrays_data
│   ├── merging_samples.Rmd
│   └── index.html
│
├── RNA-seq_data
│   ├── merging_samples_rnaseq.Rmd
│   └── index.html
│
├── index.html
└── README.md

📊 Key concepts

Multi-cohort transcriptomic integration
Cross-platform harmonization
Feature engineering
Batch effect correction
Exploratory data analysis
Reproducible workflows
Machine learning preprocessing

🚀 Outputs

The final outputs generated by this workflow include:

Annotated expression matrices
Normalized datasets
Batch-corrected datasets
Exploratory QC visualizations
ML-ready matrices for downstream predictive modeling

📚 Notes

This repository was designed as a guided and educational workflow intended to explain the rationale behind each preprocessing step. The focus is on interpretability and reproducibility rather than only providing executable scripts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transcriptomic Data Integration for Machine Learning

📖 Overview

🎯 Objectives

🛠 Technologies and Packages

Programming

Main packages

🔬 Workflow

Microarray workflow

RNA-seq workflow

📂 Repository Structure

📊 Key concepts

🚀 Outputs

📚 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Microarrays_data		Microarrays_data
RNA-seq_data		RNA-seq_data
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Folders and files

Latest commit

History

Repository files navigation

Transcriptomic Data Integration for Machine Learning

📖 Overview

🎯 Objectives

🛠 Technologies and Packages

Programming

Main packages

🔬 Workflow

Microarray workflow

RNA-seq workflow

📂 Repository Structure

📊 Key concepts

🚀 Outputs

📚 Notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages