How to run the Transportome Profiler locally

Clone the repo

For any Linux system, including WSL:

git clone git@github.com:TCP-Lab/transportome_profiler.git
cd ./transportome_profiler

Install R requirements

Rscript ./src/helper_scripts/install_r_pkgs.R

(Create and) Activate a Python virtual environment and install requirements

python -m venv env
source ./env/bin/activate
pip install -r ./src/requirements.txt

# To update internal packages (i.e., bonsai, gene_ranker, metasplit, and panid)
# from the respective git repos  
pip install --force-reinstall -r ./src/requirements.txt

# When finished:
deactivate

Install `xsv`

This is a program required by metasplit for the fast reshaping of very large CSV files.

sudo pacman -Syu xsv

Install `fast-cohen`

This is a program written in Rust that performs a fast computation of the Cohen's d statistics.

cargo install --git https://github.com/MrHedmad/fast-cohen.git

Install `Kerblam!`

This is our workflow manager.

# Install a prebuilt binary
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/MrHedmad/kerblam/releases/latest/download/kerblam-installer.sh | sh

# Or, alternatively, install it from source with cargo
cargo install kerblam

# Check it with
kerblam --version

Fetch MTP-DB, TCGA and GTEx datasets from remote (through XENA)

kerblam data fetch

will download in /data/in the following objects:

expression_matrix.tsv.gz, an archive containing all the log2 raw counts for transcript abundance quantification, as provided by TCGA and GTEx projects;
expression_matrix_metadata.tsv.gz, an archive containing the related metadata for each sample (i.e., patient or, better, specimen);
MTPDB.sqlite.gz, an archive containing our Membrane Transport Protein Database used for gene set definition;
ensg_data.csv, giving for each ENSG ID the corresponding HGNC gene symbol;
geo, a folder containing the tables of raw counts and metadata for individual small studies retrieved from GEO for validation purposes.

Raw counts

The expression_matrix.tsv.gz archive is our Zenodo copy of the gene expression RNAseq - RSEM expected_count data set by XENA, containing both TCGA and GTEx harmonized transcriptomics data.

author: UCSC TOIL RNA-seq recompute
unit: log2(expected_count+1)
size: 60,499 identifiers (genes) x 19,109 samples
download: https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/TcgaTargetGtex_gene_expected_count.gz

Sample check

zcat expression_matrix.tsv.gz | head -n1 | grep -oE 'TCGA-|GTEX-|TARGET-|\sK-' | wc -l
zcat expression_matrix.tsv.gz | head -n1 | grep -oE 'TCGA-' | wc -l
zcat expression_matrix.tsv.gz | head -n1 | grep -oE 'GTEX-' | wc -l
zcat expression_matrix.tsv.gz | head -n1 | grep -oE 'TARGET-' | wc -l
zcat expression_matrix.tsv.gz | head -n1 | grep -oE '\sK-' | wc -l

returned:

Source	Sample size
TOT	19,109
TCGA	10,530
GTEX	7,775
TARGET	734
K	70

Gene check

zcat expression_matrix.tsv.gz | wc -l

returned: 60499.

Run `heatmaps` workflow on test data set

Generate the reduced data set for testing

This pipeline makes use of metasample.py

kerblam run gen_test_data

Note

The accuracy of this step is verified by the metasample section of the profiler_tests.R script.

Run the analysis pipeline

Edit the ./data/in/config/heatmaps_runtime_options.json JSON file based on the desired options

{
    "rank_method": "norm_fold_change",
    "threads": 3,
    "save_extra_plots": false,
    "prune_similarity": 0.9,
    "prune_direction": "bottomup",
    "run_unweighted": false,
    "alpha_threshold": 0.20,
    "cluster_heatmap_cols": false
}

In particular, use generanker --list-methods within the Python virtual environment, to see the available ranking metrics implemented by Gene Ranker. Then

kerblam run heatmaps -l --profile test

this will run the following modules (from ./src/modules/) along with the related dependencies:

ranking/select_and_run.py
- metasplit
- gene_ranker
  - fast-cohen
make_genesets.py
- bonsai
run_gsea.R
plotting/plot_large_heatmap.R

Rank genes

metasplit is the program used by select_and_run.py to parse the JSON query file (./data/in/config/DEA_queries/dea_queries.json) and extract within-group (i.e., for each cancer type) case and control submatrices from the global expression matrix. Then, for each cancer type, gene_ranker is run to calculate the ranking metric selected through the rank_method JSON property. Final ranks are saved in the ./data/deas/ directory as two-coulmn (sample,ranking) CSV tables.

Make the gene sets

1. Generate large tables

make_genesets.py uses make_large_tables() Ariadne's function to make 9 fundamental "large tables" based on the queries hardcoded in ./data/in/config/gene_lists/basic.json

whole_transportome
|___pores
|	|___channels
|	|___aquaporins
|___transporters
	|___solute_carriers
	|___atp_driven
		|___ABC
		|___pumps

2. Generate lists from large tables

For each large table, bonsai is used to generate tree structures representing all the possible gene sets, based on the the 3 parameters of the function generate_gene_list_trees(), with the following default values:

min_pop_score: float = 0.5,
min_set_size: int = 10,
min_recurse_set_size: int = 40,

with the following meaning:

min_pop_score: minimum portion of non-NA values in a column to be considered for gene lists.
min_set_size: Minimum number of genes to produce a valid gene set.
min_recurse_set_size: minimum parent-geneset size to have before running recursion on it (effective if recurse boolean is True).

These parameters cannot be assigned by editing the heatmaps_runtime_options.json JSON file because they are considered lower-level parameters, however they can be passed as arguments to the make_genesets.py script within the heatmaps.makefile workflow.

# E.g.,
python $(mods)/make_genesets.py ./data/MTPDB.sqlite ./data/in/config/gene_lists/basic.json \
	./data/genesets.json ./data/genesets_repr.txt \
	--min_pop_score 0.7 \
	--min_set_size 15 \
	--min_recurse_set_size 20 \
	--prune_direction $(PRUNE_DIRECTION) \
	--prune_similarity $(PRUNE_SIMILARITY) \
	--verbose

3. Make the union of the genesets following the structure

All gene sets are merged together into a large tree structure, then the peune() function is used to remove redundancy. The two parameters of prune() function (similarity and direction) are set in the ./data/in/config/heatmaps_runtime_options.json file, with the following defaults:

"prune_similarity": 0.5,
"prune_direction": "bottomup",

Notes on DESeq2

You have to set the option minReplicatesForReplace to Inf in DESeq in order to never replace outliers (and so have the baseMeans exactly equal to the mean of the MOR-normalized counts across all samples) dds2 <- DESeq2::DESeq(dds, minReplicatesForReplace = Inf)
log2 Fold Change in DESeq2 is not identical to FC calculated from normalized count (https://support.bioconductor.org/p/p134193/) ...turning off fold change shrinkage should make log2foldchange from DESeq2 be simply equal to (mean of normalized counts group B) / (mean of normalized counts group A). However, it seems that some degree of fold change moderation is done even when betaPrior is False. It's not always equal to the ratio of the mean of normalized counts depending on the fit of the GLM, but close (when no other factors are present in the design). Michael Love

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
auxiliary_functions.R		auxiliary_functions.R
profiler_tests.R		profiler_tests.R
run_mg.sh		run_mg.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How to run the Transportome Profiler locally

Clone the repo

Install R requirements

(Create and) Activate a Python virtual environment and install requirements

Install `xsv`

Install `fast-cohen`

Install `Kerblam!`

Fetch MTP-DB, TCGA and GTEx datasets from remote (through XENA)

Raw counts

Sample check

Gene check

Run `heatmaps` workflow on test data set

Generate the reduced data set for testing

Run the analysis pipeline

Rank genes

Make the gene sets

1. Generate large tables

2. Generate lists from large tables

3. Make the union of the genesets following the structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How to run the Transportome Profiler locally

Clone the repo

Install R requirements

(Create and) Activate a Python virtual environment and install requirements

Install xsv

Install fast-cohen

Install Kerblam!

Fetch MTP-DB, TCGA and GTEx datasets from remote (through XENA)

Raw counts

Sample check

Gene check

Run heatmaps workflow on test data set

Generate the reduced data set for testing

Run the analysis pipeline

Rank genes

Make the gene sets

1. Generate large tables

2. Generate lists from large tables

3. Make the union of the genesets following the structure

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Install `xsv`

Install `fast-cohen`

Install `Kerblam!`

Run `heatmaps` workflow on test data set

Packages