diff --git a/README.md b/README.md
index bedd52a..d4e5983 100644
--- a/README.md
+++ b/README.md
@@ -20,7 +20,7 @@
4. Evaluation & Downstream Analysis: The trained model is evaluated using the test dataset by calculating metrics such as precision, recall, f1-score, and accuracy. Various visualizations, such as ROC curve of class annotation, feature rank plots, heatmap of top genes per class, [DGE analysis](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/differential_gene_expression/dge.ipynb), and [gene recall curves](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/gene_recall_curve.ipynb), are generated.
-The following flowchart explains the major steps of the scaLR platform.
+**The below flowchart also explains the major steps of the scaLR platform.**

@@ -29,7 +29,6 @@ The following flowchart explains the major steps of the scaLR platform.
- ScaLR can be installed using git or pip. It is tested in Python 3.10 and it is recommended to use that environment.
-
```
conda create -n scaLR_env python=3.10
@@ -47,9 +46,9 @@ pip install -r requirements.txt
```
pip install pyscaLR
```
-*Note* If the user wants to run the entire pipeline via installing pip pyscalr, they should clone/download these files(`pipeline.py` and `config.yaml`) from the git repository.
+**Note:** If the user wants to run the entire pipeline via installing pip pyscalr, they should clone/download these files(`pipeline.py` and `config.yaml`) from the git repository.
-## Input Data
+## Input data format
- Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only).
- The anndata object should contain cell samples as `obs` and genes as `var. '
- `adata.X`: contains normalized gene counts/expression values (`log1p` normalization with range `0-10` expected).
@@ -60,15 +59,192 @@ pip install pyscaLR
## How to run
1. It is necessary that the user modify the configuration file, and each stage of the pipeline is available inside the config folder [config.yml] as per your requirements. Simply omit/comment out stages of the pipeline you do not wish to run.
-2. Refer config.yml & it's detailed config [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file on how to use different parameters and files.
+2. Refer **config.yml** & **it's detailed config** [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file on how to use different parameters and files.
3. Then use the `pipeline.py` file to run the entire pipeline according to your configurations. This file takes as argument the path to config (`-c | --config`), along with optional flags to log all parts of the pipelines (`-l | --log`) and to analyze memory usage (`-m | --memoryprofiler`).
5. `python pipeline.py --config /path/to/config.yaml -l -m` to run the scaLR.
-## Examples configs
+## Example configs
+
+### Config for cell type classification and biomarker identification
+
+NOTE: Below are just suggestions for the model parameters. Feel free to play around with them for tuning the model & improving the results.
+
+An example configuration file for the current dataset, incorporating the edits below, can be found at '`scaLR/tutorials/pipeline/config_celltype.yaml`. Update the device as cuda or cpu as per the requirement.
+
+- **Device setup***
+ - Update device: 'cuda' for GPU enabled runtype, else device: 'cpu' for CPU enabled runtype.
+- **Experiment Config**
+ - The default exp_run number is 0.If not changed, the celltype classification experiment would be exp_run_0 with all the pipeline results.
+- **Data Config**
+ - Update the full_datapath to `data/modified_adata.h5ad` (as we will include GeneRecallCurve in the downstream).
+ - Specify the num_workers value for effective parallelization.
+ - Set target to cell_type.
+- **Feature Selection**
+ - Specify the num_workers value for effective parallelization.
+ - Update the model layers to [5000, 10], as there are only 10 cell types in the dataset.
+ - Change epoch to 10.
+- **Final Model Training**
+ - Update the model layers to the same as for feature selection: [5000, 10].
+ - Change epoch to 100.
+- **Analysis**
+ - Downstream Analysis
+ - Uncomment the test_samples_downstream_analysis section.
+ - Update the reference_genes_path to `scaLR/tutorials/pipeline/grc_reference_gene.csv`.
+ - Refer to the section below:
+ ```
+ # Config file for pipeline run for cell type classification.
+
+ # DEVICE SETUP.
+ device: 'cuda'
+
+ # EXPERIMENT.
+ experiment:
+ dirpath: 'scalr_experiments'
+ exp_name: 'exp_name'
+ exp_run: 0
+
+ # DATA CONFIG.
+ data:
+ sample_chunksize: 20000
+
+ train_val_test:
+ full_datapath: 'data/modified_adata.h5ad'
+ num_workers: 2
+
+ splitter_config:
+ name: GroupSplitter
+ params:
+ split_ratio: [7, 1, 2.5]
+ stratify: 'donor_id'
+
+ # split_datapaths: ''
+
+ # preprocess:
+ # - name: SampleNorm
+ # params:
+ # **args
+
+ # - name: StandardScaler
+ # params:
+ # **args
+
+ target: cell_type
+
+ # FEATURE SELECTION.
+ feature_selection:
+
+ # score_matrix: '/path/to/matrix'
+ feature_subsetsize: 5000
+ num_workers: 2
+
+ model:
+ name: SequentialModel
+ params:
+ layers: [5000, 10]
+ weights_init_zero: True
+
+ model_train_config:
+ trainer: SimpleModelTrainer
+
+ dataloader:
+ name: SimpleDataLoader
+ params:
+ batch_size: 25000
+ padding: 5000
+
+ optimizer:
+ name: SGD
+ params:
+ lr: 1.0e-3
+ weight_decay: 0.1
+
+ loss:
+ name: CrossEntropyLoss
+
+ epochs: 10
+
+ scoring_config:
+ name: LinearScorer
+
+ features_selector:
+ name: AbsMean
+ params:
+ k: 5000
+
+ # FINAL MODEL TRAINING.
+ final_training:
+
+ model:
+ name: SequentialModel
+ params:
+ layers: [5000, 10]
+ dropout: 0
+ weights_init_zero: False
+
+ model_train_config:
+ resume_from_checkpoint: null
+
+ trainer: SimpleModelTrainer
+
+ dataloader:
+ name: SimpleDataLoader
+ params:
+ batch_size: 15000
+
+ optimizer:
+ name: Adam
+ params:
+ lr: 1.0e-3
+ weight_decay: 0
+
+ loss:
+ name: CrossEntropyLoss
+
+ epochs: 100
+
+ callbacks:
+ - name: TensorboardLogger
+ - name: EarlyStopping
+ params:
+ patience: 3
+ min_delta: 1.0e-4
+ - name: ModelCheckpoint
+ params:
+ interval: 5
+ analysis:
+
+ model_checkpoint: ''
-### Config edits (For clinical condition-specific biomarker identification and DGE analysis)
+ dataloader:
+ name: SimpleDataLoader
+ params:
+ batch_size: 15000
+
+ gene_analysis:
+ scoring_config:
+ name: LinearScorer
+
+ features_selector:
+ name: ClasswisePromoters
+ params:
+ k: 100
+ test_samples_downstream_analysis:
+ - name: GeneRecallCurve
+ params:
+ reference_genes_path: 'scaLR/tutorials/pipeline/grc_reference_gene.csv'
+ top_K: 300
+ plots_per_row: 3
+ features_selector:
+ name: ClasswiseAbs
+ params: {}
+ - name: Heatmap
+ params: {}
+ - name: RocAucCurve
+ params: {}
+ ```
+### Config for clinical condition-specific biomarker identification and DGE analysis
-An example configuration file for the current dataset, incorporating the edits below, can be found at: scaLR/tutorials/pipeline/config_clinical.yaml.Please update the device as CUDA or CPU as per runtype
+An example configuration file (`scaLR/tutorials/pipeline/config_clinical.yaml`). Update the device as CUDA or CPU as per the requirement.
- Experiment Config
- Make sure to change the exp_run number if you have an experiment with the same number earlier related to cell classification. As we have done one experiment earlier, we'll change the number now to '1'.
@@ -83,10 +259,10 @@ An example configuration file for the current dataset, incorporating the edits b
- epoch as 100.
- Analysis
- Downstream Analysis
- - Uncomment the full_samples_downstream_analysis section.
+ - Uncomment the full_samples_downstream_analysis section for example config file.
- We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the COVID-19/normal specific genes are available, but there are many possibilities of genes in the case of normal conditions.
- - There are two options to perform differential gene expression (DGE) analysis: DgePseudoBulk and DgeLMEM. The parameters are updated as follows. Note that DgeLMEM may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime.
- - Please refer to the section below:
+ - There are two options to perform differential gene expression (DGE) analysis: **DgePseudoBulk and DgeLMEM**. The parameters are updated as follows. Note that DgeLMEM may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime.
+ - Refer to the section below:
```
analysis:
@@ -102,67 +278,6 @@ An example configuration file for the current dataset, incorporating the edits b
scoring_config:
name: LinearScorer
- features_selector:
- name: ClasswisePromoters
- params:
- k: 100
- full_samples_downstream_analysis:
- - name: Heatmap
- params:
- top_n_genes: 100
- - name: RocAucCurve
- params: {}
- - name: DgePseudoBulk
- params:
- celltype_column: 'cell_type'
- design_factor: 'disease'
- factor_categories: ['COVID-19', 'normal']
- sum_column: 'donor_id'
- cell_subsets: ['conventional dendritic cell', 'natural killer cell']
- - name: DgeLMEM
- params:
- fixed_effect_column: 'disease'
- fixed_effect_factors: ['COVID-19', 'normal']
- group: 'donor_id'
- celltype_column: 'cell_type'
- cell_subsets: ['conventional dendritic cell']
- gene_batch_size: 1000
- coef_threshold: 0.1
- ```
-### Config edits (For clinical condition-specific biomarker identification and DGE analysis)
- An example configuration file for the current dataset, incorporating the edits below, can be found at: scaLR/tutorials/pipeline/config_clinical.yaml.Please update the device as cuda or cpu as per runtype
-
-- Experiment Config
- - Make sure to change the exp_run number if you have an experiment with the same number earlier related to cell classification.As we have done one experiment earlier, we'll change the number now to '1'.
-- Data Config
- - The full_datapath remains the same as above.
- - Change the target to disease (this column contains data for clinical conditions, COVID-19/normal).
-- Feature Selection
- - Update the model layers to [5000, 2], as there are only two types of clinical conditions.
- - epoch as 10.
-- Final Model Training
- - Update the model layers to the same as for feature selection: [5000, 2].
- - epoch as 100.
-- Analysis
- - Downstream Analysis
- - Uncomment the full_samples_downstream_analysis section.
- - We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the COVID-19/normal specific genes are available, but there are many possibilities of genes in the case of normal conditions.
- - There are two options to perform differential gene expression (DGE) analysis: DgePseudoBulk and DgeLMEM. The parameters are updated as follows. Note that DgeLMEM may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime.
- - Please refer to the section below:
- ```
- analysis:
-
- model_checkpoint: ''
-
- dataloader:
- name: SimpleDataLoader
- params:
- batch_size: 15000
-
- gene_analysis:
- scoring_config:
- name: LinearScorer
-
features_selector:
name: ClasswisePromoters
params:
@@ -192,16 +307,17 @@ An example configuration file for the current dataset, incorporating the edits b
```
## Interactive tutorials
-Detailed tutorials have been made on how to use some functionalities as a scaLR library. Find the links below.
+Detailed tutorials have been made on how to use some pipeline functionalities as a scaLR library. Find the links below.
- **scaLR pipeline** [](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline.ipynb)
- **Differential gene expression analysis** [](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/differential_gene_expression/dge.ipynb)
- **Gene recall curve** [](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/gene_recall_curve.ipynb)
- **Normalization** [](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/preprocessing/normalization.ipynb)
- **Batch correction** [](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/preprocessing/batch_correction.ipynb)
-- **SHAP analysis** [](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/shap_analysis/shap_heatmap.ipynb)
-## Experiment Output Structure
+- **An example of jupyter notebook to [run scaLR in local machine](https://github.com/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline_local_run.ipynb)**.
+
+## Experiment output structure
- **pipeline.py**:
The main script that performs an end-to-end run.
- `exp_dir`: root experiment directory for the storage of all step outputs of the platform specified in the config.
@@ -256,8 +372,6 @@ Performs evaluation of best model trained on user-defined metrics on the test se
- `lmemDGE_celltype.csv`: contains LMEM DGE results between selected factor categories for a celltype.
- `lmemDGE_fixed_effect_factor_X.svg`: volcano plot of coefficient vs -log10(p-value) of genes.
-
-
## Citation
Jogani Saiyam, Anand Santosh Pol, Mayur Prajapati, Amit Samal, Kriti Bhatia, Jayendra Parmar, Urvik Patel, Falak Shah, Nisarg Vyas, and Saurabh Gupta. "scaLR: a low-resource deep neural network-based platform for single cell analysis and biomarker discovery." bioRxiv (2024): 2024-09.
diff --git a/tutorials/pipeline/scalr_pipeline_local_run.ipynb b/tutorials/pipeline/scalr_pipeline_local_run.ipynb
new file mode 100644
index 0000000..c9fe5ec
--- /dev/null
+++ b/tutorials/pipeline/scalr_pipeline_local_run.ipynb
@@ -0,0 +1,1766 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "dfGECxsGN9bo"
+ },
+ "source": [
+ "
\n",
+ "\n",
+ "# Single-cell analysis using Low Resource (scaLR)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Xna7qg2PgjJm"
+ },
+ "source": [
+ "\n",
+ "\n",
+ "**Note:** \n",
+ "1. If scaLR is intended to be run on a local system, please ensure that an `ipy kernel` with Python version `3.10` is selected. Then, all the required installations can be performed as mentioned in the section below.\n",
+ "\n",
+ "2. If scaLR has already been installed as mentioned in [Pre-requisites and installation scaLR](https://github.com/infocusp/scaLR), the repository cloning and requirement installation steps below can be skipped. Selecting the `ipy kernel` can be done as follows:\n",
+ "\n",
+ " - Open the terminal and run: \n",
+ " \n",
+ " ```\n",
+ " conda install -c anaconda ipykernel\n",
+ " python -m ipykernel install --user --name=scaLR_env\n",
+ " ```\n",
+ " - Select `scaLR_env` as the `ipy kernel` in `scalr_pipeline.ipynb`. \n",
+ " - Finally, update the system path for scaLR, as mentioned in the shell before data download. e.g.: \n",
+ " ```\n",
+ " sys.path.append('path/to/scaLR/')\n",
+ " ``` \n",
+ "## Cloning scaLR"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "CdutIWiy8xJb"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Cloning into 'scaLR'...\n",
+ "remote: Enumerating objects: 3452, done.\u001b[K\n",
+ "remote: Counting objects: 100% (372/372), done.\u001b[K\n",
+ "remote: Compressing objects: 100% (181/181), done.\u001b[K\n",
+ "remote: Total 3452 (delta 243), reused 261 (delta 189), pack-reused 3080 (from 1)\u001b[K\n",
+ "Receiving objects: 100% (3452/3452), 170.03 MiB | 2.80 MiB/s, done.\n",
+ "Resolving deltas: 100% (2073/2073), done.\n"
+ ]
+ }
+ ],
+ "source": [
+ "!git clone https://github.com/infocusp/scaLR.git"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "MLJo_0EugjJq"
+ },
+ "source": [
+ "Install all requirements after cloning the repository, excluding packages that are pre-installed in Colab."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "9dQLPmLwPL0C"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Defaulting to user installation because normal site-packages is not writeable\n",
+ "Requirement already satisfied: anndata==0.10.9 in /home/amit.samal/.local/lib/python3.10/site-packages (0.10.9)\n",
+ "Requirement already satisfied: isort==5.13.2 in /home/amit.samal/.local/lib/python3.10/site-packages (5.13.2)\n",
+ "Collecting loky==3.4.1\n",
+ " Downloading loky-3.4.1-py3-none-any.whl.metadata (6.4 kB)\n",
+ "Requirement already satisfied: pillow==10.4.0 in /home/amit.samal/.local/lib/python3.10/site-packages (10.4.0)\n",
+ "Requirement already satisfied: pydeseq2==0.4.11 in /home/amit.samal/.local/lib/python3.10/site-packages (0.4.11)\n",
+ "Requirement already satisfied: pyparsing==3.2.0 in /home/amit.samal/.local/lib/python3.10/site-packages (3.2.0)\n",
+ "Requirement already satisfied: pytest==8.3.3 in /home/amit.samal/.local/lib/python3.10/site-packages (8.3.3)\n",
+ "Requirement already satisfied: PyYAML==6.0.2 in /home/amit.samal/.local/lib/python3.10/site-packages (6.0.2)\n",
+ "Requirement already satisfied: scanpy==1.10.3 in /home/amit.samal/.local/lib/python3.10/site-packages (1.10.3)\n",
+ "Requirement already satisfied: scikit-learn==1.5.2 in /home/amit.samal/.local/lib/python3.10/site-packages (1.5.2)\n",
+ "Requirement already satisfied: shap==0.46.0 in /home/amit.samal/.local/lib/python3.10/site-packages (0.46.0)\n",
+ "Requirement already satisfied: tensorboard==2.17.0 in /home/amit.samal/.local/lib/python3.10/site-packages (2.17.0)\n",
+ "Requirement already satisfied: toml==0.10.2 in /home/amit.samal/.local/lib/python3.10/site-packages (0.10.2)\n",
+ "Requirement already satisfied: tqdm==4.66.5 in /home/amit.samal/.local/lib/python3.10/site-packages (4.66.5)\n",
+ "Requirement already satisfied: yapf==0.40.2 in /home/amit.samal/.local/lib/python3.10/site-packages (0.40.2)\n",
+ "Requirement already satisfied: array-api-compat!=1.5,>1.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.5.1)\n",
+ "Requirement already satisfied: exceptiongroup in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.2.0)\n",
+ "Requirement already satisfied: h5py>=3.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (3.10.0)\n",
+ "Requirement already satisfied: natsort in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (8.4.0)\n",
+ "Requirement already satisfied: numpy>=1.23 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.26.3)\n",
+ "Requirement already satisfied: packaging>=20.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (24.0)\n",
+ "Requirement already satisfied: pandas!=2.1.0rc0,!=2.1.2,>=1.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.5.3)\n",
+ "Requirement already satisfied: scipy>1.8 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.12.0)\n",
+ "Requirement already satisfied: cloudpickle in /home/amit.samal/.local/lib/python3.10/site-packages (from loky==3.4.1) (3.0.0)\n",
+ "Requirement already satisfied: matplotlib>=3.6.2 in /home/amit.samal/.local/lib/python3.10/site-packages (from pydeseq2==0.4.11) (3.8.3)\n",
+ "Requirement already satisfied: iniconfig in /home/amit.samal/.local/lib/python3.10/site-packages (from pytest==8.3.3) (2.0.0)\n",
+ "Requirement already satisfied: pluggy<2,>=1.5 in /home/amit.samal/.local/lib/python3.10/site-packages (from pytest==8.3.3) (1.5.0)\n",
+ "Requirement already satisfied: tomli>=1 in /home/amit.samal/.local/lib/python3.10/site-packages (from pytest==8.3.3) (2.1.0)\n",
+ "Requirement already satisfied: joblib in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (1.3.2)\n",
+ "Requirement already satisfied: legacy-api-wrap>=1.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (1.4)\n",
+ "Requirement already satisfied: networkx>=2.7 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (3.2.1)\n",
+ "Requirement already satisfied: numba>=0.56 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.59.1)\n",
+ "Requirement already satisfied: patsy in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.5.6)\n",
+ "Requirement already satisfied: pynndescent>=0.5 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.5.11)\n",
+ "Requirement already satisfied: seaborn>=0.13 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.13.2)\n",
+ "Requirement already satisfied: session-info in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (1.0.0)\n",
+ "Requirement already satisfied: statsmodels>=0.13 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.14.1)\n",
+ "Requirement already satisfied: umap-learn!=0.5.0,>=0.5 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.5.5)\n",
+ "Requirement already satisfied: threadpoolctl>=3.1.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from scikit-learn==1.5.2) (3.4.0)\n",
+ "Requirement already satisfied: slicer==0.0.8 in /home/amit.samal/.local/lib/python3.10/site-packages (from shap==0.46.0) (0.0.8)\n",
+ "Requirement already satisfied: absl-py>=0.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (2.1.0)\n",
+ "Requirement already satisfied: grpcio>=1.48.2 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (1.70.0)\n",
+ "Requirement already satisfied: markdown>=2.6.8 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (3.7)\n",
+ "Requirement already satisfied: protobuf!=4.24.0,<5.0.0,>=3.19.6 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (4.25.6)\n",
+ "Requirement already satisfied: setuptools>=41.0.0 in /usr/lib/python3/dist-packages (from tensorboard==2.17.0) (59.6.0)\n",
+ "Requirement already satisfied: six>1.9 in /usr/lib/python3/dist-packages (from tensorboard==2.17.0) (1.16.0)\n",
+ "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (0.7.2)\n",
+ "Requirement already satisfied: werkzeug>=1.0.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (3.1.3)\n",
+ "Requirement already satisfied: importlib-metadata>=6.6.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from yapf==0.40.2) (8.6.1)\n",
+ "Requirement already satisfied: platformdirs>=3.5.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from yapf==0.40.2) (4.2.0)\n",
+ "Requirement already satisfied: zipp>=3.20 in /home/amit.samal/.local/lib/python3.10/site-packages (from importlib-metadata>=6.6.0->yapf==0.40.2) (3.21.0)\n",
+ "Requirement already satisfied: contourpy>=1.0.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (1.2.0)\n",
+ "Requirement already satisfied: cycler>=0.10 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (0.12.1)\n",
+ "Requirement already satisfied: fonttools>=4.22.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (4.50.0)\n",
+ "Requirement already satisfied: kiwisolver>=1.3.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (1.4.5)\n",
+ "Requirement already satisfied: python-dateutil>=2.7 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (2.9.0.post0)\n",
+ "Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /home/amit.samal/.local/lib/python3.10/site-packages (from numba>=0.56->scanpy==1.10.3) (0.42.0)\n",
+ "Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas!=2.1.0rc0,!=2.1.2,>=1.4->anndata==0.10.9) (2022.1)\n",
+ "Requirement already satisfied: MarkupSafe>=2.1.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard==2.17.0) (3.0.2)\n",
+ "Requirement already satisfied: stdlib-list in /home/amit.samal/.local/lib/python3.10/site-packages (from session-info->scanpy==1.10.3) (0.10.0)\n",
+ "Downloading loky-3.4.1-py3-none-any.whl (54 kB)\n",
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.6/54.6 kB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
+ "\u001b[?25hInstalling collected packages: loky\n",
+ "Successfully installed loky-3.4.1\n",
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
+ "Defaulting to user installation because normal site-packages is not writeable\n",
+ "Requirement already satisfied: memory-profiler==0.61.0 in /home/amit.samal/.local/lib/python3.10/site-packages (0.61.0)\n",
+ "Requirement already satisfied: psutil in /home/amit.samal/.local/lib/python3.10/site-packages (from memory-profiler==0.61.0) (5.9.8)\n",
+ "\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
+ ]
+ }
+ ],
+ "source": [
+ "import sys\n",
+ "imported_packages = {pkg.split('.')[0] for pkg in sys.modules.keys()}\n",
+ "ignore_libraries = \"|\".join(imported_packages)\n",
+ "\n",
+ "!pip install $(grep -ivE \"$ignore_libraries\" scaLR/requirements.txt)\n",
+ "!pip install memory-profiler==0.61.0"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# # Uncomment and run the following if the scaLR pipeline is to be executed locally after installation, as explained in Note 2.\n",
+ "# import sys\n",
+ "# sys.path.append('path/to/scaLR/')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0DvyBaoIPdnX"
+ },
+ "source": [
+ "## Downloading input anndata from `cellxgene`\n",
+ "- Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only).\n",
+ "- The anndata object should contain cell samples as `obs` and genes as `var`.\n",
+ "- `adata.X`: contains normalized gene counts/expression values (Typically `log1p` normalized, data ranging from 0-10).\n",
+ "- `adata.obs`: contains any metadata regarding cells, including a column for `target` which will be used for classification. The index of `adata.obs` is cell_barcodes.\n",
+ "- `adata.var`: contains all gene_names as Index.\n",
+ "\n",
+ "The dataset we are about to download contains two clinical conditions (COVID-19 and normal) and links variations in immune response to disease severity and outcomes over time[(Liu et al. (2021))](https://doi.org/10.1016/j.cell.2021.02.018)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {
+ "id": "loCfvnwt9ei1"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "--2025-02-27 18:52:02-- https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad\n",
+ "Resolving datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)... 18.239.111.15, 18.239.111.109, 18.239.111.30, ...\n",
+ "Connecting to datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)|18.239.111.15|:443... connected.\n",
+ "HTTP request sent, awaiting response... 200 OK\n",
+ "Length: 980103606 (935M) [binary/octet-stream]\n",
+ "Saving to: ‘data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad’\n",
+ "\n",
+ "21ef2ea2-cbed-4b6c- 100%[===================>] 934.70M 3.21MB/s in 4m 48s \n",
+ "\n",
+ "2025-02-27 18:56:51 (3.25 MB/s) - ‘data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad’ saved [980103606/980103606]\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "# This shell will take approximately 00:00:53 (hh:mm:ss) to run.\n",
+ "!wget -P data https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tSiYIOo8P_3b"
+ },
+ "source": [
+ "## Data exploration"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "id": "23C87j3PR9ox"
+ },
+ "outputs": [],
+ "source": [
+ "from IPython.display import SVG, display\n",
+ "import warnings\n",
+ "import anndata as ad\n",
+ "from anndata import AnnData\n",
+ "import numpy as np\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {
+ "id": "eDH3GxXr-er6"
+ },
+ "outputs": [],
+ "source": [
+ "adata = ad.read_h5ad(\"data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad\",backed='r')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {
+ "id": "SS4oTWW6Xn8c"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "The anndata has '125117' cells and '30695' genes\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f\"\\nThe anndata has '{adata.n_obs}' cells and '{adata.n_vars}' genes\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {
+ "id": "z1u-kctbSStJ"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " dsm_severity_score_group | \n",
+ " disease_ontology_term_id | \n",
+ " severity | \n",
+ " tissue_ontology_term_id | \n",
+ " timepoint | \n",
+ " outcome | \n",
+ " dsm_severity_score | \n",
+ " days_since_hospitalized | \n",
+ " age | \n",
+ " donor_id | \n",
+ " ... | \n",
+ " tissue_type | \n",
+ " cell_type | \n",
+ " assay | \n",
+ " disease | \n",
+ " organism | \n",
+ " sex | \n",
+ " tissue | \n",
+ " self_reported_ethnicity | \n",
+ " development_stage | \n",
+ " observation_joinid | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | AAACCTGAGAAACCTA-1_1 | \n",
+ " DSM_low | \n",
+ " MONDO:0100096 | \n",
+ " Moderate | \n",
+ " UBERON:0000178 | \n",
+ " T0 | \n",
+ " alive | \n",
+ " -1.950858 | \n",
+ " 1.0 | \n",
+ " 55.0 | \n",
+ " HGR0000083 | \n",
+ " ... | \n",
+ " tissue | \n",
+ " non-classical monocyte | \n",
+ " 10x 5' v1 | \n",
+ " COVID-19 | \n",
+ " Homo sapiens | \n",
+ " male | \n",
+ " blood | \n",
+ " European | \n",
+ " 55-year-old stage | \n",
+ " !9L}G4hgnw | \n",
+ "
\n",
+ " \n",
+ " | AAACCTGAGGGTTTCT-1_1 | \n",
+ " DSM_high | \n",
+ " MONDO:0100096 | \n",
+ " Critical | \n",
+ " UBERON:0000178 | \n",
+ " T0 | \n",
+ " alive | \n",
+ " -0.092375 | \n",
+ " 13.0 | \n",
+ " 40.0 | \n",
+ " HGR0000078 | \n",
+ " ... | \n",
+ " tissue | \n",
+ " classical monocyte | \n",
+ " 10x 5' v1 | \n",
+ " COVID-19 | \n",
+ " Homo sapiens | \n",
+ " female | \n",
+ " blood | \n",
+ " European | \n",
+ " 40-year-old stage | \n",
+ " YRcUzlVyg0 | \n",
+ "
\n",
+ " \n",
+ " | AAACCTGCACCTGGTG-1_1 | \n",
+ " DSM_high | \n",
+ " MONDO:0100096 | \n",
+ " Critical | \n",
+ " UBERON:0000178 | \n",
+ " T0 | \n",
+ " alive | \n",
+ " 2.954350 | \n",
+ " 1.0 | \n",
+ " 60.0 | \n",
+ " HGR0000098 | \n",
+ " ... | \n",
+ " tissue | \n",
+ " CD16-positive, CD56-dim natural killer cell, h... | \n",
+ " 10x 5' v1 | \n",
+ " COVID-19 | \n",
+ " Homo sapiens | \n",
+ " male | \n",
+ " blood | \n",
+ " European | \n",
+ " 60-year-old stage | \n",
+ " )*azge@M0l | \n",
+ "
\n",
+ " \n",
+ " | AAACCTGGTCCGAGTC-1_1 | \n",
+ " DSM_high | \n",
+ " MONDO:0100096 | \n",
+ " Critical | \n",
+ " UBERON:0000178 | \n",
+ " T0 | \n",
+ " deceased | \n",
+ " 3.276233 | \n",
+ " 6.0 | \n",
+ " 76.0 | \n",
+ " HGR0000141 | \n",
+ " ... | \n",
+ " tissue | \n",
+ " classical monocyte | \n",
+ " 10x 5' v1 | \n",
+ " COVID-19 | \n",
+ " Homo sapiens | \n",
+ " male | \n",
+ " blood | \n",
+ " European | \n",
+ " 76-year-old stage | \n",
+ " E<FU`+QN&T | \n",
+ "
\n",
+ " \n",
+ " | AAACCTGGTGCCTTGG-1_1 | \n",
+ " DSM_low | \n",
+ " MONDO:0100096 | \n",
+ " Critical | \n",
+ " UBERON:0000178 | \n",
+ " T0 | \n",
+ " alive | \n",
+ " -0.348888 | \n",
+ " 1.0 | \n",
+ " 70.0 | \n",
+ " HGR0000093 | \n",
+ " ... | \n",
+ " tissue | \n",
+ " classical monocyte | \n",
+ " 10x 5' v1 | \n",
+ " COVID-19 | \n",
+ " Homo sapiens | \n",
+ " male | \n",
+ " blood | \n",
+ " European | \n",
+ " 70-year-old stage | \n",
+ " 2MZ#6SX}{g | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
5 rows × 32 columns
\n",
+ "
"
+ ],
+ "text/plain": [
+ " dsm_severity_score_group disease_ontology_term_id \\\n",
+ "AAACCTGAGAAACCTA-1_1 DSM_low MONDO:0100096 \n",
+ "AAACCTGAGGGTTTCT-1_1 DSM_high MONDO:0100096 \n",
+ "AAACCTGCACCTGGTG-1_1 DSM_high MONDO:0100096 \n",
+ "AAACCTGGTCCGAGTC-1_1 DSM_high MONDO:0100096 \n",
+ "AAACCTGGTGCCTTGG-1_1 DSM_low MONDO:0100096 \n",
+ "\n",
+ " severity tissue_ontology_term_id timepoint outcome \\\n",
+ "AAACCTGAGAAACCTA-1_1 Moderate UBERON:0000178 T0 alive \n",
+ "AAACCTGAGGGTTTCT-1_1 Critical UBERON:0000178 T0 alive \n",
+ "AAACCTGCACCTGGTG-1_1 Critical UBERON:0000178 T0 alive \n",
+ "AAACCTGGTCCGAGTC-1_1 Critical UBERON:0000178 T0 deceased \n",
+ "AAACCTGGTGCCTTGG-1_1 Critical UBERON:0000178 T0 alive \n",
+ "\n",
+ " dsm_severity_score days_since_hospitalized age \\\n",
+ "AAACCTGAGAAACCTA-1_1 -1.950858 1.0 55.0 \n",
+ "AAACCTGAGGGTTTCT-1_1 -0.092375 13.0 40.0 \n",
+ "AAACCTGCACCTGGTG-1_1 2.954350 1.0 60.0 \n",
+ "AAACCTGGTCCGAGTC-1_1 3.276233 6.0 76.0 \n",
+ "AAACCTGGTGCCTTGG-1_1 -0.348888 1.0 70.0 \n",
+ "\n",
+ " donor_id ... tissue_type \\\n",
+ "AAACCTGAGAAACCTA-1_1 HGR0000083 ... tissue \n",
+ "AAACCTGAGGGTTTCT-1_1 HGR0000078 ... tissue \n",
+ "AAACCTGCACCTGGTG-1_1 HGR0000098 ... tissue \n",
+ "AAACCTGGTCCGAGTC-1_1 HGR0000141 ... tissue \n",
+ "AAACCTGGTGCCTTGG-1_1 HGR0000093 ... tissue \n",
+ "\n",
+ " cell_type \\\n",
+ "AAACCTGAGAAACCTA-1_1 non-classical monocyte \n",
+ "AAACCTGAGGGTTTCT-1_1 classical monocyte \n",
+ "AAACCTGCACCTGGTG-1_1 CD16-positive, CD56-dim natural killer cell, h... \n",
+ "AAACCTGGTCCGAGTC-1_1 classical monocyte \n",
+ "AAACCTGGTGCCTTGG-1_1 classical monocyte \n",
+ "\n",
+ " assay disease organism sex tissue \\\n",
+ "AAACCTGAGAAACCTA-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n",
+ "AAACCTGAGGGTTTCT-1_1 10x 5' v1 COVID-19 Homo sapiens female blood \n",
+ "AAACCTGCACCTGGTG-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n",
+ "AAACCTGGTCCGAGTC-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n",
+ "AAACCTGGTGCCTTGG-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n",
+ "\n",
+ " self_reported_ethnicity development_stage \\\n",
+ "AAACCTGAGAAACCTA-1_1 European 55-year-old stage \n",
+ "AAACCTGAGGGTTTCT-1_1 European 40-year-old stage \n",
+ "AAACCTGCACCTGGTG-1_1 European 60-year-old stage \n",
+ "AAACCTGGTCCGAGTC-1_1 European 76-year-old stage \n",
+ "AAACCTGGTGCCTTGG-1_1 European 70-year-old stage \n",
+ "\n",
+ " observation_joinid \n",
+ "AAACCTGAGAAACCTA-1_1 !9L}G4hgnw \n",
+ "AAACCTGAGGGTTTCT-1_1 YRcUzlVyg0 \n",
+ "AAACCTGCACCTGGTG-1_1 )*azge@M0l \n",
+ "AAACCTGGTCCGAGTC-1_1 E 10 or min_val < 0:\n",
+ " warnings.warn(f\"Warning: Expression Value out of range! Max: {max_val}, Min: {min_val}. Expected range is 0-10.\", UserWarning)\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {
+ "id": "bd2fTv0gdluU"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " mvp.mean | \n",
+ " mvp.dispersion | \n",
+ " mvp.dispersion.scaled | \n",
+ " mvp.variable | \n",
+ " feature_is_filtered | \n",
+ " feature_name | \n",
+ " feature_reference | \n",
+ " feature_biotype | \n",
+ " feature_length | \n",
+ " feature_type | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | ENSG00000168454 | \n",
+ " 0.000380 | \n",
+ " 1.168876 | \n",
+ " 0.181734 | \n",
+ " False | \n",
+ " False | \n",
+ " TXNDC2 | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 1703 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ " | ENSG00000197852 | \n",
+ " 0.035995 | \n",
+ " 1.634179 | \n",
+ " 0.886458 | \n",
+ " False | \n",
+ " False | \n",
+ " INKA2 | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 1217 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ " | ENSG00000196878 | \n",
+ " 0.008862 | \n",
+ " 1.617729 | \n",
+ " 0.861545 | \n",
+ " False | \n",
+ " False | \n",
+ " LAMB3 | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 3931 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ " | ENSG00000256540 | \n",
+ " 0.000022 | \n",
+ " 1.660993 | \n",
+ " 0.927070 | \n",
+ " False | \n",
+ " False | \n",
+ " IQSEC3-AS1 | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 1065 | \n",
+ " lncRNA | \n",
+ "
\n",
+ " \n",
+ " | ENSG00000139180 | \n",
+ " 0.090100 | \n",
+ " 1.184720 | \n",
+ " 0.205731 | \n",
+ " False | \n",
+ " False | \n",
+ " NDUFA9 | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 782 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " mvp.mean mvp.dispersion mvp.dispersion.scaled \\\n",
+ "ENSG00000168454 0.000380 1.168876 0.181734 \n",
+ "ENSG00000197852 0.035995 1.634179 0.886458 \n",
+ "ENSG00000196878 0.008862 1.617729 0.861545 \n",
+ "ENSG00000256540 0.000022 1.660993 0.927070 \n",
+ "ENSG00000139180 0.090100 1.184720 0.205731 \n",
+ "\n",
+ " mvp.variable feature_is_filtered feature_name \\\n",
+ "ENSG00000168454 False False TXNDC2 \n",
+ "ENSG00000197852 False False INKA2 \n",
+ "ENSG00000196878 False False LAMB3 \n",
+ "ENSG00000256540 False False IQSEC3-AS1 \n",
+ "ENSG00000139180 False False NDUFA9 \n",
+ "\n",
+ " feature_reference feature_biotype feature_length \\\n",
+ "ENSG00000168454 NCBITaxon:9606 gene 1703 \n",
+ "ENSG00000197852 NCBITaxon:9606 gene 1217 \n",
+ "ENSG00000196878 NCBITaxon:9606 gene 3931 \n",
+ "ENSG00000256540 NCBITaxon:9606 gene 1065 \n",
+ "ENSG00000139180 NCBITaxon:9606 gene 782 \n",
+ "\n",
+ " feature_type \n",
+ "ENSG00000168454 protein_coding \n",
+ "ENSG00000197852 protein_coding \n",
+ "ENSG00000196878 protein_coding \n",
+ "ENSG00000256540 lncRNA \n",
+ "ENSG00000139180 protein_coding "
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "#Gene metadata\n",
+ "adata.var.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "sfgBeaLumPuV"
+ },
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "QLTg-WK-hTS7"
+ },
+ "source": [
+ "### Modifying `var` index (Optional)\n",
+ "- The `index` values in this AnnData object are the `gene_ids`. To retrieve the literature genes associated with a particular cell type, we need the gene symbols, which are present in `feature_name` column. Therefore, we'll replace the index values with gene symbols.\n",
+ "- This will be helpful when analyzing the `GeneRecallCurve` later.\n",
+ "- This step can be skipped if the `reference_genes.csv` already contains gene IDs corresponding to each cell type, or if the user does not want to perform the `GeneRecallCurve` analysis.\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {
+ "id": "qoSHdJtwgPaA"
+ },
+ "outputs": [],
+ "source": [
+ "adata.var.set_index('feature_name',inplace=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {
+ "id": "p3LvDmZmhJ_c"
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " mvp.mean | \n",
+ " mvp.dispersion | \n",
+ " mvp.dispersion.scaled | \n",
+ " mvp.variable | \n",
+ " feature_is_filtered | \n",
+ " feature_reference | \n",
+ " feature_biotype | \n",
+ " feature_length | \n",
+ " feature_type | \n",
+ "
\n",
+ " \n",
+ " | feature_name | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ " | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " | TXNDC2 | \n",
+ " 0.000380 | \n",
+ " 1.168876 | \n",
+ " 0.181734 | \n",
+ " False | \n",
+ " False | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 1703 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ " | INKA2 | \n",
+ " 0.035995 | \n",
+ " 1.634179 | \n",
+ " 0.886458 | \n",
+ " False | \n",
+ " False | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 1217 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ " | LAMB3 | \n",
+ " 0.008862 | \n",
+ " 1.617729 | \n",
+ " 0.861545 | \n",
+ " False | \n",
+ " False | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 3931 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ " | IQSEC3-AS1 | \n",
+ " 0.000022 | \n",
+ " 1.660993 | \n",
+ " 0.927070 | \n",
+ " False | \n",
+ " False | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 1065 | \n",
+ " lncRNA | \n",
+ "
\n",
+ " \n",
+ " | NDUFA9 | \n",
+ " 0.090100 | \n",
+ " 1.184720 | \n",
+ " 0.205731 | \n",
+ " False | \n",
+ " False | \n",
+ " NCBITaxon:9606 | \n",
+ " gene | \n",
+ " 782 | \n",
+ " protein_coding | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " mvp.mean mvp.dispersion mvp.dispersion.scaled mvp.variable \\\n",
+ "feature_name \n",
+ "TXNDC2 0.000380 1.168876 0.181734 False \n",
+ "INKA2 0.035995 1.634179 0.886458 False \n",
+ "LAMB3 0.008862 1.617729 0.861545 False \n",
+ "IQSEC3-AS1 0.000022 1.660993 0.927070 False \n",
+ "NDUFA9 0.090100 1.184720 0.205731 False \n",
+ "\n",
+ " feature_is_filtered feature_reference feature_biotype \\\n",
+ "feature_name \n",
+ "TXNDC2 False NCBITaxon:9606 gene \n",
+ "INKA2 False NCBITaxon:9606 gene \n",
+ "LAMB3 False NCBITaxon:9606 gene \n",
+ "IQSEC3-AS1 False NCBITaxon:9606 gene \n",
+ "NDUFA9 False NCBITaxon:9606 gene \n",
+ "\n",
+ " feature_length feature_type \n",
+ "feature_name \n",
+ "TXNDC2 1703 protein_coding \n",
+ "INKA2 1217 protein_coding \n",
+ "LAMB3 3931 protein_coding \n",
+ "IQSEC3-AS1 1065 lncRNA \n",
+ "NDUFA9 782 protein_coding "
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "# Now the index values are the gene symbols.\n",
+ "adata.var.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {
+ "id": "6yCi6UQ-kh0Q"
+ },
+ "outputs": [],
+ "source": [
+ "# Saving file for further analysis\n",
+ "# This shell will take approximately 00:00:47 (hh:mm:ss) to run.\n",
+ "adata.obs.index = adata.obs.index.astype(str)\n",
+ "adata.var.index = adata.var.index.astype(str)\n",
+ "AnnData(X=adata.X,obs=adata.obs,var=adata.var).write('data/modified_adata.h5ad',compression='gzip')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "e1WBarmdY0h5"
+ },
+ "source": [
+ "## scaLR pipeline \n",
+ "\n",
+ "1. The **scaLR** pipeline consists of four stages:\n",
+ " - Data ingestion\n",
+ " - Feature selection\n",
+ " - Final model training\n",
+ " - Analysis\n",
+ "\n",
+ "2. The user needs to modify the configuration file (`config.yml`) available at `scaLR/config` for each stage of the pipeline according to the requirements. Simply omit or comment out the stages of the pipeline that you do not wish to run.\n",
+ "\n",
+ "3. Refer to `config.yml` and its detailed configuration [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file for instructions on how to use different parameters and files.\n",
+ "\n",
+ "### Config edits (For Cell Type Classification and Biomarker Identification)\n",
+ "\n",
+ "NOTE: Below are just suggestions for the model parameters. Feel free to play around with them for tuning the model & improving the results.\n",
+ "\n",
+ "*An example configuration file for the current dataset, incorporating the edits below, can be found at `scaLR/tutorials/pipeline/config_celltype.yaml`. Please update the device as `cuda` or `cpu` as per runtype.*\n",
+ "\n",
+ "- **Device setup**.\n",
+ " -Update `device: 'cuda'` for `GPU` enabled runtype, else `device: 'cpu'` for `CPU` enabled runtype.\n",
+ "- **Experiment Config**\n",
+ " - The default `exp_run` number is `0`.If not changed, the celltype classification experiment would be `exp_run_0` with all the pipeline results.\n",
+ "- **Data Config**\n",
+ " - Update the `full_datapath` to `data/modified_adata.h5ad` (as we will include `GeneRecallCurve` in the downstream).\n",
+ " - Specify the `num_workers` value for effective parallelization.\n",
+ " - Set `target` to `cell_type`.\n",
+ "- **Feature Selection**\n",
+ " - Specify the `num_workers` value for effective parallelization.\n",
+ " - Update the model layers to `[5000, 10]`, as there are only 10 cell types in the dataset.\n",
+ " - Change `epoch` to `10`.\n",
+ "- **Final Model Training**\n",
+ " - Update the model layers to the same as for feature selection: `[5000, 10]`.\n",
+ " - Change `epoch` to `100`.\n",
+ "- **Analysis**\n",
+ " - **Downstream Analysis**\n",
+ " - Uncomment the `test_samples_downstream_analysis` section.\n",
+ " - Update the `reference_genes_path` to `scaLR/tutorials/pipeline/grc_reference_gene.csv`.\n",
+ " - Please refer to the section below:\n",
+ "\n",
+ " ```\n",
+ " analysis:\n",
+ "\n",
+ " model_checkpoint: ''\n",
+ "\n",
+ " dataloader:\n",
+ " name: SimpleDataLoader\n",
+ " params:\n",
+ " batch_size: 15000\n",
+ "\n",
+ " gene_analysis:\n",
+ " scoring_config:\n",
+ " name: LinearScorer\n",
+ "\n",
+ " features_selector:\n",
+ " name: ClasswisePromoters\n",
+ " params:\n",
+ " k: 100\n",
+ " test_samples_downstream_analysis:\n",
+ " - name: GeneRecallCurve\n",
+ " params:\n",
+ " reference_genes_path: 'scaLR/tutorials/pipeline/grc_reference_gene.csv'\n",
+ " top_K: 300\n",
+ " plots_per_row: 3\n",
+ " features_selector:\n",
+ " name: ClasswiseAbs\n",
+ " params: {}\n",
+ " - name: Heatmap\n",
+ " params: {}\n",
+ " - name: RocAucCurve\n",
+ " params: {}\n",
+ "\n",
+ "\n",
+ "\n",
+ "### Config edits (For clinical condition specific biomarker identification and DGE analysis) \n",
+ "\n",
+ "*An example configuration file for the current dataset, incorporating the edits below, can be found at : `scaLR/tutorials/pipeline/config_clinical.yaml`.Please update the device as `cuda` or `cpu` as per runtype*\n",
+ "\n",
+ "- **Experiment Config**\n",
+ " - Make sure to change the `exp_run` number if you have an experiment with the same number earlier related to cell classification.As we have done one experiment earlier, we'll change the number now to '1'.\n",
+ "- **Data Config**\n",
+ " - The `full_datapath` remains the same as above.\n",
+ " - Change the `target` to `disease` (this column contains data for clinical conditions, `COVID-19/normal`).\n",
+ "- **Feature Selection**\n",
+ " - Update the model layers to `[5000, 2]`, as there are only two types of clinical conditions.\n",
+ " -`epoch` as 10.\n",
+ "- **Final Model Training**\n",
+ " - Update the model layers to the same as for feature selection: `[5000, 2]`.\n",
+ " - `epoch` as 100.\n",
+ "- **Analysis**\n",
+ " - **Downstream Analysis**\n",
+ " - Uncomment the `full_samples_downstream_analysis` section.\n",
+ " - We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the `COVID-19/normal` specific genes are available, but there are many possibilities of genes in the case of normal conditions.\n",
+ " - There are two options to perform differential gene expression (DGE) analysis: `DgePseudoBulk` and `DgeLMEM`. The parameters are updated as follows. Note that `DgeLMEM` may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime.\n",
+ " - Please refer to the section below:\n",
+ " ```\n",
+ " analysis:\n",
+ "\n",
+ " model_checkpoint: ''\n",
+ "\n",
+ " dataloader:\n",
+ " name: SimpleDataLoader\n",
+ " params:\n",
+ " batch_size: 15000\n",
+ "\n",
+ " gene_analysis:\n",
+ " scoring_config:\n",
+ " name: LinearScorer\n",
+ "\n",
+ " features_selector:\n",
+ " name: ClasswisePromoters\n",
+ " params:\n",
+ " k: 100\n",
+ " full_samples_downstream_analysis:\n",
+ " - name: Heatmap\n",
+ " params:\n",
+ " top_n_genes: 100\n",
+ " - name: RocAucCurve\n",
+ " params: {}\n",
+ " - name: DgePseudoBulk\n",
+ " params:\n",
+ " celltype_column: 'cell_type'\n",
+ " design_factor: 'disease'\n",
+ " factor_categories: ['COVID-19', 'normal']\n",
+ " sum_column: 'donor_id'\n",
+ " cell_subsets: ['conventional dendritic cell', 'natural killer cell']\n",
+ " - name: DgeLMEM\n",
+ " params:\n",
+ " fixed_effect_column: 'disease'\n",
+ " fixed_effect_factors: ['COVID-19', 'normal']\n",
+ " group: 'donor_id'\n",
+ " celltype_column: 'cell_type'\n",
+ " cell_subsets: ['conventional dendritic cell']\n",
+ " gene_batch_size: 1000\n",
+ " coef_threshold: 0.1\n",
+ " "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Wny28AQQm6xB"
+ },
+ "source": [
+ "### Run Pipeline "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "metadata": {
+ "id": "uLgN7MDv7hV-"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "/bin/bash: line 1: python: command not found\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Possible flags using 'scaLR/pipeline.py'\n",
+ "!python scaLR/pipeline.py --help"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "kTAOOj1CgjJy"
+ },
+ "source": [
+ "#### Cell type classification"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "metadata": {
+ "id": "xqvT9AiQFVGq"
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "2025-02-27 19:02:51,535 - ROOT - INFO : Experiment directory: `scalr_experiments/exp_name_0`\n",
+ "2025-02-27 19:02:51,544 - ROOT - INFO : Data Ingestion pipeline running\n",
+ "2025-02-27 19:02:51,544 - DataIngestion - INFO : Generating Train, Validation and Test sets\n",
+ "2025-02-27 19:03:35,769 - DataIngestion - INFO : Generate label mappings for all columns in metadata\n",
+ "2025-02-27 19:03:36,946 - ROOT - INFO : Feature Extraction pipeline running\n",
+ "2025-02-27 19:03:36,946 - File Utils - INFO : Data Loaded from Final datapaths\n",
+ "2025-02-27 19:03:37,467 - FeatureExtraction - INFO : Feature subset models training\n",
+ "2025-02-27 19:05:09,181 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:09,253 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:09,295 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:09,393 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:09,750 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:09,751 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:09,770 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:09,881 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:16,105 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:16,106 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:16,153 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:16,154 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:16,168 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:16,174 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:20,327 - FeatureExtraction - INFO : Feature scoring\n",
+ "2025-02-27 19:05:20,712 - FeatureExtraction - INFO : Top features extraction\n",
+ "2025-02-27 19:05:20,719 - FeatureExtraction - INFO : Writing feature-subset data onto disk\n",
+ "2025-02-27 19:05:51,902 - ROOT - INFO : Final Model Training pipeline running\n",
+ "2025-02-27 19:05:51,905 - File Utils - INFO : Data Loaded from Feature subset datapaths\n",
+ "2025-02-27 19:05:52,382 - ModelTraining - INFO : Building model training artifacts\n",
+ "2025-02-27 19:05:52,841 - ModelTraining - INFO : Training the model\n",
+ "2025-02-27 19:05:59,278 - ROOT - INFO : Analysis pipeline running\n",
+ "2025-02-27 19:05:59,281 - File Utils - INFO : Data Loaded from Feature subset datapaths\n",
+ "2025-02-27 19:05:59,676 - File Utils - INFO : Data Loaded from Feature subset datapaths\n",
+ "2025-02-27 19:05:59,805 - File Utils - INFO : Data Loaded from Feature subset datapaths\n",
+ "2025-02-27 19:06:00,379 - Eval&Analysis - INFO : Calculating accuracy and generating classification report on test set\n",
+ "/home/amit.samal/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
+ " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
+ "/home/amit.samal/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
+ " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
+ "/home/amit.samal/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n",
+ " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n",
+ "2025-02-27 19:06:03,433 - Eval&Analysis - INFO : Performing gene analysis\n",
+ "2025-02-27 19:06:03,433 - FeatureExtraction - INFO : Feature scoring\n",
+ "2025-02-27 19:06:03,471 - FeatureExtraction - INFO : Top features extraction\n",
+ "2025-02-27 19:06:03,540 - Eval&Analysis - INFO : Performing Downstream Analysis on test samples\n",
+ "2025-02-27 19:06:03,540 - Eval&Analysis - INFO : Performing GeneRecallCurve\n",
+ "2025-02-27 19:06:04,781 - Eval&Analysis - INFO : Performing Heatmap\n",
+ "2025-02-27 19:06:09,548 - Eval&Analysis - INFO : Performing RocAucCurve\n",
+ "2025-02-27 19:06:09,929 - ROOT - INFO : Total time taken: 198.401921749115 s\n",
+ "2025-02-27 19:06:09,929 - ROOT - INFO : Maximum memory usage: 1915.5625 MB\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Command to run end to end pipeline.\n",
+ "# This shell will take approximately 00:21:15 (hh:mm:ss) on GPU to run.()\n",
+ "\n",
+ "!python3 scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_celltype.yaml -l -m"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0IRSOT64gjJy"
+ },
+ "source": [
+ "#### Clinical condition specific biomarker identification and differential gene expression analysis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "e71LHxUvgjJy"
+ },
+ "outputs": [],
+ "source": [
+ "## It takes 01:16:58 (hh:mm:ss) to run on the CPU for clinical condition-specific biomarker identification.\n",
+ "## To reduce the runtime, please comment out the 'DgeLMEM' section under the 'full_samples_downstream_analysis.\n",
+ "\n",
+ "!python scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_clinical.yaml -l -m"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "yviraKXXgjJy"
+ },
+ "source": [
+ "Pipeline logs can be found at `scalr_experiments/exp_name_0/logs.txt` (cell type classification)\n",
+ "\n",
+ "For clinical condition specific biomarker identification, the logs can be found at `scalr_experiments/exp_name_1/logs.txt`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "oe4d74mjIcgW"
+ },
+ "source": [
+ "### Results \n",
+ "We have done the celltype classification and biomarker discovery with name `exp_name_0`.\n",
+ "\n",
+ "- The classification report can be found at `scalr_experiments/exp_name_0/analysis/classification_report.csv`\n",
+ "\n",
+ "- Top-5k Biomarkers can be found at `scalr_experiments/exp_name_0/analysis/gene_analysis/top_features.json`.\n",
+ "\n",
+ "- `Heatmaps` for each class(cell types) can be found at `scalr_experiments/exp_name_0/analysis/test_samples/heatmaps`\n",
+ "\n",
+ "- `Gene_recall_curve`, and `roc_auc` data can be found at `scalr_experiments/exp_name_0/analysis/test_samples/`.\n",
+ "\n",
+ "- `score_matrix.csv` with gene scores for all classes can be found at `scalr_experiments/exp_name_0/analysis/gene_analysis/score_matrix.csv`"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "MM5v5OTcQocC"
+ },
+ "outputs": [],
+ "source": [
+ "#Classification report\n",
+ "pd.read_csv('/content/scalr_experiments/exp_name_0/analysis/classification_report.csv',index_col=0)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "rNZt8t-_gjJz"
+ },
+ "outputs": [],
+ "source": [
+ "#ROC_AUC\n",
+ "display(SVG('/content/scalr_experiments/exp_name_0/analysis/test_samples/roc_auc.svg'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "JBYVFclUgjJz"
+ },
+ "outputs": [],
+ "source": [
+ "# Heatmap for cell type 'classical monocyte'\n",
+ "display(SVG('/content/scalr_experiments/exp_name_0/analysis/test_samples/heatmaps/classical monocyte.svg'))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "zbui27nxIh_J"
+ },
+ "outputs": [],
+ "source": [
+ "# Gene recall curve\n",
+ "display(SVG('scalr_experiments/exp_name_0/analysis/test_samples/gene_recall_curve.svg'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "52n0PSr87FjJ"
+ },
+ "source": [
+ "\n",
+ "For clinical condition-specific biomarker identification and DGE analysis with the experiment name `exp_name_1`. All analysis results can be viewed in the `exp_name_1` directory, as explained above for cell type classification. The difference is that we have results for only two classes in `exp_name_1`, namely `COVID-19` and `normal`, along with the results for DGE analysis."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Fgu3MIxggjJ3"
+ },
+ "outputs": [],
+ "source": [
+ "# DgePseudoBulk results for 'conventional dendritic cell' in 'COVID-19' w.r.t. 'normal' samples\n",
+ "pd.read_csv('/content/scalr_experiments/exp_name_1/analysis/full_samples/pseudobulk_dge_result/pbkDGE_conventionaldendriticcell_COVID-19_vs_normal.csv')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "7n_AczPkgjJ3"
+ },
+ "outputs": [],
+ "source": [
+ "# Volcano plot of `log2FoldChange` vs `-log10(pvalue)` in gene expression for\n",
+ "# 'conventional dendritic cell' in 'COVID-19' w.r.t. 'normal' samples.\n",
+ "display(SVG('/content/scalr_experiments/exp_name_1/analysis/full_samples/pseudobulk_dge_result/pbkDGE_conventionaldendriticcell_COVID-19_vs_normal.svg'))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "Js1lFjQagjJ3"
+ },
+ "source": [
+ "*Note*: A `Fold Change (FC)` of 1.5 units in the figure above is equivalent to a `log2 Fold Change` of 0.584."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "RL5n6rqzR4Sc"
+ },
+ "source": [
+ "## Running scaLR in modules"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "6jypX2axToza"
+ },
+ "source": [
+ "### Imports"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "yqnxGZnHIiJr"
+ },
+ "outputs": [],
+ "source": [
+ "import sys\n",
+ "sys.path.append('scaLR/')\n",
+ "import os\n",
+ "from os import path\n",
+ "\n",
+ "from scalr.data_ingestion_pipeline import DataIngestionPipeline\n",
+ "from scalr.eval_and_analysis_pipeline import EvalAndAnalysisPipeline\n",
+ "from scalr.feature_extraction_pipeline import FeatureExtractionPipeline\n",
+ "from scalr.model_training_pipeline import ModelTrainingPipeline\n",
+ "from scalr.utils import read_data\n",
+ "from scalr.utils import write_data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "tObhEJKkT0Ew"
+ },
+ "source": [
+ "### Load Config\n",
+ "\n",
+ "Running with example config files with required edits. Make sure to change the experiment name if required."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "dbrUCh-LTxbl"
+ },
+ "outputs": [],
+ "source": [
+ "config = read_data('scaLR/tutorials/pipeline/config_celltype.yaml')\n",
+ "# config = read_data('scaLR/tutorials/pipeline/config_clinical.yaml')\n",
+ "config"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "XU-FLwPlULd1"
+ },
+ "outputs": [],
+ "source": [
+ "dirpath = config['experiment']['dirpath']\n",
+ "exp_name = config['experiment']['exp_name']\n",
+ "exp_run = config['experiment']['exp_run']\n",
+ "dirpath = os.path.join(dirpath, f'{exp_name}_{exp_run}')\n",
+ "os.makedirs(dirpath, exist_ok=True)\n",
+ "device = config['device']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "C44uQoNiUe4M"
+ },
+ "source": [
+ "### Data Ingestion"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "JX5nB5gzUh7L"
+ },
+ "outputs": [],
+ "source": [
+ "# This shell will take approximately 00:01:23 (hh:mm:ss) to run.\n",
+ "\n",
+ "data_dirpath = path.join(dirpath, 'data')\n",
+ "os.makedirs(data_dirpath, exist_ok=True)\n",
+ "\n",
+ "# Initialize Data Ingestion object\n",
+ "ingest_data = DataIngestionPipeline(config['data'], data_dirpath)\n",
+ "\n",
+ "# Generate Train, Validation and Test Splits for pipeline\n",
+ "ingest_data.generate_train_val_test_split()\n",
+ "\n",
+ "# Apply pre-processing on data\n",
+ "# Fit on Train data, and then apply on the entire data\n",
+ "ingest_data.preprocess_data()\n",
+ "\n",
+ "# We generate label mapings from the metadata, which is used for\n",
+ "# labels, etc.\n",
+ "ingest_data.generate_mappings()\n",
+ "\n",
+ "# All the additional data generated (label mappings, data splits, etc.)\n",
+ "# are passed onto the config for future use in pipeline\n",
+ "config['data'] = ingest_data.get_updated_config()\n",
+ "write_data(config, path.join(dirpath, 'config.yaml'))\n",
+ "del ingest_data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "qc76-jFSVmfY"
+ },
+ "source": [
+ "### Feature Selection"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "w4CfG8YQVoTJ"
+ },
+ "outputs": [],
+ "source": [
+ "# This shell will take approximately 00:19:02 (hh:mm:ss) to run.\n",
+ "\n",
+ "feature_extraction_dirpath = path.join(dirpath, 'feature_extraction')\n",
+ "os.makedirs(feature_extraction_dirpath, exist_ok=True)\n",
+ "\n",
+ "# Initialize Feature Extraction object\n",
+ "extract_features = FeatureExtractionPipeline(\n",
+ " config['feature_selection'], feature_extraction_dirpath, device)\n",
+ "extract_features.load_data_and_targets_from_config(config['data'])\n",
+ "\n",
+ "# Train feature subset models and get scores for each feature/genes\n",
+ "extract_features.feature_subsetted_model_training()\n",
+ "extract_features.feature_scoring()\n",
+ "\n",
+ "# Extract top features by some algorithm, and write a feature-subsetted\n",
+ "# dataset\n",
+ "extract_features.top_feature_extraction()\n",
+ "config['data'] = extract_features.write_top_features_subset_data(\n",
+ " config['data'])\n",
+ "\n",
+ "# All the additional data generated (subset data splits, etc.)\n",
+ "# are passed onto the config for future use in pipeline\n",
+ "config['feature_selection'] = extract_features.get_updated_config()\n",
+ "write_data(config, path.join(dirpath, 'config.yaml'))\n",
+ "del extract_features"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "z-Scub2RVtqi"
+ },
+ "source": [
+ "### Final Model Training"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "Roc1gACAVoY6"
+ },
+ "outputs": [],
+ "source": [
+ "# This shell will take approximately 00:06:20 (hh:mm:ss) to run.\n",
+ "\n",
+ "model_training_dirpath = path.join(dirpath, 'model')\n",
+ "os.makedirs(model_training_dirpath, exist_ok=True)\n",
+ "\n",
+ "# Initialize Final Model Training object\n",
+ "model_trainer = ModelTrainingPipeline(\n",
+ " config['final_training']['model'],\n",
+ " config['final_training']['model_train_config'],\n",
+ " model_training_dirpath, device)\n",
+ "model_trainer.load_data_and_targets_from_config(config['data'])\n",
+ "\n",
+ "# Build the training artifacts from config, and train the model\n",
+ "model_trainer.build_model_training_artifacts()\n",
+ "model_trainer.train()\n",
+ "\n",
+ "# All the additional data generated (model defaults filled, etc.)\n",
+ "# are passed onto the config for future use in pipeline\n",
+ "model_config, model_train_config = model_trainer.get_updated_config()\n",
+ "config['final_training']['model'] = model_config\n",
+ "config['final_training']['model_train_config'] = model_train_config\n",
+ "write_data(config, path.join(dirpath, 'config.yaml'))\n",
+ "del model_trainer"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "GZFd8R8QWpmS"
+ },
+ "source": [
+ "### Evaluation and Analysis"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "id": "w71AS8mXVob9"
+ },
+ "outputs": [],
+ "source": [
+ "# This shell will take approximately 00:00:26 (hh:mm:ss) to run.\n",
+ "\n",
+ "analysis_dirpath = path.join(dirpath, 'analysis')\n",
+ "os.makedirs(analysis_dirpath, exist_ok=True)\n",
+ "\n",
+ "# Get path of the best trained model\n",
+ "config['analysis']['model_checkpoint'] = path.join(\n",
+ " model_training_dirpath, 'best_model')\n",
+ "\n",
+ "# Initialize Evaluation and Analysis Pipeline object\n",
+ "analyser = EvalAndAnalysisPipeline(config['analysis'], analysis_dirpath,\n",
+ " device)\n",
+ "analyser.load_data_and_targets_from_config(config['data'])\n",
+ "\n",
+ "# Perform evaluation of trained model on test data and generate\n",
+ "# classification report\n",
+ "analyser.evaluation_and_classification_report()\n",
+ "\n",
+ "# Perform gene analysis based on the trained model to get\n",
+ "# top genes / biomarker analysis\n",
+ "analyser.gene_analysis()\n",
+ "\n",
+ "# Perform downstream analysis on all samples / test samples\n",
+ "analyser.full_samples_downstream_anlaysis()\n",
+ "analyser.test_samples_downstream_anlaysis()\n",
+ "\n",
+ "# All the additional data generated\n",
+ "# are passed onto the config for future use in pipeline\n",
+ "config['analysis'] = analyser.get_updated_config()\n",
+ "write_data(config, path.join(dirpath, 'config.yaml'))\n",
+ "del analyser"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "XCThcOt8gjJ5"
+ },
+ "source": [
+ "Analysis results can be viewed inside `scalr_experiments` under the `exp_name` specified in the `config.yaml`, as mentioned above."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "id": "0V2-AKThIaks"
+ },
+ "source": []
+ }
+ ],
+ "metadata": {
+ "accelerator": "GPU",
+ "colab": {
+ "gpuType": "T4",
+ "provenance": []
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}