diff --git a/README.md b/README.md index bedd52a..d4e5983 100644 --- a/README.md +++ b/README.md @@ -20,7 +20,7 @@ 4. Evaluation & Downstream Analysis: The trained model is evaluated using the test dataset by calculating metrics such as precision, recall, f1-score, and accuracy. Various visualizations, such as ROC curve of class annotation, feature rank plots, heatmap of top genes per class, [DGE analysis](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/differential_gene_expression/dge.ipynb), and [gene recall curves](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/gene_recall_curve.ipynb), are generated. -The following flowchart explains the major steps of the scaLR platform. +**The below flowchart also explains the major steps of the scaLR platform.** ![image.jpg](img/Schematic-of-scPipeline.jpg) @@ -29,7 +29,6 @@ The following flowchart explains the major steps of the scaLR platform. - ScaLR can be installed using git or pip. It is tested in Python 3.10 and it is recommended to use that environment. - ``` conda create -n scaLR_env python=3.10 @@ -47,9 +46,9 @@ pip install -r requirements.txt ``` pip install pyscaLR ``` -*Note* If the user wants to run the entire pipeline via installing pip pyscalr, they should clone/download these files(`pipeline.py` and `config.yaml`) from the git repository. +**Note:** If the user wants to run the entire pipeline via installing pip pyscalr, they should clone/download these files(`pipeline.py` and `config.yaml`) from the git repository. -## Input Data +## Input data format - Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only). - The anndata object should contain cell samples as `obs` and genes as `var. ' - `adata.X`: contains normalized gene counts/expression values (`log1p` normalization with range `0-10` expected). @@ -60,15 +59,192 @@ pip install pyscaLR ## How to run 1. It is necessary that the user modify the configuration file, and each stage of the pipeline is available inside the config folder [config.yml] as per your requirements. Simply omit/comment out stages of the pipeline you do not wish to run. -2. Refer config.yml & it's detailed config [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file on how to use different parameters and files. +2. Refer **config.yml** & **it's detailed config** [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file on how to use different parameters and files. 3. Then use the `pipeline.py` file to run the entire pipeline according to your configurations. This file takes as argument the path to config (`-c | --config`), along with optional flags to log all parts of the pipelines (`-l | --log`) and to analyze memory usage (`-m | --memoryprofiler`). 5. `python pipeline.py --config /path/to/config.yaml -l -m` to run the scaLR. -## Examples configs +## Example configs + +### Config for cell type classification and biomarker identification + +NOTE: Below are just suggestions for the model parameters. Feel free to play around with them for tuning the model & improving the results. + +An example configuration file for the current dataset, incorporating the edits below, can be found at '`scaLR/tutorials/pipeline/config_celltype.yaml`. Update the device as cuda or cpu as per the requirement. + +- **Device setup*** + - Update device: 'cuda' for GPU enabled runtype, else device: 'cpu' for CPU enabled runtype. +- **Experiment Config** + - The default exp_run number is 0.If not changed, the celltype classification experiment would be exp_run_0 with all the pipeline results. +- **Data Config** + - Update the full_datapath to `data/modified_adata.h5ad` (as we will include GeneRecallCurve in the downstream). + - Specify the num_workers value for effective parallelization. + - Set target to cell_type. +- **Feature Selection** + - Specify the num_workers value for effective parallelization. + - Update the model layers to [5000, 10], as there are only 10 cell types in the dataset. + - Change epoch to 10. +- **Final Model Training** + - Update the model layers to the same as for feature selection: [5000, 10]. + - Change epoch to 100. +- **Analysis** + - Downstream Analysis + - Uncomment the test_samples_downstream_analysis section. + - Update the reference_genes_path to `scaLR/tutorials/pipeline/grc_reference_gene.csv`. + - Refer to the section below: + ``` + # Config file for pipeline run for cell type classification. + + # DEVICE SETUP. + device: 'cuda' + + # EXPERIMENT. + experiment: + dirpath: 'scalr_experiments' + exp_name: 'exp_name' + exp_run: 0 + + # DATA CONFIG. + data: + sample_chunksize: 20000 + + train_val_test: + full_datapath: 'data/modified_adata.h5ad' + num_workers: 2 + + splitter_config: + name: GroupSplitter + params: + split_ratio: [7, 1, 2.5] + stratify: 'donor_id' + + # split_datapaths: '' + + # preprocess: + # - name: SampleNorm + # params: + # **args + + # - name: StandardScaler + # params: + # **args + + target: cell_type + + # FEATURE SELECTION. + feature_selection: + + # score_matrix: '/path/to/matrix' + feature_subsetsize: 5000 + num_workers: 2 + + model: + name: SequentialModel + params: + layers: [5000, 10] + weights_init_zero: True + + model_train_config: + trainer: SimpleModelTrainer + + dataloader: + name: SimpleDataLoader + params: + batch_size: 25000 + padding: 5000 + + optimizer: + name: SGD + params: + lr: 1.0e-3 + weight_decay: 0.1 + + loss: + name: CrossEntropyLoss + + epochs: 10 + + scoring_config: + name: LinearScorer + + features_selector: + name: AbsMean + params: + k: 5000 + + # FINAL MODEL TRAINING. + final_training: + + model: + name: SequentialModel + params: + layers: [5000, 10] + dropout: 0 + weights_init_zero: False + + model_train_config: + resume_from_checkpoint: null + + trainer: SimpleModelTrainer + + dataloader: + name: SimpleDataLoader + params: + batch_size: 15000 + + optimizer: + name: Adam + params: + lr: 1.0e-3 + weight_decay: 0 + + loss: + name: CrossEntropyLoss + + epochs: 100 + + callbacks: + - name: TensorboardLogger + - name: EarlyStopping + params: + patience: 3 + min_delta: 1.0e-4 + - name: ModelCheckpoint + params: + interval: 5 + analysis: + +     model_checkpoint: '' -### Config edits (For clinical condition-specific biomarker identification and DGE analysis) +     dataloader: +         name: SimpleDataLoader +         params: +             batch_size: 15000 + +     gene_analysis: +         scoring_config: +             name: LinearScorer + +         features_selector: +             name: ClasswisePromoters +             params: +                 k: 100 +     test_samples_downstream_analysis: +         - name: GeneRecallCurve +           params: +             reference_genes_path: 'scaLR/tutorials/pipeline/grc_reference_gene.csv' +             top_K: 300 +             plots_per_row: 3 +             features_selector: +                 name: ClasswiseAbs +                 params: {} +         - name: Heatmap +           params: {} +         - name: RocAucCurve +           params: {} + ``` +### Config for clinical condition-specific biomarker identification and DGE analysis -An example configuration file for the current dataset, incorporating the edits below, can be found at: scaLR/tutorials/pipeline/config_clinical.yaml.Please update the device as CUDA or CPU as per runtype +An example configuration file (`scaLR/tutorials/pipeline/config_clinical.yaml`). Update the device as CUDA or CPU as per the requirement. - Experiment Config - Make sure to change the exp_run number if you have an experiment with the same number earlier related to cell classification. As we have done one experiment earlier, we'll change the number now to '1'. @@ -83,10 +259,10 @@ An example configuration file for the current dataset, incorporating the edits b - epoch as 100. - Analysis - Downstream Analysis - - Uncomment the full_samples_downstream_analysis section. + - Uncomment the full_samples_downstream_analysis section for example config file. - We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the COVID-19/normal specific genes are available, but there are many possibilities of genes in the case of normal conditions. - - There are two options to perform differential gene expression (DGE) analysis: DgePseudoBulk and DgeLMEM. The parameters are updated as follows. Note that DgeLMEM may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime. - - Please refer to the section below: + - There are two options to perform differential gene expression (DGE) analysis: **DgePseudoBulk and DgeLMEM**. The parameters are updated as follows. Note that DgeLMEM may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime. + - Refer to the section below: ``` analysis: @@ -102,67 +278,6 @@ An example configuration file for the current dataset, incorporating the edits b       scoring_config:           name: LinearScorer -       features_selector: -           name: ClasswisePromoters -           params: -               k: 100 -   full_samples_downstream_analysis: -       - name: Heatmap -         params: -           top_n_genes: 100 -       - name: RocAucCurve -         params: {} -       - name: DgePseudoBulk -         params: -             celltype_column: 'cell_type' -             design_factor: 'disease' -             factor_categories: ['COVID-19', 'normal'] -             sum_column: 'donor_id' -             cell_subsets: ['conventional dendritic cell', 'natural killer cell'] -       - name: DgeLMEM -         params: -           fixed_effect_column: 'disease' -           fixed_effect_factors: ['COVID-19', 'normal'] -           group: 'donor_id' -           celltype_column: 'cell_type' -           cell_subsets: ['conventional dendritic cell'] -           gene_batch_size: 1000 -           coef_threshold: 0.1 - ``` -### Config edits (For clinical condition-specific biomarker identification and DGE analysis) - An example configuration file for the current dataset, incorporating the edits below, can be found at: scaLR/tutorials/pipeline/config_clinical.yaml.Please update the device as cuda or cpu as per runtype - -- Experiment Config - - Make sure to change the exp_run number if you have an experiment with the same number earlier related to cell classification.As we have done one experiment earlier, we'll change the number now to '1'. -- Data Config - - The full_datapath remains the same as above. - - Change the target to disease (this column contains data for clinical conditions, COVID-19/normal). -- Feature Selection - - Update the model layers to [5000, 2], as there are only two types of clinical conditions. - - epoch as 10. -- Final Model Training - - Update the model layers to the same as for feature selection: [5000, 2]. - - epoch as 100. -- Analysis - - Downstream Analysis - - Uncomment the full_samples_downstream_analysis section. - - We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the COVID-19/normal specific genes are available, but there are many possibilities of genes in the case of normal conditions. - - There are two options to perform differential gene expression (DGE) analysis: DgePseudoBulk and DgeLMEM. The parameters are updated as follows. Note that DgeLMEM may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime. - - Please refer to the section below: - ``` - analysis: - -   model_checkpoint: '' - -   dataloader: -       name: SimpleDataLoader -       params: -           batch_size: 15000 - -   gene_analysis: -       scoring_config: -           name: LinearScorer -       features_selector:           name: ClasswisePromoters           params: @@ -192,16 +307,17 @@ An example configuration file for the current dataset, incorporating the edits b ``` ## Interactive tutorials -Detailed tutorials have been made on how to use some functionalities as a scaLR library. Find the links below. +Detailed tutorials have been made on how to use some pipeline functionalities as a scaLR library. Find the links below. - **scaLR pipeline** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline.ipynb) - **Differential gene expression analysis** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/differential_gene_expression/dge.ipynb) - **Gene recall curve** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/gene_recall_curve/gene_recall_curve.ipynb) - **Normalization** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/preprocessing/normalization.ipynb) - **Batch correction** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/preprocessing/batch_correction.ipynb) -- **SHAP analysis** [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/infocusp/scaLR/blob/main/tutorials/analysis/shap_analysis/shap_heatmap.ipynb) -## Experiment Output Structure +- **An example of jupyter notebook to [run scaLR in local machine](https://github.com/infocusp/scaLR/blob/main/tutorials/pipeline/scalr_pipeline_local_run.ipynb)**. + +## Experiment output structure - **pipeline.py**: The main script that performs an end-to-end run. - `exp_dir`: root experiment directory for the storage of all step outputs of the platform specified in the config. @@ -256,8 +372,6 @@ Performs evaluation of best model trained on user-defined metrics on the test se - `lmemDGE_celltype.csv`: contains LMEM DGE results between selected factor categories for a celltype. - `lmemDGE_fixed_effect_factor_X.svg`: volcano plot of coefficient vs -log10(p-value) of genes. - - ## Citation Jogani Saiyam, Anand Santosh Pol, Mayur Prajapati, Amit Samal, Kriti Bhatia, Jayendra Parmar, Urvik Patel, Falak Shah, Nisarg Vyas, and Saurabh Gupta. "scaLR: a low-resource deep neural network-based platform for single cell analysis and biomarker discovery." bioRxiv (2024): 2024-09. diff --git a/tutorials/pipeline/scalr_pipeline_local_run.ipynb b/tutorials/pipeline/scalr_pipeline_local_run.ipynb new file mode 100644 index 0000000..c9fe5ec --- /dev/null +++ b/tutorials/pipeline/scalr_pipeline_local_run.ipynb @@ -0,0 +1,1766 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "dfGECxsGN9bo" + }, + "source": [ + "\n", + "\n", + "# Single-cell analysis using Low Resource (scaLR)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Xna7qg2PgjJm" + }, + "source": [ + "\n", + "\n", + "**Note:** \n", + "1. If scaLR is intended to be run on a local system, please ensure that an `ipy kernel` with Python version `3.10` is selected. Then, all the required installations can be performed as mentioned in the section below.\n", + "\n", + "2. If scaLR has already been installed as mentioned in [Pre-requisites and installation scaLR](https://github.com/infocusp/scaLR), the repository cloning and requirement installation steps below can be skipped. Selecting the `ipy kernel` can be done as follows:\n", + "\n", + " - Open the terminal and run: \n", + " \n", + " ```\n", + " conda install -c anaconda ipykernel\n", + " python -m ipykernel install --user --name=scaLR_env\n", + " ```\n", + " - Select `scaLR_env` as the `ipy kernel` in `scalr_pipeline.ipynb`. \n", + " - Finally, update the system path for scaLR, as mentioned in the shell before data download. e.g.: \n", + " ```\n", + " sys.path.append('path/to/scaLR/')\n", + " ``` \n", + "## Cloning scaLR" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CdutIWiy8xJb" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Cloning into 'scaLR'...\n", + "remote: Enumerating objects: 3452, done.\u001b[K\n", + "remote: Counting objects: 100% (372/372), done.\u001b[K\n", + "remote: Compressing objects: 100% (181/181), done.\u001b[K\n", + "remote: Total 3452 (delta 243), reused 261 (delta 189), pack-reused 3080 (from 1)\u001b[K\n", + "Receiving objects: 100% (3452/3452), 170.03 MiB | 2.80 MiB/s, done.\n", + "Resolving deltas: 100% (2073/2073), done.\n" + ] + } + ], + "source": [ + "!git clone https://github.com/infocusp/scaLR.git" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MLJo_0EugjJq" + }, + "source": [ + "Install all requirements after cloning the repository, excluding packages that are pre-installed in Colab." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9dQLPmLwPL0C" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Defaulting to user installation because normal site-packages is not writeable\n", + "Requirement already satisfied: anndata==0.10.9 in /home/amit.samal/.local/lib/python3.10/site-packages (0.10.9)\n", + "Requirement already satisfied: isort==5.13.2 in /home/amit.samal/.local/lib/python3.10/site-packages (5.13.2)\n", + "Collecting loky==3.4.1\n", + " Downloading loky-3.4.1-py3-none-any.whl.metadata (6.4 kB)\n", + "Requirement already satisfied: pillow==10.4.0 in /home/amit.samal/.local/lib/python3.10/site-packages (10.4.0)\n", + "Requirement already satisfied: pydeseq2==0.4.11 in /home/amit.samal/.local/lib/python3.10/site-packages (0.4.11)\n", + "Requirement already satisfied: pyparsing==3.2.0 in /home/amit.samal/.local/lib/python3.10/site-packages (3.2.0)\n", + "Requirement already satisfied: pytest==8.3.3 in /home/amit.samal/.local/lib/python3.10/site-packages (8.3.3)\n", + "Requirement already satisfied: PyYAML==6.0.2 in /home/amit.samal/.local/lib/python3.10/site-packages (6.0.2)\n", + "Requirement already satisfied: scanpy==1.10.3 in /home/amit.samal/.local/lib/python3.10/site-packages (1.10.3)\n", + "Requirement already satisfied: scikit-learn==1.5.2 in /home/amit.samal/.local/lib/python3.10/site-packages (1.5.2)\n", + "Requirement already satisfied: shap==0.46.0 in /home/amit.samal/.local/lib/python3.10/site-packages (0.46.0)\n", + "Requirement already satisfied: tensorboard==2.17.0 in /home/amit.samal/.local/lib/python3.10/site-packages (2.17.0)\n", + "Requirement already satisfied: toml==0.10.2 in /home/amit.samal/.local/lib/python3.10/site-packages (0.10.2)\n", + "Requirement already satisfied: tqdm==4.66.5 in /home/amit.samal/.local/lib/python3.10/site-packages (4.66.5)\n", + "Requirement already satisfied: yapf==0.40.2 in /home/amit.samal/.local/lib/python3.10/site-packages (0.40.2)\n", + "Requirement already satisfied: array-api-compat!=1.5,>1.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.5.1)\n", + "Requirement already satisfied: exceptiongroup in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.2.0)\n", + "Requirement already satisfied: h5py>=3.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (3.10.0)\n", + "Requirement already satisfied: natsort in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (8.4.0)\n", + "Requirement already satisfied: numpy>=1.23 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.26.3)\n", + "Requirement already satisfied: packaging>=20.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (24.0)\n", + "Requirement already satisfied: pandas!=2.1.0rc0,!=2.1.2,>=1.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.5.3)\n", + "Requirement already satisfied: scipy>1.8 in /home/amit.samal/.local/lib/python3.10/site-packages (from anndata==0.10.9) (1.12.0)\n", + "Requirement already satisfied: cloudpickle in /home/amit.samal/.local/lib/python3.10/site-packages (from loky==3.4.1) (3.0.0)\n", + "Requirement already satisfied: matplotlib>=3.6.2 in /home/amit.samal/.local/lib/python3.10/site-packages (from pydeseq2==0.4.11) (3.8.3)\n", + "Requirement already satisfied: iniconfig in /home/amit.samal/.local/lib/python3.10/site-packages (from pytest==8.3.3) (2.0.0)\n", + "Requirement already satisfied: pluggy<2,>=1.5 in /home/amit.samal/.local/lib/python3.10/site-packages (from pytest==8.3.3) (1.5.0)\n", + "Requirement already satisfied: tomli>=1 in /home/amit.samal/.local/lib/python3.10/site-packages (from pytest==8.3.3) (2.1.0)\n", + "Requirement already satisfied: joblib in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (1.3.2)\n", + "Requirement already satisfied: legacy-api-wrap>=1.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (1.4)\n", + "Requirement already satisfied: networkx>=2.7 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (3.2.1)\n", + "Requirement already satisfied: numba>=0.56 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.59.1)\n", + "Requirement already satisfied: patsy in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.5.6)\n", + "Requirement already satisfied: pynndescent>=0.5 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.5.11)\n", + "Requirement already satisfied: seaborn>=0.13 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.13.2)\n", + "Requirement already satisfied: session-info in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (1.0.0)\n", + "Requirement already satisfied: statsmodels>=0.13 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.14.1)\n", + "Requirement already satisfied: umap-learn!=0.5.0,>=0.5 in /home/amit.samal/.local/lib/python3.10/site-packages (from scanpy==1.10.3) (0.5.5)\n", + "Requirement already satisfied: threadpoolctl>=3.1.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from scikit-learn==1.5.2) (3.4.0)\n", + "Requirement already satisfied: slicer==0.0.8 in /home/amit.samal/.local/lib/python3.10/site-packages (from shap==0.46.0) (0.0.8)\n", + "Requirement already satisfied: absl-py>=0.4 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (2.1.0)\n", + "Requirement already satisfied: grpcio>=1.48.2 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (1.70.0)\n", + "Requirement already satisfied: markdown>=2.6.8 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (3.7)\n", + "Requirement already satisfied: protobuf!=4.24.0,<5.0.0,>=3.19.6 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (4.25.6)\n", + "Requirement already satisfied: setuptools>=41.0.0 in /usr/lib/python3/dist-packages (from tensorboard==2.17.0) (59.6.0)\n", + "Requirement already satisfied: six>1.9 in /usr/lib/python3/dist-packages (from tensorboard==2.17.0) (1.16.0)\n", + "Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (0.7.2)\n", + "Requirement already satisfied: werkzeug>=1.0.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from tensorboard==2.17.0) (3.1.3)\n", + "Requirement already satisfied: importlib-metadata>=6.6.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from yapf==0.40.2) (8.6.1)\n", + "Requirement already satisfied: platformdirs>=3.5.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from yapf==0.40.2) (4.2.0)\n", + "Requirement already satisfied: zipp>=3.20 in /home/amit.samal/.local/lib/python3.10/site-packages (from importlib-metadata>=6.6.0->yapf==0.40.2) (3.21.0)\n", + "Requirement already satisfied: contourpy>=1.0.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (1.2.0)\n", + "Requirement already satisfied: cycler>=0.10 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (0.12.1)\n", + "Requirement already satisfied: fonttools>=4.22.0 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (4.50.0)\n", + "Requirement already satisfied: kiwisolver>=1.3.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (1.4.5)\n", + "Requirement already satisfied: python-dateutil>=2.7 in /home/amit.samal/.local/lib/python3.10/site-packages (from matplotlib>=3.6.2->pydeseq2==0.4.11) (2.9.0.post0)\n", + "Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /home/amit.samal/.local/lib/python3.10/site-packages (from numba>=0.56->scanpy==1.10.3) (0.42.0)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas!=2.1.0rc0,!=2.1.2,>=1.4->anndata==0.10.9) (2022.1)\n", + "Requirement already satisfied: MarkupSafe>=2.1.1 in /home/amit.samal/.local/lib/python3.10/site-packages (from werkzeug>=1.0.1->tensorboard==2.17.0) (3.0.2)\n", + "Requirement already satisfied: stdlib-list in /home/amit.samal/.local/lib/python3.10/site-packages (from session-info->scanpy==1.10.3) (0.10.0)\n", + "Downloading loky-3.4.1-py3-none-any.whl (54 kB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m54.6/54.6 kB\u001b[0m \u001b[31m1.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n", + "\u001b[?25hInstalling collected packages: loky\n", + "Successfully installed loky-3.4.1\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n", + "Defaulting to user installation because normal site-packages is not writeable\n", + "Requirement already satisfied: memory-profiler==0.61.0 in /home/amit.samal/.local/lib/python3.10/site-packages (0.61.0)\n", + "Requirement already satisfied: psutil in /home/amit.samal/.local/lib/python3.10/site-packages (from memory-profiler==0.61.0) (5.9.8)\n", + "\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.0\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", + "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" + ] + } + ], + "source": [ + "import sys\n", + "imported_packages = {pkg.split('.')[0] for pkg in sys.modules.keys()}\n", + "ignore_libraries = \"|\".join(imported_packages)\n", + "\n", + "!pip install $(grep -ivE \"$ignore_libraries\" scaLR/requirements.txt)\n", + "!pip install memory-profiler==0.61.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# # Uncomment and run the following if the scaLR pipeline is to be executed locally after installation, as explained in Note 2.\n", + "# import sys\n", + "# sys.path.append('path/to/scaLR/')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0DvyBaoIPdnX" + }, + "source": [ + "## Downloading input anndata from `cellxgene`\n", + "- Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only).\n", + "- The anndata object should contain cell samples as `obs` and genes as `var`.\n", + "- `adata.X`: contains normalized gene counts/expression values (Typically `log1p` normalized, data ranging from 0-10).\n", + "- `adata.obs`: contains any metadata regarding cells, including a column for `target` which will be used for classification. The index of `adata.obs` is cell_barcodes.\n", + "- `adata.var`: contains all gene_names as Index.\n", + "\n", + "The dataset we are about to download contains two clinical conditions (COVID-19 and normal) and links variations in immune response to disease severity and outcomes over time[(Liu et al. (2021))](https://doi.org/10.1016/j.cell.2021.02.018)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "id": "loCfvnwt9ei1" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "--2025-02-27 18:52:02-- https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad\n", + "Resolving datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)... 18.239.111.15, 18.239.111.109, 18.239.111.30, ...\n", + "Connecting to datasets.cellxgene.cziscience.com (datasets.cellxgene.cziscience.com)|18.239.111.15|:443... connected.\n", + "HTTP request sent, awaiting response... 200 OK\n", + "Length: 980103606 (935M) [binary/octet-stream]\n", + "Saving to: ‘data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad’\n", + "\n", + "21ef2ea2-cbed-4b6c- 100%[===================>] 934.70M 3.21MB/s in 4m 48s \n", + "\n", + "2025-02-27 18:56:51 (3.25 MB/s) - ‘data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad’ saved [980103606/980103606]\n", + "\n" + ] + } + ], + "source": [ + "# This shell will take approximately 00:00:53 (hh:mm:ss) to run.\n", + "!wget -P data https://datasets.cellxgene.cziscience.com/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tSiYIOo8P_3b" + }, + "source": [ + "## Data exploration" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "id": "23C87j3PR9ox" + }, + "outputs": [], + "source": [ + "from IPython.display import SVG, display\n", + "import warnings\n", + "import anndata as ad\n", + "from anndata import AnnData\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "id": "eDH3GxXr-er6" + }, + "outputs": [], + "source": [ + "adata = ad.read_h5ad(\"data/21ef2ea2-cbed-4b6c-a572-0ddd1d9020bc.h5ad\",backed='r')" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "id": "SS4oTWW6Xn8c" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "The anndata has '125117' cells and '30695' genes\n" + ] + } + ], + "source": [ + "print(f\"\\nThe anndata has '{adata.n_obs}' cells and '{adata.n_vars}' genes\")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "id": "z1u-kctbSStJ" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
dsm_severity_score_groupdisease_ontology_term_idseveritytissue_ontology_term_idtimepointoutcomedsm_severity_scoredays_since_hospitalizedagedonor_id...tissue_typecell_typeassaydiseaseorganismsextissueself_reported_ethnicitydevelopment_stageobservation_joinid
AAACCTGAGAAACCTA-1_1DSM_lowMONDO:0100096ModerateUBERON:0000178T0alive-1.9508581.055.0HGR0000083...tissuenon-classical monocyte10x 5' v1COVID-19Homo sapiensmalebloodEuropean55-year-old stage!9L}G4hgnw
AAACCTGAGGGTTTCT-1_1DSM_highMONDO:0100096CriticalUBERON:0000178T0alive-0.09237513.040.0HGR0000078...tissueclassical monocyte10x 5' v1COVID-19Homo sapiensfemalebloodEuropean40-year-old stageYRcUzlVyg0
AAACCTGCACCTGGTG-1_1DSM_highMONDO:0100096CriticalUBERON:0000178T0alive2.9543501.060.0HGR0000098...tissueCD16-positive, CD56-dim natural killer cell, h...10x 5' v1COVID-19Homo sapiensmalebloodEuropean60-year-old stage)*azge@M0l
AAACCTGGTCCGAGTC-1_1DSM_highMONDO:0100096CriticalUBERON:0000178T0deceased3.2762336.076.0HGR0000141...tissueclassical monocyte10x 5' v1COVID-19Homo sapiensmalebloodEuropean76-year-old stageE<FU`+QN&T
AAACCTGGTGCCTTGG-1_1DSM_lowMONDO:0100096CriticalUBERON:0000178T0alive-0.3488881.070.0HGR0000093...tissueclassical monocyte10x 5' v1COVID-19Homo sapiensmalebloodEuropean70-year-old stage2MZ#6SX}{g
\n", + "

5 rows × 32 columns

\n", + "
" + ], + "text/plain": [ + " dsm_severity_score_group disease_ontology_term_id \\\n", + "AAACCTGAGAAACCTA-1_1 DSM_low MONDO:0100096 \n", + "AAACCTGAGGGTTTCT-1_1 DSM_high MONDO:0100096 \n", + "AAACCTGCACCTGGTG-1_1 DSM_high MONDO:0100096 \n", + "AAACCTGGTCCGAGTC-1_1 DSM_high MONDO:0100096 \n", + "AAACCTGGTGCCTTGG-1_1 DSM_low MONDO:0100096 \n", + "\n", + " severity tissue_ontology_term_id timepoint outcome \\\n", + "AAACCTGAGAAACCTA-1_1 Moderate UBERON:0000178 T0 alive \n", + "AAACCTGAGGGTTTCT-1_1 Critical UBERON:0000178 T0 alive \n", + "AAACCTGCACCTGGTG-1_1 Critical UBERON:0000178 T0 alive \n", + "AAACCTGGTCCGAGTC-1_1 Critical UBERON:0000178 T0 deceased \n", + "AAACCTGGTGCCTTGG-1_1 Critical UBERON:0000178 T0 alive \n", + "\n", + " dsm_severity_score days_since_hospitalized age \\\n", + "AAACCTGAGAAACCTA-1_1 -1.950858 1.0 55.0 \n", + "AAACCTGAGGGTTTCT-1_1 -0.092375 13.0 40.0 \n", + "AAACCTGCACCTGGTG-1_1 2.954350 1.0 60.0 \n", + "AAACCTGGTCCGAGTC-1_1 3.276233 6.0 76.0 \n", + "AAACCTGGTGCCTTGG-1_1 -0.348888 1.0 70.0 \n", + "\n", + " donor_id ... tissue_type \\\n", + "AAACCTGAGAAACCTA-1_1 HGR0000083 ... tissue \n", + "AAACCTGAGGGTTTCT-1_1 HGR0000078 ... tissue \n", + "AAACCTGCACCTGGTG-1_1 HGR0000098 ... tissue \n", + "AAACCTGGTCCGAGTC-1_1 HGR0000141 ... tissue \n", + "AAACCTGGTGCCTTGG-1_1 HGR0000093 ... tissue \n", + "\n", + " cell_type \\\n", + "AAACCTGAGAAACCTA-1_1 non-classical monocyte \n", + "AAACCTGAGGGTTTCT-1_1 classical monocyte \n", + "AAACCTGCACCTGGTG-1_1 CD16-positive, CD56-dim natural killer cell, h... \n", + "AAACCTGGTCCGAGTC-1_1 classical monocyte \n", + "AAACCTGGTGCCTTGG-1_1 classical monocyte \n", + "\n", + " assay disease organism sex tissue \\\n", + "AAACCTGAGAAACCTA-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n", + "AAACCTGAGGGTTTCT-1_1 10x 5' v1 COVID-19 Homo sapiens female blood \n", + "AAACCTGCACCTGGTG-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n", + "AAACCTGGTCCGAGTC-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n", + "AAACCTGGTGCCTTGG-1_1 10x 5' v1 COVID-19 Homo sapiens male blood \n", + "\n", + " self_reported_ethnicity development_stage \\\n", + "AAACCTGAGAAACCTA-1_1 European 55-year-old stage \n", + "AAACCTGAGGGTTTCT-1_1 European 40-year-old stage \n", + "AAACCTGCACCTGGTG-1_1 European 60-year-old stage \n", + "AAACCTGGTCCGAGTC-1_1 European 76-year-old stage \n", + "AAACCTGGTGCCTTGG-1_1 European 70-year-old stage \n", + "\n", + " observation_joinid \n", + "AAACCTGAGAAACCTA-1_1 !9L}G4hgnw \n", + "AAACCTGAGGGTTTCT-1_1 YRcUzlVyg0 \n", + "AAACCTGCACCTGGTG-1_1 )*azge@M0l \n", + "AAACCTGGTCCGAGTC-1_1 E 10 or min_val < 0:\n", + " warnings.warn(f\"Warning: Expression Value out of range! Max: {max_val}, Min: {min_val}. Expected range is 0-10.\", UserWarning)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "id": "bd2fTv0gdluU" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
mvp.meanmvp.dispersionmvp.dispersion.scaledmvp.variablefeature_is_filteredfeature_namefeature_referencefeature_biotypefeature_lengthfeature_type
ENSG000001684540.0003801.1688760.181734FalseFalseTXNDC2NCBITaxon:9606gene1703protein_coding
ENSG000001978520.0359951.6341790.886458FalseFalseINKA2NCBITaxon:9606gene1217protein_coding
ENSG000001968780.0088621.6177290.861545FalseFalseLAMB3NCBITaxon:9606gene3931protein_coding
ENSG000002565400.0000221.6609930.927070FalseFalseIQSEC3-AS1NCBITaxon:9606gene1065lncRNA
ENSG000001391800.0901001.1847200.205731FalseFalseNDUFA9NCBITaxon:9606gene782protein_coding
\n", + "
" + ], + "text/plain": [ + " mvp.mean mvp.dispersion mvp.dispersion.scaled \\\n", + "ENSG00000168454 0.000380 1.168876 0.181734 \n", + "ENSG00000197852 0.035995 1.634179 0.886458 \n", + "ENSG00000196878 0.008862 1.617729 0.861545 \n", + "ENSG00000256540 0.000022 1.660993 0.927070 \n", + "ENSG00000139180 0.090100 1.184720 0.205731 \n", + "\n", + " mvp.variable feature_is_filtered feature_name \\\n", + "ENSG00000168454 False False TXNDC2 \n", + "ENSG00000197852 False False INKA2 \n", + "ENSG00000196878 False False LAMB3 \n", + "ENSG00000256540 False False IQSEC3-AS1 \n", + "ENSG00000139180 False False NDUFA9 \n", + "\n", + " feature_reference feature_biotype feature_length \\\n", + "ENSG00000168454 NCBITaxon:9606 gene 1703 \n", + "ENSG00000197852 NCBITaxon:9606 gene 1217 \n", + "ENSG00000196878 NCBITaxon:9606 gene 3931 \n", + "ENSG00000256540 NCBITaxon:9606 gene 1065 \n", + "ENSG00000139180 NCBITaxon:9606 gene 782 \n", + "\n", + " feature_type \n", + "ENSG00000168454 protein_coding \n", + "ENSG00000197852 protein_coding \n", + "ENSG00000196878 protein_coding \n", + "ENSG00000256540 lncRNA \n", + "ENSG00000139180 protein_coding " + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Gene metadata\n", + "adata.var.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sfgBeaLumPuV" + }, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QLTg-WK-hTS7" + }, + "source": [ + "### Modifying `var` index (Optional)\n", + "- The `index` values in this AnnData object are the `gene_ids`. To retrieve the literature genes associated with a particular cell type, we need the gene symbols, which are present in `feature_name` column. Therefore, we'll replace the index values with gene symbols.\n", + "- This will be helpful when analyzing the `GeneRecallCurve` later.\n", + "- This step can be skipped if the `reference_genes.csv` already contains gene IDs corresponding to each cell type, or if the user does not want to perform the `GeneRecallCurve` analysis.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "id": "qoSHdJtwgPaA" + }, + "outputs": [], + "source": [ + "adata.var.set_index('feature_name',inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "id": "p3LvDmZmhJ_c" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
mvp.meanmvp.dispersionmvp.dispersion.scaledmvp.variablefeature_is_filteredfeature_referencefeature_biotypefeature_lengthfeature_type
feature_name
TXNDC20.0003801.1688760.181734FalseFalseNCBITaxon:9606gene1703protein_coding
INKA20.0359951.6341790.886458FalseFalseNCBITaxon:9606gene1217protein_coding
LAMB30.0088621.6177290.861545FalseFalseNCBITaxon:9606gene3931protein_coding
IQSEC3-AS10.0000221.6609930.927070FalseFalseNCBITaxon:9606gene1065lncRNA
NDUFA90.0901001.1847200.205731FalseFalseNCBITaxon:9606gene782protein_coding
\n", + "
" + ], + "text/plain": [ + " mvp.mean mvp.dispersion mvp.dispersion.scaled mvp.variable \\\n", + "feature_name \n", + "TXNDC2 0.000380 1.168876 0.181734 False \n", + "INKA2 0.035995 1.634179 0.886458 False \n", + "LAMB3 0.008862 1.617729 0.861545 False \n", + "IQSEC3-AS1 0.000022 1.660993 0.927070 False \n", + "NDUFA9 0.090100 1.184720 0.205731 False \n", + "\n", + " feature_is_filtered feature_reference feature_biotype \\\n", + "feature_name \n", + "TXNDC2 False NCBITaxon:9606 gene \n", + "INKA2 False NCBITaxon:9606 gene \n", + "LAMB3 False NCBITaxon:9606 gene \n", + "IQSEC3-AS1 False NCBITaxon:9606 gene \n", + "NDUFA9 False NCBITaxon:9606 gene \n", + "\n", + " feature_length feature_type \n", + "feature_name \n", + "TXNDC2 1703 protein_coding \n", + "INKA2 1217 protein_coding \n", + "LAMB3 3931 protein_coding \n", + "IQSEC3-AS1 1065 lncRNA \n", + "NDUFA9 782 protein_coding " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Now the index values are the gene symbols.\n", + "adata.var.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "6yCi6UQ-kh0Q" + }, + "outputs": [], + "source": [ + "# Saving file for further analysis\n", + "# This shell will take approximately 00:00:47 (hh:mm:ss) to run.\n", + "adata.obs.index = adata.obs.index.astype(str)\n", + "adata.var.index = adata.var.index.astype(str)\n", + "AnnData(X=adata.X,obs=adata.obs,var=adata.var).write('data/modified_adata.h5ad',compression='gzip')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e1WBarmdY0h5" + }, + "source": [ + "## scaLR pipeline \n", + "\n", + "1. The **scaLR** pipeline consists of four stages:\n", + " - Data ingestion\n", + " - Feature selection\n", + " - Final model training\n", + " - Analysis\n", + "\n", + "2. The user needs to modify the configuration file (`config.yml`) available at `scaLR/config` for each stage of the pipeline according to the requirements. Simply omit or comment out the stages of the pipeline that you do not wish to run.\n", + "\n", + "3. Refer to `config.yml` and its detailed configuration [README](https://github.com/infocusp/scaLR/blob/main/config/README.md) file for instructions on how to use different parameters and files.\n", + "\n", + "### Config edits (For Cell Type Classification and Biomarker Identification)\n", + "\n", + "NOTE: Below are just suggestions for the model parameters. Feel free to play around with them for tuning the model & improving the results.\n", + "\n", + "*An example configuration file for the current dataset, incorporating the edits below, can be found at `scaLR/tutorials/pipeline/config_celltype.yaml`. Please update the device as `cuda` or `cpu` as per runtype.*\n", + "\n", + "- **Device setup**.\n", + " -Update `device: 'cuda'` for `GPU` enabled runtype, else `device: 'cpu'` for `CPU` enabled runtype.\n", + "- **Experiment Config**\n", + " - The default `exp_run` number is `0`.If not changed, the celltype classification experiment would be `exp_run_0` with all the pipeline results.\n", + "- **Data Config**\n", + " - Update the `full_datapath` to `data/modified_adata.h5ad` (as we will include `GeneRecallCurve` in the downstream).\n", + " - Specify the `num_workers` value for effective parallelization.\n", + " - Set `target` to `cell_type`.\n", + "- **Feature Selection**\n", + " - Specify the `num_workers` value for effective parallelization.\n", + " - Update the model layers to `[5000, 10]`, as there are only 10 cell types in the dataset.\n", + " - Change `epoch` to `10`.\n", + "- **Final Model Training**\n", + " - Update the model layers to the same as for feature selection: `[5000, 10]`.\n", + " - Change `epoch` to `100`.\n", + "- **Analysis**\n", + " - **Downstream Analysis**\n", + " - Uncomment the `test_samples_downstream_analysis` section.\n", + " - Update the `reference_genes_path` to `scaLR/tutorials/pipeline/grc_reference_gene.csv`.\n", + " - Please refer to the section below:\n", + "\n", + " ```\n", + " analysis:\n", + "\n", + " model_checkpoint: ''\n", + "\n", + " dataloader:\n", + " name: SimpleDataLoader\n", + " params:\n", + " batch_size: 15000\n", + "\n", + " gene_analysis:\n", + " scoring_config:\n", + " name: LinearScorer\n", + "\n", + " features_selector:\n", + " name: ClasswisePromoters\n", + " params:\n", + " k: 100\n", + " test_samples_downstream_analysis:\n", + " - name: GeneRecallCurve\n", + " params:\n", + " reference_genes_path: 'scaLR/tutorials/pipeline/grc_reference_gene.csv'\n", + " top_K: 300\n", + " plots_per_row: 3\n", + " features_selector:\n", + " name: ClasswiseAbs\n", + " params: {}\n", + " - name: Heatmap\n", + " params: {}\n", + " - name: RocAucCurve\n", + " params: {}\n", + "\n", + "\n", + "\n", + "### Config edits (For clinical condition specific biomarker identification and DGE analysis) \n", + "\n", + "*An example configuration file for the current dataset, incorporating the edits below, can be found at : `scaLR/tutorials/pipeline/config_clinical.yaml`.Please update the device as `cuda` or `cpu` as per runtype*\n", + "\n", + "- **Experiment Config**\n", + " - Make sure to change the `exp_run` number if you have an experiment with the same number earlier related to cell classification.As we have done one experiment earlier, we'll change the number now to '1'.\n", + "- **Data Config**\n", + " - The `full_datapath` remains the same as above.\n", + " - Change the `target` to `disease` (this column contains data for clinical conditions, `COVID-19/normal`).\n", + "- **Feature Selection**\n", + " - Update the model layers to `[5000, 2]`, as there are only two types of clinical conditions.\n", + " -`epoch` as 10.\n", + "- **Final Model Training**\n", + " - Update the model layers to the same as for feature selection: `[5000, 2]`.\n", + " - `epoch` as 100.\n", + "- **Analysis**\n", + " - **Downstream Analysis**\n", + " - Uncomment the `full_samples_downstream_analysis` section.\n", + " - We are not performing the 'gene_recall_curve' analysis in this case. It can be performed if the `COVID-19/normal` specific genes are available, but there are many possibilities of genes in the case of normal conditions.\n", + " - There are two options to perform differential gene expression (DGE) analysis: `DgePseudoBulk` and `DgeLMEM`. The parameters are updated as follows. Note that `DgeLMEM` may take a bit more time, as the multiprocessing is not very efficient with only 2 CPUs in the current Colab runtime.\n", + " - Please refer to the section below:\n", + " ```\n", + " analysis:\n", + "\n", + " model_checkpoint: ''\n", + "\n", + " dataloader:\n", + " name: SimpleDataLoader\n", + " params:\n", + " batch_size: 15000\n", + "\n", + " gene_analysis:\n", + " scoring_config:\n", + " name: LinearScorer\n", + "\n", + " features_selector:\n", + " name: ClasswisePromoters\n", + " params:\n", + " k: 100\n", + " full_samples_downstream_analysis:\n", + " - name: Heatmap\n", + " params:\n", + " top_n_genes: 100\n", + " - name: RocAucCurve\n", + " params: {}\n", + " - name: DgePseudoBulk\n", + " params:\n", + " celltype_column: 'cell_type'\n", + " design_factor: 'disease'\n", + " factor_categories: ['COVID-19', 'normal']\n", + " sum_column: 'donor_id'\n", + " cell_subsets: ['conventional dendritic cell', 'natural killer cell']\n", + " - name: DgeLMEM\n", + " params:\n", + " fixed_effect_column: 'disease'\n", + " fixed_effect_factors: ['COVID-19', 'normal']\n", + " group: 'donor_id'\n", + " celltype_column: 'cell_type'\n", + " cell_subsets: ['conventional dendritic cell']\n", + " gene_batch_size: 1000\n", + " coef_threshold: 0.1\n", + " " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wny28AQQm6xB" + }, + "source": [ + "### Run Pipeline " + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "uLgN7MDv7hV-" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/bin/bash: line 1: python: command not found\n" + ] + } + ], + "source": [ + "# Possible flags using 'scaLR/pipeline.py'\n", + "!python scaLR/pipeline.py --help" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kTAOOj1CgjJy" + }, + "source": [ + "#### Cell type classification" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "id": "xqvT9AiQFVGq" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2025-02-27 19:02:51,535 - ROOT - INFO : Experiment directory: `scalr_experiments/exp_name_0`\n", + "2025-02-27 19:02:51,544 - ROOT - INFO : Data Ingestion pipeline running\n", + "2025-02-27 19:02:51,544 - DataIngestion - INFO : Generating Train, Validation and Test sets\n", + "2025-02-27 19:03:35,769 - DataIngestion - INFO : Generate label mappings for all columns in metadata\n", + "2025-02-27 19:03:36,946 - ROOT - INFO : Feature Extraction pipeline running\n", + "2025-02-27 19:03:36,946 - File Utils - INFO : Data Loaded from Final datapaths\n", + "2025-02-27 19:03:37,467 - FeatureExtraction - INFO : Feature subset models training\n", + "2025-02-27 19:05:09,181 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:09,253 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:09,295 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:09,393 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:09,750 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:09,751 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:09,770 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:09,881 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:16,105 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:16,106 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:16,153 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:16,154 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:16,168 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:16,174 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:20,327 - FeatureExtraction - INFO : Feature scoring\n", + "2025-02-27 19:05:20,712 - FeatureExtraction - INFO : Top features extraction\n", + "2025-02-27 19:05:20,719 - FeatureExtraction - INFO : Writing feature-subset data onto disk\n", + "2025-02-27 19:05:51,902 - ROOT - INFO : Final Model Training pipeline running\n", + "2025-02-27 19:05:51,905 - File Utils - INFO : Data Loaded from Feature subset datapaths\n", + "2025-02-27 19:05:52,382 - ModelTraining - INFO : Building model training artifacts\n", + "2025-02-27 19:05:52,841 - ModelTraining - INFO : Training the model\n", + "2025-02-27 19:05:59,278 - ROOT - INFO : Analysis pipeline running\n", + "2025-02-27 19:05:59,281 - File Utils - INFO : Data Loaded from Feature subset datapaths\n", + "2025-02-27 19:05:59,676 - File Utils - INFO : Data Loaded from Feature subset datapaths\n", + "2025-02-27 19:05:59,805 - File Utils - INFO : Data Loaded from Feature subset datapaths\n", + "2025-02-27 19:06:00,379 - Eval&Analysis - INFO : Calculating accuracy and generating classification report on test set\n", + "/home/amit.samal/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", + " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", + "/home/amit.samal/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", + " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", + "/home/amit.samal/.local/lib/python3.10/site-packages/sklearn/metrics/_classification.py:1531: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.\n", + " _warn_prf(average, modifier, f\"{metric.capitalize()} is\", len(result))\n", + "2025-02-27 19:06:03,433 - Eval&Analysis - INFO : Performing gene analysis\n", + "2025-02-27 19:06:03,433 - FeatureExtraction - INFO : Feature scoring\n", + "2025-02-27 19:06:03,471 - FeatureExtraction - INFO : Top features extraction\n", + "2025-02-27 19:06:03,540 - Eval&Analysis - INFO : Performing Downstream Analysis on test samples\n", + "2025-02-27 19:06:03,540 - Eval&Analysis - INFO : Performing GeneRecallCurve\n", + "2025-02-27 19:06:04,781 - Eval&Analysis - INFO : Performing Heatmap\n", + "2025-02-27 19:06:09,548 - Eval&Analysis - INFO : Performing RocAucCurve\n", + "2025-02-27 19:06:09,929 - ROOT - INFO : Total time taken: 198.401921749115 s\n", + "2025-02-27 19:06:09,929 - ROOT - INFO : Maximum memory usage: 1915.5625 MB\n" + ] + } + ], + "source": [ + "# Command to run end to end pipeline.\n", + "# This shell will take approximately 00:21:15 (hh:mm:ss) on GPU to run.()\n", + "\n", + "!python3 scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_celltype.yaml -l -m" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0IRSOT64gjJy" + }, + "source": [ + "#### Clinical condition specific biomarker identification and differential gene expression analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "e71LHxUvgjJy" + }, + "outputs": [], + "source": [ + "## It takes 01:16:58 (hh:mm:ss) to run on the CPU for clinical condition-specific biomarker identification.\n", + "## To reduce the runtime, please comment out the 'DgeLMEM' section under the 'full_samples_downstream_analysis.\n", + "\n", + "!python scaLR/pipeline.py --config scaLR/tutorials/pipeline/config_clinical.yaml -l -m" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yviraKXXgjJy" + }, + "source": [ + "Pipeline logs can be found at `scalr_experiments/exp_name_0/logs.txt` (cell type classification)\n", + "\n", + "For clinical condition specific biomarker identification, the logs can be found at `scalr_experiments/exp_name_1/logs.txt`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oe4d74mjIcgW" + }, + "source": [ + "### Results \n", + "We have done the celltype classification and biomarker discovery with name `exp_name_0`.\n", + "\n", + "- The classification report can be found at `scalr_experiments/exp_name_0/analysis/classification_report.csv`\n", + "\n", + "- Top-5k Biomarkers can be found at `scalr_experiments/exp_name_0/analysis/gene_analysis/top_features.json`.\n", + "\n", + "- `Heatmaps` for each class(cell types) can be found at `scalr_experiments/exp_name_0/analysis/test_samples/heatmaps`\n", + "\n", + "- `Gene_recall_curve`, and `roc_auc` data can be found at `scalr_experiments/exp_name_0/analysis/test_samples/`.\n", + "\n", + "- `score_matrix.csv` with gene scores for all classes can be found at `scalr_experiments/exp_name_0/analysis/gene_analysis/score_matrix.csv`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MM5v5OTcQocC" + }, + "outputs": [], + "source": [ + "#Classification report\n", + "pd.read_csv('/content/scalr_experiments/exp_name_0/analysis/classification_report.csv',index_col=0)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rNZt8t-_gjJz" + }, + "outputs": [], + "source": [ + "#ROC_AUC\n", + "display(SVG('/content/scalr_experiments/exp_name_0/analysis/test_samples/roc_auc.svg'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JBYVFclUgjJz" + }, + "outputs": [], + "source": [ + "# Heatmap for cell type 'classical monocyte'\n", + "display(SVG('/content/scalr_experiments/exp_name_0/analysis/test_samples/heatmaps/classical monocyte.svg'))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zbui27nxIh_J" + }, + "outputs": [], + "source": [ + "# Gene recall curve\n", + "display(SVG('scalr_experiments/exp_name_0/analysis/test_samples/gene_recall_curve.svg'))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "52n0PSr87FjJ" + }, + "source": [ + "\n", + "For clinical condition-specific biomarker identification and DGE analysis with the experiment name `exp_name_1`. All analysis results can be viewed in the `exp_name_1` directory, as explained above for cell type classification. The difference is that we have results for only two classes in `exp_name_1`, namely `COVID-19` and `normal`, along with the results for DGE analysis." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Fgu3MIxggjJ3" + }, + "outputs": [], + "source": [ + "# DgePseudoBulk results for 'conventional dendritic cell' in 'COVID-19' w.r.t. 'normal' samples\n", + "pd.read_csv('/content/scalr_experiments/exp_name_1/analysis/full_samples/pseudobulk_dge_result/pbkDGE_conventionaldendriticcell_COVID-19_vs_normal.csv')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7n_AczPkgjJ3" + }, + "outputs": [], + "source": [ + "# Volcano plot of `log2FoldChange` vs `-log10(pvalue)` in gene expression for\n", + "# 'conventional dendritic cell' in 'COVID-19' w.r.t. 'normal' samples.\n", + "display(SVG('/content/scalr_experiments/exp_name_1/analysis/full_samples/pseudobulk_dge_result/pbkDGE_conventionaldendriticcell_COVID-19_vs_normal.svg'))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Js1lFjQagjJ3" + }, + "source": [ + "*Note*: A `Fold Change (FC)` of 1.5 units in the figure above is equivalent to a `log2 Fold Change` of 0.584." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RL5n6rqzR4Sc" + }, + "source": [ + "## Running scaLR in modules" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6jypX2axToza" + }, + "source": [ + "### Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yqnxGZnHIiJr" + }, + "outputs": [], + "source": [ + "import sys\n", + "sys.path.append('scaLR/')\n", + "import os\n", + "from os import path\n", + "\n", + "from scalr.data_ingestion_pipeline import DataIngestionPipeline\n", + "from scalr.eval_and_analysis_pipeline import EvalAndAnalysisPipeline\n", + "from scalr.feature_extraction_pipeline import FeatureExtractionPipeline\n", + "from scalr.model_training_pipeline import ModelTrainingPipeline\n", + "from scalr.utils import read_data\n", + "from scalr.utils import write_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tObhEJKkT0Ew" + }, + "source": [ + "### Load Config\n", + "\n", + "Running with example config files with required edits. Make sure to change the experiment name if required." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dbrUCh-LTxbl" + }, + "outputs": [], + "source": [ + "config = read_data('scaLR/tutorials/pipeline/config_celltype.yaml')\n", + "# config = read_data('scaLR/tutorials/pipeline/config_clinical.yaml')\n", + "config" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XU-FLwPlULd1" + }, + "outputs": [], + "source": [ + "dirpath = config['experiment']['dirpath']\n", + "exp_name = config['experiment']['exp_name']\n", + "exp_run = config['experiment']['exp_run']\n", + "dirpath = os.path.join(dirpath, f'{exp_name}_{exp_run}')\n", + "os.makedirs(dirpath, exist_ok=True)\n", + "device = config['device']" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C44uQoNiUe4M" + }, + "source": [ + "### Data Ingestion" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JX5nB5gzUh7L" + }, + "outputs": [], + "source": [ + "# This shell will take approximately 00:01:23 (hh:mm:ss) to run.\n", + "\n", + "data_dirpath = path.join(dirpath, 'data')\n", + "os.makedirs(data_dirpath, exist_ok=True)\n", + "\n", + "# Initialize Data Ingestion object\n", + "ingest_data = DataIngestionPipeline(config['data'], data_dirpath)\n", + "\n", + "# Generate Train, Validation and Test Splits for pipeline\n", + "ingest_data.generate_train_val_test_split()\n", + "\n", + "# Apply pre-processing on data\n", + "# Fit on Train data, and then apply on the entire data\n", + "ingest_data.preprocess_data()\n", + "\n", + "# We generate label mapings from the metadata, which is used for\n", + "# labels, etc.\n", + "ingest_data.generate_mappings()\n", + "\n", + "# All the additional data generated (label mappings, data splits, etc.)\n", + "# are passed onto the config for future use in pipeline\n", + "config['data'] = ingest_data.get_updated_config()\n", + "write_data(config, path.join(dirpath, 'config.yaml'))\n", + "del ingest_data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qc76-jFSVmfY" + }, + "source": [ + "### Feature Selection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w4CfG8YQVoTJ" + }, + "outputs": [], + "source": [ + "# This shell will take approximately 00:19:02 (hh:mm:ss) to run.\n", + "\n", + "feature_extraction_dirpath = path.join(dirpath, 'feature_extraction')\n", + "os.makedirs(feature_extraction_dirpath, exist_ok=True)\n", + "\n", + "# Initialize Feature Extraction object\n", + "extract_features = FeatureExtractionPipeline(\n", + " config['feature_selection'], feature_extraction_dirpath, device)\n", + "extract_features.load_data_and_targets_from_config(config['data'])\n", + "\n", + "# Train feature subset models and get scores for each feature/genes\n", + "extract_features.feature_subsetted_model_training()\n", + "extract_features.feature_scoring()\n", + "\n", + "# Extract top features by some algorithm, and write a feature-subsetted\n", + "# dataset\n", + "extract_features.top_feature_extraction()\n", + "config['data'] = extract_features.write_top_features_subset_data(\n", + " config['data'])\n", + "\n", + "# All the additional data generated (subset data splits, etc.)\n", + "# are passed onto the config for future use in pipeline\n", + "config['feature_selection'] = extract_features.get_updated_config()\n", + "write_data(config, path.join(dirpath, 'config.yaml'))\n", + "del extract_features" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z-Scub2RVtqi" + }, + "source": [ + "### Final Model Training" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Roc1gACAVoY6" + }, + "outputs": [], + "source": [ + "# This shell will take approximately 00:06:20 (hh:mm:ss) to run.\n", + "\n", + "model_training_dirpath = path.join(dirpath, 'model')\n", + "os.makedirs(model_training_dirpath, exist_ok=True)\n", + "\n", + "# Initialize Final Model Training object\n", + "model_trainer = ModelTrainingPipeline(\n", + " config['final_training']['model'],\n", + " config['final_training']['model_train_config'],\n", + " model_training_dirpath, device)\n", + "model_trainer.load_data_and_targets_from_config(config['data'])\n", + "\n", + "# Build the training artifacts from config, and train the model\n", + "model_trainer.build_model_training_artifacts()\n", + "model_trainer.train()\n", + "\n", + "# All the additional data generated (model defaults filled, etc.)\n", + "# are passed onto the config for future use in pipeline\n", + "model_config, model_train_config = model_trainer.get_updated_config()\n", + "config['final_training']['model'] = model_config\n", + "config['final_training']['model_train_config'] = model_train_config\n", + "write_data(config, path.join(dirpath, 'config.yaml'))\n", + "del model_trainer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GZFd8R8QWpmS" + }, + "source": [ + "### Evaluation and Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w71AS8mXVob9" + }, + "outputs": [], + "source": [ + "# This shell will take approximately 00:00:26 (hh:mm:ss) to run.\n", + "\n", + "analysis_dirpath = path.join(dirpath, 'analysis')\n", + "os.makedirs(analysis_dirpath, exist_ok=True)\n", + "\n", + "# Get path of the best trained model\n", + "config['analysis']['model_checkpoint'] = path.join(\n", + " model_training_dirpath, 'best_model')\n", + "\n", + "# Initialize Evaluation and Analysis Pipeline object\n", + "analyser = EvalAndAnalysisPipeline(config['analysis'], analysis_dirpath,\n", + " device)\n", + "analyser.load_data_and_targets_from_config(config['data'])\n", + "\n", + "# Perform evaluation of trained model on test data and generate\n", + "# classification report\n", + "analyser.evaluation_and_classification_report()\n", + "\n", + "# Perform gene analysis based on the trained model to get\n", + "# top genes / biomarker analysis\n", + "analyser.gene_analysis()\n", + "\n", + "# Perform downstream analysis on all samples / test samples\n", + "analyser.full_samples_downstream_anlaysis()\n", + "analyser.test_samples_downstream_anlaysis()\n", + "\n", + "# All the additional data generated\n", + "# are passed onto the config for future use in pipeline\n", + "config['analysis'] = analyser.get_updated_config()\n", + "write_data(config, path.join(dirpath, 'config.yaml'))\n", + "del analyser" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XCThcOt8gjJ5" + }, + "source": [ + "Analysis results can be viewed inside `scalr_experiments` under the `exp_name` specified in the `config.yaml`, as mentioned above." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0V2-AKThIaks" + }, + "source": [] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "T4", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}