General developer documentation for robopubdata.
Robopubdata is the back end pipeline of the BioTools public data analysis module.
The pipeline is designed to run as a CronJob. Automatically pickup job submissions from the BioTools module and kick off corresponded processes.
But the pipeline can also be ran on he CLI with provided parameters:
/home/compbio_svc/miniconda3/envs/R-SECUNDO3/bin/python /n/ngs/tools/robopubdata/run_robopub.py \ --download_dir /PathTo/dir/to/download \ --sra_list /path/To/textfile/of/SRA/forDownload \ --lab labID \ --requester userID \ --genomeVer refgenomeVersion \ --genomeAnnotation refGenomeAnnotation \ --analysisType Download_RNAseq_SingleCell
A quick note, only genomes under /n/analysis/genome can be picked up by the pipeline.
It is suggested to use the BioTools module rather than CLI.
The source code for robopub is stored under: /n/ngs/tools/robopubdata
The pipeline contains three nextflow pipelines, nf-core-fetchngs; scRNAseq_roboPub; Scundo_roboPub
-
nf-core-fetchngs, is a nf-core pipeline. Its' github page can be found here https://github.com/nf-core/fetchngs
-
scRNAseq_roboPub, is a in-house developed pipeline for single-cell RNAseq analysis:
- main.nf defines the general logic
- workflows/ contain all workflows defined in main.nf
- modules/ contain all processes defined in main and workflows under workflows/
- bin/ contain all the python and r scripts used in processes under module/ and workflows/
- nextflow.config contains all process parameters for slurm resource allocations
- assets/ contain files needed for html report and shinyApp generation
-
Scundo_roboPub, is a in-house developed pipeline for bulk-RNAseq analysis:
- main.nf defines the general logic
- workflows/ contain all workflows defined in main.nf
- modules/ contain all processes defined in main and workflows under workflows/
- bin/ contain all the python and r scripts used in processes under module/ and workflows/
- nextflow.config contains all process parameters for slurm resource allocations
- assets/ contain files needed for rmd html report generation
The 3 nextflow pipelines are tied together by manager script run_robopub.py
CronJobs are used to automatically kick off robopubdata. CronJobs are ran under the compbio_svc account.
Command:
* * * * * /home/compbio_svc/miniconda3/envs/R-SECUNDO3/bin/python /n/ngs/tools/robopubdata/CronJob_PDataDIY.py >> /n/core/Bioinformatics/PDataDIY/logs/CronJob.log 2>&1
The cronjob looks for new csv files under /n/core/Bioinformatics/PDataDIY/
Log files for PRIME are stored under /n/core/Bioinformatics/PDataDIY/logs
- CronJob.log : Direct output of the CronJob script
- BioTools_PDataDIY_Orders.log : Orders that were detected in previous runs with time stamps
- nextflow_run_logs/ : Outputs of the pipeline per flowcell
- Python env for the manager script and CronJob: /home/compbio_svc/miniconda3/envs/R-SECUNDO3/bin/python
- Conda env for nf-core-fetchngs: Uses multiple conda env within. Refer to https://github.com/nf-core/fetchngs
- Conda env for scRNAseq_roboPub & Scundo_roboPub: /home/compbio_svc/miniconda3/envs/R-SECUNDO3
The pipeline will download public fastq data to /n/core/Bioinformatics/PublicData named based off of SRA project IDs.
All intermediate files will be stored under /n/core/Bioinformatics/PDataDIY named based off of BioTools PubData module job IDs.