Skip to content

metalhelix/robopubdata

Repository files navigation

robopubdata

General developer documentation for robopubdata.

Robopubdata is the back end pipeline of the BioTools public data analysis module.


Running the pipeline

The pipeline is designed to run as a CronJob. Automatically pickup job submissions from the BioTools module and kick off corresponded processes.

But the pipeline can also be ran on he CLI with provided parameters:

/home/compbio_svc/miniconda3/envs/R-SECUNDO3/bin/python /n/ngs/tools/robopubdata/run_robopub.py \ --download_dir /PathTo/dir/to/download \ --sra_list /path/To/textfile/of/SRA/forDownload \ --lab labID \ --requester userID \ --genomeVer refgenomeVersion \ --genomeAnnotation refGenomeAnnotation \ --analysisType Download_RNAseq_SingleCell

A quick note, only genomes under /n/analysis/genome can be picked up by the pipeline.

It is suggested to use the BioTools module rather than CLI.


Code Structure and logic

The source code for robopub is stored under: /n/ngs/tools/robopubdata

The pipeline contains three nextflow pipelines, nf-core-fetchngs; scRNAseq_roboPub; Scundo_roboPub

  1. nf-core-fetchngs, is a nf-core pipeline. Its' github page can be found here https://github.com/nf-core/fetchngs

  2. scRNAseq_roboPub, is a in-house developed pipeline for single-cell RNAseq analysis:

    • main.nf defines the general logic
    • workflows/ contain all workflows defined in main.nf
    • modules/ contain all processes defined in main and workflows under workflows/
    • bin/ contain all the python and r scripts used in processes under module/ and workflows/
    • nextflow.config contains all process parameters for slurm resource allocations
    • assets/ contain files needed for html report and shinyApp generation
  3. Scundo_roboPub, is a in-house developed pipeline for bulk-RNAseq analysis:

    • main.nf defines the general logic
    • workflows/ contain all workflows defined in main.nf
    • modules/ contain all processes defined in main and workflows under workflows/
    • bin/ contain all the python and r scripts used in processes under module/ and workflows/
    • nextflow.config contains all process parameters for slurm resource allocations
    • assets/ contain files needed for rmd html report generation

The 3 nextflow pipelines are tied together by manager script run_robopub.py


CronJobs

CronJobs are used to automatically kick off robopubdata. CronJobs are ran under the compbio_svc account.

Command:

* * * * * /home/compbio_svc/miniconda3/envs/R-SECUNDO3/bin/python /n/ngs/tools/robopubdata/CronJob_PDataDIY.py >> /n/core/Bioinformatics/PDataDIY/logs/CronJob.log 2>&1

The cronjob looks for new csv files under /n/core/Bioinformatics/PDataDIY/


Log Files

Log files for PRIME are stored under /n/core/Bioinformatics/PDataDIY/logs

  • CronJob.log : Direct output of the CronJob script
  • BioTools_PDataDIY_Orders.log : Orders that were detected in previous runs with time stamps
  • nextflow_run_logs/ : Outputs of the pipeline per flowcell

Pipeline Environment

  1. Python env for the manager script and CronJob: /home/compbio_svc/miniconda3/envs/R-SECUNDO3/bin/python
  2. Conda env for nf-core-fetchngs: Uses multiple conda env within. Refer to https://github.com/nf-core/fetchngs
  3. Conda env for scRNAseq_roboPub & Scundo_roboPub: /home/compbio_svc/miniconda3/envs/R-SECUNDO3

Run Outputs

The pipeline will download public fastq data to /n/core/Bioinformatics/PublicData named based off of SRA project IDs.

All intermediate files will be stored under /n/core/Bioinformatics/PDataDIY named based off of BioTools PubData module job IDs.

About

Backend analysis pipeline for BioTools Public Data analysis module

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors