This tutorial guides you through the first step of a long-read sequencing analysis workflow — basecalling — using Oxford Nanopore data on the OSPool high-throughput computing ecosystem. You will learn how to:
- Basecall raw Nanopore sequenced reads using the latest Dorado basecaller
- Use the OSPool's GPU capacity to accelerate basecalling with Dorado
- Break down massive bioinformatics workflows into many independent smaller tasks
- Submit hundreds to thousands of jobs with a few simple commands
- Use the Open Science Data Federation (OSDF) to manage file transfers during job submission
The steps of this tutorial run across hundreds (or thousands) of jobs using the HTCondor workload manager and Apptainer containers to execute your software reliably and reproducibly at scale. The tutorial uses realistic genomics data and emphasizes performance, reproducibility, and portability. You will work with real data and see how high-throughput computing (HTC) can accelerate your workflows.
Note
If you are new to running jobs on the OSPool, complete the HTCondor "Hello World" tutorial and our "Submit Jobs to the OSPool" guide before starting this tutorial.
Let’s get started!
- tutorial-ONT-Basecalling
- Long-Read Genomics on the OSPool
This tutorial assumes that you:
- Have basic command-line experience (e.g., navigating directories, using bash, editing text files)
- Have a working OSPool account and can log into an Access Point (e.g., ap40.uw.osg-htc.org)
- Are familiar with HTCondor job submission, including writing simple
.subfiles and tracking job status withcondor_q - Understand the general workflow of long-read sequencing analysis
- Have sufficient disk quota and file permissions in your OSPool home and OSDF directories
Tip
You do not need to be a genomics expert to follow this tutorial. The commands and scripts are beginner-friendly and self-contained while reflecting real-world research workflows.
Log into your OSPool account:
```bash
ssh user.name@ap##.uw.osg-htc.org
```
To obtain a copy of the tutorial files, you can:
-
Clone the repository:
git clone https://github.com/dmora127/tutorial-ONT-Basecalling.git cd tutorial-ONT-Basecalling/ bash tutorial-setup.sh <username>
This script creates the directory structure in your home directory
/home/<user.name>/tutorial-ONT-Basecalling/and OSPool directory/ospool/ap40/<user.name>/tutorial-ONT-Basecalling/, along with several subdirectories used in this tutorial. -
Or download the toy dataset using Pelican CURRENT UNAVAILBLE - WORK IN PROGRESS:
pelican object get pelican://osg-htc.org/ospool/uw-shared/OSG-Staff/osg-training/tutorial-ospool-genomics/data/path/to/pod5/files ./
In Oxford Nanopore Technologies (ONT) sequencing, the instrument does not directly read DNA or RNA bases. Instead, it measures subtle changes in ionic current as a single-stranded molecule passes through a nanopore embedded in a membrane. These electrical current traces—often called raw signals or squiggles—reflect the sequence of the nucleotides inside the pore.
Basecalling converts continuous electrical signals into nucleotide sequences (A, C, G, T, or U). Basecalling algorithms interpret the temporal and amplitude patterns in the current signal to infer the most probable sequence of bases. This step is one of the most computationally demanding parts of the ONT analysis pipeline.
Dorado uses deep neural network models trained to map signal patterns to their corresponding base sequences. These models rely on GPU acceleration to perform millions of operations per second. GPUs dramatically shorten inference time by parallelizing the math behind neural network prediction, enabling accurate and high-throughput decoding of long-read data.
Because each read or POD5 file can be basecalled independently, Dorado workflows scale perfectly on the OSPool, where thousands of GPU-enabled jobs can run simultaneously. The OSPool’s distributed architecture provides the compute, memory, and data-staging infrastructure (via OSDF) needed to handle large sequencing runs reproducibly and efficiently. You can turn hours of local computation into minutes of parallel basecalling across the national HTC fabric.
Each Oxford Nanopore flow cell contains hundreds to thousands of channels that generate independent signal traces, meaning the data in a single POD5 file can be subdivided into smaller, channel-specific subsets. By splitting the data so that each channel’s reads are in their own POD5 file, you enable true parallel basecalling: each channel file becomes an independent job that can run simultaneously across hundreds or thousands of OSPool execution points. This design aligns with the principles of High Throughput Computing (HTC)—many small, independent jobs working together to accelerate large workflows. Channel-level splitting also enables duplex basecalling, which requires organizing data to pair complementary reads, maximizing accuracy and yield.
To split reads by channel, you can use the POD5 package available inside your dorado.sif container to generate per-channel subsets and organize them into a split_pod5_subsets directory. Once subdivided, each new POD5 file can be basecalled individually, dramatically reducing time-to-results while maintaining reproducibility and scalability on the OSPool.
Use a layout like the one below (directories may already exist if you ran the setup script):
./tutorial-ONT-Basecalling/
├── executables # scripts for preprocessing and running dorado jobs
│ ├── preprocessing_pod5s.sh
│ └── run_dorado.sh
├── inputs # POD5 inputs (or OSDF paths to them)
├── list_of_pod5_files.txt # one POD5 path per line
├── logs # condor .log/.out/.err
│ ├── err
│ └── out
├── outputs # basecalled outputs (BAM/FASTQ)
├── README.md # Tutorial step-by-step documentation and instructions
├── run_dorado_at_CHTC.sub
├── run_dorado.sub # HTCondor submit file (GPU)
├── software # containers recipes
│ └── dorado.def
└── tutorial-setup.sh # optional helper to set up the treeYou also have a companion OSDF directory for storing large files, such as containers and Dorado models:
/ospool/<ap##>/data/<your-username>/tutorial-ONT-Basecalling/
├── data/
│ ├── dna_r10.4.1_e8.2_400bps_fast@v4.2.0_5mCG_5hmCG@v2.tar.gz
│ ├── dna_r10.4.1_e8.2_400bps_fast@v4.2.0.tar.gz
│ ├── rna004_130bps_sup@v5.2.0.tar.gz
├── software/
│ └── dorado_build1.2.0_27OCT2025_v1.sifRun the included tutorial-setup.sh script in the companion repository to create this structure.
Before basecalling, set up your software environment to run Dorado inside an Apptainer container.
-
Set your temporary Apptainer build directory to
/home/tmp/. Once you've built your container, you can delete the contents of this directory to reduce quota usage on/home.On the Access Point (AP), run:
mkdir -p $HOME/tmp export TMPDIR=$HOME/tmp export APPTAINER_TMPDIR=$HOME/tmp export APPTAINER_CACHEDIR=$HOME/tmp
Caution
Run these commands every time you log in or build a new container. Building Apptainer containers without setting these variables places excessive strain on shared storage resources and violates OSPool usage policies. Failure to follow these steps may result in restricted access.
-
Change to your
software/directory in your tutorial folder:cd ~/tutorial-ONT-Basecalling/software/
-
Create a definition file for Apptainer to build your Dorado container. Open a text editor, such as
vimornano, and save the following assoftware/dorado.def:Bootstrap: docker From: nvidia/cuda:13.0.1-cudnn-runtime-ubuntu22.04 %post DEBIAN_FRONTEND=noninteractive # system packages apt-get update -y apt-get install -y python3-minimal curl curl -sS https://bootstrap.pypa.io/get-pip.py -o get-pip.py python3 get-pip.py --break-system-packages rm get-pip.py apt-get install -y bedtools # install Dorado and POD5 cd /opt/ curl -L https://cdn.oxfordnanoportal.com/software/analysis/dorado-1.2.0-linux-x64.tar.gz -o ./dorado-1.2.0-linux-x64.tar.gz tar -zxvf dorado-1.2.0-linux-x64.tar.gz rm dorado-1.2.0-linux-x64.tar.gz # install POD5 using pip pip install pod5 --break-system-packages %environment # set up environment for when using the container # add Dorado to $PATH variable for ease of use export PATH="/opt/dorado-1.2.0-linux-x64/bin/:$PATH"This definition file uses the Nvidia CUDA 13.0.1 libraries on an Ubuntu 22.04 base image and installs necessary packages to run Dorado and POD5 in an Apptainer container.
-
Build your Apptainer container on the Access Point (AP):
apptainer build dorado_build1.2.0_27OCT2025_v1.sif dorado.def
[!WARNING] This will take a few minutes depending on system usage. If you encounter any errors during the build process, double-check that you have set your temporary directories correctly (see step 1). Contact your RCF team if issues persist.
-
Move your finalized container image,
dorado_build1.2.0_27OCT2025_v1.sif, to yourOSDFdirectorymv dorado_build1.2.0_27OCT2025_v1.sif /ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/software/
This section covers two important preparatory steps before submitting your basecalling jobs to the OSPool: downloading the Dorado basecalling models and splitting your POD5 files by sequencing channel.
Dorado basecalling models are not included in the Dorado Apptainer container image by default. As a result, we must download the models we wish to use for basecalling prior to submitting our basecalling jobs. We can download the models directly using the Dorado command line interface (CLI) within our Dorado Apptainer container.
-
Launch an interactive Apptainer shell session on the Access Point (AP) using your
dorado_build1.2.0_27OCT2025_v1.sifcontainer:cd /ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/ apptainer shell --bind "$PWD" software/dorado_build1.2.0_27OCT2025_v1.sif
-
Download your desired Dorado model (e.g.
dna_r10.4.1_e8.2_400bps_hac@v5.2.0) to yourdata/directory in your OSPool tutorial directory:dorado download --model dna_r10.4.1_e8.2_400bps_hac@v5.2.0 --models-directory /ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/data/
Note
This step will download the requested Dorado basecalling model. If you only wish to download specific models, you can replace the --model dna_r10.4.1_e8.2_400bps_hac@v5.2.0 argument with --model <model_name>, where <model_name> is the name of the model you wish to download (e.x. duplex_sup@v5.2.0). You can also download all the available models (not recommended as there are many models) using the --model all option. Not sure which model to use? Refer to the Model Selection - Dorado documentation for more information on selecting the appropriate model for your data.
-
Exit the Apptainer shell:
exit -
Compress the models into tar.gz files for efficient transfer to the Execution Point (EP) during job submission:
cd /ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/data/ tar -czf dna_r10.4.1_e8.2_400bps_hac@v5.2.0.tar.gz dna_r10.4.1_e8.2_400bps_hac@v5.2.0 && rm -rf dna_r10.4.1_e8.2_400bps_hac@v5.2.0/
When basecalling our sequencing data on Dorado we can subdivide our POD5 files into smaller individual subsets. This subdivision of our files enables us to take advantage of the OSPool's High Throughput Computing (HTC) principles, significantly decreasing the time-to-results for our basecalling. We will use the POD5 package installed in our dorado.sif container—if you need to generate the dorado.sif Apptainer image, refer to Setting up our software environment.
-
Move to your
inputs/directory where your POD5 files are located:cd ~/tutorial-ONT-Basecalling/inputs/
-
Launch an interactive Apptainer shell session on the Access Point (AP) using your
dorado_build1.2.0_27OCT2025_v1.sifcontainer:apptainer shell --bind "$PWD" /ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/software/dorado_build1.2.0_27OCT2025_v1.sif
-
Create a table mapping each read ID to its channel:
pod5 view <pod5_dir> --include "read_id, channel" --output summary.tsv
This will generate a TSV table mapping each read_id to each channel for basecalling. You can replace
<pod5_dir>with./, if you're in the current directory with your POD5 files. -
Subset your POD5 reads by channel using the
summary.tsvtable created in the previous step. Output the POD5 subsets tosplit_pod5_subsets/:pod5 subset <pod5_dir> --summary summary.tsv --columns channel --output split_pod5_subsets
Warning
This step will take a while depending on the size of your POD5 files and the number of channels present in your data. It will generate one pod5 file per channel in the split_pod5_subsets output directory. Most MinION Flow Cells have 512 channels, so you should expect to see ~512 POD5 files in your output directory. Never output your subsetted POD5 files to your OSPool directory, as it will cause excessive strain on shared storage resources and violate OSPool usage policies. Only output your subsetted POD5 files to your home directory.
-
Exit the Apptainer shell:
exitYou may have to run
exitmultiple times to exit the Apptainer shell and return to your normal bash shell. -
Create a list of POD5 files to iterate through during job submission:
ls split_pod5_subsets > ~/tutorial-ONT-Basecalling/list_of_pod5_files.txt
If you
headthis new file you should see an output similar to this:[user.name@ap40 user.name]$ head ~/tutorial-ONT-Basecalling/list_of_pod5_files.txt channel-100.pod5 channel-101.pod5 channel-102.pod5 channel-103.pod5 channel-104.pod5 channel-105.pod5 channel-106.pod5 channel-107.pod5 channel-108.pod5 channel-109.pod5 [user.name@ap40 user.name]$
Now that your data are organized into channel-specific subsets, you’re ready to submit your jobs to the OSPool.
-
Change to your
tutorial-ONT-Basecalling/directory:cd ~/tutorial-ONT-Basecalling/
-
Create your Dorado simplex basecalling executable
~/tutorial-ONT-Basecalling/executables/run_dorado.sh. No changes may be necessary if you are using the same Dorado script as in the example below.#!/bin/bash # Run Dorado on the EP for each POD5 file (non-resumeable) # Usage: ./run_dorado.sh "<dorado_args>" <dorado_model_name> <input_pod5_file> set -euo pipefail DORADO_ARGS="$1" DORADO_MODEL_NAME="$2" INPUT_POD5="$3" BAM_FILE="${INPUT_POD5}.bam" FASTQ_FILE="${BAM_FILE}.fastq" echo "Running Dorado with args: ${DORADO_ARGS}" echo "Model Name: ${DORADO_MODEL_NAME}" echo "Input POD5: ${INPUT_POD5}" echo "Output BAM: ${BAM_FILE}" # untar your Dorado basecalling models tar -xvzf ${DORADO_MODEL_NAME}.tar.gz rm ${DORADO_MODEL_NAME}.tar.gz echo "Completed model extraction." # Run Dorado dorado ${DORADO_ARGS} "${DORADO_MODEL_NAME}" "${INPUT_POD5}" > "${BAM_FILE}" echo "Completed Dorado basecalling." # Convert BAM to FASTQ bedtools bamtofastq -i "${BAM_FILE}" -fq "${FASTQ_FILE}" echo "Completed BAM → FASTQ conversion."
-
Create your submit file
/home/<user.name>/tutorial-ONT-Basecalling/run_dorado.sub. You will need to modify thecontainer_imageandtransfer_input_filesattributes to reflect your OSPool username and the Dorado model you wish to use for basecalling. You may also have to modify theargumentsattribute if you wish to change the Dorado basecalling parameters.container_image = osdf:///ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/software/dorado_build1.2.0_27OCT2025_v1.sif DORADO_MODEL = dna_r10.4.1_e8.2_400bps_hac@v5.2.0 executable = ./executables/run_dorado.sh # Place your Dorado command (following the Dorado invocation) in the first positional argument arguments = "'basecaller --device cuda:all --batchsize 16 hac@v5.0.0 --models-directory' $(DORADO_MODEL) $(POD5_input_file)" transfer_input_files = inputs/split_pod5_subsets/$(POD5_input_file), osdf:///ospool/ap40/data/<user.name>/tutorial-ONT-Basecalling/data/$(DORADO_MODEL).tar.gz transfer_output_files = $(POD5_input_file).bam, $(POD5_input_file).bam.fastq transfer_output_remaps = "$(POD5_input_file).bam = ./outputs/basecalledBAMs/$(POD5_input_file).bam; \ $(POD5_input_file).bam.fastq = ./outputs/basecalledFASTQs/$(POD5_input_file).bam.fastq" output = ./logs/out/$(POD5_input_file)_$(Process)_basecalling_step2.out error = ./logs/err/$(POD5_input_file)_$(Process)_basecalling_step2.err log = ./logs/$(POD5_input_file)_basecalling_step2.log request_cpus = 1 request_disk = 8 GB request_memory = 4 GB retry_request_memory = 24 GB request_gpus = 1 queue POD5_input_file from list_of_pod5_files.txt
This submit file will read the contents of /home/<user.name>/tutorial-ONT-Basecalling/list_of_pod5_files.txt, iterate through each line, and assign the value of each line to the variable $POD5_input_file. This allows you to programmatically submit N jobs, where N equals the number of POD5 subset files you previously created. Each job processes one POD5 channel subset (e.g., channel-100.pod5) using your specific Dorado model (e.g., dna_r10.4.1_e8.2_400bps_hac@v5.2.0.tar.gz), which are transferred to the Execution Point (EP) by HTCondor. Additionally, the container_image attribute ensures your dorado_build1.2.0_27OCT2025_v1.sif Apptainer image is transferred and started for each job.
The submit file will instruct the EP to run your executable run_dorado.sh and pass the arguments found in the arguments attribute. The arguments attribute allows you to customize the parameters passed to Dorado directly on your submit file, without having to edit your executable.
The submit file also specifies resource requests for each job, including 1 CPU, 1 GPU, 8 GB of disk space, and 4 GB of memory (with a retry request of 24 GB if the job fails due to insufficient memory). Finally, the transfer_output_remaps attribute ensures that the output files are organized into specific directories within your tutorial folder.
Note
The example submit script above is running the hac@v5.0.0 model for simplex basecalling. You can change this to duplex sup --models-directory. For additional usage information, refer to the Dorado User Documentation.
-
Submit your basecalling jobs:
condor_submit run_dorado.sub -
Track your job progress:
condor_watch_q
Tip
If you see held jobs (e.g., for memory overruns), use condor_qedit to increase resources, then condor_release to resume them.
For more details, see our guide on Monitor and Review Jobs With condor_q and condor_history.
You should now have a directory of basecalled FASTQ and BAM files in your outputs folder. You'll likely want to perform additional steps after basecalling, such as checking read quality, mapping to a reference genome, or calling variants. We recommend merging your basecalled FASTQ files into a single file for downstream analysis. You can do this using the cat command:
for f in outputs/basecalledFASTQs/*.fastq; do
cat "$f" >> outputs/merged_basecalled_reads.fastq
done
You can use this merged FASTQ file for running FastQC, mapping, or variant calling in the next sections. The recommended next step is to run FastQC to assess the quality of your basecalled reads. You can find a step-by-step tutorial on running FastQC on the OSPool on our documentation portal.
Now that you've completed this long-read basecalling tutorial on the OSPool, you're ready to adapt these workflows for your own data and research questions. Here are some suggestions for what you can do next:
🧬 Apply the Workflow to Your Own Data
- Replace the tutorial datasets with your own POD5 files and reference genome.
- Modify the basecalling submit files to fit your data size, read type (e.g., simplex vs. duplex), and resource needs.
🧰 Customize or Extend the Workflow
- Incorporate quality control steps (e.g., filtering or read statistics) using FastQC.
- Add downstream tools for annotation, comparison, or visualization (e.g., IGV, bedtools, SURVIVOR).
📦 Create Your Own Containers
- Extend the Apptainer containers used here with additional tools, reference data, or dependencies.
- For help with this, see our Containers Guide.
🚀 Run Larger Analyses
- Submit thousands of basecalling or alignment jobs across the OSPool.
- Explore data staging best practices using the OSDF for large-scale genomics workflows.
- Consider using workflow managers (e.g., DAGman or Pegasus) with HTCondor.
🧑💻 Get Help or Collaborate
- Reach out to support@osg-htc.org for one-on-one help with scaling your research.
- Attend office hours or training sessions—see the OSPool Help Page for details.
In this tutorial, we created several starter Apptainer containers, including tools like: Dorado and SAMtools. These containers can serve as a jumping-off for you if you need to install additional software for your workflows.
Our recommendation for most users is to use "Apptainer" containers for deploying their software. For instructions on how to build an Apptainer container, see our guide Using Apptainer/Singularity Containers. If you are familiar with Docker, or want to learn how to use Docker, see our guide Using Docker Containers.
This information can also be found in our guide Using Software on the Open Science Pool.
The ecosystem for moving data to, from, and within the HTC system can be complex, especially if trying to work with large data (> gigabytes). For guides on how data movement works on the HTC system, see our Data Staging and Transfer to Jobs guides.
The OSPool has GPU nodes available for common use, like the ones used in this tutorial. If you would like to learn more about our GPU capacity, please visit our GPU Guide on the OSPool Documentation Portal.
The OSPool Research Computing Facilitators are here to help researchers using the OSPool for their research. We provide a broad swath of research facilitation services, including:
- Web guides: OSPool Guides - instructions and how-tos for using the OSPool and OSDF.
- Email support: get help within 1-2 business days by emailing support@osg-htc.org.
- Virtual office hours: live discussions with facilitators - see the Email, Office Hours, and 1-1 Meetings page for current schedule.
- One-on-one meetings: dedicated meetings to help new users, groups get started on the system; email support@osg-htc.org to request a meeting.
This information, and more, is provided in our Get Help page.

