Overview
Dataset tiers
Data generation workflow
File types
SV placement
SV size
Read characteristics
VARium is an extensive suite of synthetic genomes designed to systematically assess the performance of structural variant (SV) discovery tools as a function of key domain parameters and confounders. The collection provides multi-platform, multi-coverage simulated WGS datasets with SVs stratified by type, size, genomic context, and genome complexity.
VARium enables rigorous, reproducible, and fine-grained stress-testing of recall and precision across diverse biological and technical conditions, while establishing an interpretable upper bound on achievable method performance.
All datasets are publicly accessible via Google Cloud: VARium Google Bucket
VARium currently comprises 119 synthetic genomes organized into two tiers:
- Tier 1 (Recall-focused) - 79 genomes designed to evaluate simple SV recall
- 4 variant types: DEL, DUP, INS, INV
- 7 size ranges: 50-150bp, 150-500bp, 500-1kbp, 1k-10kbp, 10k-20kbp, 20k-100kbp, and 100k-1Mbp
- 6 genomic contexts: MAPPABLE, NONUNIQUE, SEGDUP*, Alu**, L1HS**, TR** (*only contain variants up to 20k; **DEL only)
- 3 platforms: Illumina, PacBio, ONT
- 5 coverages: 0.5x, 5x, 10x, 15x, 30x
- Tier 2 (Precision-focused) – 40 genomes designed to evaluate simple SV precision in the presence of complex variants
- 4 complex SV types: dDUP, nrTRA, INV_dDUP, INV_nrTRA
- 5 size ranges: 50-150bp, 150-500bp, 500-1kbp, 1k-10kbp, 10k-20kbp
- 2 dispersion distances: 10–50kbp, 1–10Mbp
- 3 platforms: Illumina, PacBio, ONT
- 1 coverage: 30x
Each genome is generated by simulating specific SVs and performing context-aware SV placement into the hg38 reference genome using insilicoSV, followed by platform-specific read simulation, alignment, and assembly. The overall workflow is illustrated below:
Each simulated genome includes:
| Data type | Description | File name | Size |
|---|---|---|---|
| Genome sequence | Modified hg38 reference genome with embedded SVs | sim.hap[AB].fa | 100M-1.6G |
| SVs | Truth set of simulated variants | sim.sorted.vcf.gz | 19k-146M |
| Illumina reads | Short-read alignment (0.5x-30x) | sim.ilmn.minimap2.sorted.bam | 2.6G-36G (30x) |
| HiFi reads | HiFi alignment (0.5x-30x) | sim.hifi.minimap2.sorted.bam | 2.7G-35G (30x) |
| ONT reads | ONT alignment (0.5x-30x) | sim.ont.minimap2.sorted.bam | 2.5G-33G (30x) |
| HiFi de novo Assembly | Assembly from HiFi reads (30× or 15×) | sim.hifi.pbsim.hifiasm.bp.hap[12].p_ctg.fa.gz | 35M-486M |
| ONT de novo Assembly | Assembly from ONT reads (30× or 15×) | sim.ont.pbsim.hifiasm.bp.hap[12].p_ctg.fa.gz | 38M-503M |
| Metadata | insilico SV design parameters | *.yaml | 290-363 |
In Tier 1, SVs are placed within predefined genomic contexts to reflect real-world mapping challenges. insilicoSV configuration files used for SV simulation and placement are provided in the workflows/variants folder. Genomic contexts are defined as follows.
- MAPPABLE
Regions outside Repeat masker annotations, where mapping is relatively easy. In this regime, no SV breakpoint is allowed to overlap RMSK intervals.
- NONUNIQUE
Regions from the GIAB mappability stratification group GRCh38_nonunique_l250_m0_e0.bed.gz. Intervals ≥150bp were retained to increase contextual difficulty. SVs are placed such that at least one breakpoint overlaps a NONUNIQUE interval.
- SEGDUP
Regions from GIAB Segmental Duplications stratification group GRCh38_segdups.bed.gz. SVs are placed such that all breakpoints are fully contained within a single segmental duplication interval.
Example:
A 150-500bp homozygous deletion placed in different genomic contexts across three sequencing platforms:

SV size distributions for Tier 1 and Tier 2 are shown below:
Our simulated reads aims to approximate empirical sequencing characteristics.
Long-read datasets
Reads were simulated using pbsim3, sampling from publicly available datasets.
- HiFi and ONT read from publically available datasets.
- HiFi: AJtrio_PacBio_CCS_15kb_20kb_chemistry2_02112020 (https://sra-pub-src-2.s3.amazonaws.com/SRR10382244/m64011_190901_095311.fastq.1)
- ONT: R10 hac 5khz (https://42basepairs.com/download/s3/ont-open-data/giab_2025.01/basecalling/hac/HG002/PAW70337/calls.sorted.bam)
Read length comparison between real data (used for sampling) and simulated data at different coverages:

Illumina datasets
Fragment characteristics were modeled using Broad Clinical Labs sequencing runs on NovaSeq X:
- Read length: 151 bp
- Mean fragment length: 440 bp
- Fragment standard deviation: 200 bp
- Error rate: 0.001 (≈ Q30)
All resources required to reproduce the simulated reads, alignments, and de novo assemblies included in VARium are provided in the workflows/ folder.
Reads were simulated using scripts in:
reads/
├── dwgsim_ilmn.sh # Illumina short-read simulation
├── pbsim_hifi.sh # PacBio HiFi simulation
└── pbsim_ont.sh # ONT simulation
All simulators use seed=100 to ensure reproducibility. Each script:
- Takes the modified reference genome generated by insilicoSV as input
- Simulates reads at 15x per haplotype to achieve overall 30x coverage
- Produces FASTQ files for downstream alignment or de novo assembly
Reads are processed using scripts in:
alignment_and_assembly/
├── process_hifi.sh # Alignment (minimap2), downsampling (rasusa), assembly (hifiasm)
├── process_ilmn.sh # Alignment (minimap2), downsampling (rasusa)
└── process_ont.sh # Alignment (minimap2), downsampling (rasusa), assembly (hifiasm)
- minimap2 presets are set per data type (HiFi, ONT, Illumina).
- rasusa performs coverage downsampling using seed=100.
- De novo assemblies are generated with hifiasm using 15× and 30× HiFi and ONT datasets.
To fully regenerate the VARium dataset:
- Select the simulated genome FASTA from the VARium Google bucket
- Run the appropriate script in reads/
- Process reads using the corresponding script in alignment_and_assembly/




