Skip to content

mohimanilab/DistributionSensitiveBucketing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distribution Sensitive Bucketing (DSB)

Author(s): Chengze Shen, Mihir Mongia, Arash Gholami Davoodi, Guillaume Marcais, Hosein Mohimani

What It Does

DSB is a C++ based program to find overlaps among sequences and alignments of queries in a given genome. There are two input files, reference and query in fasta format. Reference could be a set of reads or a reference genome. For each query sequence, the program outputs the reference sequences that overlap with it. In cases where query and reference are the same files, the program discards self-overlaps.

The goal is to find as many as true overlapping sequences and alignments while minimizing false positives.

Requirements

C++

  • Linux: We tested our program on Ubuntu 18.04 with g++ 7.5.0 and above and ISO standard -std=c++11.
  • macOS: We also tested our program on macOS 10.14.6 with g++ 4.2.1 and above, ISO standard -std=c++11 and Apple LLVM 10.0.1.

Python (3.7.5 and above)

Currently, we use a python script to process the raw output files from our program. Please refer to Example 1 for the actual usage. In the future update, we are planning to integrate this functionality to the main C++ file.

Installation

  1. Make sure the correct version of C++ compiler and Python has been installed.
  2. By default, use make to generate everything (DSBMain, DataGeneration). The installation should take about a few seconds.
  3. There are two binary executables you could generate separately.
    • To generate DSBMain (for DSB), please make main.
    • To generate DataGeneration (for generating simulation data), please make gen.
  4. To remove installation, please make clean. This will remove everything generated after your initial download, except data files you generated using DataGeneration.

How To Run

DSBMain

./DSBMain -q [query file] -r [reference file] -i [insertion rate] -d [deletion rate] -m [mutation rate] \
          -a [add threshold] -k [kill threshold] -o [name] -A [alignment threshold] -K [kmer threshold] \
          -M [min map length] -L [kmer filter threshold] {-vh}

-h              Print this block of information.
-v              Verbose mode.
-i [0<=i<1]     Insertion rate.
-d [0<=d<1]     Deletion rate.
-m [0<=e<0.5]   Mutation rate when neither insertion/deletion happens.
-a [a>0]        Threshold for a node to be considered a bucket.
-k [k>0]        Threshold for a node to be pruned.
-A [0<=A<=1]    Threshold for filtering maps with global alignment (default: 0.7).
-K [0<=K<=1]    Threshold for filtering maps with percentage shared kmers (default: 0.25).
-M [M>0]        Minimum reported map length (default: 250bp).
-L [L>M]        Lower bound map length to start using shared kmer for filtering (default: 1000bp).
-b [path]       If specified, DSB will use the given buckets file produced from a previously curated run.
-s [path]       If specified, DSB will save the buckets from this run to [path] and terminate early.

Other setttings:
-v              Verbose mode.
-o [name]       Specify output file name.
-q [path/to/q]  Path to the query file.
-r [path/to/r]  Path to the reference file.

To view the full helper message with command line, please use ./DSBMain -h.

Example 1

Starting with query data/pacbio_reads_5000.fasta and reference data/ecoli_genome_full.fasta, we use 12% insertion rate, 2% deletion rate, and 1% mutation rate for the PacBio sequencing data:

./DSBMain -q data/pacbio_reads_5000.fasta -r data/ecoli_genome_full.fasta -i 0.12 -d 0.02 -m 0.01 -a 25000 -k 250000000

Without specifying the output name, the program will print the results to a default file named output.txt, in which each line represents a mapped region. Columns are:

query index, target index, query map start, query map end, target map start, target map end, identity (alignment or shared kmer), e-value (if alignment), filter type

The expected output file from this example is included in the output/example_1_output.txt directory.

DataGeneration

./DataGeneration -i [insertion rate] -d [deletion rate] -m [mutation rate] -n [number of sequences] -s [initial length of a sequence] -p [path] -vh

-h              Print this block of information.
-v              Verbose mode.
-i [0<=i<1]     Insertion rate.
-d [0<=d<1]     Deletion rate.
-m [0<=e<0.5]   Mutation rate when neither insertion/deletion happens.
-n [n>0]        Number of sequence pairs to generate.
-s [s>0]        Initial length of a sequence pair.
-p [path]       Where generated data will be written to. Default: data/

To view the full helper message with command line, please use ./DataGeneration -h.

Example 2

We generate 1000 sequence pairs (each sequence pair contains a reference r and a query q) with insertion, deletion, and mutation rate of 5% and length of approximately 100 bp using the following command:

./DataGeneration -i 0.05 -d 0.05 -m 0.05 -n 1000 -s 100 -p data

The command will generate 2 files under data/ directory, namely:

  • q_1000_100_0.05_0.05.fasta
  • r_1000_100_0.05_0.05.fasta

where q_1000_100_0.05_0.05.fasta corresponds to the queries in the sequence pairs and r_1000_100_0.05_0.05.fasta corresponds to the references. The expected output files from this example are included in the data/ directory.

About

Distribution Sensitive Bucketing (DSB) release code.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors