Skip to content

ryandkuster/guffaw

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

guffaw

Converts a pangenome graph (GFA format) into per-path/walk coordinate tables and/or a node presence-absence variation (PAV) matrix. For each path or walk in the graph, it can produce a TSV file mapping every node to its chromosome start and end position, and/or a CSV summarizing which haplotypes carry each node.

Overview

Input: a GFA file that defines:

  • Segment (S) lines — nodes with their DNA sequences
  • Path (P) lines — ordered lists of nodes representing a haplotype, used by tools like PGGB
  • Walk (W) lines — the same concept with explicit genomic start coordinates, used in newer minigraph-Cactus output

guffaw reads the node sequences to determine their lengths, then walks each path/walk to compute cumulative positions, and writes one TSV per path/walk. It can also produce a PAV matrix CSV recording which haplotypes contain each node.

Note: The GFA file must contain all S (segment) lines before any P (path) or W (walk) lines. guffaw will exit with an error if this order is not present.

Installation

Requires Rust.

git clone <repo>
cd guffaw
cargo build --release
# binary is at target/release/guffaw

Usage

guffaw -g <GFA> [-o <OUTDIR>] [-c <CSV>] [-w] [-t <THREADS>]
Flag Long Value Required Description
-g --gfa GFA Yes Input GFA file
-o --coords OUTDIR At least one of -o/-c Output directory for per-path/walk coordinate TSVs (must already exist)
-c --core CSV At least one of -o/-c Output CSV path for node presence-absence matrix
-w --walks flag No Input uses W (walk) lines instead of P (path) lines
-t --threads int No Number of threads (default: 1, 0 = all available CPUs)

Output

One TSV file per path or walk, named after the path/walk identifier.

Paths are named <sample>#<haplotype>#<chromosome>.tsv:

chr         start   end     node
Scaffold_01 0       25      5
Scaffold_01 25      27      6
Scaffold_01 27      29      7
Scaffold_01 29      69      8

Walks are named <sample>#<haplotype>#<chromosome>_<start>-<end>.tsv, with positions offset by the walk's genomic start coordinate:

chr         start    end      node
Scaffold_01 9884149  9884174  5
Scaffold_01 9884174  9884176  6
Scaffold_01 9884176  9884178  7
Scaffold_01 9884178  9884218  8

PAV matrix (--core)

A single CSV with one row per node and one column per haplotype. Values are 0 (absent) or 1 (present). Haplotype names are derived from path/walk identifiers (sample_haplotype).

hap1_0,hap2_0,hap3_1,node,length
1,0,1,node5,250
0,1,1,node6,27
...

Examples

Paths GFA, coordinate TSVs only, 4 threads:

guffaw -g pangenome.gfa -o output/ -t 4

Walks GFA, coordinate TSVs only, all available CPUs:

guffaw -g pangenome.gfa -o output/ -w -t 0

Paths GFA, PAV matrix only:

guffaw -g pangenome.gfa -c pav_matrix.csv

Walks GFA, both coordinate TSVs and PAV matrix, 8 threads:

guffaw -g pangenome.gfa -o output/ -c pav_matrix.csv -w -t 8

Running tests

cargo test

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors