Converts a pangenome graph (GFA format) into per-path/walk coordinate tables and/or a node presence-absence variation (PAV) matrix. For each path or walk in the graph, it can produce a TSV file mapping every node to its chromosome start and end position, and/or a CSV summarizing which haplotypes carry each node.
Input: a GFA file that defines:
- Segment (S) lines — nodes with their DNA sequences
- Path (P) lines — ordered lists of nodes representing a haplotype, used by tools like PGGB
- Walk (W) lines — the same concept with explicit genomic start coordinates, used in newer minigraph-Cactus output
guffaw reads the node sequences to determine their lengths, then walks each path/walk to compute cumulative positions, and writes one TSV per path/walk. It can also produce a PAV matrix CSV recording which haplotypes contain each node.
Note: The GFA file must contain all S (segment) lines before any P (path) or W (walk) lines. guffaw will exit with an error if this order is not present.
Requires Rust.
git clone <repo>
cd guffaw
cargo build --release
# binary is at target/release/guffawguffaw -g <GFA> [-o <OUTDIR>] [-c <CSV>] [-w] [-t <THREADS>]
| Flag | Long | Value | Required | Description |
|---|---|---|---|---|
-g |
--gfa |
GFA |
Yes | Input GFA file |
-o |
--coords |
OUTDIR |
At least one of -o/-c |
Output directory for per-path/walk coordinate TSVs (must already exist) |
-c |
--core |
CSV |
At least one of -o/-c |
Output CSV path for node presence-absence matrix |
-w |
--walks |
flag | No | Input uses W (walk) lines instead of P (path) lines |
-t |
--threads |
int |
No | Number of threads (default: 1, 0 = all available CPUs) |
One TSV file per path or walk, named after the path/walk identifier.
Paths are named <sample>#<haplotype>#<chromosome>.tsv:
chr start end node
Scaffold_01 0 25 5
Scaffold_01 25 27 6
Scaffold_01 27 29 7
Scaffold_01 29 69 8
Walks are named <sample>#<haplotype>#<chromosome>_<start>-<end>.tsv, with positions offset by the walk's genomic start coordinate:
chr start end node
Scaffold_01 9884149 9884174 5
Scaffold_01 9884174 9884176 6
Scaffold_01 9884176 9884178 7
Scaffold_01 9884178 9884218 8
A single CSV with one row per node and one column per haplotype. Values are 0 (absent) or 1 (present). Haplotype names are derived from path/walk identifiers (sample_haplotype).
hap1_0,hap2_0,hap3_1,node,length
1,0,1,node5,250
0,1,1,node6,27
...
Paths GFA, coordinate TSVs only, 4 threads:
guffaw -g pangenome.gfa -o output/ -t 4Walks GFA, coordinate TSVs only, all available CPUs:
guffaw -g pangenome.gfa -o output/ -w -t 0Paths GFA, PAV matrix only:
guffaw -g pangenome.gfa -c pav_matrix.csvWalks GFA, both coordinate TSVs and PAV matrix, 8 threads:
guffaw -g pangenome.gfa -o output/ -c pav_matrix.csv -w -t 8cargo test