-
Notifications
You must be signed in to change notification settings - Fork 0
feat: split output tool into two #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
f8162ba
chore: remove old create_fasta_and_index tool
ameynert f9c9b55
feat: add haplotype_coordinates function
ameynert ee67991
feat: split create_fasta_and_index into two tools
ameynert 8199960
feat: add new tools to workflow
ameynert f1c0389
feat: use unify and specify null
ameynert 96f658b
chore: move pyarrow into pyproject.toml
ameynert 9e453ec
feat: consolidate frequency filters
ameynert 659ee6e
fix: add contig to database, fix schema overrides for population AFs
ameynert 1b63371
chore: fix up docstring in config
ameynert a1c8bda
fix: argmax calculation
ameynert 8abb405
feat: fix haplotype coordinates
ameynert File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| """Tool to write per-chromosome FASTA files from a DivRef DuckDB index.""" | ||
|
|
||
| import logging | ||
| from pathlib import Path | ||
|
|
||
| import duckdb | ||
| import polars | ||
| from fgpyo.io import assert_path_is_readable | ||
| from fgpyo.io import assert_path_is_writable | ||
|
|
||
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| def _write_fasta_files(df: polars.DataFrame, output_base: Path) -> None: | ||
| """ | ||
| Write one FASTA file per chromosome to {output_base}.{chrom}.fasta. | ||
|
|
||
| Args: | ||
| df: DataFrame with sequence_id, sequence, and variants columns. | ||
| output_base: Base path; chromosome name is appended as a suffix. | ||
| """ | ||
| for chrom in sorted(df["contig"].unique().to_list()): | ||
| logger.info("Creating FASTA for chromosome %s", chrom) | ||
| df_chrom = df.filter(df["contig"] == chrom) | ||
| out_path = Path(f"{output_base}.{chrom}.fasta") | ||
| with open(out_path, "w") as fasta_out: | ||
| for sequence_id, sequence in df_chrom.select("sequence_id", "sequence").iter_rows(): | ||
| fasta_out.write(f">{sequence_id}\n{sequence}\n") | ||
|
|
||
|
|
||
| def create_divref_fasta( | ||
| *, | ||
| duckdb_path: Path, | ||
| output_base: Path, | ||
| ) -> None: | ||
| """ | ||
| Write per-chromosome FASTA files from a DivRef DuckDB index. | ||
|
|
||
| Reads sequence_id, sequence, and variants from the sequences table and writes one FASTA | ||
| file per chromosome to {output_base}.{chrom}.fasta. | ||
|
|
||
| Args: | ||
| duckdb_path: Path to an existing DivRef DuckDB index. | ||
| output_base: Base path for output FASTA files; chromosome name is appended as a suffix. | ||
| """ | ||
| assert_path_is_readable(duckdb_path) | ||
| assert_path_is_writable(output_base) | ||
| con = duckdb.connect(str(duckdb_path), read_only=True) | ||
| df = con.execute("SELECT sequence_id, sequence, contig FROM sequences").pl() | ||
| con.close() | ||
| _write_fasta_files(df, output_base) | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don’t skip configured contigs that end up with zero sequences.
This loop only writes FASTAs for contigs present in
df["contig"], butworkflows/generate_divref.smkLines 271-274 declare one output per configured chromosome. If a chromosome is filtered down to zero rows, no file gets created and the workflow fails on missing outputs. Pass the expected contig list into this tool and emit empty FASTAs for absent contigs.🤖 Prompt for AI Agents