Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
076eb17
feat: add genome_kit.df subpackage
declanyewlim Mar 17, 2026
62a806e
chore: remove unused import
declanyewlim Mar 17, 2026
3f16e0f
chore: remove unused arg and update docstring
declanyewlim Mar 17, 2026
7bef461
chore: remove unused arg and add extract helper
declanyewlim Mar 18, 2026
7b5130e
feat: add subpackage discovery and optional polars dependency
declanyewlim Mar 18, 2026
518e456
docs: add df subpackage docs
declanyewlim Mar 18, 2026
6de6f85
feat: add lazy polars import and ensure gk annotation lifetime
declanyewlim Mar 20, 2026
1d43334
fix: add fix for strenum import errors
declanyewlim Mar 20, 2026
0497eec
docs: update with_columns_seq comments and docstring
declanyewlim Mar 20, 2026
6f9c71d
feat: add optional polars configuration
declanyewlim Mar 24, 2026
a905a2b
chore: update syntax and chnage polars version:
declanyewlim Mar 24, 2026
6f87c23
fix: correct python env version
declanyewlim Mar 24, 2026
fc4d0dd
fix: add null check for registry functions
declanyewlim Mar 24, 2026
5700485
fix: remove space in array
declanyewlim Mar 24, 2026
3939f67
fix: downgrade polars again
declanyewlim Mar 24, 2026
257504f
fix: downgrade polars version for conda-forge
declanyewlim Mar 25, 2026
cf4ff13
chore: support path or str
declanyewlim Mar 25, 2026
4b89f79
fix: add metadata check
declanyewlim Mar 25, 2026
57d5b74
chore: make chrom name more clear
declanyewlim Mar 25, 2026
17edfb2
fix: don't init reference genomes
declanyewlim Mar 25, 2026
cb6fd93
docs: update comments
declanyewlim Mar 25, 2026
eb125d4
test: add gkdf tests
declanyewlim Mar 25, 2026
0437b70
test: skip tests when no polars
declanyewlim Mar 25, 2026
7565646
chore: ruff formatting
declanyewlim Mar 25, 2026
42338a0
fix: make build and runner info consistent
declanyewlim Mar 26, 2026
77f6f3c
revert: revert github workflow
declanyewlim Mar 26, 2026
575b7c5
feat: add list of gk support
declanyewlim Mar 31, 2026
325f530
Merge branch 'main' into feat/gk-serialization
declanyewlim Mar 31, 2026
8e07c2a
ci: run gkdf tests
declanyewlim Mar 31, 2026
59cfb99
chore: change serde functions names
declanyewlim Mar 31, 2026
5eaf0c9
feat: add schema inference on multiple rows
declanyewlim Mar 31, 2026
37867e3
fix: fix type inference and add test
declanyewlim Mar 31, 2026
ef2cdab
feat: additional macos rosetta checks
declanyewlim Apr 1, 2026
96a81b3
ci: add additional test config with rosetta
declanyewlim Apr 1, 2026
9cd8f08
docs: update docs
declanyewlim Apr 1, 2026
8e7b240
docs: format and update docstrings
declanyewlim Apr 1, 2026
0516acf
respond to PR comments
declanyewlim Apr 1, 2026
e730a08
ci: add missing var for config
declanyewlim Apr 2, 2026
240606c
chore: simplify extras installation [skip ci]
declanyewlim Apr 7, 2026
bbf8962
misc: change struct from string to enum
declanyewlim Apr 7, 2026
8d03a0c
feat: add gk version check
declanyewlim Apr 7, 2026
13168b6
fix: address PR comments
declanyewlim Apr 15, 2026
9f1d32c
chore: remove unused variable
declanyewlim Apr 16, 2026
6102e29
fix: add missing struct value
declanyewlim Apr 16, 2026
92b1113
fix: change expr for lists
declanyewlim Apr 16, 2026
da54475
chore: use auto for enums
declanyewlim Apr 21, 2026
6ced9e2
chore: update const name and add comment
declanyewlim Apr 21, 2026
07b2102
tests: simplify logic and add tests
declanyewlim Apr 21, 2026
5eaa280
fix: catch relevant errors
declanyewlim Apr 24, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 23 additions & 3 deletions .github/workflows/run-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ jobs:
matrix:
python-version: ['3.10', '3.11', '3.12']
platform: ["linux-64", "osx-arm64", "osx-64"]
extras: ["none", "df"]
rosetta: ["false"]
include: # specify additional fields for all configs
- python-version: "3.10"
pyver-short: "310"
Expand All @@ -53,6 +55,13 @@ jobs:
runs-on: "macos-26-intel"
- platform: "linux-64"
runs-on: "ubuntu-latest"
# single test for running intel build on arm64 macos runner with rosetta
- python-version: "3.10"
pyver-short: "310"
platform: "osx-64" # intel build
runs-on: "macos-latest" # defaults to arm64 runner with M1 chip
extras: "df"
rosetta: "true"
runs-on: ${{ matrix.runs-on }}
steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -100,6 +109,9 @@ jobs:
- name: run unittests
id: run_unittests
shell: bash -l -e {0}
env:
GK_EXTRAS: ${{ matrix.extras }}
ROSETTA: ${{ matrix.rosetta }}
run: |
set -x
micromamba activate test
Expand All @@ -108,7 +120,15 @@ jobs:
if [ ! -e "${files[0]}" ]; then
echo "No files matched for py${{ matrix.pyver-short }}"
exit 1
fi
conda mambabuild --croot /tmp/conda-bld -t $files --extra-deps python=${{ matrix.python-version }}
fi
extra_deps=(python=${{ matrix.python-version }})
# run command with optional extra deps
if [ "$GK_EXTRAS" = "df" ]; then
extra_deps+=("polars=1.39.3")
fi
if [ "$ROSETTA" = "true" ]; then
extra_deps+=("polars-runtime-compat=1.39.3")
fi
conda mambabuild --croot /tmp/conda-bld -t "${files[@]}" --extra-deps "${extra_deps[@]}"
conda clean -it
set +x
set +x
67 changes: 67 additions & 0 deletions docs-src/df.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
.. _df:

DataFrame Utilities
===================

The :py:mod:`genome_kit.df` subpackage contains utilities for working with Polars DataFrames that contain GenomeKit objects. This includes utilities for serializing DataFrames with GenomeKit objects to Parquet and deserializing them back to GenomeKit objects. This is useful when sharing tabular data sets, or when saving intermediate DataFrames to disk during data processing.

.. important::

``genome_kit.df`` depends on optional ``polars`` dependencies, which are not installed by default. These can be installed with the ``[df]`` extra:

.. code-block:: bash

pip install "genomekit[df]"

The ``[df]`` extra is not included in the default ``genomekit`` installation.

If you are running an x86 version of Python on an Apple Silicon Mac (e.g. M1 chip), this will also install the ``polars-runtime-compat`` package, which is required to run Polars on Apple Silicon due to AVX features compatibility issues.


Quickstart
-----------
The serialization and deserialization entry points are :py:func:`~genome_kit.df.read_parquet` and :py:func:`~genome_kit.df.write_parquet`:

.. code-block:: python

import polars as pl
import genome_kit as gk

genome = gk.Genome("ncbi_refseq.v110")
df = pl.DataFrame(
{
"gene": [genome.genes[0], genome.genes[1]],
"score": [0.1, 0.8],
}
)

gk.write_parquet(df, "genes.parquet")
...
...
restored_df = gk.read_parquet("genes.parquet")


.. note::

The written parquet files can be read by any software that supports the parquet format, but the GenomeKit objects will only be restored when read with :py:func:`genome_kit.df.read_parquet`.


Supported GenomeKit Objects
---------------------------
The currently supported GenomeKit objects for serialization are:

- :py:class:`genome_kit.Genome`
- :py:class:`genome_kit.Interval`
- :py:class:`genome_kit.Transcript`
- :py:class:`genome_kit.Gene`
- :py:class:`genome_kit.Exon`
- :py:class:`genome_kit.Intron`
- :py:class:`genome_kit.CDS`
- :py:class:`genome_kit.UTR`

Public API
----------------
.. automodule:: genome_kit.df
:members:
:undoc-members:
:show-inheritance:
1 change: 1 addition & 0 deletions docs-src/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@ Contents:
anchors
api
genomes
df
develop
data_org

Expand Down
3 changes: 3 additions & 0 deletions genome_kit/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
from .variant_genome import VariantGenome
from .vcf_table import VCFTable, VCFVariant
from . import serialize
from .df import write_parquet, read_parquet

#########################################################################

Expand Down Expand Up @@ -93,6 +94,7 @@
"JunctionTable",
"ReadAlignments",
"ReadDistributions",
"read_parquet",
"Transcript",
"TranscriptTable",
"Utr",
Expand All @@ -102,6 +104,7 @@
"VariantTable",
"VCFTable",
"VCFVariant",
"write_parquet",
]

#########################################################################
70 changes: 70 additions & 0 deletions genome_kit/_optional.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
from __future__ import annotations

from importlib.metadata import PackageNotFoundError, version


def require_polars():
"""Import Polars if available, otherwise provide helpful error messages.

Also checks for compatibility on MacOS with Apple Silicon, which may require
an additional package if running Python under Rosetta translation.
"""
try:
import polars as pl

if check_under_rosetta():
if not check_rtcompat():
raise ImportError(
"Polars is not compatible with Apple Silicon.\n"
"Please install with `pip install genomekit[df-mac]` to include "
"the polars-runtime-compat package required for Rosetta "
"translation."
)
except ModuleNotFoundError as e:
raise ImportError(
"Optional dependency 'polars' is required for this functionality. Please "
"install with `pip install genomekit[df]`.\n"
"If you are running this on MacOS with Apple Silicon, please install with "
"`pip install genomekit[df-mac]` to include the polars-runtime-compat "
"package required for Rosetta translation."
) from e

return pl


def check_under_rosetta():
"""Check if program is running under Rosetta translation on Apple Silicon.

The default version of Polars is incompatible with Rosetta, and requires
polars-runtime-compat to be installed.

Can be checked with the sysctl.proc_translated flag in sysctl.
See https://developer.apple.com/documentation/apple-silicon/about-the-rosetta-translation-environment#Determine-Whether-Your-App-Is-Running-as-a-Translated-Binary
"""
import subprocess

try:
result = subprocess.run(
["sysctl", "-n", "sysctl.proc_translated"],
capture_output=True,
text=True,
check=True,
)
# output will be 0 if running natively on Apple Silicon, and 1 if running under
# Rosetta translation
return result.stdout.strip() == "1"
except (subprocess.CalledProcessError, OSError):
# sysctl.proc_translated won't exist on non-Apple Silicon machines
return False


def check_rtcompat():
"""Check if polars-runtime-compat is installed.

Required for Polars to run on MacOS machines under Rosetta translation.
"""
try:
version("polars-runtime-compat")
return True
except PackageNotFoundError:
return False
3 changes: 3 additions & 0 deletions genome_kit/df/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from .serialization import read_parquet, write_parquet

__all__ = ["read_parquet", "write_parquet"]
153 changes: 153 additions & 0 deletions genome_kit/df/gk_structs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
from __future__ import annotations

from dataclasses import dataclass
from typing import TYPE_CHECKING

from genome_kit._optional import require_polars

if TYPE_CHECKING: # import polars for type checking
import polars as pl

# minimal shim for python <3.11 compatibility
try:
from enum import StrEnum, auto
except ImportError:
from enum import Enum, auto

class StrEnum(str, Enum):
def __str__(self):
return str(self.value)
Comment thread
declanyewlim marked this conversation as resolved.

@staticmethod
def _generate_next_value_(name, start, count, last_values):
return name.lower()

# serializable representations of the supported GKDF types, with a one-to-one mapping
# between GkDfType and GenomeKit object types. Serves as the key for struct and function
# definitions in registry.py, keeping serialization and deserialization paths symmetric.
class GkDfType(StrEnum):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I get the motivation right?

Suggested change
class GkDfType(StrEnum):
# an intermediate mapping (GK types -> GkDfType -> PL types) is added as a defensive
# coding measure to avoid an accidental import of pl from the main package.
class GkDfType(StrEnum):

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn’t classify this as defensive programming, the part that is defensive with regards to polars imports is the require_polars function in _optional.py. This gets called in any function that needs polars, and will provide informative messages if polars is not installed.

The main idea is to use the GkDfType as a stable key for the serialization layer. This is used to look up the Polars struct, serializer/deserializer function, and for column inference. In theory, these mappings can use the GK types (or their representation), but using a separate intermediate key should make the mappings more symmetric and easier to maintain.

GENOME = auto()
INTERVAL = auto()
TRANSCRIPT = auto()
GENE = auto()
EXON = auto()
INTRON = auto()
CDS = auto()
UTR = auto()


class CellType(StrEnum):
SCALAR = auto()
LIST = auto()


@dataclass(frozen=True)
class ColumnInfo:
"""Dataclass to store metadata about a single column in a dataframe.

Assumes that all cells in a column have the same type. If the cell contains a list,
assumes all items in the list are of the same type.
"""

cell_type: CellType
gkdf_type: GkDfType

def to_dict(self) -> dict:
return {
"cell_type": self.cell_type.value,
"gkdf_type": self.gkdf_type.value,
}


class GkDfVersion(StrEnum):
V1 = "1.0"


CURRENT_VERSION = GkDfVersion.V1


def get_structs() -> dict[GkDfType, pl.Struct]:
"""Return a mapping of GkDfType to their corresponding Polars Struct definitions."""
pl = require_polars()

GenomeStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("genome_name", pl.Utf8), # reference or annotation genome
]
)

IntervalStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("chromosome", pl.Utf8),
pl.Field("strand", pl.Utf8),
pl.Field("start", pl.Int32),
pl.Field("end", pl.Int32),
pl.Field("refg", pl.Utf8), # reference genome
]
)

TranscriptStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
# index of transcript within annotation genome transcript table
# Int32 matches index type in C++ backend (see src/table.h:22)
pl.Field("transcript_table_index", pl.Int32),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these can be gk version specific (depending if gk processes the anno file with more/less/rearranged features)...

kind of a weakpoint of gk, but not sure if you want to be explicit about the dganno version

Copy link
Copy Markdown
Contributor Author

@declanyewlim declanyewlim Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I'll add a check for mismatching gk versions when reading the parquet file. The metadata already contains the gk version used to write the file.

On second thought adding the specific dganno version(s) to the metadata isn't going to work here. The reference/annotation genomes are not collated when writing the file to disk since this is done lazily in Polars. Collecting the unique genomes before writing to file is possible, but either resulted in duplicated work, or materializing the dataframe into memory.

I would assume that a change to the dganno version here would also result in a version bump for gk so I think the check for gk version is suffficient here.

pl.Field("anno", pl.Utf8), # annotation genome
]
)

GeneStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("gene_table_index", pl.Int32),
pl.Field("anno", pl.Utf8), # annotation genome
]
)

ExonStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("exon_table_index", pl.Int32),
pl.Field("anno", pl.Utf8), # annotation genome
]
)

IntronStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("intron_table_index", pl.Int32),
pl.Field("anno", pl.Utf8), # annotation genome
]
)

CdsStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("cds_table_index", pl.Int32),
pl.Field("anno", pl.Utf8), # annotation genome
]
)

UtrType = pl.Enum(["5prime", "3prime"])

UtrStruct = pl.Struct(
[
pl.Field("schema_version", pl.Utf8),
pl.Field("utr_type", UtrType),
pl.Field("utr_table_index", pl.Int64),
pl.Field("anno", pl.Utf8), # annotation genome
]
)

return {
GkDfType.GENOME: GenomeStruct,
GkDfType.INTERVAL: IntervalStruct,
GkDfType.TRANSCRIPT: TranscriptStruct,
GkDfType.GENE: GeneStruct,
GkDfType.EXON: ExonStruct,
GkDfType.INTRON: IntronStruct,
GkDfType.CDS: CdsStruct,
GkDfType.UTR: UtrStruct,
}
Loading
Loading