Skip to content

dkajtoch/tandem-repeat

Repository files navigation

tandem-repeat

Rust-backed Python package for exact tandem-repeat detection in Unicode text.

from tandem_repeat import find, find_longest, find_primitive

find("abcabcabc")
# [TandemRepeat(start=0, end=9, substring='abc', repeats=3, ...), ...]

find("aaaaa")
# [TandemRepeat(start=0, end=5, substring='a', repeats=5, ...),
#  TandemRepeat(start=0, end=4, substring='aa', repeats=2, ...), ...]

find_primitive("aaaaa")
# [TandemRepeat(start=0, end=5, substring='a', repeats=5, ...)]

find_longest("xxabcabcabc")
# TandemRepeat(start=2, end=11, substring='abc', repeats=3, ...)

API

find(
    text: str,
    *,
    min_length: int = 1,
    min_repeats: int = 2,
    max_results: int | None = None,
) -> list[TandemRepeat]

find_primitive(
    text: str,
    *,
    min_length: int = 1,
    min_repeats: int = 2,
    max_results: int | None = None,
) -> list[TandemRepeat]

find_longest(
    text: str,
    *,
    min_length: int = 1,
    min_repeats: int = 2,
) -> TandemRepeat | None

TandemRepeat is a frozen dataclass:

@dataclass(frozen=True, slots=True)
class TandemRepeat:
    start: int       # Python character offset, inclusive
    end: int         # Python character offset, exclusive
    substring: str   # repeated unit
    repeats: int     # number of consecutive copies
    period: int      # len(substring) in Python characters
    byte_start: int  # UTF-8 byte offset, inclusive
    byte_end: int    # UTF-8 byte offset, exclusive

The detector works on the exact input code points. It does not normalize text, so precomposed and decomposed Unicode sequences are treated as different strings.

Complexity

find returns a compact vocabulary-style representation, not every occurrence position. This follows the distinction in Stoye and Gusfield's tandem repeat work: the occurrence set can be quadratic, while the vocabulary of tandem repeat types is linear-size. For example, "aaaaa" includes tandem types such as "a" repeated five times and "aa" repeated twice, but it does not expand every overlapping occurrence into separate Python objects.

find_primitive exposes the practical smaller-subunit view explicitly. It normalizes each block to its smallest repeated unit, so "aaaaa" is represented as "a" repeated five times.

The Rust scanner avoids the old all-periods-by-all-starts search for larger inputs. It scans short periods directly and discovers longer-period candidates from repeated fixed-width seeds, then verifies candidates exactly before returning them. This gives near-linear scaling on sparse realistic text while preserving exact compact blocks for emitted results. Dense fully periodic text uses a separate linear fast path for find_longest.

find_longest returns one result and avoids result-set explosion. On sparse text the benchmark suite now covers inputs up to 100k characters; on dense text it covers 100k-character fully periodic inputs.

Reference:

  • Jens Stoye and Dan Gusfield, "Linear time algorithms for finding and representing all the tandem repeats in a string", JCSS 2004.

Development

This project uses uv, maturin, Rust, Ruff, mypy, and pre-commit.

uv sync --all-extras --dev
uv run maturin develop
uv run pre-commit run --all-files
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check
uv run pytest -q
uv run python benches/bench_scaling.py
uv run python benches/capture_scaling.py

For contribution rules, test expectations, and the release checklist, see CONTRIBUTING.md.

Release

Releases are tag-driven and published by GitHub Actions with PyPI Trusted Publishing. Configure the PyPI trusted publisher for repository dkajtoch/tandem-repeat, workflow .github/workflows/release.yaml, and the protected GitHub environment pypi.

git tag v0.1.0
git push origin v0.1.0

The workflow runs the full Python and Rust quality gates, builds an sdist plus Linux and macOS wheels, validates artifacts with twine check, and publishes from the protected pypi environment.

Benchmarks

The benchmark scripts cover:

  • scaling with text length,
  • scaling with the number of inserted repeat blocks,
  • dense periodic input such as "a" * n,
  • multilingual Arabic, Hebrew, Hindi, CJK, emoji, and mixed-direction text.

The plotted scaling run writes artifacts to bench-results/scaling.csv, bench-results/summary.md, and bench-results/scaling.svg.

Latest local scaling run:

  • Python: 3.11.7
  • Platform: macOS-26.4.1-arm64-arm-64bit
  • Median of 3 trials where noted
  • find and find_primitive capped at 50,000 returned blocks
scenario operation 1k chars 10k chars 50k chars 100k chars 100k results note
sparse random text with one inserted repeat find 0.0050s 0.0577s 0.3145s 0.6650s 1,520 compact vocabulary output
sparse random text with one inserted repeat find_longest 0.0050s 0.0583s 0.3134s 0.6624s 1 same sparse scan, one result
dense single-character periodic text find_longest 0.00008s 0.00070s 0.00355s 0.00686s 1 whole-text fast path
truncated periodic text: ("aab" * n)[:chars] find_primitive 0.0073s 0.1054s 0.4404s 0.9101s 33,336 output-sensitive primitive blocks
dense single-character periodic text find 0.0012s 0.0881s 2.0210s timeout - many vocabulary blocks; 15s timeout

These timings are machine-local measurements, not portable guarantees. The right bound for result-returning methods is output-sensitive: O(n + output) is the practical target for find_primitive, while find can return many compact vocabulary blocks on low-entropy input. find_longest avoids result-set growth because it returns only one block.

About

Fast Rust-backed tandem repeat detection for Python strings

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors