tandem-repeat

Rust-backed Python package for exact tandem-repeat detection in Unicode text.

from tandem_repeat import find, find_longest, find_primitive

find("abcabcabc")
# [TandemRepeat(start=0, end=9, substring='abc', repeats=3, ...), ...]

find("aaaaa")
# [TandemRepeat(start=0, end=5, substring='a', repeats=5, ...),
#  TandemRepeat(start=0, end=4, substring='aa', repeats=2, ...), ...]

find_primitive("aaaaa")
# [TandemRepeat(start=0, end=5, substring='a', repeats=5, ...)]

find_longest("xxabcabcabc")
# TandemRepeat(start=2, end=11, substring='abc', repeats=3, ...)

API

find(
    text: str,
    *,
    min_length: int = 1,
    min_repeats: int = 2,
    max_results: int | None = None,
) -> list[TandemRepeat]

find_primitive(
    text: str,
    *,
    min_length: int = 1,
    min_repeats: int = 2,
    max_results: int | None = None,
) -> list[TandemRepeat]

find_longest(
    text: str,
    *,
    min_length: int = 1,
    min_repeats: int = 2,
) -> TandemRepeat | None

TandemRepeat is a frozen dataclass:

@dataclass(frozen=True, slots=True)
class TandemRepeat:
    start: int       # Python character offset, inclusive
    end: int         # Python character offset, exclusive
    substring: str   # repeated unit
    repeats: int     # number of consecutive copies
    period: int      # len(substring) in Python characters
    byte_start: int  # UTF-8 byte offset, inclusive
    byte_end: int    # UTF-8 byte offset, exclusive

The detector works on the exact input code points. It does not normalize text, so precomposed and decomposed Unicode sequences are treated as different strings.

Complexity

find returns a compact vocabulary-style representation, not every occurrence position. This follows the distinction in Stoye and Gusfield's tandem repeat work: the occurrence set can be quadratic, while the vocabulary of tandem repeat types is linear-size. For example, "aaaaa" includes tandem types such as "a" repeated five times and "aa" repeated twice, but it does not expand every overlapping occurrence into separate Python objects.

find_primitive exposes the practical smaller-subunit view explicitly. It normalizes each block to its smallest repeated unit, so "aaaaa" is represented as "a" repeated five times.

The Rust scanner avoids the old all-periods-by-all-starts search for larger inputs. It scans short periods directly and discovers longer-period candidates from repeated fixed-width seeds, then verifies candidates exactly before returning them. This gives near-linear scaling on sparse realistic text while preserving exact compact blocks for emitted results. Dense fully periodic text uses a separate linear fast path for find_longest.

find_longest returns one result and avoids result-set explosion. On sparse text the benchmark suite now covers inputs up to 100k characters; on dense text it covers 100k-character fully periodic inputs.

Reference:

Jens Stoye and Dan Gusfield, "Linear time algorithms for finding and representing all the tandem repeats in a string", JCSS 2004.

Development

This project uses uv, maturin, Rust, Ruff, mypy, and pre-commit.

uv sync --all-extras --dev
uv run maturin develop
uv run pre-commit run --all-files
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check
uv run pytest -q
uv run python benches/bench_scaling.py
uv run python benches/capture_scaling.py

For contribution rules, test expectations, and the release checklist, see CONTRIBUTING.md.

Release

Releases are tag-driven and published by GitHub Actions with PyPI Trusted Publishing. Configure the PyPI trusted publisher for repository dkajtoch/tandem-repeat, workflow .github/workflows/release.yaml, and the protected GitHub environment pypi.

git tag v0.1.0
git push origin v0.1.0

The workflow runs the full Python and Rust quality gates, builds an sdist plus Linux and macOS wheels, validates artifacts with twine check, and publishes from the protected pypi environment.

Benchmarks

The benchmark scripts cover:

scaling with text length,
scaling with the number of inserted repeat blocks,
dense periodic input such as "a" * n,
multilingual Arabic, Hebrew, Hindi, CJK, emoji, and mixed-direction text.

The plotted scaling run writes artifacts to bench-results/scaling.csv, bench-results/summary.md, and bench-results/scaling.svg.

Latest local scaling run:

Python: 3.11.7
Platform: macOS-26.4.1-arm64-arm-64bit
Median of 3 trials where noted
find and find_primitive capped at 50,000 returned blocks

scenario	operation	1k chars	10k chars	50k chars	100k chars	100k results	note
sparse random text with one inserted repeat	`find`	0.0050s	0.0577s	0.3145s	0.6650s	1,520	compact vocabulary output
sparse random text with one inserted repeat	`find_longest`	0.0050s	0.0583s	0.3134s	0.6624s	1	same sparse scan, one result
dense single-character periodic text	`find_longest`	0.00008s	0.00070s	0.00355s	0.00686s	1	whole-text fast path
truncated periodic text: `("aab" * n)[:chars]`	`find_primitive`	0.0073s	0.1054s	0.4404s	0.9101s	33,336	output-sensitive primitive blocks
dense single-character periodic text	`find`	0.0012s	0.0881s	2.0210s	timeout	-	many vocabulary blocks; 15s timeout

These timings are machine-local measurements, not portable guarantees. The right bound for result-returning methods is output-sensitive: O(n + output) is the practical target for find_primitive, while find can return many compact vocabulary blocks on low-entropy input. find_longest avoids result-set growth because it returns only one block.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
benches		benches
python/tandem_repeat		python/tandem_repeat
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tandem-repeat

API

Complexity

Development

Release

Benchmarks

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tandem-repeat

API

Complexity

Development

Release

Benchmarks

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages