Rust-backed Python package for exact tandem-repeat detection in Unicode text.
from tandem_repeat import find, find_longest, find_primitive
find("abcabcabc")
# [TandemRepeat(start=0, end=9, substring='abc', repeats=3, ...), ...]
find("aaaaa")
# [TandemRepeat(start=0, end=5, substring='a', repeats=5, ...),
# TandemRepeat(start=0, end=4, substring='aa', repeats=2, ...), ...]
find_primitive("aaaaa")
# [TandemRepeat(start=0, end=5, substring='a', repeats=5, ...)]
find_longest("xxabcabcabc")
# TandemRepeat(start=2, end=11, substring='abc', repeats=3, ...)find(
text: str,
*,
min_length: int = 1,
min_repeats: int = 2,
max_results: int | None = None,
) -> list[TandemRepeat]
find_primitive(
text: str,
*,
min_length: int = 1,
min_repeats: int = 2,
max_results: int | None = None,
) -> list[TandemRepeat]
find_longest(
text: str,
*,
min_length: int = 1,
min_repeats: int = 2,
) -> TandemRepeat | NoneTandemRepeat is a frozen dataclass:
@dataclass(frozen=True, slots=True)
class TandemRepeat:
start: int # Python character offset, inclusive
end: int # Python character offset, exclusive
substring: str # repeated unit
repeats: int # number of consecutive copies
period: int # len(substring) in Python characters
byte_start: int # UTF-8 byte offset, inclusive
byte_end: int # UTF-8 byte offset, exclusiveThe detector works on the exact input code points. It does not normalize text, so precomposed and decomposed Unicode sequences are treated as different strings.
find returns a compact vocabulary-style representation, not every occurrence
position. This follows the distinction in Stoye and Gusfield's tandem repeat
work: the occurrence set can be quadratic, while the vocabulary of tandem repeat
types is linear-size. For example, "aaaaa" includes tandem types such as
"a" repeated five times and "aa" repeated twice, but it does not expand every
overlapping occurrence into separate Python objects.
find_primitive exposes the practical smaller-subunit view explicitly. It
normalizes each block to its smallest repeated unit, so "aaaaa" is represented
as "a" repeated five times.
The Rust scanner avoids the old all-periods-by-all-starts search for larger
inputs. It scans short periods directly and discovers longer-period candidates
from repeated fixed-width seeds, then verifies candidates exactly before
returning them. This gives near-linear scaling on sparse realistic text while
preserving exact compact blocks for emitted results. Dense fully periodic text
uses a separate linear fast path for find_longest.
find_longest returns one result and avoids result-set explosion. On sparse text
the benchmark suite now covers inputs up to 100k characters; on dense text it
covers 100k-character fully periodic inputs.
Reference:
- Jens Stoye and Dan Gusfield, "Linear time algorithms for finding and representing all the tandem repeats in a string", JCSS 2004.
This project uses uv, maturin, Rust, Ruff, mypy, and pre-commit.
uv sync --all-extras --dev
uv run maturin develop
uv run pre-commit run --all-files
cargo test
cargo clippy --all-targets -- -D warnings
cargo fmt --check
uv run pytest -q
uv run python benches/bench_scaling.py
uv run python benches/capture_scaling.pyFor contribution rules, test expectations, and the release checklist, see CONTRIBUTING.md.
Releases are tag-driven and published by GitHub Actions with PyPI Trusted
Publishing. Configure the PyPI trusted publisher for repository
dkajtoch/tandem-repeat, workflow .github/workflows/release.yaml, and the
protected GitHub environment pypi.
git tag v0.1.0
git push origin v0.1.0The workflow runs the full Python and Rust quality gates, builds an sdist plus
Linux and macOS wheels, validates artifacts with twine check, and publishes
from the protected pypi environment.
The benchmark scripts cover:
- scaling with text length,
- scaling with the number of inserted repeat blocks,
- dense periodic input such as
"a" * n, - multilingual Arabic, Hebrew, Hindi, CJK, emoji, and mixed-direction text.
The plotted scaling run writes artifacts to bench-results/scaling.csv,
bench-results/summary.md, and bench-results/scaling.svg.
Latest local scaling run:
- Python: 3.11.7
- Platform: macOS-26.4.1-arm64-arm-64bit
- Median of 3 trials where noted
findandfind_primitivecapped at 50,000 returned blocks
| scenario | operation | 1k chars | 10k chars | 50k chars | 100k chars | 100k results | note |
|---|---|---|---|---|---|---|---|
| sparse random text with one inserted repeat | find |
0.0050s | 0.0577s | 0.3145s | 0.6650s | 1,520 | compact vocabulary output |
| sparse random text with one inserted repeat | find_longest |
0.0050s | 0.0583s | 0.3134s | 0.6624s | 1 | same sparse scan, one result |
| dense single-character periodic text | find_longest |
0.00008s | 0.00070s | 0.00355s | 0.00686s | 1 | whole-text fast path |
truncated periodic text: ("aab" * n)[:chars] |
find_primitive |
0.0073s | 0.1054s | 0.4404s | 0.9101s | 33,336 | output-sensitive primitive blocks |
| dense single-character periodic text | find |
0.0012s | 0.0881s | 2.0210s | timeout | - | many vocabulary blocks; 15s timeout |
These timings are machine-local measurements, not portable guarantees. The
right bound for result-returning methods is output-sensitive: O(n + output) is
the practical target for find_primitive, while find can return many compact
vocabulary blocks on low-entropy input. find_longest avoids result-set growth
because it returns only one block.