feat: add expand, shift, up/dnstream_of, within, strand methods to DisjointIntervalSequence#212
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extends DisjointIntervalSequence with strand-orientation helpers and positional manipulation/relationship methods, and updates documentation and tests to cover the new behaviors (including negative-strand transcripts).
Changes:
- Add
shift,expand,upstream_of,dnstream_of,within, and strand conversion/flip helpers toDisjointIntervalSequence. - Add extensive unit tests for strand handling, shifting/expanding, and relative-position comparisons (including negative-strand transcript coordinate spaces).
- Expand DIS documentation with new sections covering strand methods, shifting/expanding, and positional comparisons.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
genome_kit/diseq.py |
Implements new DIS manipulation, comparison, and strand-orientation APIs. |
tests/test_diseq.py |
Adds coverage for negative-strand transcripts and all new DIS APIs. |
docs-src/diseq.rst |
Documents the new APIs with examples and conceptual explanations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…n different strands
| Strand methods only affect the interval layer. The coordinate | ||
| intervals always remain unchanged. | ||
|
|
||
| Shifting and Expanding |
There was a problem hiding this comment.
Are the anchors on the coordinate space intervals respected? If not, then this should be explicit and the DIS initializer should fail if any interval specifies an anchor.
In addition (or alternatively), are anchors supported on the DIS interval? This will complicate the implementation and the API. I don't know if there's demand for it.
@AliceDG what do you think?
There was a problem hiding this comment.
That's a good question! Looking at the current API, we support creating DIS from a list of intervals, but we haven't added any .dna methods (sorry if I missed it - I have not reviewed line-by-line), which is where anchor and anchor_offset would matter.
I can imagine a scenario where:
- we have a
VariantGenomewith an insertion or deletion at15 - we create DIS from
[(0, 5), (10, 20)] - we want to extract local context around a point of interest
12on DIS, with 5bp context on both sides - this overlaps the insertion/deletion point - in order for the extracted sequence to be fixed (e.g. for some ML applications), we'll need anchor defined
- in this case the most nature choice will be anchoring at "point of interest"
12
I think we should keep v1 API simple, i.e. not supporting VariantGenome with indels, not supporting anchor/anchor_offset. But maybe we can ask a few more comp bio users to see how often these cases arise?
There was a problem hiding this comment.
+1
To keep this moving, let's explicitly fail on any intervals with anchors. We can add support later if needed.
| Strand Methods | ||
| ============== | ||
|
|
||
| A DIS interval can sit on either 'virtual' strand independently of the coordinate |
There was a problem hiding this comment.
biologically, is this useful? if you spliced all the exons for mrna, what would you need the opposite strand for?
There was a problem hiding this comment.
This was requested as the virtual opposite-strand is used for some machine learning applications. My understanding is it helps to represent something meant to bind to the sequence represented by the DIS. It is also useful to get the complement of a sequence on the DIS.
@AliceDG feel free to chime in
There was a problem hiding this comment.
Yes to what @SophiaPerzan-DG said. An oligo binding to the mRNA will be represented as DIS interval on the opposite strand. This is convenient for getting the oligo's sequence, or shifting the oligo around.
| Strand methods only affect the interval layer. The coordinate | ||
| intervals always remain unchanged. | ||
|
|
||
| Shifting and Expanding |
There was a problem hiding this comment.
I think this needs a bit more colour. if you shift/expand the original slice, what do you get? the unspliced nts? or it clamps?
There was a problem hiding this comment.
I'm not sure I fully follow. What do you mean by 'nts' and 'clamps'? I also assume that 'original slice' refers to the set of intervals used to create the coordinate-space?
There was a problem hiding this comment.
Got a bit of context from @ovesh, I'll make an attempt at answering. Lmk if I misunderstood something.
If the DIS interval is expanded beyond the boundaries of the coordinate space, then currently nothing will happen (since the DIS interval is just defined as 2 index vars). However, when the DIS interval is lowered to genomic-space (such as with genome.dna()), then the sequence returned would be equivalent to the coordinate intervals with the 5'-most and 3'-most intervals being expanded up/downstream by as many bases as the DIS interval exceeds the original DIS coordinate space by.
Simplified code example
ivs = [Interval(1000, 1010), Interval(1020, 1030), Interval(1040, 1050)]
dis = DIS.from_intervals(ivs)
# dis.start == 0
# dis.end == 30
dis.expand(50)
# dis.start == -50
# dis.end == 80
lowered = dis.lower()
# lowered == [ivs[0].expand(50, 0), ivs[1], ivs[2].expand(0, 50)] == [Interval(950, 1010), Interval(1020, 1030), Interval(1040, 1100)]
There was a problem hiding this comment.
I think this deserves a special section in the docs.
The dna part sounds a bit confusing, might need a motivating use case in documentation?
There was a problem hiding this comment.
Agreed, but I think it's ok to defer to the PR that adds .lower() and .dna().
Followup to #210