Skip to content

feat: add expand, shift, up/dnstream_of, within, strand methods to DisjointIntervalSequence#212

Merged
ovesh merged 18 commits intomainfrom
diseq-pt2
Apr 27, 2026
Merged

feat: add expand, shift, up/dnstream_of, within, strand methods to DisjointIntervalSequence#212
ovesh merged 18 commits intomainfrom
diseq-pt2

Conversation

@SophiaPerzan-DG
Copy link
Copy Markdown
Collaborator

@SophiaPerzan-DG SophiaPerzan-DG commented Apr 15, 2026

Followup to #210

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends DisjointIntervalSequence with strand-orientation helpers and positional manipulation/relationship methods, and updates documentation and tests to cover the new behaviors (including negative-strand transcripts).

Changes:

  • Add shift, expand, upstream_of, dnstream_of, within, and strand conversion/flip helpers to DisjointIntervalSequence.
  • Add extensive unit tests for strand handling, shifting/expanding, and relative-position comparisons (including negative-strand transcript coordinate spaces).
  • Expand DIS documentation with new sections covering strand methods, shifting/expanding, and positional comparisons.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
genome_kit/diseq.py Implements new DIS manipulation, comparison, and strand-orientation APIs.
tests/test_diseq.py Adds coverage for negative-strand transcripts and all new DIS APIs.
docs-src/diseq.rst Documents the new APIs with examples and conceptual explanations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs-src/diseq.rst Outdated
Comment thread docs-src/diseq.rst Outdated
Comment thread docs-src/diseq.rst Outdated
Comment thread genome_kit/diseq.py
Comment thread genome_kit/diseq.py Outdated
Comment thread genome_kit/diseq.py Outdated
Comment thread genome_kit/diseq.py Outdated
SophiaPerzan-DG and others added 4 commits April 15, 2026 15:43
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@SophiaPerzan-DG SophiaPerzan-DG marked this pull request as ready for review April 15, 2026 19:56
Comment thread docs-src/diseq.rst Outdated
Comment thread docs-src/diseq.rst Outdated
Comment thread docs-src/diseq.rst
Strand methods only affect the interval layer. The coordinate
intervals always remain unchanged.

Shifting and Expanding
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the anchors on the coordinate space intervals respected? If not, then this should be explicit and the DIS initializer should fail if any interval specifies an anchor.
In addition (or alternatively), are anchors supported on the DIS interval? This will complicate the implementation and the API. I don't know if there's demand for it.

@AliceDG what do you think?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question! Looking at the current API, we support creating DIS from a list of intervals, but we haven't added any .dna methods (sorry if I missed it - I have not reviewed line-by-line), which is where anchor and anchor_offset would matter.

I can imagine a scenario where:

  • we have a VariantGenome with an insertion or deletion at 15
  • we create DIS from [(0, 5), (10, 20)]
  • we want to extract local context around a point of interest 12 on DIS, with 5bp context on both sides - this overlaps the insertion/deletion point
  • in order for the extracted sequence to be fixed (e.g. for some ML applications), we'll need anchor defined
  • in this case the most nature choice will be anchoring at "point of interest" 12

I think we should keep v1 API simple, i.e. not supporting VariantGenome with indels, not supporting anchor/anchor_offset. But maybe we can ask a few more comp bio users to see how often these cases arise?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

To keep this moving, let's explicitly fail on any intervals with anchors. We can add support later if needed.

Comment thread docs-src/diseq.rst Outdated
Strand Methods
==============

A DIS interval can sit on either 'virtual' strand independently of the coordinate
Copy link
Copy Markdown
Collaborator

@s22chan s22chan Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

biologically, is this useful? if you spliced all the exons for mrna, what would you need the opposite strand for?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was requested as the virtual opposite-strand is used for some machine learning applications. My understanding is it helps to represent something meant to bind to the sequence represented by the DIS. It is also useful to get the complement of a sequence on the DIS.

@AliceDG feel free to chime in

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes to what @SophiaPerzan-DG said. An oligo binding to the mRNA will be represented as DIS interval on the opposite strand. This is convenient for getting the oligo's sequence, or shifting the oligo around.

Comment thread docs-src/diseq.rst
Strand methods only affect the interval layer. The coordinate
intervals always remain unchanged.

Shifting and Expanding
Copy link
Copy Markdown
Collaborator

@s22chan s22chan Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a bit more colour. if you shift/expand the original slice, what do you get? the unspliced nts? or it clamps?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I fully follow. What do you mean by 'nts' and 'clamps'? I also assume that 'original slice' refers to the set of intervals used to create the coordinate-space?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got a bit of context from @ovesh, I'll make an attempt at answering. Lmk if I misunderstood something.

If the DIS interval is expanded beyond the boundaries of the coordinate space, then currently nothing will happen (since the DIS interval is just defined as 2 index vars). However, when the DIS interval is lowered to genomic-space (such as with genome.dna()), then the sequence returned would be equivalent to the coordinate intervals with the 5'-most and 3'-most intervals being expanded up/downstream by as many bases as the DIS interval exceeds the original DIS coordinate space by.

Simplified code example

ivs = [Interval(1000, 1010), Interval(1020, 1030), Interval(1040, 1050)]
dis = DIS.from_intervals(ivs)
# dis.start == 0
# dis.end == 30
dis.expand(50)
# dis.start == -50
# dis.end == 80
lowered = dis.lower()
# lowered == [ivs[0].expand(50, 0), ivs[1], ivs[2].expand(0, 50)] == [Interval(950, 1010), Interval(1020, 1030), Interval(1040, 1100)]

Copy link
Copy Markdown
Collaborator

@s22chan s22chan Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this deserves a special section in the docs.

The dna part sounds a bit confusing, might need a motivating use case in documentation?

Copy link
Copy Markdown
Contributor

@ovesh ovesh Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I think it's ok to defer to the PR that adds .lower() and .dna().

@ovesh ovesh merged commit cd22aeb into main Apr 27, 2026
13 checks passed
@ovesh ovesh deleted the diseq-pt2 branch April 27, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants