Hi GenomeKit team! Regarding hg38 patch compatibility. Imagine we are using an annotation on hg38.p12, and define the following three intervals:
import genome_kit as gk
genome = gk.Genome('ncbi_refseq.v109') # uses hg38.p12
itv = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38')
itv_p12 = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38.p12')
itv_p13 = gk.Interval('chr7', '-', 65967466, 65968201, 'hg38.p13')
Only the interval itv_p12 will work with the object genome. For example, if we try to retrieve the sequence:
genome.dna(itv) # error
genome.dna(itv_p12) # works
genome.dna(itv_p13) # error
Similar errors occur when trying to find overlapping genes and transcripts, or creating new intervals based on a combination of these three intervals.
However, the sequence information on the main chromosome does not change between patches, and the intervals are actually compatible with each other. It would be useful to have either:
- Support for combining main chromosome intervals across patches
- In practice, this would mean that the above operations do not error out, but take advantage of the fact that the coordinates are the same across patches and return the same result as if
itv_p12 were used.
- A way of explicitly lifting/translating intervals across patches
- Imagine we had something like a
genome.make_compatible(itv) function that returns an interval on the same patch if the interval is on the main chromosome of the same major assembly.
- In this case,
genome.make_compatible(itv) and genome.make_compatible(itv_p13) should return itv_p12
- It would be easiest if
genome.make_compatible(itv_p12) still returned itv_p12 so we can call the function without checking the reference patch first.
- If the intervals are not compatible (e.g. different major assemblies, or non-main chromosome), the function should throw an error
This would be especially useful when dealing with intervals saved in a database. Currently, we are restricted to always working with the same patch that the interval was saved on, which limits our choice of annotations. This problem will get worse over time.
Let me know if any clarificiations are needed. Thank you!
Hi GenomeKit team! Regarding hg38 patch compatibility. Imagine we are using an annotation on hg38.p12, and define the following three intervals:
Only the interval
itv_p12will work with the objectgenome. For example, if we try to retrieve the sequence:Similar errors occur when trying to find overlapping genes and transcripts, or creating new intervals based on a combination of these three intervals.
However, the sequence information on the main chromosome does not change between patches, and the intervals are actually compatible with each other. It would be useful to have either:
itv_p12were used.genome.make_compatible(itv)function that returns an interval on the same patch if the interval is on the main chromosome of the same major assembly.genome.make_compatible(itv)andgenome.make_compatible(itv_p13)should returnitv_p12genome.make_compatible(itv_p12)still returneditv_p12so we can call the function without checking the reference patch first.This would be especially useful when dealing with intervals saved in a database. Currently, we are restricted to always working with the same patch that the interval was saved on, which limits our choice of annotations. This problem will get worse over time.
Let me know if any clarificiations are needed. Thank you!