Feature being added - Filtering!
Filter reads and alignments based on a flexible "mini language" specified in the TOML.
Would go into config TOML under the respective place for the filtering? I.e caller_settings/mapper_settings
Applied in the tight targets loop
# for base calling
filter = [
"metadata.sequence_length > 0",
"metadata.sequence_length < 1000",
]
# for alignment
filter = [
"is_primary",
"mapq > 40",
"strand == -1",
]
This is parsed into magic Enums and Classes in _filter.py
chunks = read_until_client.get_read_batch(...)
filtered_calls, calls = partition(basecall_filter, basecall(chunks))
filtered_aligns, aligns = partition(alignment_filter, align(calls))
for result in aligns:
print("boo these alignments are trash")
for filtered_item in filtered_calls + filtered_aligns:
print("Woohoo we have great success in filtering")
I suppose we would store these on the respective classes? _PluginModule or something I forget
Ideas
- Extend language to startsWith/endsWith
- and/or/not logical operators
Issues that need resolving/clarification
- VERY footgunny - for example
sequence.metadata.length < 0, mapq < 0 and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.
- Where do we add the tracking of filtering status. Do we add it directly to the
Result object, do we add it straight into the plugin basecall/map_reads methods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Results instances that passed and one that failed?
- What do we do with Results that fail validations,
unblock or proceed?
- Fails basecalling filtering
- Fails alignment filtering
- DO we add a
fails_validation to the toml/Conditions section, which defaults to proceed? This then relies on the exceeded max chunk behaviour
- How and where do we log this?
- What will it look like/where will it be placed in the config?
- What will the API between targets and plug-ins look like?
- How will we ensure that targets doesn’t miss any data?
### Tasks
- [ ] #304
- [ ] Describe mini-language
- [ ] Needs mad tests
Feature being added - Filtering!
Filter reads and alignments based on a flexible "mini language" specified in the TOML.
Would go into config TOML under the respective place for the filtering? I.e caller_settings/mapper_settings
Applied in the tight targets loop
This is parsed into magic Enums and Classes in
_filter.pyI suppose we would store these on the respective classes? _PluginModule or something I forget
Ideas
Issues that need resolving/clarification
sequence.metadata.length < 0,mapq < 0and goodbye all reads. How can we safeguard against this? I suggest maybe starting only with PAF, and maybe adding some checks in validation.Resultobject, do we add it straight into the pluginbasecall/map_readsmethods (would involve having to write separate implementations for new plugins), and have the plugin return two Iterables, one of reads/Resultsinstances that passed and one that failed?unblockorproceed?fails_validationto the toml/Conditions section, which defaults toproceed? This then relies on the exceeded max chunk behaviour