Implications of compression algorithms

Hi all,

My apologies for getting to this conversation late, but I had some questions about the implications of the compression/linearization capabilities afforded by this new implementation.  There is an existing discussion [here](https://github.com/frheault/tractography_file_format/issues/8), but to my reading it largely deals with the proximal details as they relate to implementation and immediate consequences.  My considerations relate to implications that are a bit further downstream. 

This may be an arbitrary delineation, but to me, it seems like the discussed streamline-focused, **lossy** compression methods can be divided into two tiers of “**lossyness**”:

- Insignificantly to mildly lossy
    - Conversion to less precise data formats (e.g. float 16)
    - Implementation of dictionary-based storage schemas (e.g. sphere dictionary;QFib)
- Moderately to appreciably lossy
    - “Downsampling”
    - Linearization

To me, these two categories are distinct in that the first corresponds to a slight loss in precision (and thus the introduction of a slight bit of noise), while the latter actually modifies the data representation itself through the removal of nodes/vertices.

The implications of these latter strategies are partially determined by how “decompression” is implemented (if at all).  For example:
- Are we enforcing or pushing regular sampling rates?
    - If so, it seems that we would then be “regenerating” these dropped nodes.  Is there a standard procedure for this?
    - If not, this seems to preclude the use of dictionary-based compression algorithms (which, to my understanding, are predicated upon standard step sizes).
- Are we preventing down-sampling which alters the path or trajectory of streamlines?
    - (E.g. the [compression found in DIPY’s methods](https://dipy.org/documentation/1.0.0./examples_built/streamline_length/)).

I understand that there are significant efficiencies to be gained through some of these methods, and that for some of us (e.g. those doing connectomics) the quantifications are unaltered.  Moreover, I understand that some of us have gone so far as to quantify the impact of such approaches to varying degrees (e.g [here](https://www.sciencedirect.com/science/article/pii/S1053811914010635?via%3Dihub) and [here](https://www.frontiersin.org/articles/10.3389/fninf.2017.00042/full)).  That being said, I have some concerns about the implications of such compression methods for contemporary tractogram segmentation methods themselves (explored to some extent in the later of the aforementioned papers), as opposed to tractometry or derived measurements.

Specifically, there seem to be a plentiful number of cases wherein these methods would impact the outcome of a segmentation non-trivially (as [discussed in the caveats at the bottom of this page](https://dannbullock.github.io/WiMSE/notebooks/Using_ROIs_as_tools.html#those-pesky-asterisks)).  Namely:
- Uneven sampling rates could allow for planar ROIs (highly common and widely utilized) to “slip through” linearized areas.
    - Given that the architecture of the the composite streamlines of some particularly linear tracts are highly similar, (e.g. tracts like the IFOF or ILF), it seems that:
        - Some structures may be impacted more than others by this issue (e.g. non-random impact _across_ structures)
        - Those structures that are impacted will likely have particular sub-components of their structure/morphology that are deferentially impacted (e.g. non-random impact _within_ structures)
- This is even more true of down-sampling algorithms which (1) introduce these same spacing issues and (2) impact the trajectory / volumetric occupancy of streamlines. 
- Assuming that streamlines aren’t “up-sampled” back to some higher spatial resolution, quantifications like [density masks](https://github.com/dipy/dipy/blob/be956a529465b28085f8fc435a756947ddee1c89/dipy/tracking/utils.py#L67-L107) and atlases/masks (and group variants thereof) are likely impacted as well, with all of the complications noted in the first bullet point in this list (i.e. differential within and across structure impacts).
    - [This paper](https://www.frontiersin.org/articles/10.3389/fninf.2017.00042/full) describes some methods which partially address this, but it should be noted that:
        - Not all (or even most?) packages implement this alteration
        - It’s not algorithmically (in terms of computational implementation) or intuitively (in terms of how one would intuitively interact with a streamline) simple (subjective assessments, admittedly).
        - It achieves performance increases at the cost of increasing algorithmic opacity and shifts developmental burden off on to those who develop downstream uses (e.g. segmentations or segmentation tools; a potential source of personal bias, I admit).

To some extent, I recognize that these concerns aren’t actually arising from the compression methods being proposed here, but rather downstream consequences of their application.  As such, it may be beyond the scope of concerns we are intending to address with this format implementation.  That being said, if the goal is to implement a tractography standard that can be used across the community, then it stands to reason that we are designing for a range of potential users, some non-trivial proportion of which won’t be familiar with the nuanced implications discussed above.  As such it is quite possible that such users could unintentionally run afoul of these issues and/or encounter difficulty replicating results using “the same data set”.

Are we at all concerned about these possibilities, or are we adopting a more laissez-faire approach to use of this format and its features?  If we are concerned about these possibilities, how do we make them salient, or how do we attempt to shape users’ default dispositions?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implications of compression algorithms #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implications of compression algorithms #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions