Skip to content

Function determining dataset store backing incorrectly assumes new numpy array is always created by reader #693

@yousefmoazzam

Description

@yousefmoazzam

In the case of a section that is not the last section, a reader for section n+1 will be created from the writer used in section n. The creation of a reader only causes a new numpy array to be created if there is non-zero padding required for section n+1:

self._padding = (0, 0) if padding is None else padding
if self._padding != (0, 0) and not source.is_file_based:
self._exchange_neighbourhoods()

def _exchange_neighbourhoods(self):
# we have the core of the chunk in RAM, but without the padding are
# so we construct the full area with padding in RAM and exchange with MPI
self._data = self._extend_data_for_padding(self._data)

def _extend_data_for_padding(self, core_data: np.ndarray) -> np.ndarray:
padded_shape = list(self._chunk_shape)
padded_shape[self.slicing_dim] += self._padding[0] + self._padding[1]
padded_data = np.empty(padded_shape, self._data.dtype)

However, the determine_store_backing() function assumes that the creation of the reader will always create a new numpy array, so it accounts for the size of the numpy array even though it'll only exist in the case of non-zero padding:

return reduce_decorator(_non_last_section_in_pipeline)(
memory_limit_bytes=memory_limit_bytes,
write_chunk_bytes=current_chunk_bytes,
read_chunk_bytes=next_chunk_bytes,
)

Extra info

For some more info on why I think that there's no new numpy array created for the reader of section n+1 when there is zero padding: when that case occurs, the reader's self._data attribute is assigned to the writer's self._data attribute (which is a numpy array) and nothing else will happen to the reader's self._data:

self._data = source_data

Meaning, I think that in the case of zero padding, the reader of section n+1 simply gets a reference to the numpy array from the writer of section n and nothing else (ie, no copy is made, no new array is created) so there's no reason for more memory to be allocated when creating the reader for section n+1.

Note that the above info is excluding the case of a reslice: in the case of a reslice, stuff will of course happen in the reslice algorithm to cause allocations, but that is separate from purely what the writer and reader are doing with the numpy arrays that represent the chunks associated with a section.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions