Codec pipeline memory usage

We discussed memory usage on Friday's community call. https://github.com/TomAugspurger/zarr-python-memory-benchmark started to look at some stuff.

https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-uncompressed.html has the memray flamegraph for reading an uncompressed array (400 MB total, split into 10 chunks of 40 MB each). I think the optimal memory usage here is about 400 MB. Our peak memory is about 2x that.

https://rawcdn.githack.com/TomAugspurger/zarr-python-memory-benchmark/refs/heads/main/reports/memray-flamegraph-read-compressed.html has the zstd compressed version. Peak memory is about 1.1 GiB.

I haven't looked too closely at the code, but I wonder if we could be smarter about a few things in certain cases:

1. For the uncompressed case, we *might* be able to do a `readinto` directly into (an appropriate slice of)the `out` array. We might need to expand the Store API to add some kind of `readinto`, where the user provides the buffer to read into rather than the store allocating new memory.
2. For the compressed case, we might be able to improve things once we know the size of the output buffers. I see that numcodec's `zstd.decode` takes an output buffer [here](https://github.com/zarr-developers/numcodecs/blob/3c933cf19d4d84f2efc5f3a36926d8c569514a90/numcodecs/zstd.pyx#L251C11-L252C1) that we could maybe use. And past that point, maybe all the codecs could reuse one or two buffers, rather than allocating a new buffer for each stage of the codec (one buffer if doing stuff inplace, two buffers if something can't be done inplace)?


I'm not too familiar with the codec pipeline stuff, but will look into this as I have time. Others should feel free to take this if someone gets an itch though. There's some work to be done :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Codec pipeline memory usage #2904

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Codec pipeline memory usage #2904

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions