Skip to content

Storage to S3, LIST operations #460

Open
@Cedric-LG

Description

@Cedric-LG

I have a zarr file on S3 where I am storing data on every ten minutes. I'm using zarr version 2.3.1 and s3fs to connect to the AWS bucket. The zarr file has the following structure:

/
 └── yyyy
         └── mm
                    └── dd
                             └── HHMM
                                         └── variable
                                                    ├── array (1, 1440, 1440) int32
                                         └── variable
                                                    ├── array (1, 1440, 1440) int32

As my zarr file is growing I'm noticing an increase in costs due to LIST operations. When digging into the log files I noticed that the creation of a zarr array zarr.create() on S3 involves the listing of all the groups in the zarr file. As a LIST operation on S3 is expensive and the number of requests grows with the growing number of groups we have in the zarr file. Therefore I'm having an unsustainable situation in terms costs related to (unnecessary?) LIST operations. See a screenshot of the logs:

urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2Fcth%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None

Is there a work around for this that doesn't require the listing of all the groups when pushing an array to a new group? Or is there another way of saving an array to zarr that doesn't require that the array exists?

Thanks

Cedric

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePotential issues with Zarr performance (I/O, memory, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions