Description
I have a zarr file on S3 where I am storing data on every ten minutes. I'm using zarr version 2.3.1 and s3fs to connect to the AWS bucket. The zarr file has the following structure:
/
└── yyyy
└── mm
└── dd
└── HHMM
└── variable
├── array (1, 1440, 1440) int32
└── variable
├── array (1, 1440, 1440) int32
As my zarr file is growing I'm noticing an increase in costs due to LIST operations. When digging into the log files I noticed that the creation of a zarr array zarr.create()
on S3 involves the listing of all the groups in the zarr file. As a LIST operation on S3 is expensive and the number of requests grows with the growing number of groups we have in the zarr file. Therefore I'm having an unsustainable situation in terms costs related to (unnecessary?) LIST operations. See a screenshot of the logs:
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
urllib3.util.retry DEBUG Converted retries value: False -> Retry(total=False, connect=None, read=None, redirect=0, status=None)
urllib3.connectionpool DEBUG https://amazonaws.com:443 "GET /?list-type=2&prefix=file.zarr%2F2019%2F06%2F12%2F1010%2Fcth%2F&delimiter=%2F&encoding-type=url HTTP/1.1" 200 None
Is there a work around for this that doesn't require the listing of all the groups when pushing an array to a new group? Or is there another way of saving an array to zarr that doesn't require that the array exists?
Thanks
Cedric