Description
I have a few TB large dataset with 11 parameters and about 100000 chunks, and am storing it in azure blob using the ABSStore
mutable mapping. When I do zarr.open_group(store=store, mode='r')
with store as zarr.LRUStoreCache(max_size=2**33, store=zarr.storage.ABSStore('testcontainer', 'mydataset', BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY))
, it takes about 45 seconds to open the group. Without the LRU
wrapper the open_group
operation is instantaneous. I traced the problem to the __contains__
method in LRUStoreCache
mutable mapping wrapper(open_group
calls the contains_array
method). The __contains__
method(here) in LRUStoreCache
is implemented by listing all the keys in the mutable mapping of the underlying store, and therefore, all 100000 chunks are listed before checking for existence. In the context of cloud storage this can cause significant overhead.
This is the method of LRUStoreCache
now:
def __contains__(self, key):
with self._mutex:
if self._contains_cache is None:
self._contains_cache = set(self._keys())
return key in self._contains_cache
when I changed it to this:
def __contains__(self, key):
return key in self._store
the open_group
operation is almost instantaneous as the __contains__
method of the underlying ABSStore
class uses the exists
option on azure blob and so doesn't have to list all keys.