Skip to content

Optimisation for the __contains__ method of storage.LRUStoreCache #295

Open
@shikharsg

Description

@shikharsg

I have a few TB large dataset with 11 parameters and about 100000 chunks, and am storing it in azure blob using the ABSStore mutable mapping. When I do zarr.open_group(store=store, mode='r') with store as zarr.LRUStoreCache(max_size=2**33, store=zarr.storage.ABSStore('testcontainer', 'mydataset', BLOB_ACCOUNT_NAME, BLOB_ACCOUNT_KEY)), it takes about 45 seconds to open the group. Without the LRU wrapper the open_group operation is instantaneous. I traced the problem to the __contains__ method in LRUStoreCache mutable mapping wrapper(open_group calls the contains_array method). The __contains__ method(here) in LRUStoreCache is implemented by listing all the keys in the mutable mapping of the underlying store, and therefore, all 100000 chunks are listed before checking for existence. In the context of cloud storage this can cause significant overhead.

This is the method of LRUStoreCache now:

    def __contains__(self, key):
        with self._mutex:
            if self._contains_cache is None:
                self._contains_cache = set(self._keys())
            return key in self._contains_cache

when I changed it to this:

    def __contains__(self, key):
        return key in self._store

the open_group operation is almost instantaneous as the __contains__ method of the underlying ABSStore class uses the exists option on azure blob and so doesn't have to list all keys.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePotential issues with Zarr performance (I/O, memory, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions