@@ -729,6 +729,9 @@ group (requires `lmdb <http://lmdb.readthedocs.io/>`_ to be installed)::
729
729
>>> z[:] = 42
730
730
>>> store.close()
731
731
732
+ Distributed/cloud storage
733
+ ~~~~~~~~~~~~~~~~~~~~~~~~~
734
+
732
735
It is also possible to use distributed storage systems. The Dask project has
733
736
implementations of the ``MutableMapping `` interface for Amazon S3 (`S3Map
734
737
<http://s3fs.readthedocs.io/en/latest/api.html#s3fs.mapping.S3Map> `_), Hadoop
@@ -767,6 +770,37 @@ Here is an example using S3Map to read an array created previously::
767
770
>>> z[:].tostring()
768
771
b'Hello from the cloud!'
769
772
773
+ Note that retrieving data from a remote service via the network can be significantly
774
+ slower than retrieving data from a local file system, and will depend on network latency
775
+ and bandwidth between the client and server systems. If you are experiencing poor
776
+ performance, there are several things you can try. One option is to increase the array
777
+ chunk size, which will reduce the number of chunks and thus reduce the number of network
778
+ round-trips required to retrieve data for an array (and thus reduce the impact of network
779
+ latency). Another option is to try to increase the compression ratio by changing
780
+ compression options or trying a different compressor (which will reduce the impact of
781
+ limited network bandwidth). As of version 2.2, Zarr also provides the
782
+ :class: `zarr.storage.LRUStoreCache ` which can be used to implement a local in-memory cache
783
+ layer over a remote store. E.g.::
784
+
785
+ >>> s3 = s3fs.S3FileSystem(anon=True, client_kwargs=dict(region_name='eu-west-2'))
786
+ >>> store = s3fs.S3Map(root='zarr-demo/store', s3=s3, check=False)
787
+ >>> cache = zarr.LRUStoreCache(store, max_size=2**28)
788
+ >>> root = zarr.group(store=cache)
789
+ >>> z = root['foo/bar/baz']
790
+ >>> from timeit import timeit
791
+ >>> # first data access is relatively slow, retrieved from store
792
+ ... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
793
+ b'Hello from the cloud!'
794
+ 0.1081731989979744
795
+ >>> # second data access is faster, uses cache
796
+ ... timeit('print(z[:].tostring())', number=1, globals=globals()) # doctest: +SKIP
797
+ b'Hello from the cloud!'
798
+ 0.0009490990014455747
799
+
800
+ If you are still experiencing poor performance with distributed/cloud storage, please
801
+ raise an issue on the GitHub issue tracker with any profiling data you can provide, as
802
+ there may be opportunities to optimise further either within Zarr or within the mapping
803
+ interface to the storage.
770
804
771
805
.. _tutorial_copy :
772
806
0 commit comments