Description
Zarr version
v3.0.7
Numcodecs version
Python Version
3.11.12
Operating System
Windows
Installation
uv add zarr
Description
Hi Zarr team,
I’m trying to convert a large .npy file to Zarr v3 format and enable sharding, but I can’t figure out how to correctly set up the ShardingCodec in my script. I’ve attached the full script below. It creates a large random .npy file, then attempts to load it and write it to a Zarr store with chunking and (ideally) sharding enabled.
I’m using Zarr v3, and I saw that sharding is supported via ShardingCodec, but I can’t tell where and how to specify the codec during array creation. I tried importing it and defining a codec instance, but I’m not sure how or where to actually apply it.
Could you advise how to modify this script to properly use sharding? Thanks in advance!
Here's some explanation of what I've tried.
I mustn't try to load the whole array into memory, as the use case will be huge. However zarr.open_array doesn't let me specify sharding.
If I use zarr.create_array, it is extremely slow with shards but fast without shards.
update:
I looked into this a bit more, and I think the main issue I notice is that the shared array gets written to much slower than when not using sharding. i belive it should exactly be the opposite.
See attached script
Steps to reproduce
#%%
from pathlib import Path
import numpy as np
import zarr
# Path to folder containing .npy file
folder = Path(r'data')
folder.mkdir(parents=True, exist_ok=True) # Create folder if it doesn't exist
assert folder.exists(), "Folder does not exist"
filename = folder / 'test.npy'
# create a big random npy file.
shape = (2**12, 2**9, 2**9)
#%%
fp = np.lib.format.open_memmap(
filename,
mode='w+',
dtype=np.int16,
shape=shape
)
for i in range(shape[0]):
fp[i,:,:] = (np.random.random(shape[1:])*2**14).astype(np.int16)
print(i)
#%%
fp.flush()
#%%
#restart now for the next part.
#%%
from pathlib import Path
import numpy as np
import zarr
# Path to folder containing .npy file
folder = Path(r'data')
folder.mkdir(parents=True, exist_ok=True) # Create folder if it doesn't exist
assert folder.exists(), "Folder does not exist"
filename = folder / 'test.npy'
#%%
fp=np.load(filename)
# npy_file = next(folder.glob('*.npy'))
# print(f"Using file: {npy_file}")
# Load using memory mapping
# data = np.load(npy_file, mmap_mode='r')
print(f"Data shape: {fp.shape}, dtype: {fp.dtype}")
# Define chunk size
chunk_size = 128
# Create Zarr array
# Create Zarr array with sharding
from zarr.codecs import ShardingCodec
chunk_size = 128
shard_depth = chunk_size * 8
# Define the sharding codec
from zarr.codecs import ShardingCodec, ZstdCodec
chunk_size = 128
shard_depth = chunk_size * 4
z = zarr.open_array(
store=folder / 'test.zarr',
mode='w',
shape=fp.shape,
chunks=(chunk_size, chunk_size, chunk_size),
dtype=fp.dtype,
zarr_version=3,
)
# z = zarr.create_array(
# store=folder / 'test.zarr',
# # mode='w',
# shape=fp.shape,
# chunks=(chunk_size, chunk_size, chunk_size),
# shards=(shard_depth, shard_depth, shard_depth),
# dtype=fp.dtype,
# overwrite=True,
# )
# Write data in chunks
for i in range(0, fp.shape[0], shard_depth):
z[i:i+shard_depth] = fp[i:i+shard_depth]
print(f"Wrote slice {i} to {i+shard_depth}")
#last chunk
z[i:] = fp[i:]
print("Zarr conversion complete.")
# %%
Additional output
Either sharding being slow is the bug, or enabling sharding in create array. Not sure what is the correct thing.
Here's my pyproject.toml
[project]
name = "some-name"
version = "0.1.0"
description = "Add your description here"
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
"ipykernel>=6.29.5",
"matplotlib>=3.10.1",
"numpy>=2.2.5",
"pydantic>=2.11.3",
"scipy>=1.15.2",
"zarr>=2.18.3",
]