Open
Description
When writing some contiguous elements to an uncompressed array (or chunk of an array), it is possible to use a simple seek-and-write directly on disk. This operation should be nearly free, and nearly independent of the chunk size. However, the cost seems to scale with the chunk size, just as for compressed arrays; I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.
Here is a simple script to showcase the effect:
import zarr
import numpy as np
import time
import os
import shutil
# Sizes to test
array_sizes = [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]
# Data to write: an array of 10 random floats
data_to_write = np.random.rand(10)
# Results storage
results = []
for size in array_sizes:
print(f"Testing array size: {size}")
# Define paths for Zarr arrays
uncompressed_path = f'zarr_uncompressed_{size}.zarr'
compressed_path = f'zarr_compressed_{size}.zarr'
# Remove existing arrays if they exist
for path in [uncompressed_path, compressed_path]:
if os.path.exists(path):
shutil.rmtree(path)
# Create uncompressed Zarr array
uncompressed_array = zarr.open(
uncompressed_path,
mode='w',
shape=(size,),
chunks=(size,), # Single chunk covering the entire array
dtype='float64',
compressor=None # No compression
)
# Create compressed Zarr array (default compressor is zlib)
compressed_array = zarr.open(
compressed_path,
mode='w',
shape=(size,),
chunks=(size,), # Single chunk
dtype='float64' # Using default compression
)
# Initialize arrays with zeros
uncompressed_array[:] = 0.0
compressed_array[:] = 0.0
# Define the index range for writing
start_index = size // 2 # Middle of the array
end_index = start_index + 10
# Measure write speed for uncompressed array
start_time = time.time()
uncompressed_array[start_index:end_index] = data_to_write
uncompressed_write_time = time.time() - start_time
# Measure write speed for compressed array
start_time = time.time()
compressed_array[start_index:end_index] = data_to_write
compressed_write_time = time.time() - start_time
# Store the results
results.append({
'array_size': size,
'uncompressed_write_time': uncompressed_write_time,
'compressed_write_time': compressed_write_time
})
print(f"Uncompressed write time: {uncompressed_write_time:.6f} seconds")
print(f"Compressed write time: {compressed_write_time:.6f} seconds\n")
Testing array size: 1000
Uncompressed write time: 0.000219 seconds
Compressed write time: 0.000234 seconds
Testing array size: 10000
Uncompressed write time: 0.000454 seconds
Compressed write time: 0.000268 seconds
Testing array size: 100000
Uncompressed write time: 0.001391 seconds
Compressed write time: 0.000645 seconds
Testing array size: 1000000
Uncompressed write time: 0.015249 seconds
Compressed write time: 0.001800 seconds
Testing array size: 10000000
Uncompressed write time: 0.196500 seconds
Compressed write time: 0.029215 seconds
Testing array size: 100000000
Uncompressed write time: 1.841548 seconds
Compressed write time: 0.242806 seconds