Skip to content

Unnecessarily slow writes to uncompressed arrays #2518

Open
@jbuckman

Description

@jbuckman

When writing some contiguous elements to an uncompressed array (or chunk of an array), it is possible to use a simple seek-and-write directly on disk. This operation should be nearly free, and nearly independent of the chunk size. However, the cost seems to scale with the chunk size, just as for compressed arrays; I assume this means that under the hood, zarr is reading the entire chunk into memory and writing it all back to disk again.

Here is a simple script to showcase the effect:

import zarr
import numpy as np
import time
import os
import shutil

# Sizes to test
array_sizes = [10**3, 10**4, 10**5, 10**6, 10**7, 10**8]

# Data to write: an array of 10 random floats
data_to_write = np.random.rand(10)

# Results storage
results = []

for size in array_sizes:
    print(f"Testing array size: {size}")
    # Define paths for Zarr arrays
    uncompressed_path = f'zarr_uncompressed_{size}.zarr'
    compressed_path = f'zarr_compressed_{size}.zarr'
    # Remove existing arrays if they exist
    for path in [uncompressed_path, compressed_path]:
        if os.path.exists(path):
            shutil.rmtree(path)
    # Create uncompressed Zarr array
    uncompressed_array = zarr.open(
        uncompressed_path,
        mode='w',
        shape=(size,),
        chunks=(size,),  # Single chunk covering the entire array
        dtype='float64',
        compressor=None  # No compression
    )
    # Create compressed Zarr array (default compressor is zlib)
    compressed_array = zarr.open(
        compressed_path,
        mode='w',
        shape=(size,),
        chunks=(size,),  # Single chunk
        dtype='float64'  # Using default compression
    )
    # Initialize arrays with zeros
    uncompressed_array[:] = 0.0
    compressed_array[:] = 0.0
    # Define the index range for writing
    start_index = size // 2  # Middle of the array
    end_index = start_index + 10
    # Measure write speed for uncompressed array
    start_time = time.time()
    uncompressed_array[start_index:end_index] = data_to_write
    uncompressed_write_time = time.time() - start_time
    # Measure write speed for compressed array
    start_time = time.time()
    compressed_array[start_index:end_index] = data_to_write
    compressed_write_time = time.time() - start_time
    # Store the results
    results.append({
        'array_size': size,
        'uncompressed_write_time': uncompressed_write_time,
        'compressed_write_time': compressed_write_time
    })
    print(f"Uncompressed write time: {uncompressed_write_time:.6f} seconds")
    print(f"Compressed write time:   {compressed_write_time:.6f} seconds\n")
Testing array size: 1000
Uncompressed write time: 0.000219 seconds
Compressed write time:   0.000234 seconds

Testing array size: 10000
Uncompressed write time: 0.000454 seconds
Compressed write time:   0.000268 seconds

Testing array size: 100000
Uncompressed write time: 0.001391 seconds
Compressed write time:   0.000645 seconds

Testing array size: 1000000
Uncompressed write time: 0.015249 seconds
Compressed write time:   0.001800 seconds

Testing array size: 10000000
Uncompressed write time: 0.196500 seconds
Compressed write time:   0.029215 seconds

Testing array size: 100000000
Uncompressed write time: 1.841548 seconds
Compressed write time:   0.242806 seconds

Metadata

Metadata

Assignees

No one assigned

    Labels

    V2Affects the v2 branchperformancePotential issues with Zarr performance (I/O, memory, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions