Skip to content

ENH: Add Zstd Compression Option for Parquet Output #55571

Open
@alexozwald

Description

@alexozwald

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The current default compression method for exporting to a parquet file is 'snappy', a somewhat outdated format in comparison to 'zstandard/zstd'. I want to add an option to be able to set a default compression method for parquet io writing operations (i.e. pandas.options.io.parquet.compression). Adding this option should also start conversation to later on switch the default compression method to zstandard > snappy.

Considerations to note of zstd over snappy:

  • Offers a higher compression ratio
  • Offers similar overhead time to snappy for compression/decompression
  • It is a newer format (potentially a con)

Feature Description

Add a new option for the io.parquet.compression setting to support 'zstd' as a default compression algorithm for Parquet files. Create option/documentation for an option, pandas.describe_option("io.parquet.compression"), that includes the list of available options: snappy, gzip, brotli, lz4, zstd.

Current behavior of io.parquet options:

>>> pandas.describe_option("io.parquet")
io.parquet.engine : string
    The default parquet reader/writer engine. Available options:
    'auto', 'pyarrow', 'fastparquet', the default is 'auto'
    [default: auto] [currently: auto]

Proposed addition:
Proposed documentation for pandas.describe_option("io.parquet.compression"):

io.parquet.compression : string
    The default parquet writer compression type. Available options:
    'snappy', 'gzip', 'brotli', 'lz4', 'zstd'. The default is 'snappy'
    [default: snappy] [currently: snappy]

Alternative Solutions

It's currently possible to merely specify ZStandard compression manually for every parquet-write operation.

df.to_parquet('data.parquet', compression='zstd')

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions