Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
The current default compression method for exporting to a parquet file is 'snappy', a somewhat outdated format in comparison to 'zstandard/zstd'. I want to add an option to be able to set a default compression method for parquet io writing operations (i.e. pandas.options.io.parquet.compression
). Adding this option should also start conversation to later on switch the default compression method to zstandard > snappy.
Considerations to note of zstd over snappy:
- Offers a higher compression ratio
- Offers similar overhead time to snappy for compression/decompression
- It is a newer format (potentially a con)
Feature Description
Add a new option for the io.parquet.compression
setting to support 'zstd' as a default compression algorithm for Parquet files. Create option/documentation for an option, pandas.describe_option("io.parquet.compression")
, that includes the list of available options: snappy, gzip, brotli, lz4, zstd.
Current behavior of io.parquet
options:
>>> pandas.describe_option("io.parquet")
io.parquet.engine : string
The default parquet reader/writer engine. Available options:
'auto', 'pyarrow', 'fastparquet', the default is 'auto'
[default: auto] [currently: auto]
Proposed addition:
Proposed documentation for pandas.describe_option("io.parquet.compression")
:
io.parquet.compression : string
The default parquet writer compression type. Available options:
'snappy', 'gzip', 'brotli', 'lz4', 'zstd'. The default is 'snappy'
[default: snappy] [currently: snappy]
Alternative Solutions
It's currently possible to merely specify ZStandard compression manually for every parquet-write operation.
df.to_parquet('data.parquet', compression='zstd')
Additional Context
No response