Skip to content

Add filters parameter to pandas.read_parquet() to enable PyArrow/Parquet partition filtering #26551

Closed
@rjurney

Description

@rjurney

Code Sample, a copy-pastable example if possible

ticker = 'AA'

stocks_close_df = pd.read_parquet(
    'data/v4.parquet',
    columns=['DateTime', 'Close', 'Ticker'],
    engine='pyarrow',
    filters=[('Ticker','=',ticker)]
)

# This is what the above should effect
stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker]

This results in the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-7-450ddb513430> in <module>
      6     columns=['DateTime', 'Close', 'Ticker'],
      7     engine='pyarrow',
      8     filters=[('Ticker','=',ticker)]
      9 )
     10 stocks_close_df.index = stocks_close_df['DateTime']

~/anaconda3/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
    280 
    281     impl = get_engine(engine)
    282     return impl.read(path, columns=columns, **kwargs)

~/anaconda3/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
    127         kwargs['use_pandas_metadata'] = True
    128         result = self.api.parquet.read_table(path, columns=columns,
    129                                              **kwargs).to_pandas()
    130         if should_close:
    131             try:

TypeError: read_table() got an unexpected keyword argument 'filters'

Problem description

I would like to pass a filters argument from pandas.read_parquet through to the pyarrow engine to do filtering on partitions in Parquet files. The pyarrow engine has this capability, it is just a matter of passing through the filters argument.

From a discussion on [email protected]:

But, filtering could also be done when reading the parquet file(s), to
actually prevent reading everything into memory. However, this is only
partly implemented in pyarrow at this moment. If you have a dataset
consisting of partitioned files in nested directories (Hive like), pyarrow
can filter on which files to read. See the "filters" keyword of
ParquetDataset (
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html).
I am only not fully sure you can already use this through the pandas
interface, it might be you need to use the pyarrow interface directly (in
which case, feel free to open an issue on the pandas issue tracker).

Note that https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html takes a filters (List[Tuple] or List[List[Tuple]] or None (default)) argument.

Expected Output

A filtered pandas.DataFrame.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: 0.13.0
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions