Description
Code Sample, a copy-pastable example if possible
ticker = 'AA'
stocks_close_df = pd.read_parquet(
'data/v4.parquet',
columns=['DateTime', 'Close', 'Ticker'],
engine='pyarrow',
filters=[('Ticker','=',ticker)]
)
# This is what the above should effect
stocks_close_df = stocks_close_df[stocks_close_df.Ticker == ticker]
This results in the following exception:
TypeError Traceback (most recent call last)
<ipython-input-7-450ddb513430> in <module>
6 columns=['DateTime', 'Close', 'Ticker'],
7 engine='pyarrow',
8 filters=[('Ticker','=',ticker)]
9 )
10 stocks_close_df.index = stocks_close_df['DateTime']
~/anaconda3/lib/python3.6/site-packages/pandas/io/parquet.py in read_parquet(path, engine, columns, **kwargs)
280
281 impl = get_engine(engine)
282 return impl.read(path, columns=columns, **kwargs)
~/anaconda3/lib/python3.6/site-packages/pandas/io/parquet.py in read(self, path, columns, **kwargs)
127 kwargs['use_pandas_metadata'] = True
128 result = self.api.parquet.read_table(path, columns=columns,
129 **kwargs).to_pandas()
130 if should_close:
131 try:
TypeError: read_table() got an unexpected keyword argument 'filters'
Problem description
I would like to pass a filters
argument from pandas.read_parquet
through to the pyarrow
engine to do filtering on partitions in Parquet files. The pyarrow
engine has this capability, it is just a matter of passing through the filters
argument.
From a discussion on [email protected]:
But, filtering could also be done when reading the parquet file(s), to
actually prevent reading everything into memory. However, this is only
partly implemented in pyarrow at this moment. If you have a dataset
consisting of partitioned files in nested directories (Hive like), pyarrow
can filter on which files to read. See the "filters" keyword of
ParquetDataset (
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html).
I am only not fully sure you can already use this through the pandas
interface, it might be you need to use the pyarrow interface directly (in
which case, feel free to open an issue on the pandas issue tracker).
Note that https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html takes a filters (List[Tuple] or List[List[Tuple]] or None (default))
argument.
Expected Output
A filtered pandas.DataFrame
.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-50-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 41.0.1
Cython: 0.29.7
numpy: 1.16.3
scipy: 1.2.1
pyarrow: 0.13.0
xarray: None
IPython: 7.5.0
sphinx: 2.0.1
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.3
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: None
gcsfs: None