Closed
Description
-
I have checked that this issue has not already been reported.
Two variants of this bug have been reported - BUG: pd.read_parquet with pyarrow fails when row number is 0 and contains Pandas extensions type #35436 and BUG: read-only buffer failures in datetime parsing #34857EDIT: I read into those two issues a bit more. They don't seem similar. But I'll keep it there.
-
I have confirmed this bug exists on the latest version of pandas.
Bug exists in pandas 1.1.1
Code Sample, a copy-pastable example
import pandas as pd
import pyarrow as pa
df = pd.DataFrame(
{
"sepal_length": [5.1, 4.9, 4.7, 4.6, 5.0],
"species": ["setosa", "setosa", "setosa", "setosa", "setosa"],
}
)
context = pa.default_serialization_context()
data = context.serialize(df).to_buffer().to_pybytes()
df_new = context.deserialize(data)
# this fails
df_new.groupby(["species"]).agg({"sepal_length": "sum"})
# this works
# df_new.copy().groupby(["species"]).agg({"sepal_length": "sum"})
Problem description
This is the traceback.
Traceback (most recent call last):
File "demo.py", line 16, in <module>
df_new.groupby(["species"]).agg({"sepal_length": "sum"})
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 949, in aggregate
result, how = self._aggregate(func, *args, **kwargs)
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 416, in _aggregate
result = _agg(arg, _agg_1dim)
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 383, in _agg
result[fname] = func(fname, agg_how)
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 367, in _agg_1dim
return colg.aggregate(how)
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 240, in aggregate
return getattr(self, func)(*args, **kwargs)
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1539, in sum
return self._agg_general(
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 999, in _agg_general
return self._cython_agg_general(
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1033, in _cython_agg_general
result, agg_names = self.grouper.aggregate(
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 584, in aggregate
return self._cython_operation(
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 537, in _cython_operation
result = self._aggregate(result, counts, values, codes, func, min_count)
File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 599, in _aggregate
agg_func(result, counts, values, comp_ids, min_count)
File "pandas/_libs/groupby.pyx", line 475, in pandas._libs.groupby._group_add
File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
In the .agg
line that fails, if you do a min, max, median, or count aggregation, then it's going to work.
But if you do a sum or mean, then it fails.
Expected Output
I expected the aggregation to succeed without any error.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python : 3.8.5.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-7642-generic
Version : #46~1597422484~20.04~e78f762-Ubuntu SMP Wed Aug 19 14:35:06 UTC
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None