Skip to content

BUG: groupby and agg on read-only array gives ValueError: buffer source array is read-only #36014

Closed
@jeet-parekh

Description

@jeet-parekh

Code Sample, a copy-pastable example

import pandas as pd
import pyarrow as pa

df = pd.DataFrame(
    {
        "sepal_length": [5.1, 4.9, 4.7, 4.6, 5.0],
        "species": ["setosa", "setosa", "setosa", "setosa", "setosa"],
    }
)

context = pa.default_serialization_context()
data = context.serialize(df).to_buffer().to_pybytes()
df_new = context.deserialize(data)

# this fails
df_new.groupby(["species"]).agg({"sepal_length": "sum"})

# this works
# df_new.copy().groupby(["species"]).agg({"sepal_length": "sum"})

Problem description

This is the traceback.

Traceback (most recent call last):
  File "demo.py", line 16, in <module>
    df_new.groupby(["species"]).agg({"sepal_length": "sum"})
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 949, in aggregate
    result, how = self._aggregate(func, *args, **kwargs)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 416, in _aggregate
    result = _agg(arg, _agg_1dim)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 383, in _agg
    result[fname] = func(fname, agg_how)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/base.py", line 367, in _agg_1dim
    return colg.aggregate(how)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/generic.py", line 240, in aggregate
    return getattr(self, func)(*args, **kwargs)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1539, in sum
    return self._agg_general(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 999, in _agg_general
    return self._cython_agg_general(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 1033, in _cython_agg_general
    result, agg_names = self.grouper.aggregate(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 584, in aggregate
    return self._cython_operation(
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 537, in _cython_operation
    result = self._aggregate(result, counts, values, codes, func, min_count)
  File "/home/jeet/miniconda3/envs/rnd/lib/python3.8/site-packages/pandas/core/groupby/ops.py", line 599, in _aggregate
    agg_func(result, counts, values, comp_ids, min_count)
  File "pandas/_libs/groupby.pyx", line 475, in pandas._libs.groupby._group_add
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only

In the .agg line that fails, if you do a min, max, median, or count aggregation, then it's going to work.

But if you do a sum or mean, then it fails.

Expected Output

I expected the aggregation to succeed without any error.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python           : 3.8.5.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.4.0-7642-generic
Version          : #46~1597422484~20.04~e78f762-Ubuntu SMP Wed Aug 19 14:35:06 UTC 
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.1
numpy            : 1.19.1
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.6.0.post20200814
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : 2.8.5 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 7.17.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : 1.0.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.5.2
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions