Skip to content

Categorical in GroupBy with aggregations raise error under specific conditions #36698

Closed
@ant1j

Description

@ant1j
ticks = pd.DataFrame.from_dict({
    'cid':   [1, 1, 2, 2, 3],
    'date':  ['2019-01-01' , '2020-01-02' , '2020-01-03' , '2019-01-04' , '2020-01-05'],
    'tid':   [1, 2, 3, 4, 5],
    'amount':[1, 1, 2, 2, 3],
})
ticks['date'] = pd.to_datetime(ticks['date'])
ticks['year'] = ticks['date'].dt.year
ticks['year'] = ticks['year'].astype('category')

ticks.groupby(['cid', 'year'], as_index=False, observed=False).agg({'amount': sum})

Outputs a: ValueError: Length of values (5) does not match length of index (6)

Full traceback

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-15-502fa0f135e2> in <module>
      9 
     10 
---> 11         ticks.groupby(['cid', 'year'], as_index=False, observed=False).agg({'amount': sum})
     12 )

c:\users\a.jouanjean\htdocs\factbook-py\.venv\lib\site-packages\pandas\core\groupby\generic.py in aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
    992 
    993         if not self.as_index:
--> 994             self._insert_inaxis_grouper_inplace(result)
    995             result.index = np.arange(len(result))
    996 

c:\users\a.jouanjean\htdocs\factbook-py\.venv\lib\site-packages\pandas\core\groupby\generic.py in _insert_inaxis_grouper_inplace(self, result)
   1716             # When using .apply(-), name will be in columns already
   1717             if in_axis and name not in columns:
-> 1718                 result.insert(0, name, lev)
   1719 
   1720     def _wrap_aggregated_output(

c:\users\a.jouanjean\htdocs\factbook-py\.venv\lib\site-packages\pandas\core\frame.py in insert(self, loc, column, value, allow_duplicates)
   3620         """
   3621         self._ensure_valid_index(value)
-> 3622         value = self._sanitize_column(column, value, broadcast=False)
   3623         self._mgr.insert(loc, column, value, allow_duplicates=allow_duplicates)
   3624 

c:\users\a.jouanjean\htdocs\factbook-py\.venv\lib\site-packages\pandas\core\frame.py in _sanitize_column(self, key, value, broadcast)
   3761 
   3762             # turn me into an ndarray
-> 3763             value = sanitize_index(value, self.index)
   3764             if not isinstance(value, (np.ndarray, Index)):
   3765                 if isinstance(value, list) and len(value) > 0:

c:\users\a.jouanjean\htdocs\factbook-py\.venv\lib\site-packages\pandas\core\internals\construction.py in sanitize_index(data, index)
    745     """
    746     if len(data) != len(index):
--> 747         raise ValueError(
    748             "Length of values "
    749             f"({len(data)}) "

ValueError: Length of values (5) does not match length of index (6)

Problem description

After quite some time trying to narrow down the origin of a ValueError: Length of values (N) does not match length of index (M), it seems to occur, only when these conditions are met:

  • groupby() done using a categorical variables in the by list
  • as_index=False, but as_index=True is OK
  • observed=False, but observed=True is OK
  • aggregate() is performed, while applying directly a sum() on the DataFrameGroupBy is OK

see the different combinations in details below.

Expected Output

Would it be possible to issue an early check and error/exception raise when such conditions are met?

It would definitely help user to understand clearly where the problem comes from and how to correct it.

I am aware that some parts of the issue are being addressed (see PR #35967), but this will not help the user to understand what is actually going on when all conditions are met.

Conditions under which Error is not raised

# with observed=True
ticks.groupby(['cid', 'year'], as_index=False, observed=True).agg({'amount': sum}),

# with as_index=True
ticks.groupby(['cid', 'year'], as_index=True, observed=False).agg({'amount': sum}),  

# without using aggregate() but sum() directly [will also sum tid, but still no error]
ticks.groupby(['cid', 'year'], as_index=False, observed=False).sum(),

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 2a7d332
python : 3.8.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : fr_FR.cp1252

pandas : 1.1.2
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions