Skip to content

BUG: pandas 1.5 fails to groupby on (nullable) Int64 column with dropna=False #48794

Closed
@sm-Fifteen

Description

@sm-Fifteen

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'a':np.random.randn(150),
    'b':np.repeat( [25,50,75,100,np.nan], 30),
    'c':np.random.choice( ['panda','python','shark'], 150),
})

df['b'] = df['b'].astype("Int64")
df = df.set_index(['b', 'c'])

groups = df.groupby(axis='rows', level='b', dropna=False)

# Crashes here with something like "index 91 is out of bounds for axis 0 with size 4"
groups.ngroups

Issue Description

Attempting to run df.groupby() with an Int64Dtype column (nullable integers) no longer works as of Pandas 1.5.0. Seems to be caused by #46601, and I'm unsure if the patch at #48702 fixes this issue.

Traceback (most recent call last):
  File "pandas15_groupby_bug.py", line 17, in <module>
    groups.ngroups
  File "venv\lib\site-packages\pandas\core\groupby\groupby.py", line 992, in __getattribute__
    return super().__getattribute__(attr)
  File "venv\lib\site-packages\pandas\core\groupby\groupby.py", line 671, in ngroups
    return self.grouper.ngroups
  File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "venv\lib\site-packages\pandas\core\groupby\ops.py", line 983, in ngroups
    return len(self.result_index)
  File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "venv\lib\site-packages\pandas\core\groupby\ops.py", line 994, in result_index
    return self.groupings[0].result_index.rename(self.names[0])
  File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "venv\lib\site-packages\pandas\core\groupby\grouper.py", line 648, in result_index
    return self.group_index
  File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "venv\lib\site-packages\pandas\core\groupby\grouper.py", line 656, in group_index
    uniques = self._codes_and_uniques[1]
  File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
  File "venv\lib\site-packages\pandas\core\groupby\grouper.py", line 693, in _codes_and_uniques
    codes, uniques = algorithms.factorize(  # type: ignore[assignment]
  File "venv\lib\site-packages\pandas\core\algorithms.py", line 792, in factorize
    codes, uniques = values.factorize(  # type: ignore[call-arg]
  File "venv\lib\site-packages\pandas\core\arrays\masked.py", line 918, in factorize
    uniques = np.insert(uniques, na_code, 0)
  File "<__array_function__ internals>", line 180, in insert
  File "venv\lib\site-packages\numpy\lib\function_base.py", line 5332, in insert
    raise IndexError(f"index {obj} is out of bounds for axis {axis} "
IndexError: index 91 is out of bounds for axis 0 with size 4

Expected Behavior

Groupby shouldn't throw, and behave the same as on Pandas 1.4.

Installed Versions

commit : 87cfe4e python : 3.8.10.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19042 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : French_Canada.1252 pandas : 1.5.0 numpy : 1.23.3 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 60.2.0 pip : 21.3.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 3.0.3 lxml.etree : None html5lib : None pymysql : None psycopg2 : 2.9.3 jinja2 : 3.1.2 IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : 2022.8.2 gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : 9.0.0 pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : 1.3.24 tables : None tabulate : 0.8.10 xarray : None xlrd : None xlwt : None zstandard : None tzdata : None

Metadata

Metadata

Assignees

Labels

BugGroupbyNA - MaskedArraysRelated to pd.NA and nullable extension arraysRegressionFunctionality that used to work in a prior pandas version

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions