Closed
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a':np.random.randn(150),
'b':np.repeat( [25,50,75,100,np.nan], 30),
'c':np.random.choice( ['panda','python','shark'], 150),
})
df['b'] = df['b'].astype("Int64")
df = df.set_index(['b', 'c'])
groups = df.groupby(axis='rows', level='b', dropna=False)
# Crashes here with something like "index 91 is out of bounds for axis 0 with size 4"
groups.ngroups
Issue Description
Attempting to run df.groupby()
with an Int64Dtype
column (nullable integers) no longer works as of Pandas 1.5.0. Seems to be caused by #46601, and I'm unsure if the patch at #48702 fixes this issue.
Traceback (most recent call last):
File "pandas15_groupby_bug.py", line 17, in <module>
groups.ngroups
File "venv\lib\site-packages\pandas\core\groupby\groupby.py", line 992, in __getattribute__
return super().__getattribute__(attr)
File "venv\lib\site-packages\pandas\core\groupby\groupby.py", line 671, in ngroups
return self.grouper.ngroups
File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
File "venv\lib\site-packages\pandas\core\groupby\ops.py", line 983, in ngroups
return len(self.result_index)
File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
File "venv\lib\site-packages\pandas\core\groupby\ops.py", line 994, in result_index
return self.groupings[0].result_index.rename(self.names[0])
File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
File "venv\lib\site-packages\pandas\core\groupby\grouper.py", line 648, in result_index
return self.group_index
File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
File "venv\lib\site-packages\pandas\core\groupby\grouper.py", line 656, in group_index
uniques = self._codes_and_uniques[1]
File "pandas\_libs\properties.pyx", line 36, in pandas._libs.properties.CachedProperty.__get__
File "venv\lib\site-packages\pandas\core\groupby\grouper.py", line 693, in _codes_and_uniques
codes, uniques = algorithms.factorize( # type: ignore[assignment]
File "venv\lib\site-packages\pandas\core\algorithms.py", line 792, in factorize
codes, uniques = values.factorize( # type: ignore[call-arg]
File "venv\lib\site-packages\pandas\core\arrays\masked.py", line 918, in factorize
uniques = np.insert(uniques, na_code, 0)
File "<__array_function__ internals>", line 180, in insert
File "venv\lib\site-packages\numpy\lib\function_base.py", line 5332, in insert
raise IndexError(f"index {obj} is out of bounds for axis {axis} "
IndexError: index 91 is out of bounds for axis 0 with size 4
Expected Behavior
Groupby shouldn't throw, and behave the same as on Pandas 1.4.
Installed Versions
commit : 87cfe4e
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : French_Canada.1252
pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 60.2.0
pip : 21.3.1
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.1.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.8.2
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 9.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : 1.3.24
tables : None
tabulate : 0.8.10
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None