Description
Code Sample
Using the latest version of pandas v0.25.1
import numpy as np
import pandas as pd
np.random.seed(12345)
df1 = pd.DataFrame({
'category': ['A', 'A', 'A', 'A',
'B', 'B', 'B', 'B',
],
'value': np.random.randint(1, 10, 8)
})
df1.groupby("category").value.quantile([0.25, 0.75])
produces
category
A 0.25 2.75
0.75 5.25
B 0.25 2.75
0.75 6.25
Name: value, dtype: float64
as expected. However, running this
np.random.seed(12345)
df2 = pd.DataFrame({
'category': ['A', 'A', 'A', 'A',
'B', 'B', 'B', 'B',
'C', 'C', 'C', 'C',
],
'value': np.random.randint(1, 10, 12)
})
df2.groupby("category").value.quantile([0.25, 0.75])
produces this error instead:
IndexError Traceback (most recent call last)
<ipython-input-60-12c4dbb665fc> in <module>
8 })
9
---> 10 df2.groupby("category").value.quantile([0.25, 0.75])
~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
1951 indices = np.concatenate(arrays)
1952 assert len(indices) == len(result)
-> 1953 return result.take(indices)
1954
1955 @Substitution(name="groupby")
~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/series.py in take(self, indices, axis, is_copy, **kwargs)
4430
4431 indices = ensure_platform_int(indices)
-> 4432 new_index = self.index.take(indices)
4433
4434 if is_categorical_dtype(self):
~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in take(self, indices, axis, allow_fill, fill_value, **kwargs)
2030 allow_fill=allow_fill,
2031 fill_value=fill_value,
-> 2032 na_value=-1,
2033 )
2034 return MultiIndex(
~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in _assert_take_fillable(self, values, indices, allow_fill, fill_value, na_value)
2058 taken = masked
2059 else:
-> 2060 taken = [lab.take(indices) for lab in self.codes]
2061 return taken
2062
~/anaconda3/envs/dabest-dev-py3.7/lib/python3.7/site-packages/pandas/core/indexes/multi.py in <listcomp>(.0)
2058 taken = masked
2059 else:
-> 2060 taken = [lab.take(indices) for lab in self.codes]
2061 return taken
2062
IndexError: index 6 is out of bounds for size 6
The expected output is produced with pandas=0.24
:
df2.groupby("category").value.quantile([0.25, 0.75])
category
A 0.25 2.75
0.75 5.25
B 0.25 2.75
0.75 6.25
C 0.25 1.75
0.75 7.25
Not exactly sure how to mitigate this?
I understand a related bug was patched with #28285 and #27526.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
pandas : 0.25.1
numpy : 1.16.2
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : 4.3.0
hypothesis : None
sphinx : 2.2.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.2.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : None
tables : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None