Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
UPD: bug does not exist anymore in latest master branch, yet still present in 1.1.3 from pip.
Updated version details:
INSTALLED VERSIONS
commit : db08276
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.1.3
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 20.2.3
setuptools : 40.8.0
Cython : 0.29.21
pytest : 6.0.1
hypothesis : 5.36.0
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Edit: removed everything related to categorical rhetoric, updated with findings. Corrected several mistypes.
Code Sample, a copy-pastable example
UPD: A clearer sample provided by @AlexanderCecile which addresses that bug appears on group names inside lambda, not due to tuples in cell values.
import random
import pandas as pd
df = pd.DataFrame({"col_1": [random.randint(0, 5) for _ in range(5)]})
print(df, end="\n\n")
grp_tuples = [("a", "b"), ("c", "d"), ("e", "f")]
# grp_res = df.groupby(by="col_1")
grp_res = df.groupby(by=lambda x: grp_tuples[x % len(grp_tuples)])
for key, grp in grp_res:
print(f"key: {key}\ngroup:\n{grp}\n")
print(grp_res.indices)
print(grp_res.grouper.indices)
Original sample
df = pd.DataFrame({'Tuples': ((x, y)
for x in [0, 1]
for y in np.random.randint(3, 5, 5))})
df.groupby('Tuples').grouper.indices
df.groupby(lambda x: df.iloc[x,0]).indices
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
'Tuples' | (0,3) | (0,3) | (0,3) | (0,3) | (0,3) | (1,4) | (1,4) | (1,4) | (1,3) | (1,4) |
Problem description
Problem: .indices do not work on lambda made group by with tuples
In the sample above two different .groupby in crude terms do the same thing and provide groups with same characteristics - grouping DataFrame by column values which are represented by tuples. Technically single column DF is a Series object, but issue remains the same if some other columns are added. However these groups share some weird behavior:
- Both have same output for .groups:
{(0, 3): [0, 1, 2, 3, 4], (1, 3): [8], (1, 4): [5, 6, 7, 9]}
- Both show same .head(), .size() some other methods listed here. Groups can be selected with tuple key
(0, 3)
with all same output.
And here come the differences:
gb.transform('size')
+ just shows the index of a DF
- IndexError: Column(s) Tuples already selected
gb1.sum() / gb1.agg(sum)
+ shows column 'Tuples' with recurring tuples, e.g. (0, 3), (1, 3)
+ in case GroupBy is prepared with as_index=False, also shows index
- show index and 'Tuples' column where tuples are added per group
- on each line: (0, 3, 0, 3, 0, 3), (1, 3, 1, 3)
gb.all()
+ index with tuples per line, e.g. (0, 3), (1, 3)
+ if as_index=False:
+ ValueError: Length mismatch: Expected axis has 0 elements, new values have 3 elements
- table with tuples (indexes if as_index=False) and True values
gb.count()
+ Tuples column and no counts
+ if as_index=False:
+ IndexError: list index out of range
- table with tuples (indexes if as_index=False) and actual count of groups
Basically all aggregated methods have same result as above. However if GroupBy is made on Series df['Tuples'].groupby(df['Tuples'])
majority of aggregated functions start working as for lambda made GroupBy, output is the same, but does not show as table in jupyter, just plain-text, yet I guess that's an unexpected behavior. That's sapid, but whatever, these are the differences one can live with.
Main issue
Tuple group names in lambda or named function GroupBy break method .get_group() and attribute .indices.
As mentioned in docs GroupBy works pretty well with two-dimensional keys, although missing handful level selectors that MultiIndex has. Perhaps that's a topic for feature-request, not an issue.
Sample in docs produces same tuple groups dict_keys( [ ('Store_1', 'Product_1'), ... () ]
, this should not be a problem. Also GroupBy made with lambda works all fine with numeric / string labels.
In case with tuples, .indices and .get_group fall with error NotImplementedError: isna is not defined for MultiIndex
.
.get_group()
.get_group() Traceback
ERROR
df.groupby(lambda x: df.iloc[x,0]).get_group((0,3))- --------------------------------------------------------------------------- - NotImplementedError Traceback (most recent call last) - <ipython-input-3-c37a9ee93781> in <module> - ----> 1 df.groupby(lambda x: df.iloc[x,0]).get_group((0,3))Part 1 => .get_indices
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in get_group(self, name, obj)
806 obj = self._selected_obj 807 --> 808 inds = self._get_index(name) 809 if not len(inds): 810 raise KeyError(name)
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in _get_index(self, name)
628 Safe get index, translate keys for datelike to underlying repr. 629 """ --> 630 return self._get_indices([name])[0] 631 632 @cache_readonly
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in _get_indices(self, names)
593 return [] 594 --> 595 if len(self.indices) > 0: 596 index_sample = next(iter(self.indices)) 597 else:
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in indices(self)
572 """ 573 self._assure_grouper() --> 574 return self.grouper.indices 575 576 def _get_indices(self, names):
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
Part 2 => Categories nan
c:\python37\lib\site-packages\pandas\core\groupby\ops.py in indices(self)
222 """ dict {group name -> group indices} """ 223 if len(self.groupings) == 1: --> 224 return self.groupings[0].indices 225 else: 226 codes_list = [ping.codes for ping in self.groupings]
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\groupby\grouper.py in indices(self)
558 return self.grouper.indices 559 --> 560 values = Categorical(self.grouper) 561 return values._reverse_indexer() 562
c:\python37\lib\site-packages\pandas\core\arrays\categorical.py in init(self, values, categories, ordered, dtype, fastpath)
360 361 # we're inferring from values --> 362 dtype = CategoricalDtype(categories, dtype.ordered) 363 364 elif is_categorical_dtype(values.dtype):
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in init(self, categories, ordered)
161 162 def __init__(self, categories=None, ordered: Ordered = False): --> 163 self._finalize(categories, ordered, fastpath=False) 164 165 @classmethod
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in _finalize(self, categories, ordered, fastpath)
315 316 if categories is not None: --> 317 categories = self.validate_categories(categories, fastpath=fastpath) 318 319 self._categories = categories
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in validate_categories(categories, fastpath)
487 if not fastpath: 488 --> 489 if categories.hasnans: 490 raise ValueError("Categorical categories cannot be null") 491
Part 3 => nan, Multiindex
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\indexes\base.py in hasnans(self)
2046 """ 2047 if self._can_hold_na: -> 2048 return bool(self._isnan.any()) 2049 else: 2050 return False
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\indexes\base.py in _isnan(self)
2026 """ 2027 if self._can_hold_na: -> 2028 return isna(self) 2029 else: 2030 # shouldn't reach to this condition by checking hasnans beforehand
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in isna(obj)
122 Name: 1, dtype: bool 123 """ --> 124 return _isna(obj) 125 126
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in _isna(obj, inf_as_na)
151 # hack (for now) because MI registers as ndarray 152 elif isinstance(obj, ABCMultiIndex): --> 153 raise NotImplementedError("isna is not defined for MultiIndex") 154 elif isinstance(obj, type): 155 return False
NotImplementedError: isna is not defined for MultiIndex
>>> End of Traceback <<<
.indices
.indices Traceback
ERROR
df.groupby(lambda x: df.iloc[x,0]).indices- --------------------------------------------------------------------------- - NotImplementedError Traceback (most recent call last) - <ipython-input-5-fd5a9f33be4e> in <module> - ----> 1 df.groupby(lambda x: df.iloc[x,0]).indicesPart 1 => groupings, Categorical
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in indices(self)
572 """ 573 self._assure_grouper() --> 574 return self.grouper.indices 575 576 def _get_indices(self, names):
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\groupby\ops.py in indices(self)
222 """ dict {group name -> group indices} """ 223 if len(self.groupings) == 1: --> 224 return self.groupings[0].indices 225 else: 226 codes_list = [ping.codes for ping in self.groupings]
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\groupby\grouper.py in indices(self)
558 return self.grouper.indices 559 --> 560 values = Categorical(self.grouper) 561 return values._reverse_indexer() 562
c:\python37\lib\site-packages\pandas\core\arrays\categorical.py in init(self, values, categories, ordered, dtype, fastpath)
360 361 # we're inferring from values --> 362 dtype = CategoricalDtype(categories, dtype.ordered) 363 364 elif is_categorical_dtype(values.dtype):
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in init(self, categories, ordered)
161 162 def __init__(self, categories=None, ordered: Ordered = False): --> 163 self._finalize(categories, ordered, fastpath=False) 164 165 @classmethod
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in _finalize(self, categories, ordered, fastpath)
315 316 if categories is not None: --> 317 categories = self.validate_categories(categories, fastpath=fastpath) 318 319 self._categories = categories
Part 2 => Categories nan
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in validate_categories(categories, fastpath)
487 if not fastpath: 488 --> 489 if categories.hasnans: 490 raise ValueError("Categorical categories cannot be null") 491
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\indexes\base.py in hasnans(self)
2046 """ 2047 if self._can_hold_na: -> 2048 return bool(self._isnan.any()) 2049 else: 2050 return False
pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()
c:\python37\lib\site-packages\pandas\core\indexes\base.py in _isnan(self)
2026 """ 2027 if self._can_hold_na: -> 2028 return isna(self) 2029 else: 2030 # shouldn't reach to this condition by checking hasnans beforehand
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in isna(obj)
122 Name: 1, dtype: bool 123 """ --> 124 return _isna(obj) 125 126
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in _isna(obj, inf_as_na)
151 # hack (for now) because MI registers as ndarray 152 elif isinstance(obj, ABCMultiIndex): --> 153 raise NotImplementedError("isna is not defined for MultiIndex") 154 elif isinstance(obj, type): 155 return False
NotImplementedError: isna is not defined for MultiIndex
>>> End of Traceback <<<
Seems like in both cases problem starts on Categorical(self.grouper) and no categories are found, that's just my doubt.
Problem starts during grouper init (line 413), if lambda or function is used, grouper is assigned lambda or named function respectively and line 440 is skipped as it does not see MultiIndex behind it.
Later on line 512 changes are triggered and line 518 assigns actual value (MultiIndex) instead of a function, however this case skips line 440 casting and bug appears.
Most of the methods listed in grouper.py for lambda made GroupBy grouper are working, to name a few: codes, ngroups, result_index. Everything the BaseGrouper has. Only indices and get_group do not work.
Pandas is a great module and I'm happy to use it without those indices especially that I can get the same output performing a for
loop on .groups PrettyDict, yet that's a bug and usage of multiple-level group names is a potential growth area for the module.
I will be happy to continue digging on this problem further on my own and will appreciate any input on the course of this error.
Expected Output
>>> df.groupby(lambda x: df.iloc[x,0]).indices
{(0, 3): array([0, 1, 2, 3, 4], dtype=int64),
(1, 3): array([8], dtype=int64),
(1, 4): array([5, 6, 7, 9], dtype=int64)}
Output of pd.show_versions()
INSTALLED VERSIONS
commit : d9fff27
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 20.2.2
setuptools : 40.8.0
Cython : None
pytest : 6.0.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None