Skip to content

BUG: indices and get_group raise exception with GroupBy initialized using lambda or named function with tuple values as group names (Multi-dimensional groups) #36158

Closed
@dimithras

Description

@dimithras
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

UPD: bug does not exist anymore in latest master branch, yet still present in 1.1.3 from pip.
Updated version details:

INSTALLED VERSIONS

commit : db08276
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.3
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 20.2.3
setuptools : 40.8.0
Cython : 0.29.21
pytest : 6.0.1
hypothesis : 5.36.0
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Edit: removed everything related to categorical rhetoric, updated with findings. Corrected several mistypes.


Code Sample, a copy-pastable example

UPD: A clearer sample provided by @AlexanderCecile which addresses that bug appears on group names inside lambda, not due to tuples in cell values.

import random

import pandas as pd

df = pd.DataFrame({"col_1": [random.randint(0, 5) for _ in range(5)]})

print(df, end="\n\n")

grp_tuples = [("a", "b"), ("c", "d"), ("e", "f")]

# grp_res = df.groupby(by="col_1")
grp_res = df.groupby(by=lambda x: grp_tuples[x % len(grp_tuples)])

for key, grp in grp_res:
    print(f"key: {key}\ngroup:\n{grp}\n")

print(grp_res.indices)
print(grp_res.grouper.indices)
Original sample
df = pd.DataFrame({'Tuples': ((x, y)
                              for x in [0, 1]
                              for y in np.random.randint(3, 5, 5))})
df.groupby('Tuples').grouper.indices
df.groupby(lambda x: df.iloc[x,0]).indices
0 1 2 3 4 5 6 7 8 9
'Tuples' (0,3) (0,3) (0,3) (0,3) (0,3) (1,4) (1,4) (1,4) (1,3) (1,4)

Problem description

Problem: .indices do not work on lambda made group by with tuples

In the sample above two different .groupby in crude terms do the same thing and provide groups with same characteristics - grouping DataFrame by column values which are represented by tuples. Technically single column DF is a Series object, but issue remains the same if some other columns are added. However these groups share some weird behavior:

  • Both have same output for .groups:

{(0, 3): [0, 1, 2, 3, 4], (1, 3): [8], (1, 4): [5, 6, 7, 9]}

  • Both show same .head(), .size() some other methods listed here. Groups can be selected with tuple key (0, 3) with all same output.

And here come the differences:

gb.transform('size')
+ just shows the index of a DF
- IndexError: Column(s) Tuples already selected 

gb1.sum() / gb1.agg(sum)
+ shows column 'Tuples' with recurring tuples, e.g. (0, 3), (1, 3)
+ in case GroupBy is prepared with as_index=False, also shows index
- show index and 'Tuples' column where tuples are added per group
- on each line: (0, 3, 0, 3, 0, 3), (1, 3, 1, 3)

gb.all()
+ index with tuples per line, e.g. (0, 3), (1, 3)
+ if as_index=False:
+ ValueError: Length mismatch: Expected axis has 0 elements, new values have 3 elements 
- table with tuples (indexes if as_index=False) and True values

gb.count()
+ Tuples column and no counts
+ if as_index=False:
+ IndexError: list index out of range
- table with tuples (indexes if as_index=False) and actual count of groups

Basically all aggregated methods have same result as above. However if GroupBy is made on Series df['Tuples'].groupby(df['Tuples']) majority of aggregated functions start working as for lambda made GroupBy, output is the same, but does not show as table in jupyter, just plain-text, yet I guess that's an unexpected behavior. That's sapid, but whatever, these are the differences one can live with.

Main issue

Tuple group names in lambda or named function GroupBy break method .get_group() and attribute .indices.

As mentioned in docs GroupBy works pretty well with two-dimensional keys, although missing handful level selectors that MultiIndex has. Perhaps that's a topic for feature-request, not an issue.

Sample in docs produces same tuple groups dict_keys( [ ('Store_1', 'Product_1'), ... () ], this should not be a problem. Also GroupBy made with lambda works all fine with numeric / string labels.

In case with tuples, .indices and .get_group fall with error NotImplementedError: isna is not defined for MultiIndex.


.get_group()

.get_group() Traceback
ERROR
df.groupby(lambda x: df.iloc[x,0]).get_group((0,3))
-   ---------------------------------------------------------------------------
-   NotImplementedError                       Traceback (most recent call last)
-   <ipython-input-3-c37a9ee93781> in <module>
-   ----> 1 df.groupby(lambda x: df.iloc[x,0]).get_group((0,3))
Part 1 => .get_indices
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in get_group(self, name, obj)
    806             obj = self._selected_obj
    807 
--> 808         inds = self._get_index(name)
    809         if not len(inds):
    810             raise KeyError(name)
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in _get_index(self, name)
    628         Safe get index, translate keys for datelike to underlying repr.
    629         """
--> 630         return self._get_indices([name])[0]
    631 
    632     @cache_readonly
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in _get_indices(self, names)
    593             return []
    594 
--> 595         if len(self.indices) > 0:
    596             index_sample = next(iter(self.indices))
    597         else:
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in indices(self)
    572         """
    573         self._assure_grouper()
--> 574         return self.grouper.indices
    575 
    576     def _get_indices(self, names):

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

Part 2 => Categories nan
c:\python37\lib\site-packages\pandas\core\groupby\ops.py in indices(self)
    222         """ dict {group name -> group indices} """
    223         if len(self.groupings) == 1:
--> 224             return self.groupings[0].indices
    225         else:
    226             codes_list = [ping.codes for ping in self.groupings]

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\groupby\grouper.py in indices(self)
    558             return self.grouper.indices
    559 
--> 560         values = Categorical(self.grouper)
    561         return values._reverse_indexer()
    562 
c:\python37\lib\site-packages\pandas\core\arrays\categorical.py in init(self, values, categories, ordered, dtype, fastpath)
    360 
    361             # we're inferring from values
--> 362             dtype = CategoricalDtype(categories, dtype.ordered)
    363 
    364         elif is_categorical_dtype(values.dtype):
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in init(self, categories, ordered)
    161 
    162     def __init__(self, categories=None, ordered: Ordered = False):
--> 163         self._finalize(categories, ordered, fastpath=False)
    164 
    165     @classmethod
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in _finalize(self, categories, ordered, fastpath)
    315 
    316         if categories is not None:
--> 317             categories = self.validate_categories(categories, fastpath=fastpath)
    318 
    319         self._categories = categories
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in validate_categories(categories, fastpath)
    487         if not fastpath:
    488 
--> 489             if categories.hasnans:
    490                 raise ValueError("Categorical categories cannot be null")
    491 
Part 3 => nan, Multiindex

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\indexes\base.py in hasnans(self)
   2046         """
   2047         if self._can_hold_na:
-> 2048             return bool(self._isnan.any())
   2049         else:
   2050             return False

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\indexes\base.py in _isnan(self)
   2026         """
   2027         if self._can_hold_na:
-> 2028             return isna(self)
   2029         else:
   2030             # shouldn't reach to this condition by checking hasnans beforehand
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in isna(obj)
    122     Name: 1, dtype: bool
    123     """
--> 124     return _isna(obj)
    125 
    126 
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in _isna(obj, inf_as_na)
    151     # hack (for now) because MI registers as ndarray
    152     elif isinstance(obj, ABCMultiIndex):
--> 153         raise NotImplementedError("isna is not defined for MultiIndex")
    154     elif isinstance(obj, type):
    155         return False

NotImplementedError: isna is not defined for MultiIndex

>>> End of Traceback <<<

.indices

.indices Traceback
ERROR
df.groupby(lambda x: df.iloc[x,0]).indices
-   ---------------------------------------------------------------------------
-   NotImplementedError                       Traceback (most recent call last)
-   <ipython-input-5-fd5a9f33be4e> in <module>
-   ----> 1 df.groupby(lambda x: df.iloc[x,0]).indices
Part 1 => groupings, Categorical
c:\python37\lib\site-packages\pandas\core\groupby\groupby.py in indices(self)
    572         """
    573         self._assure_grouper()
--> 574         return self.grouper.indices
    575 
    576     def _get_indices(self, names):

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\groupby\ops.py in indices(self)
    222         """ dict {group name -> group indices} """
    223         if len(self.groupings) == 1:
--> 224             return self.groupings[0].indices
    225         else:
    226             codes_list = [ping.codes for ping in self.groupings]

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\groupby\grouper.py in indices(self)
    558             return self.grouper.indices
    559 
--> 560         values = Categorical(self.grouper)
    561         return values._reverse_indexer()
    562 
c:\python37\lib\site-packages\pandas\core\arrays\categorical.py in init(self, values, categories, ordered, dtype, fastpath)
    360 
    361             # we're inferring from values
--> 362             dtype = CategoricalDtype(categories, dtype.ordered)
    363 
    364         elif is_categorical_dtype(values.dtype):
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in init(self, categories, ordered)
    161 
    162     def __init__(self, categories=None, ordered: Ordered = False):
--> 163         self._finalize(categories, ordered, fastpath=False)
    164 
    165     @classmethod
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in _finalize(self, categories, ordered, fastpath)
    315 
    316         if categories is not None:
--> 317             categories = self.validate_categories(categories, fastpath=fastpath)
    318 
    319         self._categories = categories
Part 2 => Categories nan
c:\python37\lib\site-packages\pandas\core\dtypes\dtypes.py in validate_categories(categories, fastpath)
    487         if not fastpath:
    488 
--> 489             if categories.hasnans:
    490                 raise ValueError("Categorical categories cannot be null")
    491 

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\indexes\base.py in hasnans(self)
   2046         """
   2047         if self._can_hold_na:
-> 2048             return bool(self._isnan.any())
   2049         else:
   2050             return False

pandas_libs\properties.pyx in pandas._libs.properties.CachedProperty.get()

c:\python37\lib\site-packages\pandas\core\indexes\base.py in _isnan(self)
   2026         """
   2027         if self._can_hold_na:
-> 2028             return isna(self)
   2029         else:
   2030             # shouldn't reach to this condition by checking hasnans beforehand
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in isna(obj)
    122     Name: 1, dtype: bool
    123     """
--> 124     return _isna(obj)
    125 
    126 
c:\python37\lib\site-packages\pandas\core\dtypes\missing.py in _isna(obj, inf_as_na)
    151     # hack (for now) because MI registers as ndarray
    152     elif isinstance(obj, ABCMultiIndex):
--> 153         raise NotImplementedError("isna is not defined for MultiIndex")
    154     elif isinstance(obj, type):
    155         return False

NotImplementedError: isna is not defined for MultiIndex

>>> End of Traceback <<<


Seems like in both cases problem starts on Categorical(self.grouper) and no categories are found, that's just my doubt.
Problem starts during grouper init (line 413), if lambda or function is used, grouper is assigned lambda or named function respectively and line 440 is skipped as it does not see MultiIndex behind it.
Later on line 512 changes are triggered and line 518 assigns actual value (MultiIndex) instead of a function, however this case skips line 440 casting and bug appears.

Most of the methods listed in grouper.py for lambda made GroupBy grouper are working, to name a few: codes, ngroups, result_index. Everything the BaseGrouper has. Only indices and get_group do not work.

Pandas is a great module and I'm happy to use it without those indices especially that I can get the same output performing a for loop on .groups PrettyDict, yet that's a bug and usage of multiple-level group names is a potential growth area for the module.

I will be happy to continue digging on this problem further on my own and will appreciate any input on the course of this error.

Expected Output

>>> df.groupby(lambda x: df.iloc[x,0]).indices
{(0, 3): array([0, 1, 2, 3, 4], dtype=int64),
 (1, 3): array([8], dtype=int64),
 (1, 4): array([5, 6, 7, 9], dtype=int64)}

Output of pd.show_versions()

INSTALLED VERSIONS

commit : d9fff27
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.1.0
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.0
pip : 20.2.2
setuptools : 40.8.0
Cython : None
pytest : 6.0.1
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.0
numexpr : None
odfpy : None
openpyxl : 3.0.4
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : None

Metadata

Metadata

Assignees

Labels

BugGroupbyNeeds TestsUnit test(s) needed to prevent regressions

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions