Skip to content

Odd behaviour of groupby-agg with certain boolean columns and multi-index #21240

Closed
@Khris777

Description

@Khris777

Problem Description

I stumbled over some very specific off behaviour when trying to do a multi-index groupby over several boolean columns using agg with a dictionary to apply a function to each column that would count the number of True-values by using the boolean series as a boolean index on itself and getting the length of that.

Example code:

NN = 1
l0 = NN*['a','a','b','b','b','b','b','b','b']
l1 = NN*['x','y','x','x','x','y','y','z','z']
l2 = NN*[True,True,False,True,False,False,False,True,True]
l3 = NN*[True,False,False,True,False,True,False,True,False]
l4 = NN*[False,False,True,True,True,True,False,True,False]

df = pd.DataFrame({"C0":l0,'C1':l1,'C2':l2,'C3':l3,'C4':l4})
print(df.groupby(["C0","C1"])[["C2","C3","C4"]]\
.agg({"C2":lambda x: len(x[x]),"C3":lambda x: len(x[x]),"C4":lambda x: len(x[x])}))

The result I'm getting is:

	   C2     C3  C4
C0 C1               
a  x    1   True   0
   y    1  False   0
b  x    1   True   2
   y    0   True   1
   z    2   True   1

As you can see, the second column isn't properly aggregated for some reason.

If I set NN = 10, I'm getting a proper result:

	   C2  C3  C4
C0 C1            
a  x   10  10   0
   y   10   0   0
b  x   10  10  30
   y    0  10  10
   z   20  10  10

The problem only happens when using a multi-index, only with the C3 column, only when NN=1 and only when using the agg function (it works fine with apply on the single column), so it can be reduced to this call:

print(df.groupby(["C0","C1"])["C3"].agg(lambda x: len(x[x])))

It seems like the specific combination of the multi-index and the boolean values causes some odd behaviour.

EDIT: I did not write the C3 data specifically to cause the error, I just played with random combinations around until the error appeared, so I have no clue, why exactly that combination causes the problem.

In my real application this happens to two of three columns in a dataframe with 570 rows, so it's not limited to very short dataframes like in the example.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.0.1
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: 0.1.5
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapBugDtype ConversionsUnexpected or buggy dtype conversionsGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions