Skip to content

BUG: Binned groupby median function calculates median on empty bins and outputs random numbers #13629

Closed
@Khris777

Description

@Khris777

Code Sample, a copy-pastable example if possible

import pandas as pd
d = pd.DataFrame([1,2,5,6,9,3,6,5,9,7,11,36,4,7,8,25,8,24])
b = [0,5,10,15,20,25,30,35,40,45,50,55]
g = d.groupby(pd.cut(d[0],b))
print g.mean()
print g.median()
print g.get_group('(0, 5]').median()
print g.get_group('(40, 45]').median()

Expected Output

                  0
0                  
(0, 5]     3.333333
(5, 10]    7.500000
(10, 15]  11.000000
(15, 20]        NaN
(20, 25]  24.500000
(25, 30]        NaN
(30, 35]        NaN
(35, 40]  36.000000
(40, 45]        NaN
(45, 50]        NaN
(50, 55]        NaN
             0
0             
(0, 5]     3.5
(5, 10]    7.5
(10, 15]  11.0
(15, 20]  18.0
(20, 25]  24.5
(25, 30]  30.5
(30, 35]  30.5
(35, 40]  36.0
(40, 45]  18.0
(45, 50]  18.0
(50, 55]  18.0
0    3.5
dtype: float64
Traceback (most recent call last):

  File "<ipython-input-9-0663486889da>", line 1, in <module>
    runfile('C:/PythonDir/test04.py', wdir='C:/PythonDir')

  File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
    execfile(filename, namespace)

  File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 74, in execfile
    exec(compile(scripttext, filename, 'exec'), glob, loc)

  File "C:/PythonDir/test04.py", line 20, in <module>
    print g.get_group('(40, 45]').median()

  File "C:\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 587, in get_group
    raise KeyError(name)

KeyError: '(40, 45]'

This example shows how the median-function of the groupby object outputs a random number instead of NaN like the mean-function does when a bin is empty. Directly trying to call that bin with its key leads to an error since it doesn't exist, yet the full median output suggests it does exist and that the value might even be meaningful (like in the (15, 20] bin or the (30, 35] bin). The wrong numbers that are returned can change randomly, another possible output using the same code might look like this:

(0, 5]     3.500000e+00
(5, 10]    7.500000e+00
(10, 15]   1.100000e+01
(15, 20]   1.800000e+01
(20, 25]   2.450000e+01
(25, 30]   3.050000e+01
(30, 35]   3.050000e+01
(35, 40]   3.600000e+01
(40, 45]  4.927210e+165
(45, 50]  4.927210e+165
(50, 55]  4.927210e+165

output of pd.show_versions()

pandas: 0.18.1

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions