Description
Code Sample, a copy-pastable example if possible
import pandas as pd
df = pd.DataFrame([[1, 10], [1, 20], [2, 40], [2, 30]], columns=['a', 'b'])
print("%s\n" % df)
print("Output of df.groupby('a')['b'].transform('rank'):\n%s\n" %
df.groupby('a')['b'].transform('rank'))
print("Output of df.groupby('a')['b'].rank():\n%s\n" %
df.groupby('a')['b'].rank())
print("Output of df.groupby('a')['b'].transform(lambda x: x.rank()):\n%s\n" %
df.groupby('a')['b'].transform(lambda x: x.rank()))
print("Output of df.groupby('a')['b'].transform('cumcount'):\n%s\n" %
df.groupby('a')['b'].transform('cumcount'))
print("Output of df.groupby('a')['b'].cumcount():\n%s\n" %
df.groupby('a')['b'].cumcount())
Problem description
For simplicity, I will explain the issue for SeriesGroupBy
, though the bug is present in DataFrameGroupBy
as well.
When calling transform
on a SeriesGroupBy
object with a string input 'string_input'
(e.g., .transform('mean')
), the relevant code inside pandas/pandas/core/groupby/generic.py
will end up calling SeriesGroupBy._transform_fast
on the function
func = getattr(SeriesGroupBy, 'string_input')
(unless 'string_input'
is inside base.cython_transforms
, currently consisting of ['cumprod', 'cumsum', 'shift', 'cummin', 'cummax']
). Inside _transform_fast
, the result of applying func
to the SeriesGroupBy
object is then broadcast to the entire index of the original object. This works as expected if func
returns a single value per group in the GroupBy
object (e.g., for functions like 'mean'
, 'std'
, etc.). However, for functions like rank
, cumcount
, etc., that return several values per group, the result of broadcasting is nonsensical to my best knowledge.
Note: Index broadcasting works correctly if transform
is called with a (non-cython) function, for example lambda x: x.rank()
. In that case, _fast_transform
is never called and the result is a simple concatenation of the results for each group.
Expected Output
The results of df.groupby('a')['b'].transform('rank')
and df.groupby('a')['b'].rank()
should be identical. Same for 'cumcount' and maybe other GroupBy
functions.
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
INSTALLED VERSIONS
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.3-arch1-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.7
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None