Skip to content

PERF: groupby rank is slow when tie count is big #21237

Closed
@peterpanmj

Description

@peterpanmj

Code Sample, a copy-pastable example if possible

df = pd.DataFrame({"A":[1,2,3]*10000 ,"B":[1]*30000})

In [31]: %%timeit
    ...: t = df.groupby("B").rank()

608 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [32]: %%timeit
    ...: t = df.A.rank()
1.27 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [33]: %%timeit
    ...: t = df.groupby("B").apply(pd.Series.rank)
    ...:
6.51 ms ± 141 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

groupby rank is much slower than without groupby when there is a lot of ties

Expected Output

In [42]: df1 = pd.DataFrame({"A":np.random.rand(30000) ,"B":[1]*30000})

In [44]: %%timeit
    ...: t = df1.groupby("B").apply(pd.Series.rank)
    ...:
10.1 ms ± 203 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [46]: %%timeit
    ...: t = df1.groupby("B").rank()
    ...:
4.77 ms ± 111 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: 3b770fa python: 3.6.4.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: zh_CN.UTF-8 LOCALE: None.None

pandas: 0.24.0.dev0+32.g3b770fa07
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.6.7
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions