Skip to content

DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes #25397

Closed
@joseortiz3

Description

@joseortiz3

groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.

Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of first when and only when c2 is a categorical column.

import pandas as pd, numpy as np, timeit

def test(N_CATEGORIES, cat_cols = ['c2']):
    # creates dataframe df with categorical column c1, optional categorical column c2.
    # Times how long grouping by c1 and calling nth(0) and first takes
    global df 
    print(N_CATEGORIES)
    df = pd.DataFrame({
        'c1':np.arange(0,10000)%N_CATEGORIES,
        'c2':np.arange(0,10000)
        })
    for col in cat_cols:
        df[col] = df[col].astype('category')
    t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
    t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
    return t_nth, t_first

test_N_categories = [1,10,100,1000,10000]

# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], 
                       index=test_N_categories, 
                       columns=['nth','first'])

print(results_c2_cat)
print(results_c2_not_cat)

The results are

>>> print(results_c2_cat)
            nth     first
1      0.004204  0.010677
10     0.005890  0.015701
100    0.005305  0.052130
1000   0.004365  0.581036
10000  0.004358  2.847709
>>> print(results_c2_not_cat)
            nth     first
1      0.003479  0.001110
10     0.003027  0.000993
100    0.003297  0.001089
1000   0.003297  0.001057
10000  0.003952  0.001382

As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).

first and last both suffer from this problem.

(see also #19283 and #19598)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    GroupbyPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions