Description
groupby.first()
is much slower than groupby.nth(0)
for categorical columns in a very specific way.
Consider carefully the example below, where a dataframe has two columns c1
and c2
. The number of unique values in the c1
column (regardless of its datatype) increases the runtime of first
when and only when c2
is a categorical column.
import pandas as pd, numpy as np, timeit
def test(N_CATEGORIES, cat_cols = ['c2']):
# creates dataframe df with categorical column c1, optional categorical column c2.
# Times how long grouping by c1 and calling nth(0) and first takes
global df
print(N_CATEGORIES)
df = pd.DataFrame({
'c1':np.arange(0,10000)%N_CATEGORIES,
'c2':np.arange(0,10000)
})
for col in cat_cols:
df[col] = df[col].astype('category')
t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from __main__ import df", number=1)
t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from __main__ import df", number=1)
return t_nth, t_first
test_N_categories = [1,10,100,1000,10000]
# Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories],
index=test_N_categories,
columns=['nth','first'])
# Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories],
index=test_N_categories,
columns=['nth','first'])
print(results_c2_cat)
print(results_c2_not_cat)
The results are
>>> print(results_c2_cat)
nth first
1 0.004204 0.010677
10 0.005890 0.015701
100 0.005305 0.052130
1000 0.004365 0.581036
10000 0.004358 2.847709
>>> print(results_c2_not_cat)
nth first
1 0.003479 0.001110
10 0.003027 0.000993
100 0.003297 0.001089
1000 0.003297 0.001057
10000 0.003952 0.001382
As shown, when the column c2
is categorical, the runtime of first
grows very rapidly as a function of the number of unique values in the column c1
(and thus the "width" of the groupby).
first
and last
both suffer from this problem.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None