Closed
Description
Code Sample
import time
import numpy as np
import pandas as pd
nrows, ncols = 1000, 100
# data frame with random values and a key to be grouped by
df = pd.DataFrame(np.random.rand(nrows, ncols))
df["key"] = range(nrows)
numeric_columns = list(range(ncols))
grouping = df.groupby(by="key")
# performance regression in apply()
start = time.time()
grouping[numeric_columns].apply(lambda x: x - x.mean())
end = time.time()
print("[pandas=={}] execution time: {:.4f} seconds".format(pd.__version__, end - start))
# [pandas==0.23.4] execution time: 0.8700 seconds
# [pandas==0.24.0] execution time: 24.3790 seconds
# [pandas==0.24.2] execution time: 23.9600 seconds
Problem description
The function GroupBy.apply
is a lot slower (~ 25 times) with version 0.24.0, compared to the 0.23.4 release. The problem still persists in the latest 0.24.2 release.
The code sample above shows this performance regression.
The purpose of the sample is to subtract the group mean from all elements in this group.
The problem does only occur when the lambda for apply()
returns a data frame.
There are no performance issues with scalar return values, e.g. lambda x: x.mean()
.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.16.2
scipy: 1.2.1
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None