Description
Code Sample
import pandas as pd
import numpy as np
n=80000
g=5
index = pd.MultiIndex.from_product([
np.arange(g),
pd.to_timedelta(np.arange(n), unit='s')
])
data = pd.DataFrame(
np.random.randint(0,1000,size=(len(index))),
index=index
)
%timeit data.groupby(level=0).resample('10s',level=1).mean()
# 3.93 s ± 295 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit data.reset_index(1).groupby(level=0).resample('10s',on='level_1').mean()
# 157 ms ± 3.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Problem description
resample
seem to take much more time when resampling on a level of MultiIndex instead of normal data column. The second, faster approach is more convoluted and is not what first comes to mind.
Expected Output
Both operations should around the same amount of time with second possibly slightly more, because of additional reset_index operation. If the difference is expected than first operation should show warning hinting on optimal solution.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None
pandas : 0.25.1
numpy : 1.16.4
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.2
setuptools : 41.0.1
Cython : 0.29.13
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : 1.3.7
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8