
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
from time import time
df = pd.date_range(start="1/1/2018", end="1/2/2018", periods=1e6).to_frame()
start = time()
dfr = df.resample("1s").last()
print(time() - start)
print("Length:", len(dfr))
print()
group_index = np.round(df.index.astype(int) / 1e9)
start = time()
dfr = df.groupby(group_index).last()
print(time() - start)
print("Length:", len(dfr))
Problem description
In my current project, I use groupby as well as resample for the same data frames with the same aggregations. I have noticed that resample is way quicker than groupby. While I understand that groupby is more flexible, it would still be nice if the performance was comparable. In the example above, resample is more than 50 times faster:
Length: 86401
0.023558616638183594
Length: 86401
1.264981746673584
I am aware that they don't result in the exact same data frames, but this does not matter for this discussion.
Expected Output
Better performance for groupby.
I haven't looked at the groupby implementation and therefore I don't know if there is a good reason for the difference. If there is a good reason, some common cases could still be improved a lot. For example, in this case, we could just check first if the by-argument is monotonic increasing or decreasing. In this case, the operation can be implemented even without a hash map.
Output of pd.show_versions()
INSTALLED VERSIONS
commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.4.0-17134-Microsoft
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.17.1
pytz : 2019.2
dateutil : 2.8.0
pip : 19.2.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.8.0
pandas_datareader: 0.7.4
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.14.1
pytables : None
s3fs : None
scipy : 1.3.1
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None