Closed
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
dft = pd.read_csv('pandas_gb_min_test.csv.gz', parse_dates=['split_lutevent_time_offset'])
# An unsigned dtype is slow (23 s) on 0.25.1, but fast on 0.24.2.
# Commenting out this line on 0.25.1 restores performance
dft['split_event_parent_event_id'] = dft.split_event_parent_event_id.astype('uint64')
for mdvn in dft.columns: print(mdvn, dft[mdvn].dtype, type(dft[mdvn]))
mindata = dft.groupby('xyidx').min()
Problem description
In pandas 0.25.1, the performance of df.groupby().min()
is 30x slower than in 0.24.2 when one of the columns has a dtype of uint64
.
I saved the exact dataframe min_data_in
from my original code with min_data_in.to_csv('pandas_gb_min_test.csv')
. The columns in the original code had the following data types:
>>> for mdvn in min_data_in.columns: print(mdvn, min_data_in[mdvn].dtype, type(min_data_in[mdvn]))
split_event_parent_event_id uint64 <class 'pandas.core.series.Series'>
split_lutevent_time_offset datetime64[ns] <class 'pandas.core.series.Series'>
split_event_mesh_x_idx int64 <class 'pandas.core.series.Series'>
split_event_mesh_y_idx int64 <class 'pandas.core.series.Series'>
split_lutevent_min_flash_area float64 <class 'pandas.core.series.Series'>
xyidx int64 <class 'pandas.core.series.Series'>
After loading with read_csv
the dtype of split_event_parent_event_id
was signed.
If I restore the uint64
dtype, v0.25.1 runtime is 23 s. Runtime for v0.24.2 is 0.7 s.
A signed dtype in v0.25.1 has the same performance as 0.24.2.
The data file, cProfile capture, and snakeviz output are attached.
Output of pd.show_versions()
0.24.2
and
0.25.1
pd0242.pdf
pd0251.pdf
pd0251.prof.zip
pd0242.prof.zip
pandas_gb_min_test.csv.gz