Skip to content

Performance regression uint64 groupby().min() in 0.25.1 vs. 0.24.2 #28122

Closed
@deeplycloudy

Description

@deeplycloudy

Code Sample, a copy-pastable example if possible

import pandas as pd
dft = pd.read_csv('pandas_gb_min_test.csv.gz', parse_dates=['split_lutevent_time_offset'])

# An unsigned dtype is slow (23 s) on 0.25.1, but fast on 0.24.2.
# Commenting out this line on 0.25.1 restores performance
dft['split_event_parent_event_id'] = dft.split_event_parent_event_id.astype('uint64')

for mdvn in dft.columns: print(mdvn, dft[mdvn].dtype, type(dft[mdvn]))
mindata = dft.groupby('xyidx').min()

Problem description

In pandas 0.25.1, the performance of df.groupby().min() is 30x slower than in 0.24.2 when one of the columns has a dtype of uint64.

I saved the exact dataframe min_data_in from my original code with min_data_in.to_csv('pandas_gb_min_test.csv'). The columns in the original code had the following data types:

>>> for mdvn in min_data_in.columns: print(mdvn, min_data_in[mdvn].dtype, type(min_data_in[mdvn]))

split_event_parent_event_id uint64 <class 'pandas.core.series.Series'>
split_lutevent_time_offset datetime64[ns] <class 'pandas.core.series.Series'>
split_event_mesh_x_idx int64 <class 'pandas.core.series.Series'>
split_event_mesh_y_idx int64 <class 'pandas.core.series.Series'>
split_lutevent_min_flash_area float64 <class 'pandas.core.series.Series'>
xyidx int64 <class 'pandas.core.series.Series'>

After loading with read_csv the dtype of split_event_parent_event_id was signed.

If I restore the uint64 dtype, v0.25.1 runtime is 23 s. Runtime for v0.24.2 is 0.7 s.

A signed dtype in v0.25.1 has the same performance as 0.24.2.

The data file, cProfile capture, and snakeviz output are attached.

Output of pd.show_versions()

0.24.2
and
0.25.1

pd0242.pdf
pd0251.pdf
pd0251.prof.zip
pd0242.prof.zip
pandas_gb_min_test.csv.gz

Metadata

Metadata

Assignees

Labels

GroupbyPerformanceMemory or execution speed performanceRegressionFunctionality that used to work in a prior pandas version

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions