Skip to content

PERF: cache sorted data in GroupBy? #51077

Open
@jbrockmendel

Description

@jbrockmendel

When we do a groupby transform/reduce that requires operating group-by-group, we construct a sorted (DataFrame|Series) so that we can iterate over it efficiently. That construction is cached within a DataSplitter class, but the splitter itself is not cached. IIUC we can get some mileage by caching the DataSplitter, at the possible cost of having a copy hang around longer than we might want.

Also we have a separate construct-a-sorted-object path in _numba_prep that might be able to re-use some code.

Final thought: we could check in DataSplitter.sorted_data whether _sort_idx is monotonic, in which case the (DataFrame|Series) is already sorted and we don't need to make a copy.

Metadata

Metadata

Assignees

Labels

EnhancementGroupbyInternalsRelated to non-user accessible pandas implementationPerformanceMemory or execution speed performance

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions