Skip to content

ENH: add efficient groupby for sorted dataframe #43011

Open
@AdeBC

Description

@AdeBC

Feature description

I wish I could use Pandas to efficiently group a large data frame by a sorted column. Since many sorting algorithms have time complexities of O(nlog(n)) and splitting a column according to its consecutive same value requires time complexity of just O(n), given that the current groupby of Pandas has the time complexity of O(n^2), it would be better if we can perform dataframe.groupby by the sorted column, by means of splitting the dataframe according to consecutive same values in the sorted column. In this way, the groupby operation has the time complexity of O(n). Together with the time complexity of sorting, the total time complexity is O(nlog(n)).

Here is my solution

DataFrame.groupby should get a new parameter sorted or something intuitive to perform groupby with O(n) time complexity (see the code below, though the returned object should be a DataFrameGroupBy object, not a dict.)

Existing similar implementation

see here.

def groupby(dataframe, by):
    """
    Groupby the dataframe (:param dataframe) according to the values of the column by (:param by).
    low_memory (:param low_memory) mode can only work when the input dataframe are sorted by the column by.
    :return: extracted Groups.
    """
    df = dataframe[~dataframe[by].isna()]
    col = df[by]
    group_sizes = pd.Series(perGroupSize(col))  # get size of each group according to the consecutive same values, `O(nlog(n))`
    ends = group_sizes.cumsum()
    begins = ends.shift(1).fillna(0).astype(int)
    group_names = tqdm(group_sizes.index,
                       desc='Generating groups from original dataframe, progress: ',
                       total=group_sizes.shape[0])
    groups = {name: df.iloc[begins[name]:ends[name], :].reset_index(drop=True) for name in group_names}
    return groups

Does this make sense? If so, I can implement this in Pandas and create a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementGroupbyNeeds DiscussionRequires discussion from core team before further actionPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions