ENH: add efficient `groupby` for sorted dataframe

#### Feature description

I wish I could use Pandas to efficiently group a large data frame by a sorted column. Since many sorting algorithms have time complexities of `O(nlog(n))` and splitting a column according to its consecutive same value requires time complexity of just `O(n)`, given that the current `groupby` of Pandas has the time complexity of O(n^2), it would be better if we can perform `dataframe.groupby` by the sorted column, by means of splitting the dataframe according to consecutive same values in the sorted column. In this way, the groupby operation has the time complexity of `O(n)`. Together with the time complexity of sorting, the total time complexity is `O(nlog(n))`.

#### Here is my solution

`DataFrame.groupby` should get a new parameter `sorted` or something intuitive to perform `groupby` with `O(n)`  time complexity (see the code below, though the returned object should be a DataFrameGroupBy object, not a dict.)

#### Existing similar implementation

see [here](https://github.com/AdeBC/easyGroup/blob/be17515b3280a44d536cbe36f621656fa3e5d4d1/easyGroup/src/groupper.py#L5).

```python
def groupby(dataframe, by):
    """
    Groupby the dataframe (:param dataframe) according to the values of the column by (:param by).
    low_memory (:param low_memory) mode can only work when the input dataframe are sorted by the column by.
    :return: extracted Groups.
    """
    df = dataframe[~dataframe[by].isna()]
    col = df[by]
    group_sizes = pd.Series(perGroupSize(col))  # get size of each group according to the consecutive same values, `O(nlog(n))`
    ends = group_sizes.cumsum()
    begins = ends.shift(1).fillna(0).astype(int)
    group_names = tqdm(group_sizes.index,
                       desc='Generating groups from original dataframe, progress: ',
                       total=group_sizes.shape[0])
    groups = {name: df.iloc[begins[name]:ends[name], :].reset_index(drop=True) for name in group_names}
    return groups
```
Does this make sense? If so, I can implement this in Pandas and create a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add efficient `groupby` for sorted dataframe #43011

Feature description

Here is my solution

Existing similar implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ENH: add efficient groupby for sorted dataframe #43011

Description

Feature description

Here is my solution

Existing similar implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ENH: add efficient `groupby` for sorted dataframe #43011