Description
Feature description
I wish I could use Pandas to efficiently group a large data frame by a sorted column. Since many sorting algorithms have time complexities of O(nlog(n))
and splitting a column according to its consecutive same value requires time complexity of just O(n)
, given that the current groupby
of Pandas has the time complexity of O(n^2), it would be better if we can perform dataframe.groupby
by the sorted column, by means of splitting the dataframe according to consecutive same values in the sorted column. In this way, the groupby operation has the time complexity of O(n)
. Together with the time complexity of sorting, the total time complexity is O(nlog(n))
.
Here is my solution
DataFrame.groupby
should get a new parameter sorted
or something intuitive to perform groupby
with O(n)
time complexity (see the code below, though the returned object should be a DataFrameGroupBy object, not a dict.)
Existing similar implementation
see here.
def groupby(dataframe, by):
"""
Groupby the dataframe (:param dataframe) according to the values of the column by (:param by).
low_memory (:param low_memory) mode can only work when the input dataframe are sorted by the column by.
:return: extracted Groups.
"""
df = dataframe[~dataframe[by].isna()]
col = df[by]
group_sizes = pd.Series(perGroupSize(col)) # get size of each group according to the consecutive same values, `O(nlog(n))`
ends = group_sizes.cumsum()
begins = ends.shift(1).fillna(0).astype(int)
group_names = tqdm(group_sizes.index,
desc='Generating groups from original dataframe, progress: ',
total=group_sizes.shape[0])
groups = {name: df.iloc[begins[name]:ends[name], :].reset_index(drop=True) for name in group_names}
return groups
Does this make sense? If so, I can implement this in Pandas and create a PR.