-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: DataFrameGroupBy.value_counts() fails if as_index=False and there are duplicate column labels #45160
Changes from 3 commits
696130b
6b03989
db2f38a
9093374
4f65829
85cf095
68ae88b
faa17e5
44ff075
c097e5d
7532cc0
d490187
837c850
89c90c4
92999fb
6e55670
a47bbf7
ad164dc
0f4b155
f6be00d
0b4853c
fece32b
df127ef
36f2b0d
207b55e
e3b245c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,7 +26,10 @@ | |
|
||
import numpy as np | ||
|
||
from pandas._libs import reduction as libreduction | ||
from pandas._libs import ( | ||
lib, | ||
reduction as libreduction, | ||
) | ||
from pandas._typing import ( | ||
ArrayLike, | ||
Manager, | ||
|
@@ -1731,7 +1734,7 @@ def value_counts( | |
observed=self.observed, | ||
dropna=self.dropna, | ||
) | ||
result = cast(Series, gb.size()) | ||
result = gb.size() | ||
|
||
if normalize: | ||
# Normalize the results by dividing by the original group sizes. | ||
|
@@ -1750,13 +1753,32 @@ def value_counts( | |
if sort: | ||
# Sort the values and then resort by the main grouping | ||
index_level = range(len(self.grouper.groupings)) | ||
result = result.sort_values(ascending=ascending).sort_index( | ||
level=index_level, sort_remaining=False | ||
result = ( | ||
cast(Series, result) | ||
.sort_values(ascending=ascending) | ||
.sort_index(level=index_level, sort_remaining=False) | ||
) | ||
|
||
if not self.as_index: | ||
# Convert to frame | ||
result = result.reset_index(name="proportion" if normalize else "count") | ||
name = "proportion" if normalize else "count" | ||
columns = result.index.names | ||
if name in columns: | ||
raise ValueError( | ||
f"Column label '{name}' is duplicate of result column" | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
columns = com.fill_missing_names(columns) | ||
values = result.values | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. result is a Series at this point? ._values is generally preferable to .values, as the latter will cast dt64tz to ndarray[object] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks! |
||
result_frame = DataFrame() | ||
for i, column in enumerate(columns): | ||
level_values = result.index.get_level_values(i)._values | ||
if level_values.dtype == np.object_: | ||
level_values = lib.maybe_convert_objects( | ||
cast(np.ndarray, level_values) | ||
) | ||
result_frame.insert(i, column, level_values, allow_duplicates=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this almost O(n^2) inefficient.
pls don't conflate this issue with reset_index or allow_duplicates, they are completely orthogonal. this is not going to move forward this keeps re-inventing the wheel. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not O(n^2)! But O(n), and of course I would rather not be itererating through an axis. That is the whole point of Pandas! concat does not work, as I explained many times. The best I could do was:
which fails 6 of my tests due to bool/object problems in MultiIndex. These will probably be fixed by #45061, but I see that has been deferred until 1.5. Meanwhile, I employed your column renaming suggestion and there is no more looping. It is now all green (apart from the usual ci problems) |
||
result = result_frame.assign(**{name: values}) | ||
|
||
return result.__finalize__(self.obj, method="value_counts") | ||
|
||
|
||
|
Uh oh!
There was an error while loading. Please reload this page.