-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Fixed Inconsistent GroupBy Output Shape with Duplicate Column Labels #29124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 56 commits
fd53827
6f60cd0
0aa1813
4af22f6
9756e74
b675963
444d542
98a9901
c8648b1
1626de1
2a6b8d7
12d1ca0
5a3fcd7
dee597a
a2f1b64
fdb36f6
8975009
9adde1b
63b35f9
9eb7c73
0e49bdb
2ad7632
11fda39
caf8f11
7c4bad9
b9dca96
a878e67
037f9af
dd3b1dc
a66d37f
6d50448
4dd8f5b
9d39862
d6b197b
3cfd1a2
16e9512
d297684
a234bed
e3959b0
391d106
c30ca82
3a78051
f4f9e61
936591e
5dd131e
6cc1607
d5ce753
ce97eff
d1a92b4
23eb803
4aa9f4c
faa08c9
b07335b
fb71185
d7b84a2
7934422
c8f0b19
acf22d3
a0aae64
a9b411a
51b8050
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,7 +10,17 @@ | |
from functools import partial | ||
from textwrap import dedent | ||
import typing | ||
from typing import Any, Callable, FrozenSet, Iterable, Sequence, Type, Union, cast | ||
from typing import ( | ||
Any, | ||
Callable, | ||
FrozenSet, | ||
Iterable, | ||
Mapping, | ||
Sequence, | ||
Type, | ||
Union, | ||
cast, | ||
) | ||
import warnings | ||
|
||
import numpy as np | ||
|
@@ -322,29 +332,108 @@ def _aggregate_multiple_funcs(self, arg, _level): | |
|
||
return DataFrame(results, columns=columns) | ||
|
||
def _wrap_series_output(self, output, index, names=None): | ||
""" common agg/transform wrapping logic """ | ||
output = output[self._selection_name] | ||
def _wrap_series_output( | ||
self, | ||
output: Mapping[int, Union[Series, np.ndarray]], | ||
index: Index, | ||
columns: Index, | ||
) -> Union[Series, DataFrame]: | ||
""" | ||
Wraps the output of a SeriesGroupBy operation into the expected result. | ||
|
||
Parameters | ||
---------- | ||
output : Mapping[int, Union[Series, np.ndarray]] | ||
Dict where the key represents the columnar-index and the values are | ||
the actual results. Must be ordered from 0..n | ||
index : pd.Index | ||
Index to apply to the output. | ||
columns : pd.Index | ||
Columns to apply to the output. | ||
|
||
if names is not None: | ||
return DataFrame(output, index=index, columns=names) | ||
Returns | ||
------- | ||
Series or DataFrame | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Notes | ||
----- | ||
In the vast majority of cases output and columns will only contain one | ||
element. The exception is operations that expand dimensions, like ohlc. | ||
""" | ||
assert len(output) == len(columns) | ||
assert list(output.keys()) == sorted(output.keys()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is by definition always true right as this is a property of a mapping? an assertion here is confusing There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn’t this what you had asked me to add asserts for previously? If not then I am confused as to what asserts you wanted There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW I don't consider this to be a long term approach anyway. Probably easier to make a named tuple that holds the position and label of each item in the result and construct from there; happy to do as a follow up There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
not at all. an assert about a property of a dict is not very useful. I asked about an assert about the sortness of output. you are assigning .columns = columns. The keys must be in the same order. There is no guarantee on this at all (except that dicts are sorted in > 3.6 and I think you must be adding them in the same order). but this is non-obvious, and non-trivial. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't a dict property though - any chance you are mixing up a dict maintaining insertion order versus being sorted? Sorting the output would add a behavior change that I don't think is desirable either. I do agree this isn't the clearest way of managing in the long run but I'm trying to limit the scope of the PR. Any chance we can move forward as is and I can redo the dict insertion in a follow up? I think should just use a namedtuple or something similar as the dict key that holds the position and label together There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you are still missing my point. you are assigning There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right there is no easy way to do this in the current iteration. I was planning to do as a follow up - is that a requirement to do here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
sure there is, you are creating an enumeration of lables (0, 1....etc), you just need to assert that the columns are in the same order here. |
||
|
||
result: Union[Series, DataFrame] | ||
if len(output) > 1: | ||
result = DataFrame(output, index=index) | ||
result.columns = columns | ||
else: | ||
name = self._selection_name | ||
if name is None: | ||
name = self._selected_obj.name | ||
return Series(output, index=index, name=name) | ||
result = Series(output[0], index=index, name=columns[0]) | ||
|
||
return result | ||
|
||
def _wrap_aggregated_output( | ||
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index | ||
) -> Union[Series, DataFrame]: | ||
""" | ||
Wraps the output of a SeriesGroupBy aggregation into the expected result. | ||
|
||
Parameters | ||
---------- | ||
output : Mapping[int, Union[Series, np.ndarray]] | ||
Dict where the key represents the columnar-index and the values are | ||
the actual results. | ||
columns : pd.Index | ||
Columns to apply to the output. | ||
|
||
Returns | ||
------- | ||
Series or DataFrame | ||
|
||
Notes | ||
----- | ||
In the vast majority of cases output and columns will only contain one | ||
element. The exception is operations that expand dimensions, like ohlc. | ||
""" | ||
assert list(output.keys()) == sorted(output.keys()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is this not always true? |
||
|
||
def _wrap_aggregated_output(self, output, names=None): | ||
result = self._wrap_series_output( | ||
output=output, index=self.grouper.result_index, names=names | ||
output=output, index=self.grouper.result_index, columns=columns | ||
) | ||
return self._reindex_output(result)._convert(datetime=True) | ||
|
||
def _wrap_transformed_output(self, output, names=None): | ||
return self._wrap_series_output( | ||
output=output, index=self.obj.index, names=names | ||
def _wrap_transformed_output( | ||
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index | ||
) -> Series: | ||
""" | ||
Wraps the output of a SeriesGroupBy aggregation into the expected result. | ||
|
||
Parameters | ||
---------- | ||
output : dict[int, Union[Series, np.ndarray]] | ||
Dict with a sole key of 0 and a value of the result values. | ||
columns : pd.Index | ||
Columns to apply to the output. | ||
|
||
Returns | ||
------- | ||
Series | ||
|
||
Notes | ||
----- | ||
output and columns should only contain one element. These are containers | ||
for generic compatability with the DataFrameGroupBy class. | ||
""" | ||
assert list(output.keys()) == sorted(output.keys()) | ||
|
||
result = self._wrap_series_output( | ||
output=output, index=self.obj.index, columns=columns | ||
) | ||
|
||
# No transformations increase the ndim of the result | ||
assert isinstance(result, Series) | ||
return result | ||
|
||
def _wrap_applied_output(self, keys, values, not_indexed_same=False): | ||
if len(keys) == 0: | ||
# GH #6265 | ||
|
@@ -1104,17 +1193,6 @@ def _aggregate_item_by_item(self, func, *args, **kwargs) -> DataFrame: | |
|
||
return DataFrame(result, columns=result_columns) | ||
|
||
def _decide_output_index(self, output, labels): | ||
if len(output) == len(labels): | ||
output_keys = labels | ||
else: | ||
output_keys = sorted(output) | ||
|
||
if isinstance(labels, MultiIndex): | ||
output_keys = MultiIndex.from_tuples(output_keys, names=labels.names) | ||
|
||
return output_keys | ||
|
||
def _wrap_applied_output(self, keys, values, not_indexed_same=False): | ||
if len(keys) == 0: | ||
return DataFrame(index=keys) | ||
|
@@ -1579,27 +1657,66 @@ def _insert_inaxis_grouper_inplace(self, result): | |
if in_axis: | ||
result.insert(0, name, lev) | ||
|
||
def _wrap_aggregated_output(self, output, names=None): | ||
agg_axis = 0 if self.axis == 1 else 1 | ||
agg_labels = self._obj_with_exclusions._get_axis(agg_axis) | ||
def _wrap_aggregated_output( | ||
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index | ||
) -> DataFrame: | ||
""" | ||
Wraps the output of DataFrameGroupBy aggregations into the expected result. | ||
|
||
Parameters | ||
---------- | ||
output : dict[int, Union[Series, np.ndarray]] | ||
Dict where the key represents the columnar-index and the values are | ||
the actual results. | ||
columns : pd.Index | ||
Column names to apply | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
""" | ||
assert list(output.keys()) == sorted(output.keys()) | ||
|
||
output_keys = self._decide_output_index(output, agg_labels) | ||
result = DataFrame(output) | ||
result.columns = columns | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
if not self.as_index: | ||
result = DataFrame(output, columns=output_keys) | ||
self._insert_inaxis_grouper_inplace(result) | ||
result = result._consolidate() | ||
else: | ||
index = self.grouper.result_index | ||
result = DataFrame(output, index=index, columns=output_keys) | ||
result.index = index | ||
|
||
if self.axis == 1: | ||
result = result.T | ||
|
||
return self._reindex_output(result)._convert(datetime=True) | ||
|
||
def _wrap_transformed_output(self, output, names=None) -> DataFrame: | ||
return DataFrame(output, index=self.obj.index) | ||
def _wrap_transformed_output( | ||
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here and elsewhere the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it does. I had a go at trying to avoid the cast, and needed an instance EA in _wrap_series_output to get the tests to pass. diff --git a/pandas/core/groupby/generic.py b/pandas/core/groupby/generic.py
index 92378ca91..20cbc9c96 100644
--- a/pandas/core/groupby/generic.py
+++ b/pandas/core/groupby/generic.py
@@ -20,6 +20,7 @@ from typing import (
Type,
Union,
cast,
+ overload,
)
import warnings
@@ -50,6 +51,7 @@ from pandas.core.dtypes.missing import _isna_ndarraylike, isna, notna
from pandas._typing import FrameOrSeries
import pandas.core.algorithms as algorithms
+from pandas.core.arrays import ExtensionArray
from pandas.core.base import DataError, SpecificationError
import pandas.core.common as com
from pandas.core.frame import DataFrame
@@ -332,9 +334,29 @@ class SeriesGroupBy(GroupBy):
return DataFrame(results, columns=columns)
+ # TODO: _OUTPUT should be Union[Series, np.ndarray, ExtensionArray] but since
+ # np.ndarray resolves to Any, can't be used with overloads without producing
+ # error: Overloaded function signatures 1 and 2 overlap with incompatible return
+ # types
+ # while np.ndarray resolves to Any, it will be compatible with Series and
+ # ExtensionArray
+ _OUTPUT = Union[Series, ExtensionArray]
+
+ @overload
+ def _wrap_series_output(
+ self, output: Mapping[int, _OUTPUT], index: Index, columns: Index
+ ) -> DataFrame:
+ ...
+
+ @overload # noqa: F811
def _wrap_series_output(
+ self, output: _OUTPUT, index: Index, columns: Index
+ ) -> Series:
+ ...
+
+ def _wrap_series_output( # noqa: F811
self,
- output: Mapping[int, Union[Series, np.ndarray]],
+ output: Union[_OUTPUT, Mapping[int, _OUTPUT]],
index: Index,
columns: Index,
) -> Union[Series, DataFrame]:
@@ -360,17 +382,18 @@ class SeriesGroupBy(GroupBy):
In the vast majority of cases output and columns will only contain one
element. The exception is operations that expand dimensions, like ohlc.
"""
- assert len(output) == len(columns)
- assert list(output.keys()) == sorted(output.keys())
- result: Union[Series, DataFrame]
- if len(output) > 1:
+ if (
+ not isinstance(output, (Series, np.ndarray, ExtensionArray))
+ and len(output) > 1
+ ):
+ assert len(output) == len(columns)
+ assert list(output.keys()) == sorted(output.keys())
result = DataFrame(output, index=index)
result.columns = columns
+ return result
else:
- result = Series(output[0], index=index, name=columns[0])
-
- return result
+ return Series(output, index=index, name=columns[0])
def _wrap_aggregated_output(
self, output: Mapping[int, Union[Series, np.ndarray]], columns: Index
@@ -397,8 +420,13 @@ class SeriesGroupBy(GroupBy):
"""
assert list(output.keys()) == sorted(output.keys())
+ if len(output) == 1:
+ _output = output[0]
+ else:
+ _output = output
+
result = self._wrap_series_output(
- output=output, index=self.grouper.result_index, columns=columns
+ output=_output, index=self.grouper.result_index, columns=columns
)
return self._reindex_output(result)._convert(datetime=True)
@@ -426,13 +454,12 @@ class SeriesGroupBy(GroupBy):
"""
assert list(output.keys()) == sorted(output.keys())
+ assert len(output) == 1
+
result = self._wrap_series_output(
- output=output, index=self.obj.index, columns=columns
+ output=output[0], index=self.obj.index, columns=columns
)
- # No transformations increase the ndim of the result
- # Unfortunately need to cast for mypy to know this
- result = cast(Series, result)
return result
def _wrap_applied_output(self, keys, values, not_indexed_same=False): There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. p.s. I don't recommend the above since still needed |
||
) -> DataFrame: | ||
""" | ||
Wraps the output of DataFrameGroupBy transformations into the expected result. | ||
|
||
Parameters | ||
---------- | ||
output : dict[int, Union[Series, np.ndarray]] | ||
Dict where the key represents the columnar-index and the values are | ||
the actual results. | ||
columns : pd.Index | ||
Column names to apply. | ||
|
||
Returns | ||
------- | ||
DataFrame | ||
""" | ||
assert list(output.keys()) == sorted(output.keys()) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same comment as above, I think you need to sort the columns (index is ok ) |
||
result = DataFrame(output) | ||
result.columns = columns | ||
result.index = self.obj.index | ||
|
||
return result | ||
|
||
def _wrap_agged_blocks(self, items, blocks): | ||
if not self.as_index: | ||
|
@@ -1719,9 +1836,11 @@ def groupby_series(obj, col=None): | |
if isinstance(obj, Series): | ||
results = groupby_series(obj) | ||
else: | ||
# TODO: this is duplicative of how GroupBy naturally works | ||
# Try to consolidate with normal wrapping functions | ||
from pandas.core.reshape.concat import concat | ||
|
||
results = [groupby_series(obj[col], col) for col in obj.columns] | ||
results = [groupby_series(content, label) for label, content in obj.items()] | ||
results = concat(results, axis=1) | ||
jreback marked this conversation as resolved.
Show resolved
Hide resolved
|
||
results.columns.names = obj.columns.names | ||
|
||
|
@@ -1763,7 +1882,7 @@ def _normalize_keyword_aggregation(kwargs): | |
""" | ||
Normalize user-provided "named aggregation" kwargs. | ||
|
||
Transforms from the new ``Dict[str, NamedAgg]`` style kwargs | ||
Transforms from the new ``Mapping[str, NamedAgg]`` style kwargs | ||
jbrockmendel marked this conversation as resolved.
Show resolved
Hide resolved
|
||
to the old OrderedDict[str, List[scalar]]]. | ||
|
||
Parameters | ||
|
@@ -1784,7 +1903,7 @@ def _normalize_keyword_aggregation(kwargs): | |
>>> _normalize_keyword_aggregation({'output': ('input', 'sum')}) | ||
(OrderedDict([('input', ['sum'])]), ('output',), [('input', 'sum')]) | ||
""" | ||
# Normalize the aggregation functions as Dict[column, List[func]], | ||
# Normalize the aggregation functions as Mapping[column, List[func]], | ||
# process normally, then fixup the names. | ||
# TODO(Py35): When we drop python 3.5, change this to | ||
# defaultdict(list) | ||
|
Uh oh!
There was an error while loading. Please reload this page.