Skip to content

groupby/transform with NaNs in grouped column #9941

Closed
@evanpw

Description

@evanpw

What's the expected behavior when grouping on a column containing NaN and then applying transform? For a Series, the current result is to throw an exception:

>>> df = pd.DataFrame({
...     'a' : range(10),
...     'b' : [1, 1, 2, 3, np.nan, 4, 4, 5, 5, 5]})
>>> 
>>> df.groupby(df.b)['a'].transform(max)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/groupby.py", line 2422, in transform
    return self._transform_fast(cyfunc)
  File "pandas/core/groupby.py", line 2463, in _transform_fast
    return self._set_result_index_ordered(Series(values))
  File "pandas/core/groupby.py", line 498, in _set_result_index_ordered
    result.index = self.obj.index
  File "pandas/core/generic.py", line 1997, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41301)
    obj._set_axis(self.axis, value)
  File "pandas/core/series.py", line 273, in _set_axis
    self._data.set_axis(axis, labels)
  File "pandas/core/internals.py", line 2219, in set_axis
    'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 9 elements, new values have 10 elements

For a DataFrame, the missing value gets filled in with what looks like an uninitialized value from np.empty_like:

>>> df.groupby(df.b).transform(max)
   a
0  1
1  1
2  2
3  3
4 -1
5  6
6  6
7  9
8  9
9  9

It seems like either it should fill in the missing values with NaN (which might require a change of dtype), or just drop those rows from the result (which requires the shape to change). Either solution has the potential to surprise.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions