Closed
Description
What's the expected behavior when grouping on a column containing NaN
and then applying transform
? For a Series
, the current result is to throw an exception:
>>> df = pd.DataFrame({
... 'a' : range(10),
... 'b' : [1, 1, 2, 3, np.nan, 4, 4, 5, 5, 5]})
>>>
>>> df.groupby(df.b)['a'].transform(max)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pandas/core/groupby.py", line 2422, in transform
return self._transform_fast(cyfunc)
File "pandas/core/groupby.py", line 2463, in _transform_fast
return self._set_result_index_ordered(Series(values))
File "pandas/core/groupby.py", line 498, in _set_result_index_ordered
result.index = self.obj.index
File "pandas/core/generic.py", line 1997, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41301)
obj._set_axis(self.axis, value)
File "pandas/core/series.py", line 273, in _set_axis
self._data.set_axis(axis, labels)
File "pandas/core/internals.py", line 2219, in set_axis
'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 9 elements, new values have 10 elements
For a DataFrame
, the missing value gets filled in with what looks like an uninitialized value from np.empty_like
:
>>> df.groupby(df.b).transform(max)
a
0 1
1 1
2 2
3 3
4 -1
5 6
6 6
7 9
8 9
9 9
It seems like either it should fill in the missing values with NaN
(which might require a change of dtype), or just drop those rows from the result (which requires the shape to change). Either solution has the potential to surprise.