Description
xref #9772
df = pd.DataFrame({i:pd.Series(np.random.normal(size=10),
index=range(10)) for i in range(11)})
g = df.groupby(['a']*6+['b']*5, axis=1)
g.apply(lambda x : x.sum())
now raises, but used to give the (perhaps) surprising output
a b
0 -0.381070 NaN
1 -1.214075 NaN
2 -1.496252 NaN
3 3.392565 NaN
4 -0.782376 NaN
5 1.306043 NaN
6 NaN -1.772334
7 NaN 4.125280
8 NaN 1.992329
9 NaN 4.283854
10 NaN -4.791092
A fix for this (today and previously) would be to pass axis=1
into the call to sum, but again I think that is viewed as unintuitive. In #9772 (comment) I argued:
...when pandas feeds a group of values into the UDF, they are not transposed. It seems reasonable to me to argue that they should be, but one technical hurdle here is what happens with a frame where the columns are different dtypes. Upon transposing, you now have columns of mixed dtypes, which are coerced to object type. So upon transposing the result back you lose type information. Since the UDF can return anything, there is no way to reliably determine that the resulting dtypes should be.
Of course, an argument against transposing the group when passing it to the UDF is that this would be a rather large change for what seems to me to be of little value.
A few counter points that I've realized in the meantime:
- The case of multiple dtypes seems to me to be a very minor one, to the point of insignificance. Is there an example of a (somewhat natural) function that is applied to multiple dtypes where the resulting dtype does not get coerced correctly? I've played around with this in the code, the only such example in the tests is the identify function.
- This is notably not how
groupby(..., axis=1).transform
works, nor is it how apply/transform/agg with no groupby andaxis=1
work. These methods all feed in the row as a Series so that supplyingaxis=1
results in an error.
I'm now of the opinion that transposing the inputs and results is more maintainable and easier to grok for users.