Closed
Description
Using the built-in sum
function gives correct result in df.agg(sum, axis=0)
, but wrong result in df.agg(sum, axis=1)
.
>>> df = pd.DataFrame([[np.nan, 2], [3, 4]])
>>> df.agg(sum, axis=0)
0 3.0
1 6.0
dtype: float64
>>> df.agg(sum, axis=1)
0 NaN
1 7.0
dtype: float64
The NaN
in the last example should be 2.0.
Also, operation using the builtin sum
in agg
with axis=1
are very slow:
>>> n = 1_000
>>> df = pd.DataFrame({'a': range(n), 'b': range(n)})
>>> %timeit df.agg(sum, axis=1)
16.5 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df.T.agg(sum) # correct result *and* faster
312 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Problem description
Currently, df.agg(func, axis=1)
defers to df.apply(func, axis=1)
. This is not done for axis=0
, and the operation may therefore give unexpected results and slow the operation down (because df.apply
can be very slow).
Expected Output
The expected output is:
>>> df.agg(sum, axis=1)
0 2.0
1 7.0
dtype: float64
Solution proposal
I'm thinking about putting in df.T.agg(func, axis=0)
rather than df.apply(func, axis=1)
in a few strategic places. This should ensure both getting correct results and faster operations. will report back if this succeeds.