Description
As the subject says, if I try to call .agg() with a dictionary with a column that has all np.nan's it falls back to python agg functions instead of numpy.
To reproduce: (my dataset is 60 cols, 100000 rows)
I imported a csv and one column was all null (np.nan). The column dtype was set to object. (that's one issue, why the large upcast container to store np.nan?)
sq = pd.read_table(sqFile, sep='\t', skiprows = 1, nrows=None, header=0)
sq_g=sq.groupby(all_key_cols, as_index=False, sort=False)
sq_g.agg(sum)
Without specifying a dictionary and using sum over the entire dataframe it correctly uses the cython optimized numpy.sum:
10 loops, best of 3: 48.3 ms per loop
sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
Specifying a dictionary and a column that is of dtype object that is entirely rows of np.nan falls back to python (bad):
1 loops, best of 3: 7.26 s per loop
For reference (ColumnRef20 has floats, ColumnRef53 has entirely np.nan's):
sq.dtypes
# Row float64
Rowdesc.iphost object
...
ColumnRef20 float64
ColumnRef53 object
...
dtype: object
My workaround is to downcast these np.nan filled columns back to float64, then the dictionary aggregation correctly uses the numpy optimized functions and not python:
# workaround for numpy groupby issue:
# downcast columns with all NaN from object to float64 so agg() doesn't fallback to python.
# first find all columns with all np.nan rows
data_cols = [x for x in sq_concat.columns.tolist() if x.startswith('Column')]
all_nan = pd.isnull(sq_concat[data_cols]).all()
all_nan_cols = all_nan[all_nan == True].index.values.tolist()
# only need to downcast if type is object
obj_downcast = sq_concat[all_nan_cols].dtypes == object
obj_downcast_cols = obj_downcast[obj_downcast == True].index.values.tolist()
# downcast object to np.float64
for nan_col in obj_downcast_cols:
sq_concat[nan_col] = sq_concat[nan_col].apply(np.float64)
Then the dictionary .agg() works as expected:
sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
100 loops, best of 3: 6.2 ms per loop