aggregating with a dictionary that specifies a column that has all nan's fails to use numpy

As the subject says, if I try to call .agg() with a dictionary with a column that has all np.nan's it falls back to python agg functions instead of numpy.

To reproduce: (my dataset is 60 cols, 100000 rows)

I imported a csv and one column was all null (np.nan). The column dtype was set to object. (that's one issue, why the large upcast container to store np.nan?)

```
sq = pd.read_table(sqFile, sep='\t', skiprows = 1, nrows=None, header=0)
sq_g=sq.groupby(all_key_cols, as_index=False, sort=False)
```

`sq_g.agg(sum)`
Without specifying a dictionary and using sum over the entire dataframe it correctly uses the cython optimized numpy.sum:
`10 loops, best of 3: 48.3 ms per loop`

`sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))`
Specifying a dictionary and a column that is of dtype object that is entirely rows of np.nan falls back to python (bad):
`1 loops, best of 3: 7.26 s per loop`

For reference (ColumnRef20 has floats, ColumnRef53 has entirely np.nan's):

```
sq.dtypes
# Row               float64
Rowdesc.iphost       object
...
ColumnRef20         float64
ColumnRef53          object
...
dtype: object
```

My workaround is to downcast these np.nan filled columns back to float64, then the dictionary aggregation correctly uses the numpy optimized functions and not python:

```
# workaround for numpy groupby issue:
#  downcast columns with all NaN from object to float64 so agg() doesn't fallback to python.

# first find all columns with all np.nan rows
data_cols = [x for x in sq_concat.columns.tolist() if x.startswith('Column')]
all_nan = pd.isnull(sq_concat[data_cols]).all()
all_nan_cols = all_nan[all_nan == True].index.values.tolist()

# only need to downcast if type is object
obj_downcast = sq_concat[all_nan_cols].dtypes == object
obj_downcast_cols = obj_downcast[obj_downcast == True].index.values.tolist()

# downcast object to np.float64
for nan_col in obj_downcast_cols:
    sq_concat[nan_col] = sq_concat[nan_col].apply(np.float64)
```

Then the dictionary .agg() works as expected:
`sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))`
`100 loops, best of 3: 6.2 ms per loop`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aggregating with a dictionary that specifies a column that has all nan's fails to use numpy #9149

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

aggregating with a dictionary that specifies a column that has all nan's fails to use numpy #9149

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions