Skip to content

aggregating with a dictionary that specifies a column that has all nan's fails to use numpy #9149

Closed
@flyingchipmunk

Description

@flyingchipmunk

As the subject says, if I try to call .agg() with a dictionary with a column that has all np.nan's it falls back to python agg functions instead of numpy.

To reproduce: (my dataset is 60 cols, 100000 rows)

I imported a csv and one column was all null (np.nan). The column dtype was set to object. (that's one issue, why the large upcast container to store np.nan?)

sq = pd.read_table(sqFile, sep='\t', skiprows = 1, nrows=None, header=0)
sq_g=sq.groupby(all_key_cols, as_index=False, sort=False)

sq_g.agg(sum)
Without specifying a dictionary and using sum over the entire dataframe it correctly uses the cython optimized numpy.sum:
10 loops, best of 3: 48.3 ms per loop

sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
Specifying a dictionary and a column that is of dtype object that is entirely rows of np.nan falls back to python (bad):
1 loops, best of 3: 7.26 s per loop

For reference (ColumnRef20 has floats, ColumnRef53 has entirely np.nan's):

sq.dtypes
# Row               float64
Rowdesc.iphost       object
...
ColumnRef20         float64
ColumnRef53          object
...
dtype: object

My workaround is to downcast these np.nan filled columns back to float64, then the dictionary aggregation correctly uses the numpy optimized functions and not python:

# workaround for numpy groupby issue:
#  downcast columns with all NaN from object to float64 so agg() doesn't fallback to python.

# first find all columns with all np.nan rows
data_cols = [x for x in sq_concat.columns.tolist() if x.startswith('Column')]
all_nan = pd.isnull(sq_concat[data_cols]).all()
all_nan_cols = all_nan[all_nan == True].index.values.tolist()

# only need to downcast if type is object
obj_downcast = sq_concat[all_nan_cols].dtypes == object
obj_downcast_cols = obj_downcast[obj_downcast == True].index.values.tolist()

# downcast object to np.float64
for nan_col in obj_downcast_cols:
    sq_concat[nan_col] = sq_concat[nan_col].apply(np.float64)

Then the dictionary .agg() works as expected:
sq_g.agg(collections.OrderedDict({'ColumnRef53':'sum'}))
100 loops, best of 3: 6.2 ms per loop

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions