Skip to content

Serializing Pandas Functions #12021

Closed
Closed
@mrocklin

Description

@mrocklin

In recent efforts using Pandas on multiple machines I've found that some of the functions are tricky to serialize. Apparently this might be due to runtime generation. Here are a few examples of serialization breaking, occasionally in unpleasant ways:

In [1]: import pandas as pd
In [2]: import pickle
In [3]: pd.read_csv
Out[3]: <function pandas.io.parsers._make_parser_function.<locals>.parser_f>
In [4]: pickle.loads(pickle.dumps(pd.read_csv))
AttributeError: Can't pickle local object '_make_parser_function.<locals>.parser_f'

Lest you think that this is just a problem with pickle (which has many flaws), dill, a much more robust function serialization library, also fails (the failure here is py35 only.) (cc @mmckerns)

In [5]: import dill
In [6]: dill.loads(dill.dumps(pd.read_csv))
PicklingError: Can't pickle <function _make_parser_function.<locals>.parser_f at 0x7f71f5ec1158>: it's not found as pandas.io.parsers._make_parser_function.<locals>.parser_f

In this particular case though cloudpickle will work.

Other functions have this problem as well. Consider the series methods:

In [7]: pickle.loads(pickle.dumps(pd.Series.sum))
AttributeError: Can't pickle local object '_make_stat_function.<locals>.stat_func'

In this case, concerningly cloudpickle completes, but returns a wrong result:

In [9]: import cloudpickle
In [11]: pd.Series.sum
Out[11]: <function pandas.core.generic._make_stat_function.<locals>.stat_func>

In [12]: cloudpickle.loads(cloudpickle.dumps(pd.Series.sum))
Out[12]: <function stat_func>

I've been able to fix some of these in cloudpipe/cloudpickle#46 but generally speaking I'm running into a number of problems here. It would be useful if, during the generation of these functions we could at least pay attention to assigning metadata like __name__ correctly. This one in particular confused me for a while:

In [15]: pd.Series.cumsum.__name__
Out[15]: 'sum'

What would help?

  • Testing that most of the API is serializable
  • Looking at what metadata the serialization libraries use, and making sure that this metadata is enough to properly identify the function. Some relevant snippets from cloudpickle follow:
    def save_instancemethod(self, obj):
        # Memoization rarely is ever useful due to python bounding
        if obj.__self__ is None:
            self.save_reduce(getattr, (obj.im_class, obj.__name__))
        else:
            if PY3:
                self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
            else:
                self.save_reduce(types.MethodType, (obj.__func__, obj.__self__, obj.__self__.__class__),
                         obj=obj)

    def _reduce_method_descriptor(obj):
        return (getattr, (obj.__objclass__, obj.__name__))

Metadata

Metadata

Assignees

No one assigned

    Labels

    Compatpandas objects compatability with Numpy or Python functions

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions