Description
In recent efforts using Pandas on multiple machines I've found that some of the functions are tricky to serialize. Apparently this might be due to runtime generation. Here are a few examples of serialization breaking, occasionally in unpleasant ways:
In [1]: import pandas as pd
In [2]: import pickle
In [3]: pd.read_csv
Out[3]: <function pandas.io.parsers._make_parser_function.<locals>.parser_f>
In [4]: pickle.loads(pickle.dumps(pd.read_csv))
AttributeError: Can't pickle local object '_make_parser_function.<locals>.parser_f'
Lest you think that this is just a problem with pickle (which has many flaws), dill
, a much more robust function serialization library, also fails (the failure here is py35 only.) (cc @mmckerns)
In [5]: import dill
In [6]: dill.loads(dill.dumps(pd.read_csv))
PicklingError: Can't pickle <function _make_parser_function.<locals>.parser_f at 0x7f71f5ec1158>: it's not found as pandas.io.parsers._make_parser_function.<locals>.parser_f
In this particular case though cloudpickle
will work.
Other functions have this problem as well. Consider the series methods:
In [7]: pickle.loads(pickle.dumps(pd.Series.sum))
AttributeError: Can't pickle local object '_make_stat_function.<locals>.stat_func'
In this case, concerningly cloudpickle
completes, but returns a wrong result:
In [9]: import cloudpickle
In [11]: pd.Series.sum
Out[11]: <function pandas.core.generic._make_stat_function.<locals>.stat_func>
In [12]: cloudpickle.loads(cloudpickle.dumps(pd.Series.sum))
Out[12]: <function stat_func>
I've been able to fix some of these in cloudpipe/cloudpickle#46 but generally speaking I'm running into a number of problems here. It would be useful if, during the generation of these functions we could at least pay attention to assigning metadata like __name__
correctly. This one in particular confused me for a while:
In [15]: pd.Series.cumsum.__name__
Out[15]: 'sum'
What would help?
- Testing that most of the API is serializable
- Looking at what metadata the serialization libraries use, and making sure that this metadata is enough to properly identify the function. Some relevant snippets from cloudpickle follow:
def save_instancemethod(self, obj):
# Memoization rarely is ever useful due to python bounding
if obj.__self__ is None:
self.save_reduce(getattr, (obj.im_class, obj.__name__))
else:
if PY3:
self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
else:
self.save_reduce(types.MethodType, (obj.__func__, obj.__self__, obj.__self__.__class__),
obj=obj)
def _reduce_method_descriptor(obj):
return (getattr, (obj.__objclass__, obj.__name__))