Description
The following code fails:
import pandas as pd
pdf = pd.DataFrame(dict(A=[1,1,1,2,2,2], B = [1,2,3,4,5,6]))
pdf['A'] = pdf['A'].astype('category')
pdf.set_index('A', inplace = True)
pdf.to_msgpack('/some/path')
pdf2 = pd.read_msgpack('/some/path')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-6-cab186b6bdcd> in <module>()
4 pdf.set_index('A', inplace = True)
5 pdf.to_msgpack(/some/path')
----> 6 pdf2 = pd.read_msgpack('/some/path')
/.../Anaconda2/lib/python2.7/site-packages/pandas/io/packers.pyc in read_msgpack(path_or_buf, encoding, iterator, **kwargs)
200 if exists:
201 with open(path_or_buf, 'rb') as fh:
--> 202 return read(fh)
203
204 # treat as a binary-like
/.../Anaconda2/lib/python2.7/site-packages/pandas/io/packers.pyc in read(fh)
185
186 def read(fh):
--> 187 l = list(unpack(fh, encoding=encoding, **kwargs))
188 if len(l) == 1:
189 return l[0]
pandas/msgpack/_unpacker.pyx in pandas.msgpack._unpacker.Unpacker.__next__ (pandas/msgpack/_unpacker.cpp:5618)()
pandas/msgpack/_unpacker.pyx in pandas.msgpack._unpacker.Unpacker._unpack (pandas/msgpack/_unpacker.cpp:4602)()
/.../Anaconda2/lib/python2.7/site-packages/pandas/io/packers.pyc in decode(obj)
557 data = unconvert(obj[u'data'], dtype,
558 obj.get(u'compress'))
--> 559 return globals()[obj[u'klass']](data, dtype=dtype, name=obj[u'name'])
560 elif typ == u'range_index':
561 return globals()[obj[u'klass']](obj[u'start'],
KeyError: u'CategoricalIndex'
Problem description
read_msgpack
apparently does not seem to support a CategoricalIndex, however, it is possible to save a dataframe with a CategoricalIndex using to_msgpack
.
Background: I am currently using the to_msgpack method to save a dask dataframe, where the index is (something like) a time stamp. It is not unique. I am overall very satisfied with the performance of to_msgpack
, however when it comes to space efficency, having a categorical index would probably provide a significant improvement.
Or maybe it works, but I am using it wrong?
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.16.60-0.42.5-smp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.2.2
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: 1.4.4
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None