Skip to content

support CategoricalIndex for read_msgpack #15487

Closed
@abast

Description

@abast

The following code fails:

import pandas as pd
pdf = pd.DataFrame(dict(A=[1,1,1,2,2,2], B = [1,2,3,4,5,6]))
pdf['A'] = pdf['A'].astype('category')
pdf.set_index('A', inplace = True)
pdf.to_msgpack('/some/path')
pdf2 = pd.read_msgpack('/some/path')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-6-cab186b6bdcd> in <module>()
      4 pdf.set_index('A', inplace = True)
      5 pdf.to_msgpack(/some/path')
----> 6 pdf2 = pd.read_msgpack('/some/path')

/.../Anaconda2/lib/python2.7/site-packages/pandas/io/packers.pyc in read_msgpack(path_or_buf, encoding, iterator, **kwargs)
    200         if exists:
    201             with open(path_or_buf, 'rb') as fh:
--> 202                 return read(fh)
    203 
    204     # treat as a binary-like

/.../Anaconda2/lib/python2.7/site-packages/pandas/io/packers.pyc in read(fh)
    185 
    186     def read(fh):
--> 187         l = list(unpack(fh, encoding=encoding, **kwargs))
    188         if len(l) == 1:
    189             return l[0]

pandas/msgpack/_unpacker.pyx in pandas.msgpack._unpacker.Unpacker.__next__ (pandas/msgpack/_unpacker.cpp:5618)()

pandas/msgpack/_unpacker.pyx in pandas.msgpack._unpacker.Unpacker._unpack (pandas/msgpack/_unpacker.cpp:4602)()

/.../Anaconda2/lib/python2.7/site-packages/pandas/io/packers.pyc in decode(obj)
    557         data = unconvert(obj[u'data'], dtype,
    558                          obj.get(u'compress'))
--> 559         return globals()[obj[u'klass']](data, dtype=dtype, name=obj[u'name'])
    560     elif typ == u'range_index':
    561         return globals()[obj[u'klass']](obj[u'start'],

KeyError: u'CategoricalIndex'

Problem description

read_msgpack apparently does not seem to support a CategoricalIndex, however, it is possible to save a dataframe with a CategoricalIndex using to_msgpack.

Background: I am currently using the to_msgpack method to save a dask dataframe, where the index is (something like) a time stamp. It is not unique. I am overall very satisfied with the performance of to_msgpack, however when it comes to space efficency, having a categorical index would probably provide a significant improvement.

Or maybe it works, but I am using it wrong?

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.16.60-0.42.5-smp
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.2
nose: 1.3.7
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.11.1
scipy: 0.18.1
statsmodels: 0.6.1
xarray: None
IPython: 5.2.2
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: 1.4.4
bottleneck: 1.1.0
tables: 3.2.3.1
numexpr: 2.6.1
matplotlib: 1.5.3
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
httplib2: 0.9.2
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.42.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions