Skip to content

BUG: HDFStore mishandles pd.categorical when initialized with custom categories #38131

Open
@PaulAmosKreiner

Description

@PaulAmosKreiner
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

df = pd.DataFrame(columns=['col1'])

df['col1'] = pd.Categorical(["a","b","c","a"], categories=["c","b","a"])

store = pd.HDFStore('bug.h5')
store.append('df', df, data_columns=df.columns)
store.select('df', "col1='a'")

Problem description

yields

col1
2	c

Expected Output

should yield

col1
0	a
3	a

workaround

simply don't provide categories to the pd.Categorical(). everything works fine with the inferred categories – at least in my project from what I have seen.

reference & questions

issue in pytables (where I don't think it belongs at this point): PyTables/PyTables#714

the bug is invariant with regard to the following HDFStore settings:

  • choosing "fixed" or "table" format
  • using compression

are categories even officially supported in HDFStore at this point? The IO guide still does not mention the type as supported. (https://pandas.pydata.org/pandas-docs/dev/user_guide/io.html#storing-types) Maybe create a warning if it really is not supported still.

Output of pd.show_versions()

default Google Colaboratory as of 28.11.2020

commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.6.9.final.0
python-bits      : 64
OS               : Linux
OS-release       : 4.19.112+
Version          : #1 SMP Thu Jul 23 08:00:38 PDT 2020
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.4
numpy            : 1.18.5
pytz             : 2018.9
dateutil         : 2.8.1
pip              : 19.3.1
setuptools       : 50.3.2
Cython           : 0.29.21
pytest           : 3.6.4
hypothesis       : None
sphinx           : 1.8.5
blosc            : None
feather          : 0.4.1
xlsxwriter       : None
lxml.etree       : 4.2.6
html5lib         : 1.0.1
pymysql          : None
psycopg2         : 2.7.6.1 (dt dec pq3 ext lo64)
jinja2           : 2.11.2
IPython          : 5.5.0
pandas_datareader: 0.9.0
bs4              : 4.6.3
bottleneck       : 1.3.2
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.2.2
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 2.5.9
pandas_gbq       : 0.13.3
pyarrow          : 0.14.1
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.20
tables           : 3.4.4
tabulate         : 0.8.7
xarray           : 0.15.1
xlrd             : 1.1.0
xlwt             : 1.3.0
numba            : 0.48.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions