Skip to content

crosstab gives wrong result if a categorical Series contains NaNs #21565

Closed
@simon-anders

Description

@simon-anders

Test code:

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict( {"objcol": ("A", "B", np.nan, "C", "C", "A", "D" ) })
df["catcol"] = df.objcol.astype('category')

pd.crosstab( df.objcol, 1 )
pd.crosstab( df.catcol, 1 )

Problem description

We have this data frame:

>>> df
  objcol catcol
0      A      A
1      B      B
2    NaN    NaN
3      C      C
4      C      C
5      A      A
6      D      D

The first column is of dtype object, the second column of dtype 'category'. Running crosstab on the two columns gives different results:

>>> pd.crosstab( df.objcol, 1 )
col_0   1
objcol   
A       2
B       1
C       2
D       1

>>> pd.crosstab( df.catcol, 1 )
col_0   1
catcol   
A       2
B       1
NaN     2
C       1

Clearly, the second result is wrong. Note how "C" has the wrong count, 1 instead of 2.

value_counts, on the other hand, works correctly:

>>> df.objcol.value_counts()
C    2
A    2
D    1
B    1
Name: objcol, dtype: int64

>>> df.catcol.value_counts()
C    2
A    2
D    1
B    1
Name: catcol, dtype: int64

Expected Output

pd.crosstab( df.catcol, 1 ) should give the same output as pd.crosstab( df.objcol, 1 ).

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.5.final.0 python-bits: 64 OS: Linux OS-release: 4.13.0-45-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2014.10
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeDuplicate ReportDuplicate issue or pull requestReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions