Description
Test code:
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict( {"objcol": ("A", "B", np.nan, "C", "C", "A", "D" ) })
df["catcol"] = df.objcol.astype('category')
pd.crosstab( df.objcol, 1 )
pd.crosstab( df.catcol, 1 )
Problem description
We have this data frame:
>>> df
objcol catcol
0 A A
1 B B
2 NaN NaN
3 C C
4 C C
5 A A
6 D D
The first column is of dtype object, the second column of dtype 'category'. Running crosstab
on the two columns gives different results:
>>> pd.crosstab( df.objcol, 1 )
col_0 1
objcol
A 2
B 1
C 2
D 1
>>> pd.crosstab( df.catcol, 1 )
col_0 1
catcol
A 2
B 1
NaN 2
C 1
Clearly, the second result is wrong. Note how "C" has the wrong count, 1 instead of 2.
value_counts
, on the other hand, works correctly:
>>> df.objcol.value_counts()
C 2
A 2
D 1
B 1
Name: objcol, dtype: int64
>>> df.catcol.value_counts()
C 2
A 2
D 1
B 1
Name: catcol, dtype: int64
Expected Output
pd.crosstab( df.catcol, 1 )
should give the same output as pd.crosstab( df.objcol, 1 )
.
Output of pd.show_versions()
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.2.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2014.10
blosc: None
bottleneck: None
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None