Description
Code Sample, a copy-pastable example if possible
import pandas as pd
# create our dataframe
m = 5
temp = pd.DataFrame({
'a': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * m,
'b': ['t', 'w', 'x', 'y', 'z'] * 2 * m,
'c': [letter
for each in ['m', 'n', 'u', 'p', 'o']
for letter in [each] * 2 * m],
'd': [letter
for each in ['aa', 'bb', 'cc', 'dd', 'ee',
'ff', 'gg', 'hh','ii', 'jj']
for letter in [each] * m],
})
# change them all to categorical variables
for c in temp.columns:
temp[c] = temp[c].astype('category')
# get the dimensions before we do anything
print(temp.shape)
# drop duplicates to make sure this is unqiue
# it should be unique
id_df = temp.drop_duplicates()
print(id_df.shape)
# join a row-wise unique dataset to itself on all variables
# when they're categorical variables it duplicates rows
# when they're strings things behave as they're suppposed to
temp1 = pd.merge(temp, id_df, on = list(temp.columns))
print(temp1.shape)
Problem description
Using merge on Categorical dtypes doesn't appear to be checking equality correctly. Merging a unique dataframe to itself on 4 Categorical columns appears to duplicate rows. The above code example is simpler than what I experienced the issue on but the behavior is there.
The dataframe as it is created is a 50 row by 4 column dataframe of strings. Casting the strings to Categoricals to save on RAM appears to work well. Running the drop_duplicates method and checking the dimensions shows that each row is unique. Then simply merging the dataframes together results in a 54 row by 4 column dataframe.
My guess is that there is something about the way the values are assigned that underlie the labels differs and that the underlying values may be equal when the labels aren't. It appears to be a fairly specific case, as commenting any of those columns out results in what I'd expect in terms of output.
Expected Output
Running the same code and steps as illustrated in the prior paragraph without casting the columns to a Categorical dtype results in what I would expect: a 50 row by 4 column dataframe.
Output of pd.show_versions()
pandas: 0.20.2
pytest: 2.8.5
pip: 9.0.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.13.0
scipy: 0.17.0
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None