Skip to content

pd.merge on Categoricals duplicating unique rows #16767

Closed
@cdknox

Description

@cdknox

Code Sample, a copy-pastable example if possible

import pandas as pd

# create our dataframe
m = 5
temp = pd.DataFrame({
        'a': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'] * m,
        'b': ['t', 'w', 'x', 'y', 'z'] * 2 * m,
        'c': [letter 
              for each in ['m', 'n', 'u',  'p', 'o']
              for letter in [each] * 2 * m],
        'd': [letter 
              for each in ['aa', 'bb', 'cc', 'dd', 'ee',
                           'ff', 'gg', 'hh','ii', 'jj']
              for letter in [each] * m],

    })

# change them all to categorical variables
for c in temp.columns:
    temp[c] = temp[c].astype('category')

# get the dimensions before we do anything
print(temp.shape)

# drop duplicates to make sure this is unqiue
# it should be unique
id_df = temp.drop_duplicates()
print(id_df.shape)

# join a row-wise unique dataset to itself on all variables
# when they're categorical variables it duplicates rows
# when they're strings things behave as they're suppposed to
temp1 = pd.merge(temp, id_df, on = list(temp.columns))
print(temp1.shape)

Problem description

Using merge on Categorical dtypes doesn't appear to be checking equality correctly. Merging a unique dataframe to itself on 4 Categorical columns appears to duplicate rows. The above code example is simpler than what I experienced the issue on but the behavior is there.

The dataframe as it is created is a 50 row by 4 column dataframe of strings. Casting the strings to Categoricals to save on RAM appears to work well. Running the drop_duplicates method and checking the dimensions shows that each row is unique. Then simply merging the dataframes together results in a 54 row by 4 column dataframe.

My guess is that there is something about the way the values are assigned that underlie the labels differs and that the underlying values may be equal when the labels aren't. It appears to be a fairly specific case, as commenting any of those columns out results in what I'd expect in terms of output.

Expected Output

Running the same code and steps as illustrated in the prior paragraph without casting the columns to a Categorical dtype results in what I would expect: a 50 row by 4 column dataframe.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.1.final.0 python-bits: 64 OS: Windows OS-release: 10 machine: AMD64 processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: 2.8.5
pip: 9.0.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.13.0
scipy: 0.17.0
xarray: None
IPython: 4.1.2
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeRegressionFunctionality that used to work in a prior pandas versionReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions