Skip to content

BUG: merging with a boolean/int categorical column  #17187

Closed
@lvphj

Description

@lvphj

Code Sample, a copy-pastable example if possible

dfA = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],'colA':[3,4,2,4,3,4,5,4,5,6],'colB':[7,6,5,6,5,7,8,7,6,7],'colC':[False,True,True,False,False,True,False,True,True,True]})
dfA['colC'] = dfA['colC'].astype('category',categories=[True,False],ordered=True)
dfB = pd.DataFrame({'id':[2,5,7,8],'colD':[1,9,7,3]})

print("Before\n====")
print('dfA dtypes\n------')
print(dfA.dtypes)
print('\ndfA\n---')
print(dfA)
print('\ndfB\n---')
print(dfB)

dfA = pd.merge(left=dfA,right=dfB,how='left',on='id')
print("\nAfter\n=====")
print(dfA)

Problem description

This problem was asked on StackOverflow at https://stackoverflow.com/questions/45538092/merging-pandas-dataframes-containing-a-categorical-variable-fails-with-valueerr where it was suggested that it was a bug.

Two dataframes containing different columns can be combined using the pandas.merge() method. This works well but in the above example, converting one of the columns in the dataframe to a categorical variable causes the method to fail with error:

/Users/.../env3/lib/python3.4/site-packages/pandas/core/internals.py in __init__(self, values, placement, ndim, fastpath)
    104             ndim = values.ndim
    105         elif values.ndim != ndim:
--> 106             raise ValueError('Wrong number of dimensions')
    107         self.ndim = ndim
    108 

ValueError: Wrong number of dimensions

Using df.ndim() indicates that both dataframes have 2 dimensions.

Expected Output

The expected output can be generated simply by commenting out the second line in the above code, the line that converts one of the columns to a categorical variable.

   colA  colB   colC  id  colD
0     3     7  False   1   NaN
1     4     6   True   2   1.0
2     2     5   True   3   NaN
3     4     6  False   4   NaN
4     3     5  False   5   9.0
5     4     7   True   6   NaN
6     5     8  False   7   7.0
7     4     7   True   8   3.0
8     5     6   True   9   NaN
9     6     7   True  10   NaN

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 34.1.0
Cython: None
numpy: 1.12.1
scipy: 0.16.1
xarray: None
IPython: 4.1.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 1.5.3
openpyxl: 2.4.7
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugCategoricalCategorical Data TypeReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions