Skip to content

BUG: inconsistent behaviors for Index.union() and Index.intersection() with duplicates #31326

Closed
@jeffzi

Description

@jeffzi

While working on #31312, I noticed that the behavior of Index.union() and Index.intersection() is inconsistent when there are duplicates in one of the Index.

import pandas as pd
import traceback

a = pd.Index([1, 2, 2, 3])
b = pd.Index([3, 3, 4])

def test_setops(left, right):
    for op in ["intersection", "union"]:
        for sort in [None, False]:
            result = getattr(left, op)(right, sort=sort)
            print(f"sort = {sort}, {op}: {result} -> has duplicates: {result.has_duplicates}")

test_setops(a, b)
#> sort = None, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True
#> sort = False, intersection: Int64Index([3, 3], dtype='int64') -> has duplicates: True
#> sort = None, union: Int64Index([1, 2, 2, 3, 3, 4], dtype='int64') -> has duplicates: True
#> sort = False, union: Int64Index([1, 2, 2, 3, 4], dtype='int64') -> has duplicates: True

arrays = [['a', 'b', 'b', 'c'],
          ['1', '2', '2', '1']]
a_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
arrays = [['c', 'c', 'd'],
          ['1', '1', '2']]
b_mi = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])

test_setops(a_mi, b_mi)
#> sort = None, intersection: MultiIndex([('c', '1')],
#>            names=['first', 'second']) -> has duplicates: False
#> sort = False, intersection: MultiIndex([('c', '1')],
#>            names=['first', 'second']) -> has duplicates: False
#> sort = None, union: MultiIndex([('a', '1'),
#>             ('b', '2'),
#>             ('c', '1'),
#>             ('d', '2')],
#>            names=['first', 'second']) -> has duplicates: False
#> sort = False, union: MultiIndex([('a', '1'),
#>             ('b', '2'),
#>             ('c', '1'),
#>             ('d', '2')],
#>            names=['first', 'second']) -> has duplicates: False

Created on 2020-01-26 by the reprexpy package

Problem description

  1. The behavior of intersection() and union() when duplicates are present is not consistent between Index and MultiIndex. Those operations return duplicates with Index but not with MultiIndex. The documentation doesn't clearly state what to expect.

  2. When duplicates are present, the size of the result of Index.union() depends on sort is None or False.

  3. If duplicates are present on only one side, Index.intersection() always return duplicates.

Here are more succinct examples for 2. and 3.

import pandas as pd

a = pd.Index([1, 2, 2, 3])
b = pd.Index([3, 3, 4])

# expected [1, 2, 2, 3, 3, 3, 4]
a.union(b, sort=None) 
#> Int64Index([1, 2, 2, 3, 3, 4], dtype='int64')
a.union(b, sort=False) 
#> Int64Index([1, 2, 2, 3, 4], dtype='int64')

# expected [3]
a.intersection(b, sort=None)
#> Int64Index([3, 3], dtype='int64')
a.intersection(b, sort=False)
#> Int64Index([3, 3], dtype='int64')

# expected [3, 3]
b.intersection(a, sort=None)
#> Int64Index([3, 3], dtype='int64')
b.intersection(a, sort=False)
#> Int64Index([3, 3], dtype='int64')

Created on 2020-01-26 by the reprexpy package

Expected Output

For consistency and clarity, I think it would be better to enforce unicity in the index returned by logical operations. Index.union() and Index.interesection() are the only ones allowing duplicates.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : ca3bfcc
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.2.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.0rc0+212.gca3bfcc54
numpy : 1.17.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.1
setuptools : 45.1.0.post20200119
Cython : 0.29.14
pytest : 5.3.4
hypothesis : 5.3.0
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : 0.8.6
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.47.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexRelated to the Index class or subclasses

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions