Skip to content

BUG: names on union and intersection for Index were inconsistent (GH9943 GH9862) #19849

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Nov 6, 2018
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,8 @@ Indexing
- Bug in :func:`IntervalIndex.symmetric_difference` where the symmetric difference with a non-``IntervalIndex`` did not raise (:issue:`18475`)
- Bug in :class:`IntervalIndex` where set operations that returned an empty ``IntervalIndex`` had the wrong dtype (:issue:`19101`)
- Bug in :meth:`DataFrame.drop_duplicates` where no ``KeyError`` is raised when passing in columns that don't exist on the ``DataFrame`` (issue:`19726`)
- Bug in :func:`Index.union` and :func:`Index.intersection` where name of the `Index` of the result was not computed correctly for certain cases (:issue:`9943`, :issue:`9862`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double-tick Index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

- Bug in :func:`Index.difference` when taking difference of set and itself as type was not preserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add this PR number is the issue, e.g. (:issue:`19849`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should "set" be "index"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made both of those changes.



MultiIndex
Expand Down
76 changes: 48 additions & 28 deletions pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@
import pandas.core.algorithms as algos
import pandas.core.sorting as sorting
from pandas.io.formats.printing import pprint_thing
from pandas.core.ops import make_invalid_op
from pandas.core.ops import make_invalid_op, get_op_result_name
from pandas.core.config import get_option
from pandas.core.strings import StringMethods

Expand Down Expand Up @@ -1191,7 +1191,7 @@ def _convert_can_do_setop(self, other):
other = Index(other, name=self.name)
result_name = self.name
else:
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)
return other, result_name

def _convert_for_op(self, value):
Expand Down Expand Up @@ -1240,9 +1240,9 @@ def set_names(self, names, level=None, inplace=False):
Examples
--------
>>> Index([1, 2, 3, 4]).set_names('foo')
Int64Index([1, 2, 3, 4], dtype='int64')
Int64Index([1, 2, 3, 4], dtype='int64', name='foo')
>>> Index([1, 2, 3, 4]).set_names(['foo'])
Int64Index([1, 2, 3, 4], dtype='int64')
Int64Index([1, 2, 3, 4], dtype='int64', name='foo')
>>> idx = MultiIndex.from_tuples([(1, u'one'), (1, u'two'),
(2, u'one'), (2, u'two')],
names=['foo', 'bar'])
Expand Down Expand Up @@ -2263,21 +2263,37 @@ def __or__(self, other):
def __xor__(self, other):
return self.symmetric_difference(other)

def _get_consensus_name(self, other):
def _get_reconciled_name_object(self, other):
"""
Given 2 indexes, give a consensus name meaning
we take the not None one, or None if the names differ.
Return a new object if we are resetting the name
If the result of a set operation will be self,
return self, unless the name changes, in which
case make a shallow copy of self.
"""
if self.name != other.name:
if self.name is None or other.name is None:
name = self.name or other.name
else:
name = None
if self.name != name:
return self._shallow_copy(name=name)
name = get_op_result_name(self, other)
if self.name != name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think might be able to dispense with this entirely, and just always do

from pandas.core.ops import get_op_result_name

....

return self._shallow_copy(name=get_op_result_name(self, other))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do this in _wrap_setup_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I eliminate self.get_setop_name_object and use _wrap_setop_result, other tests fail, because those tests assume the corner cases return the same object.

return self._shallow_copy(name=name)
return self

def _union_corner_case(self, other):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you are creatijng this as a separate method? why not inline it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for inlining it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback @TomAugspurger The subclasses for DateTimeIndex, RangeIndex and TimeDeltaIndex need to call this method from their union implementations. Should I inline it there too and have repeated code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this PR looks a lot better. I still don't like this hlper function. can we refactor to not use it

"""
If self or other have no length, or self and other
are the same, then return self, after reconciling names

Returns
-------
Tuple of (is_corner, result), where is_corner is True if
it is a corner case, and result is the reconciled result

"""
# GH 9943 9862
if len(other) == 0 or self.equals(other):
return (True, self._get_reconciled_name_object(other))

if len(self) == 0:
return (True, other._get_reconciled_name_object(self))

return (False, None)

def union(self, other):
"""
Form the union of two Index objects and sorts if possible.
Expand All @@ -2302,11 +2318,9 @@ def union(self, other):
self._assert_can_do_setop(other)
other = _ensure_index(other)

if len(other) == 0 or self.equals(other):
return self._get_consensus_name(other)

if len(self) == 0:
return other._get_consensus_name(self)
is_corner_case, corner_result = self._union_corner_case(other)
if is_corner_case:
return corner_result

# TODO: is_dtype_union_equal is a hack around
# 1. buggy set ops with duplicates (GH #13432)
Expand Down Expand Up @@ -2369,11 +2383,10 @@ def union(self, other):
stacklevel=3)

# for subclasses
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
return self.__class__(result, name=name)
def _wrap_setop_result(self, other, result):
return self.__class__(result, name=get_op_result_name(self, other))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self._shallow_copy this propagates meta-data, allows you to remove some sublcass implementations of this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I've done some testing with self.shallow_copy() and it broke things with respect to CategoricalIndex, and then the discussion in #10186 becomes relevant, as we'd need to decide what to do when taking the union of two categorical indexes that have different categories. So I'm going to leave this implementation as it was.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats fine

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we consistently use self.__class__ or type(self) in this module? (I prefer the latter.

can you use ._shallow_copy() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no other usage of self.__class__ in this module. There is a _constructor() method that is the same as type(self), so I have switched to using that.

We had discussion 4 months ago about why _shallow_copy() doesn't work here.


def intersection(self, other):
"""
Expand Down Expand Up @@ -2403,7 +2416,7 @@ def intersection(self, other):
other = _ensure_index(other)

if self.equals(other):
return self._get_consensus_name(other)
return self._get_reconciled_name_object(other)

if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
Expand All @@ -2423,7 +2436,7 @@ def intersection(self, other):
if self.is_monotonic and other.is_monotonic:
try:
result = self._inner_indexer(lvals, rvals)[0]
return self._wrap_union_result(other, result)
return self._wrap_setop_result(other, result)
except TypeError:
pass

Expand All @@ -2441,6 +2454,13 @@ def intersection(self, other):
taken.name = None
return taken

def _create_empty_index(self, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't create new API, this is just self._shallow_copy([], name=name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback So the issue here is that self._shallow_copy([], name=name) doesn't work correctly for RangeIndex. I can modify RangeIndex._self_shallow_copy to allow [] as an argument. For an index of strings (objects) or an index of dtype int64, self._shallow_copy([], name=name) doesn't preserve the dtype. And I haven't even looked at all the types yet. So now the changes get more involved to handle this corner case of creating an empty index, which is why I created the create_empty_index. The idea here is to have an empty index that preserves all the other attributes (dtype, categories, range step, etc.) of the class of self. self._shallow_copy([], name=name) does not do that without a lot of modifications. If you want me to do those modifications, I can look into that, but they might be pretty substantial.

"""
Returns an empty index. Overridden as necessary by
subclasses that have different constructors.
"""
return self.__class__([], name=name)

def difference(self, other):
"""
Return a new Index with elements from the index that are not in
Expand Down Expand Up @@ -2469,7 +2489,7 @@ def difference(self, other):
self._assert_can_do_setop(other)

if self.equals(other):
return Index([], name=self.name)
return self._create_empty_index(get_op_result_name(self, other))

other, result_name = self._convert_can_do_setop(other)

Expand Down Expand Up @@ -3552,7 +3572,7 @@ def _join_monotonic(self, other, how='left', return_indexers=False):
return join_index

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Index(joined, name=name)

def _get_string_slice(self, key, use_lhs=True, use_rhs=True):
Expand Down
14 changes: 14 additions & 0 deletions pandas/core/indexes/category.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import pandas.core.base as base
import pandas.core.missing as missing
import pandas.core.indexes.base as ibase
from pandas.core.ops import get_op_result_name

_index_doc_kwargs = dict(ibase._index_doc_kwargs)
_index_doc_kwargs.update(dict(target_klass='CategoricalIndex'))
Expand Down Expand Up @@ -300,6 +301,12 @@ def itemsize(self):
# Size of the items in categories, not codes.
return self.values.itemsize

def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _shallow_copy here, we never call _simple_new directly. (another reason why the super method should use _shallow_copy, but that's for new PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move this method to Index.base? does it break anything?

return self._simple_new(result, name=name,
categories=self.categories,
ordered=self.ordered)

def get_values(self):
""" return the underlying data as an ndarray """
return self._data.get_values()
Expand Down Expand Up @@ -716,6 +723,13 @@ def insert(self, loc, item):
codes = np.concatenate((codes[:loc], code, codes[loc:]))
return self._create_from_codes(codes)

def _create_empty_index(self, name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

"""
Returns an empty index using categories and ordered of self
"""
return CategoricalIndex([], categories=self.categories,
ordered=self.ordered, name=name)

def _concat(self, to_concat, name):
# if calling index is category, don't check dtype of others
return CategoricalIndex._concat_same_dtype(self, to_concat, name)
Expand Down
16 changes: 13 additions & 3 deletions pandas/core/indexes/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@

from pandas.core.indexes.base import Index, _index_shared_docs
from pandas.core.indexes.numeric import Int64Index, Float64Index
from pandas.core.ops import get_op_result_name
import pandas.compat as compat
from pandas.tseries.frequencies import to_offset, get_period_alias, Resolution
from pandas.core.indexes.datetimelike import (
Expand Down Expand Up @@ -1136,6 +1137,11 @@ def union(self, other):
y : Index or DatetimeIndex
"""
self._assert_can_do_setop(other)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you just call the super .union() here (after any conversion isneeded)? then you can inline the union_corner bizness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

is_corner_case, corner_result = self._union_corner_case(other)
if is_corner_case:
return corner_result

if not isinstance(other, DatetimeIndex):
try:
other = DatetimeIndex(other)
Expand Down Expand Up @@ -1237,7 +1243,7 @@ def _maybe_utc_convert(self, other):
return this, other

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
if (isinstance(other, DatetimeIndex) and
self.offset == other.offset and
self._can_fast_union(other)):
Expand Down Expand Up @@ -1333,8 +1339,8 @@ def __iter__(self):
box="timestamp")
return iter(converted)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
if not timezones.tz_compare(self.tz, other.tz):
raise ValueError('Passed item and index have different timezone')
return self._simple_new(result, name=name, freq=None, tz=self.tz)
Expand All @@ -1353,6 +1359,10 @@ def intersection(self, other):
y : Index or DatetimeIndex
"""
self._assert_can_do_setop(other)

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, DatetimeIndex):
try:
other = DatetimeIndex(other)
Expand Down
3 changes: 2 additions & 1 deletion pandas/core/indexes/interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from pandas.core.indexes.base import (
Index, _ensure_index,
default_pprint, _index_shared_docs)
from pandas.core.ops import get_op_result_name

from pandas._libs import Timestamp, Timedelta
from pandas._libs.interval import (
Expand Down Expand Up @@ -1351,7 +1352,7 @@ def func(self, other):
raise TypeError(msg.format(op=op_name))

result = getattr(self._multiindex, op_name)(other._multiindex)
result_name = self.name if self.name == other.name else None
result_name = get_op_result_name(self, other)

# GH 19101: ensure empty results have correct dtype
if result.empty:
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/indexes/multi.py
Original file line number Diff line number Diff line change
Expand Up @@ -2733,7 +2733,7 @@ def intersection(self, other):
other_tuples = other._ndarray_values
uniq_tuples = sorted(set(self_tuples) & set(other_tuples))
if len(uniq_tuples) == 0:
return MultiIndex(levels=[[]] * self.nlevels,
return MultiIndex(levels=self.levels,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you changing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback IMHO, if the resulting intersection is empty, the levels should be preserved. Here's a simple example:

mi = pd.MultiIndex.from_product([['a','b','c'],[1,2]], names=['ll','nn'])
print(mi.drop(mi))
print(mi.intersection([]))

This produces (with v0.22.0):

MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
           labels=[[], []],
           names=['ll', 'nn'])
MultiIndex(levels=[[], []],
           labels=[[], []],
           names=['ll', 'nn'])

Doing this change as I documented it makes the set algebra consistent.

labels=[[]] * self.nlevels,
names=result_names, verify_integrity=False)
else:
Expand All @@ -2755,7 +2755,7 @@ def difference(self, other):
return self

if self.equals(other):
return MultiIndex(levels=[[]] * self.nlevels,
return MultiIndex(levels=self.levels,
labels=[[]] * self.nlevels,
names=result_names, verify_integrity=False)

Expand Down
6 changes: 3 additions & 3 deletions pandas/core/indexes/numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
from pandas.util._decorators import Appender, cache_readonly
import pandas.core.dtypes.concat as _concat
import pandas.core.indexes.base as ibase

from pandas.core.ops import get_op_result_name

_num_index_shared_docs = dict()

Expand Down Expand Up @@ -187,7 +187,7 @@ def _convert_scalar_indexer(self, key, kind=None):
._convert_scalar_indexer(key, kind=kind))

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return Int64Index(joined, name=name)

@classmethod
Expand Down Expand Up @@ -264,7 +264,7 @@ def _convert_index_indexer(self, keyarr):
return keyarr

def _wrap_joined_index(self, joined, other):
name = self.name if self.name == other.name else None
name = get_op_result_name(self, other)
return UInt64Index(joined, name=name)

@classmethod
Expand Down
11 changes: 9 additions & 2 deletions pandas/core/indexes/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
_ensure_object)
from pandas.core.dtypes.dtypes import PeriodDtype
from pandas.core.dtypes.generic import ABCSeries
from pandas.core.ops import get_op_result_name

import pandas.tseries.frequencies as frequencies
from pandas.tseries.frequencies import get_freq_code as _gfc
Expand Down Expand Up @@ -962,6 +963,12 @@ def _convert_tolerance(self, tolerance, target):
'target index size')
return self._maybe_convert_timedelta(tolerance)

def _create_empty_index(self, name):
"""
Returns an empty index using freq of self
"""
return PeriodIndex([], freq=self.freq, name=name)

def insert(self, loc, item):
if not isinstance(item, Period) or self.freq != item.freq:
return self.astype(object).insert(loc, item)
Expand Down Expand Up @@ -996,8 +1003,8 @@ def _assert_can_do_setop(self, other):
msg = _DIFFERENT_FREQ_INDEX.format(self.freqstr, other.freqstr)
raise IncompatibleFrequency(msg)

def _wrap_union_result(self, other, result):
name = self.name if self.name == other.name else None
def _wrap_setop_result(self, other, result):
name = get_op_result_name(self, other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think could just remove this entirely here if _wrap_setop_result is defined in Index.base AND you move _apply_meta functionaily to inside of _shallow_copy for PI. (could also do this as a followup), but want to try to avoid duplicating _wrap_setup_result everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we can't move it to base from category is because in the case of category, we need to preserve the categories, and in the case of period, we need to preserve frequency.

So we could investigate in a separate PR about moving the _apply_meta() functionality.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's fine

result = self._apply_meta(result)
result.name = name
return result
Expand Down
21 changes: 16 additions & 5 deletions pandas/core/indexes/range.py
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,8 @@ def tolist(self):
@Appender(_index_shared_docs['_shallow_copy'])
def _shallow_copy(self, values=None, **kwargs):
if values is None:
return RangeIndex(name=self.name, fastpath=True,
name = kwargs.get("name", self.name)
return RangeIndex(name=name, fastpath=True,
**dict(self._get_data_as_items()))
else:
kwargs.setdefault('name', self.name)
Expand Down Expand Up @@ -337,6 +338,10 @@ def intersection(self, other):
-------
intersection : Index
"""

if self.equals(other):
return self._get_reconciled_name_object(other)

if not isinstance(other, RangeIndex):
return super(RangeIndex, self).intersection(other)

Expand Down Expand Up @@ -417,10 +422,10 @@ def union(self, other):
union : Index
"""
self._assert_can_do_setop(other)
if len(other) == 0 or self.equals(other):
return self
if len(self) == 0:
return other
is_corner_case, corner_result = self._union_corner_case(other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

if is_corner_case:
return corner_result

if isinstance(other, RangeIndex):
start_s, step_s = self._start, self._step
end_s = self._start + self._step * (len(self) - 1)
Expand Down Expand Up @@ -474,6 +479,12 @@ def join(self, other, how='left', level=None, return_indexers=False,
def _concat_same_dtype(self, indexes, name):
return _concat._concat_rangeindex_same_dtype(indexes).rename(name)

def _create_empty_index(self, name):
"""
Returns an empty index using step size of self
"""
return RangeIndex(start=None, stop=None, step=self._step, name=name)

def __len__(self):
"""
return the length of the RangeIndex
Expand Down
Loading