Skip to content

IntervalDtype inconsistencies and bugs #18980

Closed
@jschendel

Description

@jschendel

1. Inconsistent comparisons versus string 'interval':

In [2]: IntervalDtype() == 'interval'
Out[2]: True

In [3]: IntervalDtype('interval') == 'interval'
Out[3]: False

In [4]: IntervalDtype('int64') == 'interval'
Out[4]: False

I'd expect all of these to return True, like how CategoricalDtype(*, *) == 'category' always returns True.


2. Inconsistent comparisons versus IntervalDtype(None):

In [5]: IntervalDtype(None) == IntervalDtype('interval')
Out[5]: False

In [6]: IntervalDtype(None) == IntervalDtype('int64')
Out[6]: False

I'd expect all of these to return True, like how CDT(None, None) == CDT(*, *) always returns True.


3. IntervalDtype.name attribute changes

In [7]: IntervalDtype().name
Out[7]: 'interval'

In [8]: IntervalDtype('interval').name
Out[8]: 'interval[]'

In [9]: IntervalDtype('int64').name
Out[9]: 'interval[int64]'

CategoricalDtype.name attribute is always the same:

In [10]: CategoricalDtype(list('abc'), True).name
Out[10]: 'category'

In [11]: CategoricalDtype(list('wxyz'), False).name
Out[11]: 'category'

I'd expect IntervalDtype.name to always return 'interval', like how CDT.name always returns 'category'. This makes the code for checking equality against strings (i.e. what I described in 1) simpler. I don't think the behavior of str(IntervalDtype) should change, which is currently the same as IntervalDtype.name, so I'd still have that return strings specifying the subtype.


4. CategoricalDtype gets cached incorrectly: (No longer an issue due to #19022)

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))

In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))

In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

This looks to be caused by the caching being done by string representation, and str(CDT(*, *)) always returns 'category':

try:
return cls._cache[str(subtype)]
except KeyError:
u = object.__new__(cls)
u.subtype = subtype
cls._cache[str(subtype)] = u
return u

Can caching be removed entirely for IntervalDtype, or is there some need/advantage that I'm not seeing? Looking at the other dtypes, CategoricalDtype appears to have had the caching code removed, but PeriodDtype and DatetimeTZDtype are using it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    CategoricalCategorical Data TypeCompatpandas objects compatability with Numpy or Python functionsDtype ConversionsUnexpected or buggy dtype conversionsIntervalInterval data type

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions