Skip to content

DOC: Error in pd.cut documentation example regarding IntervalIndex usage. #27319

Open
@msznajder

Description

@msznajder

Code Sample, a copy-pastable example if possible

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
[NaN, (0, 1], NaN, (2, 3], (4, 5]]
Categories (3, interval[int64]): [(0, 1] < (2, 3] < (4, 5]]

Problem description

Proposed example in pd.cut IntervalIndex section does not take into consideration actual pd.cut behaviour which in above example results produces ranges with lots of missing values and nans in results. In docs example above for example intermediate values like 2 and 4 WILL NOT be included in any bins, so the actual values of 2 and 4 in the data will produce nans after cutting the attribute using pd.cut.

I assume here that user in 99% of the time when using cut to bucketize value space wants all values in the spectrum to be included. This usage example can lead to data loss.

Actual example should be along the lines:

bins = pd.IntervalIndex.from_tuples([(0, 1), (1, 3), (3, 5)])

resulting in the following bins:

Categories (3, interval[int64]): [(0, 1] < (1, 3] < (3, 5]]

EDIT:
There is a correct/proper example in IntervalIndex docs:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.IntervalIndex.from_tuples.html#pandas.IntervalIndex.from_tuples

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions