Skip to content

ENH: pd.cut closed intervals #51534

Open
Open
@Gabriel-Kissin

Description

@Gabriel-Kissin

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

pd.cut and pd.qcut create intervals and partition the dataset according to those intervals. The intervals are generally open on the left and closed on the right, and with continuous data are fine.

However with discrete data this can sometimes give slightly ridiculous intervals. Integers are a case in point. If we have numbers from 0 to 99, pd.cut makes intervals like (-0.099, 14.1429], (14.1429, 28.2857].

Here is a MRE:

import pandas as pd
import numpy as np

interval_testing = pd.DataFrame(columns=['data', 'interval'],)

interval_testing.data = np.arange(0,100).astype(int)

interval_testing.interval = pd.cut(interval_testing.data, bins=7, precision=4, )
# interval_testing.interval = pd.qcut(interval_testing.data, q=7, precision=5, )

interval_testing.groupby('interval').aggregate(['min', 'max', 'count'])

which outputs the following:
image

It would be great if there was an option to specify that data is discrete, so that intervals would be like (14, 28], (28, 42] etc. (Not just integers - for example data which is measured to one dp, it would give (14.3, 28.7], (28.7, 42.4] or similar). The level of discrete-ness can be inferred from the data. [EDIT: ATM you can achieve sthg similar using the precision= parameter, but this doesn't automatically infer from the data. The main point of this suggestion is the next point, that intervals should be closed].

A further improvement would then be a parameter to control that the intervals should be fully closed. So [15, 28], [29, 42] etc.

For example if the data is "number of times something happened", intervals like those described would be more intuitive.

It isn't particularly hard to work round this, but this might be a useful feature to add.

Feature Description

Some rough code:

import pandas as pd
import numpy as np

def integer_qcut(x, q):
    binned_df, bins = pd.qcut(x, q, duplicates='drop', retbins=True)
    bins = np.floor(bins).astype(int)
    bins_left  = bins[:-1]
    bins_right = bins[1:] - np.array([1]*(len(bins)-2) + [0])
    bins = pd.IntervalIndex.from_arrays(left=bins_left, right=bins_right, closed='both')
    return pd.cut(x=x, bins=bins)
    # return  bins, #quantiles


interval_testing = pd.DataFrame(columns=['data', 'interval'],)
interval_testing.data = np.arange(0,100).astype(int)
interval_testing.interval = integer_qcut(interval_testing.data, q=7,  )

interval_testing.groupby('interval').aggregate(['min', 'max', 'count'])

gives

image

Not perfect because the final group is larger than the others (as a result of using np.floor), but illustrates what I'm getting at.

Alternative Solutions

potentially multiple ways of solving this...

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions