Skip to content

pd.cut returning incorrect output in some cases #31586

Open
@dsaxton

Description

@dsaxton

(Below is from master)

import numpy as np
import pandas as pd

arr = np.arange(10).astype(object)
arr[::2] = np.nan

print(arr)
# [nan 1 nan 3 nan 5 nan 7 nan 9]

result = pd.cut(arr, 2)

print(result)
# [NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0]]
# Categories (2, interval[float64]): [(0.992, 5.0] < (5.0, 9.0]]

print(result.unique())
# [NaN, (0.992, 5.0]]
# Categories (1, interval[float64]): [(0.992, 5.0]]

Using cut with an array of object dtype containing missing values seems to return the wrong intervals in some cases (e.g., in the example above only the first interval appears in the result). Actually, the only situation where I've been able to reproduce this problem is specifically when the NaN values are evenly spaced, which is strange.

Looks like the problem is due to searchsorted in numpy:

import numpy as np

arr = np.array([1, 2, 3, 4, 5], dtype=object)
arr[::2] = np.nan

print(arr)
# [nan 2 nan 4 nan]

bins = np.array([1, 3, 5])

# Inserts into same position (incorrect)
bins.searchsorted(arr)                                                                                                               
# array([0, 1, 0, 1, 0])

# Now inserts into different positions (correct)
bins.searchsorted(arr.astype(float))
# array([3, 1, 3, 2, 3])

np.__version__
# '1.17.5'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions