Open
Description
(Below is from master)
import numpy as np
import pandas as pd
arr = np.arange(10).astype(object)
arr[::2] = np.nan
print(arr)
# [nan 1 nan 3 nan 5 nan 7 nan 9]
result = pd.cut(arr, 2)
print(result)
# [NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0], NaN, (0.992, 5.0]]
# Categories (2, interval[float64]): [(0.992, 5.0] < (5.0, 9.0]]
print(result.unique())
# [NaN, (0.992, 5.0]]
# Categories (1, interval[float64]): [(0.992, 5.0]]
Using cut
with an array of object dtype containing missing values seems to return the wrong intervals in some cases (e.g., in the example above only the first interval appears in the result). Actually, the only situation where I've been able to reproduce this problem is specifically when the NaN values are evenly spaced, which is strange.
Looks like the problem is due to searchsorted
in numpy
:
import numpy as np
arr = np.array([1, 2, 3, 4, 5], dtype=object)
arr[::2] = np.nan
print(arr)
# [nan 2 nan 4 nan]
bins = np.array([1, 3, 5])
# Inserts into same position (incorrect)
bins.searchsorted(arr)
# array([0, 1, 0, 1, 0])
# Now inserts into different positions (correct)
bins.searchsorted(arr.astype(float))
# array([3, 1, 3, 2, 3])
np.__version__
# '1.17.5'