Skip to content

qcut not respecting precision valuesΒ #32127

Open
@m-dz

Description

@m-dz

Code Sample, a copy-pastable example if possible

np.random.seed(42)
df = pd.DataFrame({"a": np.random.rand(20)})
for prec in range(1,10):
    print(f"Prec: {prec}, first interval: {pd.qcut(df['a'], q=10, precision=prec).cat.categories[0]}")

# Prec: 1, first interval: (0.011000000000000001, 0.15]
# Prec: 2, first interval: (0.011000000000000001, 0.15]
# Prec: 3, first interval: (0.0196, 0.146]
# Prec: 4, first interval: (0.02048, 0.1462]
# Prec: 5, first interval: (0.020574000000000002, 0.1462]
# Prec: 6, first interval: (0.020583499999999998, 0.146203]
# Prec: 7, first interval: (0.02058439, 0.1462034]
# Prec: 8, first interval: (0.020584483999999997, 0.14620343]
# Prec: 9, first interval: (0.0205844933, 0.14620343]

Problem description

Example above shows inconsistent behaviour of qcut function when changing the precision argument. I can potentially understand why the first 2 values returned what they did (maybe it is not possible to create enough intervals with low precision, but then error/warning should be raised, shouldn't it?), but output of precisions 5, 6 and 8 is rather confusing and causes issues in e.g. printing or plotting using labels.

This is not happening for all intervals as far as I can see, just the first one, see precisions 6 and 7 output below:

Prec: 6, first interval: IntervalIndex([(0.020583499999999998, 0.146203], (0.146203, 0.176664], (0.176664, 0.203659], (0.203659, 0.299037], (0.299037, 0.403243], (0.403243, 0.554317], (0.554317, 0.633202], (0.633202, 0.752084], (0.752084, 0.87463], (0.87463, 0.96991]],
              closed='right',
              dtype='interval[float64]')
Prec: 7, first interval: IntervalIndex([(0.02058439, 0.1462034], (0.1462034, 0.1766637], (0.1766637, 0.2036587], (0.2036587, 0.299037], (0.299037, 0.4032426], (0.4032426, 0.5543173], (0.5543173, 0.6332023], (0.6332023, 0.7520837], (0.7520837, 0.87463], (0.87463, 0.9699099]],
              closed='right',
              dtype='interval[float64]')

I have checked issues for "qcut precision", but none of the existing issues seems to cover this.

Expected Output

# Prec: 1, first interval: (0.01, 0.2]
# Prec: 2, first interval: (0.011, 0.15]
# Prec: 3, first interval: (0.0196, 0.146]
# Prec: 4, first interval: (0.02048, 0.1462]
# Prec: 5, first interval: (0.020574, 0.1462]
# Prec: 6, first interval: (0.0205835, 0.146203]
# Prec: 7, first interval: (0.02058439, 0.1462034]
# Prec: 8, first interval: (0.020584484, 0.14620343]
# Prec: 9, first interval: (0.0205844933, 0.14620343]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.6.10.final.0
python-bits : 64
OS : Linux
OS-release : 4.14.165-102.185.amzn1.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : 1.2.7
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions