Skip to content

Feature: Qcut when passed labels and duplicates='drop' should drop corresponding labels #22669

Open
@olveirap

Description

@olveirap

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
    """
    Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to 
    specify equal spaced quantiles or an array of quantiles 
    :param data: DataFrame :type data: DataFrame 
    :param column: column to which add quantiles :type column: string 
    :param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float] 
    :return: DataFrame 
    """
    if isinstance(quantiles, int):
        labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
    if isinstance(quantiles, list):
        labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
        del labels[0]  # Bin labels must be one fewer than the number of bin edges
    data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
                                                   q=quantiles,
                                                   labels=labels, duplicates='drop'))], axis=1)
    return data

zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0      0.0
#1      0.0
#2      0.0
#3     33.0
#4     81.0
#5     13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
  File "pandas_qcut.py", line 29, in <module>
    print(add_quantiles(df, 'numbers'))
  File "pandas_qcut.py", line 20, in add_quantiles
    labels=labels, duplicates='drop'))], axis=1)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
    dtype=dtype, duplicates=duplicates)
  File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
    raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges

Problem description

When using this function with quantiles that return repeated bins, the function raises "ValueError: Bin labels must be one fewer than the number of bin edges". When using the optional parameter "duplicates" the only way to pass a valid "labels" parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.

Expected Output

Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins.
E.g output of add_quantiles.:

   numbers  numbers_50q  numbers_75q  numbers_100q
0      0.0            1            0             0
1      0.0            1            0             0
2      0.0            1            0             0
3     33.0            0            0             1
4     81.0            0            0             1
5     13.0            0            1             0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None

pandas: 0.22.0
pytest: 3.5.0
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions