Description
Code Sample, a copy-pastable example if possible
import pandas as pd
import numpy as np
def add_quantiles(data, column, quantiles=4):
"""
Returns the given dataframe with dummy columns for quantiles of a given column. Quantiles can be a int to
specify equal spaced quantiles or an array of quantiles
:param data: DataFrame :type data: DataFrame
:param column: column to which add quantiles :type column: string
:param quantiles: number of quantiles to generate or list of quantiles :type quantiles: Union[int, list of float]
:return: DataFrame
"""
if isinstance(quantiles, int):
labels = [column + "_" + str(int(quantile / quantiles * 100)) + "q" for quantile in range(1, quantiles + 1)]
if isinstance(quantiles, list):
labels = [column + "_" + str(int(quantile * 100)) + "q" for quantile in quantiles]
del labels[0] # Bin labels must be one fewer than the number of bin edges
data = pd.concat([data, pd.get_dummies(pd.qcut(x=data[column],
q=quantiles,
labels=labels, duplicates='drop'))], axis=1)
return data
zs = np.zeros(3)
rs = np.random.randint(1, 100, size=3)
arr=np.concatenate((zs, rs))
ser = pd.Series(arr)
df = pd.DataFrame({'numbers':ser})
print(df)
#numbers
#0 0.0
#1 0.0
#2 0.0
#3 33.0
#4 81.0
#5 13.0
print(add_quantiles(df, 'numbers'))
Traceback (most recent call last):
File "pandas_qcut.py", line 29, in <module>
print(add_quantiles(df, 'numbers'))
File "pandas_qcut.py", line 20, in add_quantiles
labels=labels, duplicates='drop'))], axis=1)
File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 206, in qcut
dtype=dtype, duplicates=duplicates)
File "/home/mindcraft/anaconda3/lib/python3.6/site-packages/pandas/core/reshape/tile.py", line 252, in _bins_to_cuts
raise ValueError('Bin labels must be one fewer than '
ValueError: Bin labels must be one fewer than the number of bin edges
Problem description
When using this function with quantiles that return repeated bins, the function raises "ValueError: Bin labels must be one fewer than the number of bin edges". When using the optional parameter "duplicates" the only way to pass a valid "labels" parameters is checking for duplicate bins beforehand, repeating code in order to calculate the bins.
Expected Output
Pd.qcut should return the quantilizated column with the labels corresponding to the indices of the unique bins.
E.g output of add_quantiles.:
numbers numbers_50q numbers_75q numbers_100q
0 0.0 1 0 0
1 0.0 1 0 0
2 0.0 1 0 0
3 33.0 0 0 1
4 81.0 0 0 1
5 13.0 0 1 0
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
pandas: 0.22.0
pytest: 3.5.0
pip: 18.0
setuptools: 40.0.0
Cython: 0.28.1
numpy: 1.14.2
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.7.2
patsy: 0.5.0
dateutil: 2.7.2
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.1
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.0.2
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None