Skip to content

BUG: groupby upon categorical and sort=False triggers ValueError #13179

Closed
@mpschr

Description

@mpschr

Code that triggers ValueError

The combination of sort=False and a missing category in the data causes the bug - see below

First off, see this notebook which showcases the bug nicely: github.com/mpschr/pandas_missing_cat_bug

random.seed(88)
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
chromosomes = [str(x) for x in range(1,23)] + ["X","Y"]
df.insert(0, 'chromosomes', sorted([random.choice(chromosomes) for x in range(100)]))
df.chromosomes = df.chromosomes.astype('category', categories=chromosomes, ordered=True)

for c, g in df.query("chromosomes != '1'").groupby('chromosomes', sort=False):
    print(c, g.chromosomes.cat.categories, g.shape)


/home/michi/bin/anaconda3/lib/python3.4/site-packages/pandas/core/groupby.py in __init__(self, index, grouper, obj, name, level, sort, in_axis)
   2181                     cat = self.grouper.unique()
   2182                     self.grouper = self.grouper.reorder_categories(
-> 2183                         cat.categories)
   2184 
   2185                 # we make a CategoricalIndex out of the cat grouper

/home/michi/bin/anaconda3/lib/python3.4/site-packages/pandas/core/categorical.py in reorder_categories(self, new_categories, ordered, inplace)
    756         """
    757         if set(self._categories) != set(new_categories):
--> 758             raise ValueError("items in new_categories are not the same as in "
    759                              "old categories")
    760         return self.set_categories(new_categories, ordered=ordered,

ValueError: items in new_categories are not the same as in old categories
Summaries of the scenarios where this bug appears:

Bug scenarios with ordered categories:

  • Default (sort = True): No error
  • chromosome 1 filtered out and sort=True: No error
  • chromosome 1 filtered out and sort=False: Error
  • sort = False: Error

Bug scenarios without ordered categories:

the 4 scenarios:

  • Default (sort = True): No error
  • chromosome 1 filtered out and sort=True: No error
  • sort = False: No error
  • chromosome 1 filtered out and sort=False: Error

Expected Output

Not an error, but this:


1 Index(['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13',
       '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y'],
      dtype='object') (7, 5)

output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-34-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1
nose: 1.3.4
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.22
numpy: 1.10.4
scipy: 0.16.0
statsmodels: 0.6.0.dev-9ce1605
xarray: None
IPython: 4.1.2
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.2
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.36.0
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions