Skip to content

Groupby indices error with datetime categorical #26859

Closed
@alexifm

Description

@alexifm

Code Sample

df = pd.DataFrame({
    'a': pd.Series(list('abc')),
    'b': pd.Series(pd.to_datetime(['2018-01-01', '2018-02-01', '2018-03-01']), dtype='category'),
    'c': pd.Categorical.from_codes([-1, 0, 1], categories=[0, 1])
})

df.groupby(['a', 'b']).indices

Problem description

Tossing an error. You can play around with difference choices of columns but this happens so long as you include 'b' with one of the other columns. 'b' on its own is okay.

>> df.groupby(['a', 'b']).indices
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-c4de90de974e> in <module>
      1 gb = df.groupby(['a', 'b'])
----> 2 gb.indices

/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in indices(self)
    401         """
    402         self._assure_grouper()
--> 403         return self.grouper.indices
    404 
    405     def _get_indices(self, names):

pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()

/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/ops.py in indices(self)
    204             keys = [com.values_from_object(ping.group_index)
    205                     for ping in self.groupings]
--> 206             return get_indexer_dict(label_list, keys)
    207 
    208     @property

/opt/conda/lib/python3.6/site-packages/pandas/core/sorting.py in get_indexer_dict(label_list, keys)
    331     group_index = group_index.take(sorter)
    332 
--> 333     return lib.indices_fast(sorter, group_index, keys, sorted_labels)
    334 
    335 

pandas/_libs/lib.pyx in pandas._libs.lib.indices_fast()

TypeError: Cannot convert DatetimeIndex to numpy.ndarray

Expected Output

Not an error.

Cause

If we inspect, BaseGrouper.indices, we see that keys gets passed to get_indexer_dict here:

def indices(self):
""" dict {group name -> group indices} """
if len(self.groupings) == 1:
return self.groupings[0].indices
else:
label_list = [ping.labels for ping in self.groupings]
keys = [com.values_from_object(ping.group_index)
for ping in self.groupings]
return get_indexer_dict(label_list, keys)

get_indexer_dict eventually passes the elements of keys to get_value_at found here:

cdef inline object get_value_at(ndarray arr, object loc):
cdef:
Py_ssize_t i
i = validate_indexer(arr, loc)
return arr[i]

The problem is that to build keys, the get_values method is called on each group index (you can see in BaseGrouper.indices how this isn't an issue when there's a single grouper). When grouping on a categorical-datetime column like df['b'], the get_values method on the underlying categorical array is called and within that method this branch of the if statement is triggered, causing a DatetimeIndex to be returned instead of a numpy array.

return self.categories.take(self._codes, fill_value=np.nan)

Solution

Now, it states in the Categorical.get_values doc string that an Index object could be return and not a numpy array. The simplest thing is to just introduce a line like this before get_indexer_dict

keys = [np.array(key) for key in keys]

A pull request for this will be created imminently.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.14.3
scipy: 1.2.1
pyarrow: 0.13.0
xarray: None
IPython: 7.4.0
sphinx: 2.0.0
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions