Description
Code Sample
df = pd.DataFrame({
'a': pd.Series(list('abc')),
'b': pd.Series(pd.to_datetime(['2018-01-01', '2018-02-01', '2018-03-01']), dtype='category'),
'c': pd.Categorical.from_codes([-1, 0, 1], categories=[0, 1])
})
df.groupby(['a', 'b']).indices
Problem description
Tossing an error. You can play around with difference choices of columns but this happens so long as you include 'b'
with one of the other columns. 'b'
on its own is okay.
>> df.groupby(['a', 'b']).indices
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-c4de90de974e> in <module>
1 gb = df.groupby(['a', 'b'])
----> 2 gb.indices
/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/groupby.py in indices(self)
401 """
402 self._assure_grouper()
--> 403 return self.grouper.indices
404
405 def _get_indices(self, names):
pandas/_libs/properties.pyx in pandas._libs.properties.CachedProperty.__get__()
/opt/conda/lib/python3.6/site-packages/pandas/core/groupby/ops.py in indices(self)
204 keys = [com.values_from_object(ping.group_index)
205 for ping in self.groupings]
--> 206 return get_indexer_dict(label_list, keys)
207
208 @property
/opt/conda/lib/python3.6/site-packages/pandas/core/sorting.py in get_indexer_dict(label_list, keys)
331 group_index = group_index.take(sorter)
332
--> 333 return lib.indices_fast(sorter, group_index, keys, sorted_labels)
334
335
pandas/_libs/lib.pyx in pandas._libs.lib.indices_fast()
TypeError: Cannot convert DatetimeIndex to numpy.ndarray
Expected Output
Not an error.
Cause
If we inspect, BaseGrouper.indices
, we see that keys
gets passed to get_indexer_dict
here:
pandas/pandas/core/groupby/ops.py
Lines 227 to 235 in 430f0fd
get_indexer_dict
eventually passes the elements of keys
to get_value_at
found here:
Lines 94 to 99 in 2b32e41
The problem is that to build keys
, the get_values
method is called on each group index (you can see in BaseGrouper.indices
how this isn't an issue when there's a single grouper). When grouping on a categorical-datetime column like df['b']
, the get_values
method on the underlying categorical array is called and within that method this branch of the if statement is triggered, causing a DatetimeIndex to be returned instead of a numpy array.
pandas/pandas/core/arrays/categorical.py
Line 1504 in 2b32e41
Solution
Now, it states in the Categorical.get_values
doc string that an Index object could be return and not a numpy array. The simplest thing is to just introduce a line like this before get_indexer_dict
keys = [np.array(key) for key in keys]
A pull request for this will be created imminently.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.18.0-21-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.2
pytest: 4.3.1
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.14.3
scipy: 1.2.1
pyarrow: 0.13.0
xarray: None
IPython: 7.4.0
sphinx: 2.0.0
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: 2.6.1
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.5
lxml.etree: 4.3.3
bs4: 4.7.1
html5lib: 1.0.1
sqlalchemy: 1.3.1
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: 0.3.1
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None