Skip to content

REGR: Column with datetime values too big to be converted to pd.Timestamp leads to assertion error in groupby #36003

Closed
@Khris777

Description

@Khris777
  • [X ] I have checked that this issue has not already been reported.

  • [ X] I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

Two different dates, one within the range of what pd.Timestamp can handle, the other outside of that range:

import pandas as pd
import datetime
df = pd.DataFrame({'A': ['X', 'Y'], 'B': [datetime.datetime(2005, 1, 1, 10, 30, 23, 540000),
                                          datetime.datetime(3005, 1, 1, 10, 30, 23, 540000)]})
print(df.groupby('A').B.max())

Problem description

pd.Timestamp can't deal with a too big date like the year 3005, so to represent such a date I need to use the datetime.datetime type. Before 1.1.1 (1.1.0?) this hasn't been an issue, but now this code throws an assertion error:

Traceback (most recent call last):

  File "<ipython-input-38-8b8ec5e4e179>", line 5, in <module>
    print(df.groupby('A').B.max())

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1558, in max
    numeric_only=numeric_only, min_count=min_count, alias="max", npfunc=np.max

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1015, in _agg_general
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\generic.py", line 261, in aggregate
    func, *args, engine=engine, engine_kwargs=engine_kwargs, **kwargs

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1083, in _python_agg_general
    result, counts = self.grouper.agg_series(obj, f)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\ops.py", line 644, in agg_series
    return self._aggregate_series_fast(obj, func)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\ops.py", line 669, in _aggregate_series_fast
    result, counts = grouper.get_result()

  File "pandas\_libs\reduction.pyx", line 256, in pandas._libs.reduction.SeriesGrouper.get_result

  File "pandas\_libs\reduction.pyx", line 74, in pandas._libs.reduction._BaseGrouper._apply_to_group

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1060, in <lambda>
    f = lambda x: func(x, *args, **kwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\groupby\groupby.py", line 1015, in <lambda>
    result = self.aggregate(lambda x: npfunc(x, axis=self.axis))

  File "<__array_function__ internals>", line 6, in amax

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\core\fromnumeric.py", line 2706, in amax
    keepdims=keepdims, initial=initial, where=where)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\numpy\core\fromnumeric.py", line 85, in _wrapreduction
    return reduction(axis=axis, out=out, **passkwargs)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\generic.py", line 11460, in stat_func
    func, name=name, axis=axis, skipna=skipna, numeric_only=numeric_only

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\series.py", line 4220, in _reduce
    delegate = self._values

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\series.py", line 572, in _values
    return self._mgr.internal_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\managers.py", line 1615, in internal_values
    return self._block.internal_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\blocks.py", line 2019, in internal_values
    return self.array_values()

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\internals\blocks.py", line 2022, in array_values
    return self._holder._simple_new(self.values)

  File "C:\Users\My.Name\AppData\Local\Continuum\miniconda3\envs\main\lib\site-packages\pandas\core\arrays\datetimes.py", line 290, in _simple_new
    assert values.dtype == "i8"

AssertionError

From testing with mixing pd.Timestamp and datetime.datetime types I presume pandas is converting applicable dates (first line in the example) to pd.Timestamp while leaving the others as datetime.datetime leading to a mixed-type result column and the assertion error.

Expected Output

Since I'm explicitely operating with datatype datetime.datetime there should be no implicit conversion to pd.Timestamp if it's not assured that all values are within the range that pd.Timestamp allows.

Output of pd.show_versions()

commit : f2ca0a2
python : 3.7.8.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19041
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : None.None

pandas : 1.1.1
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 50.0.0.post20200830
Cython : 0.29.21
pytest : None
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : 1.3.3
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : 0.8.0
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.1
numexpr : None
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : None
tabulate : 0.8.7
xarray : None
xlrd : 1.2.0
xlwt : None
numba : 0.51.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    DatetimeDatetime data dtypeRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions