Skip to content

BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -412,6 +412,7 @@ Numeric

Conversion
^^^^^^^^^^
- Bug in :class:`Series` constructor responsible for bad datetime to str dtype conversions in ``read_csv``. (:issue:`57512`)
- Bug in :meth:`DataFrame.astype` not casting ``values`` for Arrow-based dictionary dtype correctly (:issue:`58479`)
- Bug in :meth:`DataFrame.update` bool dtype being converted to object (:issue:`55509`)
- Bug in :meth:`Series.astype` might modify read-only array inplace when casting to a string dtype (:issue:`57212`)
Expand Down
3 changes: 2 additions & 1 deletion pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -758,7 +758,8 @@ cpdef ndarray[object] ensure_string_array(
# dtype check to exclude DataFrame
# GH#41409 TODO: not a great place for this
out = arr.astype(str).astype(object)
out[arr.isna()] = na_value
if convert_na_value:
out[arr.isna()] = na_value
return out
arr = arr.to_numpy(dtype=object)
elif not util.is_array(arr):
Expand Down
1 change: 1 addition & 0 deletions pandas/core/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -799,6 +799,7 @@ def _try_cast(
shape = arr.shape
if arr.ndim > 1:
arr = arr.ravel()
arr = ensure_wrapped_if_datetimelike(arr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this necessary? seems weird to do this cast before calling ensure_string_array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the test added in this PR address the comment on L792?

Copy link
Contributor Author

@ruimamaral ruimamaral Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure_string_array treats EAs specially, and in this case both checks on lines 748 and 750 would pass, and the array would get converted to object by two consecutive astype calls: first to string and finally to object, which works as expected.

However, if we do not cast it to an EA, the check on line 748 (in ensure_string_array) will fail, which leads to the array being cast directly to object dtype by np.asarray on line 760:

pandas/pandas/_libs/lib.pyx

Lines 748 to 760 in bb0fcc2

if hasattr(arr, "to_numpy"):
if hasattr(arr, "dtype") and arr.dtype.kind in "mM":
# dtype check to exclude DataFrame
# GH#41409 TODO: not a great place for this
out = arr.astype(str).astype(object)
out[arr.isna()] = na_value
return out
arr = arr.to_numpy(dtype=object)
elif not util.is_array(arr):
arr = np.array(arr, dtype="object")
result = np.asarray(arr, dtype="object")

This is especially problematic for M8[ns] arrays since numpy converts the dates into nanosecond timestamps (strangely this doesn't seem to happen with any other M8 dtype array, just the nanosecond one).

The resulting Series will then contain these timestamps, instead of the usual YYYY-MM-DD HH:MM:SS format, which cannot be easily converted back to dt64.

I thought casting it to DatetimeArray was a good way of preventing this from happening, but you know better than me so I'm open to suggestions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the test added in this PR address the comment on L792?

I guess it does, would you like me to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel Hi, sorry to be a bother, but would it be possible to get some feedback?
Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the nanosecond behavior diverge from other units for M8 arrays? I see you hinting at that but I'm not sure if it has been identified already.

I do agree it is strange for only nanoseconds to diverge from the other units

Copy link
Contributor Author

@ruimamaral ruimamaral May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so, currently, whenever we cast an M8 array to string using the Series constructor we get a Series with object dtype containing stringified dates. We can recover the original timestamps from this Series easily.

Here is the described behaviour using a M8[ms] array (there is no particular reason for picking milliseconds since this next behaviour is observed with any precision other than nanoseconds from what I have seen):

>>> arr = np.array([
...     ('2024-03-17T00:00:00.000000000'),
...     ('2024-03-18T00:00:00.000000000'),
... ], dtype='M8[ms]')
>>> arr
array(['2024-03-17T00:00:00.000', '2024-03-18T00:00:00.000'],
      dtype='datetime64[ms]')
>>> s1 = pd.Series(arr, dtype=str)
>>> s1
0    2024-03-17 00:00:00
1    2024-03-18 00:00:00
dtype: object
>>> s2 = pd.Series(s1, dtype='M8[ns]')
>>> s2
0   2024-03-17
1   2024-03-18
dtype: datetime64[ns]

However, if (and only if) the array has nanosecond precision, we get a Series with object dtype containing unix timestamps instead of the usual stringified date format. The original timestamps cannot be easily recovered from the resulting Series:

>>> arr = np.array([
...     ('2024-03-17T00:00:00.000000000'),
...     ('2024-03-18T00:00:00.000000000'),
... ], dtype='M8[ns]')
>>> arr
array(['2024-03-17T00:00:00.000000000', '2024-03-18T00:00:00.000000000'],
      dtype='datetime64[ns]')
>>> s1 = pd.Series(arr, dtype=str)
>>> s1
0    1710633600000000000
1    1710720000000000000
dtype: object
>>> s2 = pd.Series(s1, dtype='M8[ns]')
>>> s2
0    1710633600000000000
1    1710720000000000000
dtype: object

This behaviour seems to stem from numpy's conversions as I mentioned in another comment.
I am not sure whether or not this is intentional but, to me, it seems a bit odd and I feel like the user would not be expecting to see such a divergence.
This PR addresses this quirk and makes it so nanoseconds are handled the same way as the other M8 types.

else:
shape = (len(arr),)
return lib.ensure_string_array(arr, convert_na_value=False, copy=copy).reshape(
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/series/test_constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,23 @@ def test_infer_with_date_and_datetime(self):
expected = Index(vals, dtype=object)
tm.assert_index_equal(idx, expected)

@pytest.mark.parametrize("dtype", ["M8[s]", "M8[ms]", "M8[us]", "M8[ns]"])
def test_constructor_str_object_dtypes_dt64_array(self, dtype):
# GH 57512
dt_arr = np.array(
[
"2024-01-03T00:00:00.000000000",
"2024-01-01T00:00:00.000000000",
],
dtype=dtype,
)
result = Series(Series(dt_arr, dtype=str), dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little unsure about this change and the issue from the OP - what the is reason for trying to have these losslessly convert between strings and datetimes? If you remove the string conversions things work fine right?

Copy link
Contributor Author

@ruimamaral ruimamaral May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feeback and apologies for the delayed response.

While investigating the original problem, I tried to find the origin of the timestamps we see in the OP, and I found out that the odd read_csv behaviour started because of some refactoring that was done it.

I narrowed it down to it being the Series constructor in this refactored code that ends up generating the timestamps when called with a dt64[ns] nparray and dtype=str:

if dtype is not None:
new_col_dict = {}
for k, v in col_dict.items():
d = (
dtype[k]
if pandas_dtype(dtype[k]) in (np.str_, np.object_)
else None
)
new_col_dict[k] = Series(v, index=index, dtype=d, copy=False)

I thought this behaviour from the Series constructor was not intended since it seems to be isolated and stands out from other combinations of types:

For example, doing the exact same ctor call but with a dt64 nparray that has any precision level other than nanoseconds works fine, and the resulting Series is able to be losslessly converted back to a dt64 Series (without the fix, the test case I added already passes whenever it runs with an array of an M8 type that is not the nanosecond one).
The same behaviour can be observed when calling it with a dt64[ns] nparray and dtype=object instead of str.

This seemed like the standard and it made sense to me that we could recover the dates from the resulting strings.

I thought the best solution to this would be to deal with the root cause (assuming it's unintended behaviour) instead of trying to patch read_csv to work as before.

I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor.

Yea ultimately I am trying to figure out what a user is expecting to have happen when specifying dtype=str alongside parse_dates= - are they trying to get back a date or a string? I understand this may have worked previously but on the flip side if its ambiguous what to expect we may decide we want to warn / deprecate when receiving conflicting arguments like htat

Copy link
Contributor Author

@ruimamaral ruimamaral May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, and I totally agree, I feel like it would be better if the user was at least warned about the incompatible arguments they specified.
But I'd be interested in hearing your thoughts on the Series constructor behaviour by itself, since, to me, it seems a bit unintuitive how it treats M8[ns] arrays differently from other M8 type arrays.

expected = Series(dt_arr, dtype=dtype)
tm.assert_series_equal(result, expected)

result = Series(Series(dt_arr, dtype=object), dtype=dtype)
tm.assert_series_equal(result, expected)

def test_unparsable_strings_with_dt64_dtype(self):
# pre-2.0 these would be silently ignored and come back with object dtype
vals = ["aa"]
Expand Down