-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Fix bad datetime64[ns] array to str dtype conversion in Series ctor which causes odd read_csv behaviour #57937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
5a61da8
e583bac
654714d
854d2ed
4fc6eac
eccf7a2
947f640
57cca23
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -79,6 +79,23 @@ def test_infer_with_date_and_datetime(self): | |||||||||||||||||||
expected = Index(vals, dtype=object) | ||||||||||||||||||||
tm.assert_index_equal(idx, expected) | ||||||||||||||||||||
|
||||||||||||||||||||
@pytest.mark.parametrize("dtype", ["M8[s]", "M8[ms]", "M8[us]", "M8[ns]"]) | ||||||||||||||||||||
def test_constructor_str_object_dtypes_dt64_array(self, dtype): | ||||||||||||||||||||
# GH 57512 | ||||||||||||||||||||
dt_arr = np.array( | ||||||||||||||||||||
[ | ||||||||||||||||||||
"2024-01-03T00:00:00.000000000", | ||||||||||||||||||||
"2024-01-01T00:00:00.000000000", | ||||||||||||||||||||
], | ||||||||||||||||||||
dtype=dtype, | ||||||||||||||||||||
) | ||||||||||||||||||||
result = Series(Series(dt_arr, dtype=str), dtype=dtype) | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am a little unsure about this change and the issue from the OP - what the is reason for trying to have these losslessly convert between strings and datetimes? If you remove the string conversions things work fine right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the feeback and apologies for the delayed response. While investigating the original problem, I tried to find the origin of the timestamps we see in the OP, and I found out that the odd read_csv behaviour started because of some refactoring that was done it. I narrowed it down to it being the Series constructor in this refactored code that ends up generating the timestamps when called with a dt64[ns] nparray and dtype=str: pandas/pandas/io/parsers/readers.py Lines 1872 to 1880 in fb05cc7
I thought this behaviour from the Series constructor was not intended since it seems to be isolated and stands out from other combinations of types: For example, doing the exact same ctor call but with a dt64 nparray that has any precision level other than nanoseconds works fine, and the resulting Series is able to be losslessly converted back to a dt64 Series (without the fix, the test case I added already passes whenever it runs with an array of an M8 type that is not the nanosecond one). This seemed like the standard and it made sense to me that we could recover the dates from the resulting strings. I thought the best solution to this would be to deal with the root cause (assuming it's unintended behaviour) instead of trying to patch read_csv to work as before. I'm not 100% sure if I understood what you meant by removing the string conversions, however, if we do not call the ctor with dtype=str, everything works as expected since the problem only arises when we try to convert dt64[ns] dates in an nparray to string dtype using the Series ctor. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yea ultimately I am trying to figure out what a user is expecting to have happen when specifying There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah I see, and I totally agree, I feel like it would be better if the user was at least warned about the incompatible arguments they specified. |
||||||||||||||||||||
expected = Series(dt_arr, dtype=dtype) | ||||||||||||||||||||
tm.assert_series_equal(result, expected) | ||||||||||||||||||||
|
||||||||||||||||||||
result = Series(Series(dt_arr, dtype=object), dtype=dtype) | ||||||||||||||||||||
tm.assert_series_equal(result, expected) | ||||||||||||||||||||
|
||||||||||||||||||||
def test_unparsable_strings_with_dt64_dtype(self): | ||||||||||||||||||||
# pre-2.0 these would be silently ignored and come back with object dtype | ||||||||||||||||||||
vals = ["aa"] | ||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? seems weird to do this cast before calling ensure_string_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the test added in this PR address the comment on L792?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure_string_array treats EAs specially, and in this case both checks on lines 748 and 750 would pass, and the array would get converted to object by two consecutive
astype
calls: first to string and finally to object, which works as expected.However, if we do not cast it to an EA, the check on line 748 (in ensure_string_array) will fail, which leads to the array being cast directly to object dtype by
np.asarray
on line 760:pandas/pandas/_libs/lib.pyx
Lines 748 to 760 in bb0fcc2
This is especially problematic for
M8[ns]
arrays since numpy converts the dates into nanosecond timestamps (strangely this doesn't seem to happen with any other M8 dtype array, just the nanosecond one).The resulting Series will then contain these timestamps, instead of the usual YYYY-MM-DD HH:MM:SS format, which cannot be easily converted back to dt64.
I thought casting it to DatetimeArray was a good way of preventing this from happening, but you know better than me so I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it does, would you like me to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel Hi, sorry to be a bother, but would it be possible to get some feedback?
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @WillAyd @mroeschke
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the nanosecond behavior diverge from other units for M8 arrays? I see you hinting at that but I'm not sure if it has been identified already.
I do agree it is strange for only nanoseconds to diverge from the other units
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so, currently, whenever we cast an M8 array to string using the Series constructor we get a Series with object dtype containing stringified dates. We can recover the original timestamps from this Series easily.
Here is the described behaviour using a M8[ms] array (there is no particular reason for picking milliseconds since this next behaviour is observed with any precision other than nanoseconds from what I have seen):
However, if (and only if) the array has nanosecond precision, we get a Series with object dtype containing unix timestamps instead of the usual stringified date format. The original timestamps cannot be easily recovered from the resulting Series:
This behaviour seems to stem from numpy's conversions as I mentioned in another comment.
I am not sure whether or not this is intentional but, to me, it seems a bit odd and I feel like the user would not be expecting to see such a divergence.
This PR addresses this quirk and makes it so nanoseconds are handled the same way as the other M8 types.