-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Issues with DatetimeTZ values in where and combine_first (#21469 + #21546) #21660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
One note: in the old code, combining first a datetime-valued Series with a string-valued Series would return a datetime-valued Series:
There was a unit test to this effect, so I kept the behavior. However, combining a datetime-valued Series with an int-valued Series did not promote:
There was no unit test to this effect, however, and it seems inconsistent with the string-promoting behavior above. A side effect of my current implementation is that now combining first a datetime-valued Series with an int-valued Series now generates a datetime index:
Arguably I think combine_first shouldn't be in the business of type promotion at all in either the string or integer case-- I'd just leave it as object. Thoughts? Should we:
|
Codecov Report
@@ Coverage Diff @@
## master #21660 +/- ##
=========================================
Coverage ? 91.9%
=========================================
Files ? 154
Lines ? 49651
Branches ? 0
=========================================
Hits ? 45633
Misses ? 4018
Partials ? 0
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatnew note in 0.24.0
pandas/tests/frame/test_indexing.py
Outdated
@@ -2936,6 +2936,20 @@ def test_where_callable(self): | |||
tm.assert_frame_equal(result, | |||
(df + 2).where((df + 2) > 8, (df + 2) + 10)) | |||
|
|||
def test_where_tz_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the tz_naive_fixture here
@@ -551,6 +551,19 @@ def test_where_datetime_conversion(): | |||
assert_series_equal(rs, expected) | |||
|
|||
|
|||
def test_where_dt_tz_values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the tz_naive_fixture here
@@ -170,6 +170,19 @@ def get_result_type(dtype, dtype2): | |||
]).dtype | |||
assert result.kind == expected | |||
|
|||
def test_combine_first_dt_tz_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the tz_naive_fixture here
@@ -580,12 +580,11 @@ def test_where_series_datetime64(self, fill_val, exp_dtype): | |||
values = pd.Series(pd.date_range(fill_val, periods=4)) | |||
if fill_val.tz: | |||
exp = pd.Series([pd.Timestamp('2011-01-01'), | |||
pd.Timestamp('2012-01-02 05:00'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what changed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previous behavior was was treating other
for DatetimeTZ as ints (see the writeup for #21546). As a result was getting incorrectly merged with naive datetime Series as naive UTC timestamps. Fixing the behavior of tz-aware other
also fixed this and so now the result of combine_first
is a promotion of the whole Series to object
(as was desired per the deleted xfail/TODO).
pandas/tests/frame/test_indexing.py
Outdated
mask = DataFrame(True, index=df1.index, columns=df2.columns) | ||
mask.iloc[3:] = False | ||
result = df1.where(mask, df2) | ||
dts3 = DatetimeIndex(['20150101', '20150102', '20150103', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally just contruct what you can w/o using temporaries.
ideally call this expected
dts2 = pd.date_range('20150103', '20150107', tz='America/New_York') | ||
df2 = pd.DataFrame({'date': dts2}) | ||
result = df1.date.where(df1.date < df1.date[3], df2.date) | ||
exp_vals = pd.DatetimeIndex(['20150101', '20150102', '20150103', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment as above
dts2 = pd.date_range('20160514', '20160518', tz='America/New_York') | ||
df2 = pd.DataFrame({'date': dts2}, index=range(3, 8)) | ||
result = df1.date.combine_first(df2.date) | ||
exp_vals = pd.DatetimeIndex(['20150101', '20150102', '20150103', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/series.py
Outdated
@@ -2303,10 +2305,12 @@ def combine_first(self, other): | |||
new_index = self.index.union(other.index) | |||
this = self.reindex(new_index, copy=False) | |||
other = other.reindex(new_index, copy=False) | |||
if is_datetimelike(this) and not is_datetimelike(other): | |||
other = to_datetime(other) | |||
|
|||
# TODO: do we need name? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comment still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment no longer relevant, I'll remove.
What do you think of datetime coercion of other? As I noted my (too long) comment above, using to_datetime coerces ints as well as strings. Old code used the Series constructor which only coerced strings but not ints. Alternatively, we could remove the datetimelike coercion entirely.
cond = getattr(cond, 'values', cond) | ||
|
||
# If the default broadcasting would go in the wrong direction, then | ||
# explicitly reshape other instead | ||
if getattr(other, 'ndim', 0) >= 1: | ||
if values.ndim - 1 == other.ndim and axis == 1: | ||
other = other.reshape(tuple(other.shape + (1, ))) | ||
elif transpose and values.ndim == self.ndim - 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we really need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without lines 1487-1488 cond
and values
have opposite dimension, which is the root cause of (the DataFrame part of) #21544.
The issue is DatetimeTZBlock._try_coerce_args returns values
as a 2-D array. I've updated _try_coerce_args so that other
also is a 2-D array for consistency, which fixes part of the issue.
However, we don't run _try_coerce_args until the interior of func at 1499. Thus lines 1476-1477 (values = values.T
) is a no-op given a DatetimeTZBlock (since at that point values
is 1-D). other._values
is also a 1-D DatetimeIndex. So to make cond
consistent with the values
and other
we get out of _try_coerce_args it needs to be transposed.
An alternative would be changing `DatetimeTZBlock._try_coerce_args to return 1-D arrays for values and values_mask (changing 2875-2877). But if I do that it breaks setitem.
@@ -410,19 +410,6 @@ def _apply_if_callable(maybe_callable, obj, **kwargs): | |||
return maybe_callable | |||
|
|||
|
|||
def _where_compat(mask, arr1, arr2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
thanks @Liam3851 nice patch! keep em coming! |
git diff upstream/master -u -- "*.py" | flake8 --diff
CC: @jreback. Replacement for PR #21544; fixes for #21469 (
Series.combine_first
fails on tz values input) and #21546 (where
fails on tz values input). Series.combine_first now delegates towhere
for the bulk of its operations.