Skip to content

BUG: Issues with DatetimeTZ values in where and combine_first (#21469 + #21546) #21660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 2, 2018

Conversation

Liam3851
Copy link
Contributor

CC: @jreback. Replacement for PR #21544; fixes for #21469 (Series.combine_first fails on tz values input) and #21546 (where fails on tz values input). Series.combine_first now delegates to where for the bulk of its operations.

@Liam3851
Copy link
Contributor Author

Liam3851 commented Jun 27, 2018

One note: in the old code, combining first a datetime-valued Series with a string-valued Series would return a datetime-valued Series:

ser = pd.Series(pd.date_range('20110101', '20110102'))
mask = np.array([True, False])
ser.where(mask).combine_first(pd.Series(['20120101', '20120202']))
# result: 
# 0   2011-01-01
# 1   2012-02-02
# dtype: datetime64[ns]

There was a unit test to this effect, so I kept the behavior.

However, combining a datetime-valued Series with an int-valued Series did not promote:

ser = pd.Series(pd.date_range('20110101', '20110102'))  
mask = np.array([True, False])                          
ser.where(mask).combine_first(pd.Series([1, 2]))        
# result:
# 0    2011-01-01 00:00:00
# 1                      2
# dtype: object

There was no unit test to this effect, however, and it seems inconsistent with the string-promoting behavior above.

A side effect of my current implementation is that now combining first a datetime-valued Series with an int-valued Series now generates a datetime index:

ser = pd.Series(pd.date_range('20110101', '20110102'))  
mask = np.array([True, False])                          
ser.where(mask).combine_first(pd.Series([1, 2]))        
# result: 
# 0   2011-01-01 00:00:00.000000000
# 1   1970-01-01 00:00:00.000000002
# dtype: datetime64[ns]

Arguably I think combine_first shouldn't be in the business of type promotion at all in either the string or integer case-- I'd just leave it as object.

Thoughts? Should we:

  • promote on both strings and ints
  • promote on strings but not on ints
  • leave combining datetimes with either strings or ints as object dtype?

@gfyoung gfyoung added Bug Indexing Related to indexing on series/frames, not to indexes themselves Timedelta Timedelta data type MultiIndex Timezones Timezone data dtype and removed Timedelta Timedelta data type labels Jun 27, 2018
@gfyoung gfyoung requested a review from jreback June 27, 2018 23:07
@codecov
Copy link

codecov bot commented Jun 27, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@823478c). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #21660   +/-   ##
=========================================
  Coverage          ?    91.9%           
=========================================
  Files             ?      154           
  Lines             ?    49651           
  Branches          ?        0           
=========================================
  Hits              ?    45633           
  Misses            ?     4018           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.28% <100%> (?)
#single 41.9% <20%> (?)
Impacted Files Coverage Δ
pandas/core/internals.py 95.59% <100%> (ø)
pandas/core/series.py 94.2% <100%> (ø)
pandas/core/common.py 92.19% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 823478c...c97b951. Read the comment docs.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatnew note in 0.24.0

@@ -2936,6 +2936,20 @@ def test_where_callable(self):
tm.assert_frame_equal(result,
(df + 2).where((df + 2) > 8, (df + 2) + 10))

def test_where_tz_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the tz_naive_fixture here

@@ -551,6 +551,19 @@ def test_where_datetime_conversion():
assert_series_equal(rs, expected)


def test_where_dt_tz_values():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the tz_naive_fixture here

@@ -170,6 +170,19 @@ def get_result_type(dtype, dtype2):
]).dtype
assert result.kind == expected

def test_combine_first_dt_tz_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the tz_naive_fixture here

@@ -580,12 +580,11 @@ def test_where_series_datetime64(self, fill_val, exp_dtype):
values = pd.Series(pd.date_range(fill_val, periods=4))
if fill_val.tz:
exp = pd.Series([pd.Timestamp('2011-01-01'),
pd.Timestamp('2012-01-02 05:00'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what changed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous behavior was was treating other for DatetimeTZ as ints (see the writeup for #21546). As a result was getting incorrectly merged with naive datetime Series as naive UTC timestamps. Fixing the behavior of tz-aware other also fixed this and so now the result of combine_first is a promotion of the whole Series to object (as was desired per the deleted xfail/TODO).

mask = DataFrame(True, index=df1.index, columns=df2.columns)
mask.iloc[3:] = False
result = df1.where(mask, df2)
dts3 = DatetimeIndex(['20150101', '20150102', '20150103',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally just contruct what you can w/o using temporaries.

ideally call this expected

dts2 = pd.date_range('20150103', '20150107', tz='America/New_York')
df2 = pd.DataFrame({'date': dts2})
result = df1.date.where(df1.date < df1.date[3], df2.date)
exp_vals = pd.DatetimeIndex(['20150101', '20150102', '20150103',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment as above

dts2 = pd.date_range('20160514', '20160518', tz='America/New_York')
df2 = pd.DataFrame({'date': dts2}, index=range(3, 8))
result = df1.date.combine_first(df2.date)
exp_vals = pd.DatetimeIndex(['20150101', '20150102', '20150103',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2303,10 +2305,12 @@ def combine_first(self, other):
new_index = self.index.union(other.index)
this = self.reindex(new_index, copy=False)
other = other.reindex(new_index, copy=False)
if is_datetimelike(this) and not is_datetimelike(other):
other = to_datetime(other)

# TODO: do we need name?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comment still relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment no longer relevant, I'll remove.

What do you think of datetime coercion of other? As I noted my (too long) comment above, using to_datetime coerces ints as well as strings. Old code used the Series constructor which only coerced strings but not ints. Alternatively, we could remove the datetimelike coercion entirely.

cond = getattr(cond, 'values', cond)

# If the default broadcasting would go in the wrong direction, then
# explicitly reshape other instead
if getattr(other, 'ndim', 0) >= 1:
if values.ndim - 1 == other.ndim and axis == 1:
other = other.reshape(tuple(other.shape + (1, )))
elif transpose and values.ndim == self.ndim - 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without lines 1487-1488 cond and values have opposite dimension, which is the root cause of (the DataFrame part of) #21544.

The issue is DatetimeTZBlock._try_coerce_args returns values as a 2-D array. I've updated _try_coerce_args so that other also is a 2-D array for consistency, which fixes part of the issue.

However, we don't run _try_coerce_args until the interior of func at 1499. Thus lines 1476-1477 (values = values.T) is a no-op given a DatetimeTZBlock (since at that point values is 1-D). other._values is also a 1-D DatetimeIndex. So to make cond consistent with the values and other we get out of _try_coerce_args it needs to be transposed.

An alternative would be changing `DatetimeTZBlock._try_coerce_args to return 1-D arrays for values and values_mask (changing 2875-2877). But if I do that it breaks setitem.

@@ -410,19 +410,6 @@ def _apply_if_callable(maybe_callable, obj, **kwargs):
return maybe_callable


def _where_compat(mask, arr1, arr2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

@jreback jreback added this to the 0.24.0 milestone Jul 2, 2018
@jreback jreback merged commit 8b2070a into pandas-dev:master Jul 2, 2018
@jreback
Copy link
Contributor

jreback commented Jul 2, 2018

thanks @Liam3851

nice patch! keep em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex Timezones Timezone data dtype
Projects
None yet
3 participants