Skip to content

PERF: Implement DataFrame-with-scalar ops block-wise #28583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
9a617e3
REF: implement logical and comparison array ops
jbrockmendel Sep 12, 2019
3f414b1
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 13, 2019
396b4a8
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 13, 2019
56dff20
implement arithmetic_op
jbrockmendel Sep 13, 2019
77e3241
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 13, 2019
148a8e8
add comments, types
jbrockmendel Sep 13, 2019
fcf9735
typo fixup
jbrockmendel Sep 13, 2019
fec86de
revert types
jbrockmendel Sep 14, 2019
2abdccb
add types
jbrockmendel Sep 17, 2019
121d783
docstrings
jbrockmendel Sep 17, 2019
5faa820
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 17, 2019
267c7ca
ignore type
jbrockmendel Sep 17, 2019
0b5aa34
revert technically-incorrect type
jbrockmendel Sep 17, 2019
ad6da57
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 18, 2019
8ced97b
REF: move na_op out
jbrockmendel Sep 18, 2019
6b9bce0
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 18, 2019
b0d6263
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 19, 2019
524a1fb
Checkpoint, 5 expressions tests failing
jbrockmendel Sep 20, 2019
504a12d
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 20, 2019
e968517
revert
jbrockmendel Sep 20, 2019
709b1db
revert
jbrockmendel Sep 20, 2019
274188a
tests passing
jbrockmendel Sep 20, 2019
8f8f527
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 20, 2019
7561f05
OK
jbrockmendel Sep 20, 2019
837f028
revert
jbrockmendel Sep 20, 2019
a6eada6
revert
jbrockmendel Sep 20, 2019
936be5f
revert
jbrockmendel Sep 20, 2019
5176a59
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 23, 2019
26d1696
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 23, 2019
01e4922
Fix tests by passing eval_kwargs
jbrockmendel Sep 23, 2019
16587e2
update tests
jbrockmendel Sep 23, 2019
b735d71
reenable check
jbrockmendel Sep 23, 2019
4dd8944
lint fixup
jbrockmendel Sep 23, 2019
829e72a
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Sep 27, 2019
e9d6cd9
blackify
jbrockmendel Sep 28, 2019
4c65d37
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Oct 9, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 28 additions & 5 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,6 +328,32 @@ class DatetimeLikeArrayMixin(ExtensionOpsMixin, AttributesMixin, ExtensionArray)
_generate_range
"""

@property
def ndim(self):
return self._data.ndim

@property
def shape(self):
return self._data.shape

def __len__(self):
return len(self._data)

@property
def T(self):
# Note: we drop any freq
return type(self)(self._data.T, dtype=self.dtype)

def reshape(self, *args, **kwargs):
# Note: we drop any freq
data = self._data.reshape(*args, **kwargs)
return type(self)(data, dtype=self.dtype)

def ravel(self, *args, **kwargs):
# Note: we drop any freq
data = self._data.ravel(*args, **kwargs)
return type(self)(data, dtype=self.dtype)

@property
def _box_func(self):
"""
Expand Down Expand Up @@ -396,9 +422,6 @@ def size(self) -> int:
"""The number of elements in this array."""
return np.prod(self.shape)

def __len__(self):
return len(self._data)

def __getitem__(self, key):
"""
This getitem defers to the underlying array, which by-definition can
Expand Down Expand Up @@ -1032,7 +1055,7 @@ def _add_nat(self):

# GH#19124 pd.NaT is treated like a timedelta for both timedelta
# and datetime dtypes
result = np.zeros(len(self), dtype=np.int64)
result = np.zeros(self.shape, dtype=np.int64)
result.fill(iNaT)
return type(self)(result, dtype=self.dtype, freq=None)

Expand All @@ -1046,7 +1069,7 @@ def _sub_nat(self):
# For datetime64 dtypes by convention we treat NaT as a datetime, so
# this subtraction returns a timedelta64 dtype.
# For period dtype, timedelta64 is a close-enough return dtype.
result = np.zeros(len(self), dtype=np.int64)
result = np.zeros(self.shape, dtype=np.int64)
result.fill(iNaT)
return result.view("timedelta64[ns]")

Expand Down
15 changes: 14 additions & 1 deletion pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,17 @@
"""


def compat_2d(meth):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably this is going to be applied to other methods (else you'd just do the reshaping in _add_offset)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented this at a stage when it was needed in several places (for e.g. getting __repr__ to work in debugging), then trimmed the usages back to the bare minimum. So we could get by without it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, OK good to know. Let's hear what others prefer (my slight preference is to inline it in the one place it's used).

def new_meth(self, *args, **kwargs):
if self.ndim > 1:
result = meth(self.ravel(), *args, **kwargs)
return result.reshape(self.shape)
return meth(self, *args, **kwargs)

new_meth.__name__ = meth.__name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use functools.wraps instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

return new_meth


def tz_to_dtype(tz):
"""
Return a datetime64[ns] dtype appropriate for the given timezone.
Expand Down Expand Up @@ -361,7 +372,7 @@ def __init__(self, values, dtype=_NS_DTYPE, freq=None, copy=False):
"ndarray, or Series or Index containing one of those."
)
raise ValueError(msg.format(type(values).__name__))
if values.ndim != 1:
if values.ndim not in [1, 2]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this changed? Seems to conflict with the error message directly below

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applicable in a few places

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yah, this is kludge-adjacent. We don't really support 2D, so don't to tell users its OK.

raise ValueError("Only 1-dimensional input arrays are supported.")

if values.dtype == "i8":
Expand Down Expand Up @@ -818,13 +829,15 @@ def _sub_datetime_arraylike(self, other):
new_values[arr_mask] = iNaT
return new_values.view("timedelta64[ns]")

@compat_2d
def _add_offset(self, offset):
assert not isinstance(offset, Tick)
try:
if self.tz is not None:
values = self.tz_localize(None)
else:
values = self

result = offset.apply_index(values)
if self.tz is not None:
result = result.tz_localize(self.tz)
Expand Down
4 changes: 2 additions & 2 deletions pandas/core/arrays/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,7 +222,7 @@ def __init__(self, values, dtype=_TD_DTYPE, freq=None, copy=False):
"ndarray, or Series or Index containing one of those."
)
raise ValueError(msg.format(type(values).__name__))
if values.ndim != 1:
if values.ndim not in [1, 2]:
raise ValueError("Only 1-dimensional input arrays are supported.")

if values.dtype == "i8":
Expand Down Expand Up @@ -1076,7 +1076,7 @@ def sequence_to_td64ns(data, copy=False, unit="ns", errors="raise"):
)

data = np.array(data, copy=copy)
if data.ndim != 1:
if data.ndim not in [1, 2]:
raise ValueError("Only 1-dimensional input arrays are supported.")

assert data.dtype == "m8[ns]", data
Expand Down
71 changes: 64 additions & 7 deletions pandas/core/ops/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,11 @@
from pandas._libs import Timedelta, Timestamp, lib
from pandas.util._decorators import Appender

from pandas.core.dtypes.common import is_list_like, is_timedelta64_dtype
from pandas.core.dtypes.common import (
is_extension_array_dtype,
is_list_like,
is_timedelta64_dtype,
)
from pandas.core.dtypes.generic import (
ABCDataFrame,
ABCExtensionArray,
Expand All @@ -24,6 +28,7 @@
from pandas.core.construction import extract_array
from pandas.core.ops.array_ops import (
arithmetic_op,
array_op,
comparison_op,
define_na_arithmetic_op,
logical_op,
Expand Down Expand Up @@ -350,7 +355,7 @@ def fill_binop(left, right, fill_value):
# Dispatch logic


def dispatch_to_series(left, right, func, str_rep=None, axis=None):
def dispatch_to_series(left, right, func, str_rep=None, axis=None, eval_kwargs=None):
"""
Evaluate the frame operation func(left, right) by evaluating
column-by-column, dispatching to the Series implementation.
Expand All @@ -369,11 +374,57 @@ def dispatch_to_series(left, right, func, str_rep=None, axis=None):
"""
# Note: we use iloc to access columns for compat with cases
# with non-unique columns.
eval_kwargs = eval_kwargs or {}

import pandas.core.computation.expressions as expressions

right = lib.item_from_zerodim(right)

if lib.is_scalar(right) or np.ndim(right) == 0:

new_blocks = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather actually actualy do this in the block manager no? (maybe put in internals/ops.py) we should actually move all block type ops there (e.g. from groupby as well). OR isolate this code in ops somewhere (maybe internals.py), so things are separated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree on getting Block/BlockManager-touching things isolated in internals. Will give some thought to how/when to move this. ATM this is in Proof of Concept phase while I figure out how to handle the remaining cases.

mgr = left._data
for blk in mgr.blocks:
# Reshape for EA Block
blk_vals = blk.values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be using to_numpy instead of .values? Not super well versed on what types actually get preserved this way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we specifically want to get the .values attribute, which can be either a ndarray or EA (also I dont think Block has to_numpy)

if hasattr(blk_vals, "reshape"):
# ndarray, DTA/TDA/PA
blk_vals = blk_vals.reshape(blk.shape)
blk_vals = blk_vals.T

new_vals = array_op(blk_vals, right, func, str_rep, eval_kwargs)

# Reshape for EA Block
if is_extension_array_dtype(new_vals.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own understanding: the if and the elif would be unnecessary if all EAs were allowed to be 2D?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Everything from 489-524 would boil down to something like:

new_vals = array_op(blk.values.T, right, func, str_rep, eval_kwargs)
nb = blk.make_block(new_vals.T)
new_blocks.append(nb)

from pandas.core.internals.blocks import make_block

if hasattr(new_vals, "reshape"):
# ndarray, DTA/TDA/PA
new_vals = new_vals.reshape(blk.shape[::-1])
assert new_vals.shape[-1] == len(blk.mgr_locs)
for i in range(new_vals.shape[-1]):
nb = make_block(new_vals[..., i], placement=[blk.mgr_locs[i]])
new_blocks.append(nb)
else:
# Categorical, IntegerArray
assert len(blk.mgr_locs) == 1
assert new_vals.shape == (blk.shape[-1],)
nb = make_block(new_vals, placement=blk.mgr_locs, ndim=2)
new_blocks.append(nb)
elif blk.values.ndim == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: what hits this case? An op on an EA that returns an ndarray? Say DataFrame[Categorical] == 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. Block[Categorical].values == 0 needs to become a 2D ndarray[bool] block before we're done.

# need to bump up to 2D
new_vals = new_vals.reshape(-1, 1)
assert new_vals.T.shape == blk.shape
nb = blk.make_block(new_vals.T)
new_blocks.append(nb)
else:
assert new_vals.T.shape == blk.shape
nb = blk.make_block(new_vals.T)
new_blocks.append(nb)

bm = type(mgr)(new_blocks, mgr.axes)
return type(left)(bm)

def column_op(a, b):
return {i: func(a.iloc[:, i], b) for i in range(len(a.columns))}

Expand Down Expand Up @@ -526,7 +577,7 @@ def wrapper(self, other):
lvalues = extract_array(self, extract_numpy=True)
rvalues = extract_array(other, extract_numpy=True)

res_values = comparison_op(lvalues, rvalues, op)
res_values = comparison_op(lvalues, rvalues, op, None, {})
Copy link
Contributor

@topper-123 topper-123 Oct 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability, I'd prefer the last two argument to be keyword arguments (It's relatively self-evident what the first three arguments for, but that's IMO not the case for the last two...). Same for calls to comparison_op below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.


return _construct_result(self, res_values, index=self.index, name=res_name)

Expand All @@ -552,7 +603,7 @@ def wrapper(self, other):
lvalues = extract_array(self, extract_numpy=True)
rvalues = extract_array(other, extract_numpy=True)

res_values = logical_op(lvalues, rvalues, op)
res_values = logical_op(lvalues, rvalues, op, None, {})
return _construct_result(self, res_values, index=self.index, name=res_name)

wrapper.__name__ = op_name
Expand Down Expand Up @@ -723,7 +774,9 @@ def f(self, other, axis=default_axis, level=None, fill_value=None):
if fill_value is not None:
self = self.fillna(fill_value)

new_data = dispatch_to_series(self, other, op)
new_data = dispatch_to_series(
self, other, op, str_rep=str_rep, eval_kwargs=eval_kwargs
)
return self._construct_result(new_data)

f.__name__ = op_name
Expand All @@ -749,7 +802,9 @@ def f(self, other, axis=default_axis, level=None):
# Another DataFrame
if not self._indexed_same(other):
self, other = self.align(other, "outer", level=level, copy=False)
new_data = dispatch_to_series(self, other, op, str_rep)
new_data = dispatch_to_series(
self, other, op, str_rep=str_rep, eval_kwargs={}
)
return self._construct_result(new_data)

elif isinstance(other, ABCSeries):
Expand Down Expand Up @@ -781,7 +836,9 @@ def f(self, other):
raise ValueError(
"Can only compare identically-labeled DataFrame objects"
)
new_data = dispatch_to_series(self, other, op, str_rep)
new_data = dispatch_to_series(
self, other, op, str_rep=str_rep, eval_kwargs={}
)
return self._construct_result(new_data)

elif isinstance(other, ABCSeries):
Expand Down
23 changes: 17 additions & 6 deletions pandas/core/ops/array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def comp_method_OBJECT_ARRAY(op, x, y):

result = libops.vec_compare(x, y, op)
else:
result = libops.scalar_compare(x, y, op)
result = libops.scalar_compare(x.ravel(), y, op).reshape(x.shape)
return result


Expand Down Expand Up @@ -156,12 +156,22 @@ def na_arithmetic_op(left, right, op, str_rep: str, eval_kwargs):
return missing.dispatch_fill_zeros(op, left, right, result)


def array_op(left, right, op, str_rep, eval_kwargs):
op_name = op.__name__.strip("_")
if op_name in {"eq", "ne", "lt", "le", "gt", "ge"}:
return comparison_op(left, right, op, str_rep, eval_kwargs)
elif op_name in {"and", "or", "xor", "rand", "ror", "rxor"}:
return logical_op(left, right, op, str_rep, eval_kwargs)
else:
return arithmetic_op(left, right, op, str_rep, eval_kwargs)


def arithmetic_op(
left: Union[np.ndarray, ABCExtensionArray],
right: Any,
op,
str_rep: str,
eval_kwargs: Dict[str, str],
eval_kwargs: Dict[str, bool],
):
"""
Evaluate an arithmetic operation `+`, `-`, `*`, `/`, `//`, `%`, `**`, ...
Expand Down Expand Up @@ -218,7 +228,7 @@ def arithmetic_op(


def comparison_op(
left: Union[np.ndarray, ABCExtensionArray], right: Any, op
left: Union[np.ndarray, ABCExtensionArray], right: Any, op, str_rep, eval_kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str_rep and eval_kwargs don't seem to be used here, is that right? Could they be removed, or have default arguments of None?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if they need to stay, could you add type hints in order to ease understanding? I assume they have types Optional[str] and dict, respectively?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add types, double-check whether these can be removed.

) -> Union[np.ndarray, ABCExtensionArray]:
"""
Evaluate a comparison operation `=`, `!=`, `>=`, `>`, `<=`, or `<`.
Expand Down Expand Up @@ -256,10 +266,11 @@ def comparison_op(

elif is_scalar(rvalues) and isna(rvalues):
# numpy does not like comparisons vs None
# TODO: Should we be using invalid_comparison here?
if op is operator.ne:
res_values = np.ones(len(lvalues), dtype=bool)
res_values = np.ones(lvalues.shape, dtype=bool)
else:
res_values = np.zeros(len(lvalues), dtype=bool)
res_values = np.zeros(lvalues.shape, dtype=bool)

elif is_object_dtype(lvalues.dtype):
res_values = comp_method_OBJECT_ARRAY(op, lvalues, rvalues)
Expand Down Expand Up @@ -323,7 +334,7 @@ def na_logical_op(x: np.ndarray, y, op):


def logical_op(
left: Union[np.ndarray, ABCExtensionArray], right: Any, op
left: Union[np.ndarray, ABCExtensionArray], right: Any, op, str_rep, eval_kwargs
) -> Union[np.ndarray, ABCExtensionArray]:
"""
Evaluate a logical operation `|`, `&`, or `^`.
Expand Down
8 changes: 4 additions & 4 deletions pandas/tests/arithmetic/test_period.py
Original file line number Diff line number Diff line change
Expand Up @@ -755,18 +755,18 @@ def test_pi_sub_isub_offset(self):
rng -= pd.offsets.MonthEnd(5)
tm.assert_index_equal(rng, expected)

def test_pi_add_offset_n_gt1(self, box_transpose_fail):
def test_pi_add_offset_n_gt1(self, box_with_array):
# GH#23215
# add offset to PeriodIndex with freq.n > 1
box, transpose = box_transpose_fail
box = box_with_array

per = pd.Period("2016-01", freq="2M")
pi = pd.PeriodIndex([per])

expected = pd.PeriodIndex(["2016-03"], freq="2M")

pi = tm.box_expected(pi, box, transpose=transpose)
expected = tm.box_expected(expected, box, transpose=transpose)
pi = tm.box_expected(pi, box)
expected = tm.box_expected(expected, box)

result = pi + per.freq
tm.assert_equal(result, expected)
Expand Down
5 changes: 2 additions & 3 deletions pandas/tests/arrays/test_datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,8 @@ def test_from_sequence_invalid_type(self):
def test_only_1dim_accepted(self):
arr = np.array([0, 1, 2, 3], dtype="M8[h]").astype("M8[ns]")

with pytest.raises(ValueError, match="Only 1-dimensional"):
# 2-dim
DatetimeArray(arr.reshape(2, 2))
# 2-dim allowed for ops compat
DatetimeArray(arr.reshape(2, 2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we are to make any assertion about the ._data of this result?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update


with pytest.raises(ValueError, match="Only 1-dimensional"):
# 0-dim
Expand Down
5 changes: 2 additions & 3 deletions pandas/tests/arrays/test_timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,8 @@ def test_only_1dim_accepted(self):
# GH#25282
arr = np.array([0, 1, 2, 3], dtype="m8[h]").astype("m8[ns]")

with pytest.raises(ValueError, match="Only 1-dimensional"):
# 2-dim
TimedeltaArray(arr.reshape(2, 2))
# 2-dim allowed for ops compat
TimedeltaArray(arr.reshape(2, 2))

with pytest.raises(ValueError, match="Only 1-dimensional"):
# 0-dim
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/frame/test_query_eval.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,8 @@ def test_ops(self):
result = getattr(df, rop)(m)
assert_frame_equal(result, expected)

# GH7192
# GH7192: Note we need a large number of rows to ensure this
# goes through the numexpr path
df = DataFrame(dict(A=np.random.randn(25000)))
df.iloc[0:5] = np.nan
expected = 1 - np.isnan(df.iloc[0:25])
Expand Down