Skip to content

API/REGR: (re-)allow neg/pos unary operation on object dtype #21590

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 29, 2018

Conversation

jorisvandenbossche
Copy link
Member

closes #21380

This is an easy fix to simply be more forgiveable and just try the operation:

  • For object dtype, this is IMO the better way: simply let the scalar values decide if it can do those operations or not.
    The scalar values (eg strings) already raise an informative error message. Eg for a series with string you get TypeError: bad operand type for unary -: 'str' for unary negative operation.
  • But with two caveats to discuss:
    • For datetime64 data, the error message is not as user friendly as our custom one: "TypeError: ufunc 'negative' did not contain a loop with signature matching types dtype('<M8[ns]') dtype('<M8[ns]')"
    • For the unary positive operation, numpy has apparently different rules: it does allow datetime64 data and returns itself (the same for object string data). Because of this, the current tests fail (it expects TypeError for +s_string)
Details with examples
In [14]: s_string = pd.Series(['a', 'b'])

In [15]: -s_string
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-15-91308cf04ad2> in <module>()
----> 1 -s_string

~/scipy/pandas/pandas/core/generic.py in __neg__(self)
   1119             arr = operator.inv(values)
   1120         else:
-> 1121             arr = operator.neg(values)
   1122         return self.__array_wrap__(arr)
   1123 

TypeError: bad operand type for unary -: 'str'

In [16]: +s_string
Out[16]: 
0    a
1    b
dtype: object

In [17]: s_datetime = pd.Series(pd.date_range('2012', periods=3))

In [18]: -s_datetime
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-b460860ba74c> in <module>()
----> 1 -s_datetime

~/scipy/pandas/pandas/core/generic.py in __neg__(self)
   1119             arr = operator.inv(values)
   1120         else:
-> 1121             arr = operator.neg(values)
   1122         return self.__array_wrap__(arr)
   1123 

TypeError: ufunc 'negative' did not contain a loop with signature matching types dtype('<M8[ns]') dtype('<M8[ns]')

In [19]: +s_datetime
Out[19]: 
0   2012-01-01
1   2012-01-02
2   2012-01-03
dtype: datetime64[ns]

In [20]: 

In [20]: from decimal import Decimal
    ...: s_decimal = pd.Series([Decimal(1)])

In [21]: -s_decimal
Out[21]: 
0    -1
dtype: object

In [22]: +s_decimal
Out[22]: 
0    1
dtype: object


@jorisvandenbossche jorisvandenbossche added Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version labels Jun 22, 2018
@jorisvandenbossche jorisvandenbossche added this to the 0.23.2 milestone Jun 22, 2018
@jorisvandenbossche
Copy link
Member Author

cc @deniederhut

For the unary positive, if we care I can check for datetime64 and raise a TypeError for that (although we can also follow numpy's example). For the object dtype (eg strings), that's not really possible since it is exactly the idea to allow it for those types in an object column that allow it.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2018

code change looks good (though you simply may need to add is_object_type to the check as we are strict on it).

@jbrockmendel
Copy link
Member

+1. Thoughts on applying the same pass-through approach to Index.__sub__?

@jorisvandenbossche
Copy link
Member Author

though you simply may need to add is_object_type to the check as we are strict on it

Do you mean to only specifically call operator.neg(values) for object dtype (as we do for numeric), so it still fails for datetime64?

@jreback
Copy link
Contributor

jreback commented Jun 25, 2018

Do you mean to only specifically call operator.neg(values) for object dtype (as we do for numeric), so it still fails for datetime64?

I meant more just to extend the if from the original code, like

        elif (is_numeric_dtype(values) or is_timedelta64_dtype(values) or is_object_type(values)):
            arr = operator.neg(values)
       else:
            raise TypeError

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

@jorisvandenbossche I pushed a fix to make this work

@jorisvandenbossche
Copy link
Member Author

yeah, sorry for the delay, getting to it today

I am fine with the is_object_dtype check (to explicitly not allow the datetime64), but personally I don't think we should infer anything if it is object. The error message is already fine in case of __neg__, and I don't really care about +s_string working (it is consistent with numpy). So I would rather do what you posted above in #21590 (comment)

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

I am fine with the is_object_dtype check (to explicitly not allow the datetime64), but personally I don't think we should infer anything if it is object. The error message is already fine in case of neg, and I don't really care about +s_string working (it is consistent with numpy). So I would rather do what you posted above in #21590 (comment)

I tried that and it failed your tests :> So maybe they are too strict. I didn't dive too deep here.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2018

in particular this (on master)

In [9]: df = pd.DataFrame({'A': ['a', 'b']})

In [10]: + df
Out[10]: 
   A
0  a
1  b

In [11]: - df
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-42e20adaf419> in <module>()
----> 1 - df

~/pandas/pandas/core/generic.py in __neg__(self)
   1125               not is_string_like_dtype(np.array(values))):#lib.infer_dtype(values) not in ['string', 'bytes', 'unicode']):
   1126             # explicity allow object dtypes that are not strings, gh-21380
-> 1127             arr = operator.neg(values)
   1128         else:
   1129             raise TypeError("Unary negative expects numeric dtype, not {}"

TypeError: bad operand type for unary -: 'str'

@jorisvandenbossche
Copy link
Member Author

Yes, not inferring strings means that we need to update the test (but only for __pos__), because +s_strings would now work while it fails on master.
I didn't do that initially because I first wanted to get an idea if you were OK with allowing something like +s_strings.

@codecov
Copy link

codecov bot commented Jun 27, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@c45bb0b). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #21590   +/-   ##
=========================================
  Coverage          ?    91.9%           
=========================================
  Files             ?      154           
  Lines             ?    49555           
  Branches          ?        0           
=========================================
  Hits              ?    45542           
  Misses            ?     4013           
  Partials          ?        0
Flag Coverage Δ
#multiple 90.27% <100%> (?)
#single 42.03% <0%> (?)
Impacted Files Coverage Δ
pandas/core/generic.py 96.21% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c45bb0b...453b9bd. Read the comment docs.

@jorisvandenbossche
Copy link
Member Author

Thoughts on applying the same pass-through approach to Index.sub?

Yes, I think that is a good idea, but let's leave that for a separate PR / 0.24.0

@jreback
Copy link
Contributor

jreback commented Jun 28, 2018

seems numpy allows this which I find odd.

In [1]: arr = np.array(['foo'])

In [2]: +arr
Out[2]: array(['foo'], dtype='<U3')

In [3]: arr = np.array(['foo'], dtype=object)

In [4]: +arr                                 
Out[4]: array(['foo'], dtype=object)

ok with matching numpy (for pos only).

@jorisvandenbossche
Copy link
Member Author

I think the policy of numpy regarding + is simply to return the original object regardless whether it would otherwise support the notion of negative/positive or addition.

ok with matching numpy (for pos only).

Yes (for object dtype now, for datetime we still raise), and for neg our behaviour was already consistent with numpy before.

@jreback jreback merged commit 8cb6be0 into pandas-dev:master Jun 29, 2018
@jreback
Copy link
Contributor

jreback commented Jun 29, 2018

thank you sir!

@jorisvandenbossche jorisvandenbossche deleted the unary-op-object branch June 29, 2018 08:36
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Jul 2, 2018
jorisvandenbossche added a commit that referenced this pull request Jul 5, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas 0.23 broke unary negative expression on Decimal data type
3 participants