-
Notifications
You must be signed in to change notification settings - Fork 21
Fill in column a bit #141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fill in column a bit #141
Conversation
|
||
Notes | ||
----- | ||
Includes anything with NaN-like semantics, e.g. np.datetime64("NaT"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan for Pandas with regards to NaT
and moving to Arrow types and Arrow compute? Arrow doesn't have a concept of NaT
and there isn't an actual binary value for NaT
where numpy uses min(int64)
as a sentinel value for it:
In [6]: np.datetime64(np.iinfo(np.int64).min, "ns")
Out[6]: numpy.datetime64('NaT')
In [7]: np.datetime64(np.iinfo(np.int64).min + 1, "ns")
Out[7]: numpy.datetime64('1677-09-21T00:12:43.145224193')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fine, this is just for anything nan-like, there's no requirement that NaT
exists
it's the same as DataFrame.isnan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my point is that NaN
has a binary representation and is a universally agreed upon standard and concept. NaT
is not and different libraries will have different support, different behaviors, different representations, etc.
Are we saying that supporting the cases of having both NaT
and null
values in datetime columns is within the scope of what we're building here?
I.E. we explicitly didn't include NaT
in the interchange protocol as a null option since in reality it's a sentinel value of min(int64)
: https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L64-L86
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with excluding NaT
, since it's numpy-specific. I don't think that is different here from in the interchange protocol. Column.isnan
should handle only actual floating-point nan
values I'd say.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need to update it here too then
dataframe-api/spec/API_specification/dataframe_api/dataframe_object.py
Lines 583 to 601 in 86634ea
def isnan(self) -> DataFrame: | |
""" | |
Check for nan-like entries. | |
Returns | |
------- | |
DataFrame | |
See also | |
-------- | |
isnull | |
Notes | |
----- | |
Includes anything with NaN-like semantics, e.g. np.datetime64("NaT"). | |
Does *not* include 'missing' or 'null' entries. | |
""" | |
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to think about whether that is relevant to including/excluding NaT in
isnan
.
I agree it's not relevant for isnan
, and that it sounds best to keep isnan
strictly for floating dtypes.
But I would say it does matter for considering it as null or not. If we specify how a null should behave in comparisons, and if that's to propagate as missing value, then NaT (if you rely purely on the numpy behaviour in your DataFrame library) does not behave as a null. So should then isnull()
return True for NaT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd think isnull([np.NaT])
should indeed return True. numpy
doesn't know about missing values at all, so it can't be right or wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So to explicit, you are then OK with Column.from_sequence([np.NaT], dtype="datetime") > val
and Column.from_sequence([null], dtype="some_numeric_dtype") > val
to return different results? (and essentially say that the behaviour of nulls is implementation-dependent?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and essentially say that the behaviour of nulls is implementation-dependent?)
I think we're saying that in a few places already, in particular related to ordering/sorting - which is what your > val
example seems to get at? Isn't the end result in both cases expected to be the same though? Both cases compare "missing" to a value, so I'd expect a similar outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in particular related to ordering/sorting
No, we said that this is implementation-defined for NaNs, not for nulls (for nulls the keyword controls the behaviour):
dataframe-api/spec/API_specification/dataframe_api/dataframe_object.py
Lines 268 to 272 in 7059081
nulls_position : ``{'first', 'last'}`` | |
Whether null values should be placed at the beginning | |
or at the end of the result. | |
Note that the position of NaNs is unspecified and may | |
vary based on the implementation. |
which is what your
> val
example seems to get at?
For NaNs this is related to sorting, but not for nulls I think. At least if we specify that null > val
always returns null
, but so the question is whether we want to specify this, or whether we want to say that behaviour of null is implementation-defined.
Isn't the end result in both cases expected to be the same though? Both cases compare "missing" to a value, so I'd expect a similar outcome.
Yes, I would also expect the same result if both are considered as a null. But so that was my original question: NaT does not give the same result:
In [19]: pd.NaT > pd.Timestamp("2012-01-01")
Out[19]: False
In [20]: pd.NA > pd.Timestamp("2012-01-01")
Out[20]: <NA>
So either it should not be considered as null, or implementations should hide that behaviour of NaT? (or we say this is not covered by the spec and implementation defined)
To make it a bit more concrete:
s1 = pd.Series([pd.NaT, "2012-01-01"], dtype="datetime64[ns]")
s2 = pd.Series([pd.NA, 2], dtype="Int64")
In [14]: s1
Out[14]:
0 NaT
1 2012-01-01
dtype: datetime64[ns]
In [15]: s2
Out[15]:
0 <NA>
1 2
dtype: Int64
In [17]: s1 >= s1[1]
Out[17]:
0 False
1 True
dtype: bool
In [18]: s2 >= s2[1]
Out[18]:
0 <NA>
1 True
dtype: boolean
That's with current pandas, where we have a mix of dtypes that are "nullable" (in the sense of following the behaviour of pd.NA), and dtypes that are not (that have the behaviour of NaN / NaT).
So the question for a spec-compliant DataFrame implementation: which behaviour do we want/require? Is both fine? (and so leave this to the implementation) Or should a spec-compliant implementation hide this different behaviour of NaT compared to NA? (eg an implementation could still use NaT as a missing value sentinel under the hood, but in certain operation have custom handling for it to ensure it behaves spec-compliant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks about ready to merge, thanks Marco! Just a comment on the dtype-related behavior of the reductions.
I did also consider whether reductions should return a size-1 Column rather than a scalar. That may still make sense, but I suspect that returning a scalar value is more practical.
|
||
Notes | ||
----- | ||
Includes anything with NaN-like semantics, e.g. np.datetime64("NaT"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Could you update the description here?
Sum allows timedelta
…On Wed, Apr 26, 2023 at 4:00 PM Ralf Gommers ***@***.***> wrote:
***@***.**** commented on this pull request.
This looks about ready to merge, thanks Marco! Just a comment on the
dtype-related behavior of the reductions.
I did also consider whether reductions should return a size-1 Column
rather than a scalar. That may still make sense, but I suspect that
returning a scalar value is more practical.
------------------------------
In spec/API_specification/dataframe_api/column_object.py
<#141 (comment)>
:
> + ------
+ ValueError
+ If column is not boolean.
+ """
+
+ def all(self, skip_nulls: bool = True) -> bool:
+ """
+ Reduction returns a bool.
+
+ Raises
+ ------
+ ValueError
+ If column is not boolean.
+ """
+
+ def min(self, skip_nulls: bool = True) -> float:
All these reductions should return the same data type as the column
already has I think, rather than a float. And the description should say
that these reductions are only defined for numerical and datetime data
types. Except for prod and sum, where I don't think datetime support
makes sense.
------------------------------
In spec/API_specification/dataframe_api/column_object.py
<#141 (comment)>
:
> +
+ def isnan(self) -> Column:
+ """
+ Check for nan-like entries.
+
+ Returns
+ -------
+ Column
+
+ See also
+ --------
+ isnull
+
+ Notes
+ -----
+ Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").
Sounds good. Could you update the description here?
—
Reply to this email directly, view it on GitHub
<#141 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6BMIOYGLDFM6KPWGSLXDGLJJANCNFSM6AAAAAAXBSKPKU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @MarcoGorelli, we're both happy with the way things look now and the extra details on what reductions return what dtypes. So I'll merge this. This is a large PR, so if there's a detail we missed somewhere, let's deal with that in a follow-up PR.
No description provided.