Skip to content

Fill in column a bit #141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 27, 2023
Merged

Conversation

MarcoGorelli
Copy link
Contributor

No description provided.


Notes
-----
Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan for Pandas with regards to NaT and moving to Arrow types and Arrow compute? Arrow doesn't have a concept of NaT and there isn't an actual binary value for NaT where numpy uses min(int64) as a sentinel value for it:

In [6]: np.datetime64(np.iinfo(np.int64).min, "ns")
Out[6]: numpy.datetime64('NaT')

In [7]: np.datetime64(np.iinfo(np.int64).min + 1, "ns")
Out[7]: numpy.datetime64('1677-09-21T00:12:43.145224193')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fine, this is just for anything nan-like, there's no requirement that NaT exists

it's the same as DataFrame.isnan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my point is that NaN has a binary representation and is a universally agreed upon standard and concept. NaT is not and different libraries will have different support, different behaviors, different representations, etc.

Are we saying that supporting the cases of having both NaT and null values in datetime columns is within the scope of what we're building here?

I.E. we explicitly didn't include NaT in the interchange protocol as a null option since in reality it's a sentinel value of min(int64): https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L64-L86

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with excluding NaT, since it's numpy-specific. I don't think that is different here from in the interchange protocol. Column.isnan should handle only actual floating-point nan values I'd say.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll need to update it here too then

def isnan(self) -> DataFrame:
"""
Check for nan-like entries.
Returns
-------
DataFrame
See also
--------
isnull
Notes
-----
Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").
Does *not* include 'missing' or 'null' entries.
"""
...

Copy link
Member

@jorisvandenbossche jorisvandenbossche Apr 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to think about whether that is relevant to including/excluding NaT in isnan.

I agree it's not relevant for isnan, and that it sounds best to keep isnan strictly for floating dtypes.
But I would say it does matter for considering it as null or not. If we specify how a null should behave in comparisons, and if that's to propagate as missing value, then NaT (if you rely purely on the numpy behaviour in your DataFrame library) does not behave as a null. So should then isnull() return True for NaT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd think isnull([np.NaT]) should indeed return True. numpy doesn't know about missing values at all, so it can't be right or wrong here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to explicit, you are then OK with Column.from_sequence([np.NaT], dtype="datetime") > val and Column.from_sequence([null], dtype="some_numeric_dtype") > val to return different results? (and essentially say that the behaviour of nulls is implementation-dependent?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and essentially say that the behaviour of nulls is implementation-dependent?)

I think we're saying that in a few places already, in particular related to ordering/sorting - which is what your > val example seems to get at? Isn't the end result in both cases expected to be the same though? Both cases compare "missing" to a value, so I'd expect a similar outcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in particular related to ordering/sorting

No, we said that this is implementation-defined for NaNs, not for nulls (for nulls the keyword controls the behaviour):

nulls_position : ``{'first', 'last'}``
Whether null values should be placed at the beginning
or at the end of the result.
Note that the position of NaNs is unspecified and may
vary based on the implementation.

which is what your > val example seems to get at?

For NaNs this is related to sorting, but not for nulls I think. At least if we specify that null > val always returns null, but so the question is whether we want to specify this, or whether we want to say that behaviour of null is implementation-defined.

Isn't the end result in both cases expected to be the same though? Both cases compare "missing" to a value, so I'd expect a similar outcome.

Yes, I would also expect the same result if both are considered as a null. But so that was my original question: NaT does not give the same result:

In [19]: pd.NaT > pd.Timestamp("2012-01-01")
Out[19]: False

In [20]: pd.NA > pd.Timestamp("2012-01-01")
Out[20]: <NA>

So either it should not be considered as null, or implementations should hide that behaviour of NaT? (or we say this is not covered by the spec and implementation defined)

To make it a bit more concrete:

s1 = pd.Series([pd.NaT, "2012-01-01"], dtype="datetime64[ns]")
s2 = pd.Series([pd.NA, 2], dtype="Int64")

In [14]: s1
Out[14]: 
0          NaT
1   2012-01-01
dtype: datetime64[ns]

In [15]: s2
Out[15]: 
0    <NA>
1       2
dtype: Int64

In [17]: s1 >= s1[1]
Out[17]: 
0    False
1     True
dtype: bool

In [18]: s2 >= s2[1]
Out[18]: 
0    <NA>
1    True
dtype: boolean

That's with current pandas, where we have a mix of dtypes that are "nullable" (in the sense of following the behaviour of pd.NA), and dtypes that are not (that have the behaviour of NaN / NaT).
So the question for a spec-compliant DataFrame implementation: which behaviour do we want/require? Is both fine? (and so leave this to the implementation) Or should a spec-compliant implementation hide this different behaviour of NaT compared to NA? (eg an implementation could still use NaT as a missing value sentinel under the hood, but in certain operation have custom handling for it to ensure it behaves spec-compliant)

@MarcoGorelli MarcoGorelli marked this pull request as ready for review April 19, 2023 10:59
Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks about ready to merge, thanks Marco! Just a comment on the dtype-related behavior of the reductions.

I did also consider whether reductions should return a size-1 Column rather than a scalar. That may still make sense, but I suspect that returning a scalar value is more practical.


Notes
-----
Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Could you update the description here?

@jbrockmendel
Copy link
Contributor

jbrockmendel commented Apr 26, 2023 via email

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @MarcoGorelli, we're both happy with the way things look now and the extra details on what reductions return what dtypes. So I'll merge this. This is a large PR, so if there's a detail we missed somewhere, let's deal with that in a follow-up PR.

@rgommers rgommers merged commit 7059081 into data-apis:main Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants