Fill in column a bit #141

MarcoGorelli · 2023-04-17T18:44:12Z

No description provided.

kkraus14 · 2023-04-18T02:16:02Z

spec/API_specification/dataframe_api/column_object.py

+
+        Notes
+        -----
+        Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").


What is the plan for Pandas with regards to NaT and moving to Arrow types and Arrow compute? Arrow doesn't have a concept of NaT and there isn't an actual binary value for NaT where numpy uses min(int64) as a sentinel value for it:

In [6]: np.datetime64(np.iinfo(np.int64).min, "ns") Out[6]: numpy.datetime64('NaT') In [7]: np.datetime64(np.iinfo(np.int64).min + 1, "ns") Out[7]: numpy.datetime64('1677-09-21T00:12:43.145224193')

that's fine, this is just for anything nan-like, there's no requirement that NaT exists

it's the same as DataFrame.isnan

I guess my point is that NaN has a binary representation and is a universally agreed upon standard and concept. NaT is not and different libraries will have different support, different behaviors, different representations, etc.

Are we saying that supporting the cases of having both NaT and null values in datetime columns is within the scope of what we're building here?

I.E. we explicitly didn't include NaT in the interchange protocol as a null option since in reality it's a sentinel value of min(int64): https://github.com/data-apis/dataframe-api/blob/main/protocol/dataframe_protocol.py#L64-L86

I agree with excluding NaT, since it's numpy-specific. I don't think that is different here from in the interchange protocol. Column.isnan should handle only actual floating-point nan values I'd say.

we'll need to update it here too then

dataframe-api/spec/API_specification/dataframe_api/dataframe_object.py

Lines 583 to 601 in 86634ea

def isnan(self) -> DataFrame:

"""

Check for nan-like entries.

Returns

-------

DataFrame

See also

--------

isnull

Notes

-----

Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").

Does *not* include 'missing' or 'null' entries.

"""

...

I'm trying to think about whether that is relevant to including/excluding NaT in isnan.

I agree it's not relevant for isnan, and that it sounds best to keep isnan strictly for floating dtypes.
But I would say it does matter for considering it as null or not. If we specify how a null should behave in comparisons, and if that's to propagate as missing value, then NaT (if you rely purely on the numpy behaviour in your DataFrame library) does not behave as a null. So should then isnull() return True for NaT?

I'd think isnull([np.NaT]) should indeed return True. numpy doesn't know about missing values at all, so it can't be right or wrong here.

So to explicit, you are then OK with Column.from_sequence([np.NaT], dtype="datetime") > val and Column.from_sequence([null], dtype="some_numeric_dtype") > val to return different results? (and essentially say that the behaviour of nulls is implementation-dependent?)

and essentially say that the behaviour of nulls is implementation-dependent?)

I think we're saying that in a few places already, in particular related to ordering/sorting - which is what your > val example seems to get at? Isn't the end result in both cases expected to be the same though? Both cases compare "missing" to a value, so I'd expect a similar outcome.

in particular related to ordering/sorting

No, we said that this is implementation-defined for NaNs, not for nulls (for nulls the keyword controls the behaviour):

dataframe-api/spec/API_specification/dataframe_api/dataframe_object.py

Lines 268 to 272 in 7059081

nulls_position : ``{'first', 'last'}``

Whether null values should be placed at the beginning

or at the end of the result.

Note that the position of NaNs is unspecified and may

vary based on the implementation.

which is what your > val example seems to get at?

For NaNs this is related to sorting, but not for nulls I think. At least if we specify that null > val always returns null, but so the question is whether we want to specify this, or whether we want to say that behaviour of null is implementation-defined.

Isn't the end result in both cases expected to be the same though? Both cases compare "missing" to a value, so I'd expect a similar outcome.

Yes, I would also expect the same result if both are considered as a null. But so that was my original question: NaT does not give the same result:

In [19]: pd.NaT > pd.Timestamp("2012-01-01") Out[19]: False In [20]: pd.NA > pd.Timestamp("2012-01-01") Out[20]: <NA>

So either it should not be considered as null, or implementations should hide that behaviour of NaT? (or we say this is not covered by the spec and implementation defined)

To make it a bit more concrete:

s1 = pd.Series([pd.NaT, "2012-01-01"], dtype="datetime64[ns]") s2 = pd.Series([pd.NA, 2], dtype="Int64") In [14]: s1 Out[14]: 0 NaT 1 2012-01-01 dtype: datetime64[ns] In [15]: s2 Out[15]: 0 <NA> 1 2 dtype: Int64 In [17]: s1 >= s1[1] Out[17]: 0 False 1 True dtype: bool In [18]: s2 >= s2[1] Out[18]: 0 <NA> 1 True dtype: boolean

That's with current pandas, where we have a mix of dtypes that are "nullable" (in the sense of following the behaviour of pd.NA), and dtypes that are not (that have the behaviour of NaN / NaT).
So the question for a spec-compliant DataFrame implementation: which behaviour do we want/require? Is both fine? (and so leave this to the implementation) Or should a spec-compliant implementation hide this different behaviour of NaT compared to NA? (eg an implementation could still use NaT as a missing value sentinel under the hood, but in certain operation have custom handling for it to ensure it behaves spec-compliant)

rgommers

This looks about ready to merge, thanks Marco! Just a comment on the dtype-related behavior of the reductions.

I did also consider whether reductions should return a size-1 Column rather than a scalar. That may still make sense, but I suspect that returning a scalar value is more practical.

rgommers · 2023-04-26T21:53:25Z

spec/API_specification/dataframe_api/column_object.py

+
+        Notes
+        -----
+        Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").


Sounds good. Could you update the description here?

spec/API_specification/dataframe_api/column_object.py

jbrockmendel · 2023-04-26T22:53:39Z

Sum allows timedelta

…

On Wed, Apr 26, 2023 at 4:00 PM Ralf Gommers ***@***.***> wrote: ***@***.**** commented on this pull request. This looks about ready to merge, thanks Marco! Just a comment on the dtype-related behavior of the reductions. I did also consider whether reductions should return a size-1 Column rather than a scalar. That may still make sense, but I suspect that returning a scalar value is more practical. ------------------------------ In spec/API_specification/dataframe_api/column_object.py <#141 (comment)> : > + ------ + ValueError + If column is not boolean. + """ + + def all(self, skip_nulls: bool = True) -> bool: + """ + Reduction returns a bool. + + Raises + ------ + ValueError + If column is not boolean. + """ + + def min(self, skip_nulls: bool = True) -> float: All these reductions should return the same data type as the column already has I think, rather than a float. And the description should say that these reductions are only defined for numerical and datetime data types. Except for prod and sum, where I don't think datetime support makes sense. ------------------------------ In spec/API_specification/dataframe_api/column_object.py <#141 (comment)> : > + + def isnan(self) -> Column: + """ + Check for nan-like entries. + + Returns + ------- + Column + + See also + -------- + isnull + + Notes + ----- + Includes anything with NaN-like semantics, e.g. np.datetime64("NaT"). Sounds good. Could you update the description here? — Reply to this email directly, view it on GitHub <#141 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB5UM6BMIOYGLDFM6KPWGSLXDGLJJANCNFSM6AAAAAAXBSKPKU> . You are receiving this because you commented.Message ID: ***@***.***>

rgommers

Discussed with @MarcoGorelli, we're both happy with the way things look now and the extra details on what reductions return what dtypes. So I'll merge this. This is a large PR, so if there's a detail we missed somewhere, let's deal with that in a follow-up PR.

add some missing Column methods

e3adbd8

kkraus14 reviewed Apr 18, 2023

View reviewed changes

MarcoGorelli marked this pull request as ready for review April 19, 2023 10:59

MarcoGorelli mentioned this pull request Apr 19, 2023

remove 'nan-like' from isnan description #149

Merged

Merge remote-tracking branch 'upstream/main' into fill-in-column-a-bit

defdbc9

rgommers added the API design label Apr 26, 2023

rgommers reviewed Apr 26, 2023

View reviewed changes

MarcoGorelli and others added 10 commits April 27, 2023 11:43

Merge remote-tracking branch 'upstream/main' into fill-in-column-a-bit

c681659

update isnan to only be about nan

084b83c

fixup return types of reductions

34e4fc0

skipna -> skip_nulls

5c53316

make fillnan changes only to the coolumn page

5ce1c6c

Specify more precisely what dtypes are supported

ce1dafc

Merge remote-tracking branch 'upstream/main' into fill-in-column-a-bit

3aaed8e

Fix Myst rendering issue

786827e

Bug: median behaves like mean, not like min

e385af8

Remove rendering of __hash__

9724327

rgommers approved these changes Apr 27, 2023

View reviewed changes

rgommers merged commit 7059081 into data-apis:main Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fill in column a bit #141

Fill in column a bit #141

Uh oh!

MarcoGorelli commented Apr 17, 2023

Uh oh!

kkraus14 Apr 18, 2023

Uh oh!

MarcoGorelli Apr 18, 2023

Uh oh!

kkraus14 Apr 18, 2023

Uh oh!

rgommers Apr 19, 2023

Uh oh!

MarcoGorelli Apr 19, 2023

Uh oh!

jorisvandenbossche Apr 27, 2023 •

edited

Loading

Uh oh!

rgommers Apr 27, 2023

Uh oh!

jorisvandenbossche Apr 27, 2023

Uh oh!

rgommers Apr 27, 2023

Uh oh!

jorisvandenbossche Apr 28, 2023

Uh oh!

rgommers left a comment

Uh oh!

rgommers Apr 26, 2023

Uh oh!

Uh oh!

jbrockmendel commented Apr 26, 2023 via email

Uh oh!

rgommers left a comment

Uh oh!

Uh oh!


	def isnan(self) -> DataFrame:
	"""
	Check for nan-like entries.

	Returns
	-------
	DataFrame

	See also
	--------
	isnull

	Notes
	-----
	Includes anything with NaN-like semantics, e.g. np.datetime64("NaT").
	Does not include 'missing' or 'null' entries.
	"""
	...

	nulls_position : ``{'first', 'last'}``
	Whether null values should be placed at the beginning
	or at the end of the result.
	Note that the position of NaNs is unspecified and may
	vary based on the implementation.

Fill in column a bit #141

Fill in column a bit #141

Uh oh!

Conversation

MarcoGorelli commented Apr 17, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Apr 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jbrockmendel commented Apr 26, 2023 via email

Uh oh!

rgommers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jorisvandenbossche Apr 27, 2023 •

edited

Loading