Skip to content

Bug in Series constructor returning wrong missing values #43026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 19, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 13, 2021

This ensures consistency with the DataFrame constructor

@phofl phofl added Constructors Series/DataFrame/Index/pd.array Constructors Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Series Series data structure labels Aug 13, 2021
# GH#43018
ser = Series(np.nan, dtype="object")
result = ser.astype("bool")
expected = Series(True, dtype="bool")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is True?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bool(np.nan) returns True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly.

def test_constructor_bool_dtype_missing_values(self):
# GH#43018
result = Series(index=[0], dtype="bool")
expected = Series(True, index=[0], dtype="bool")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is True?

@jreback jreback added this to the 1.4 milestone Aug 19, 2021
@jreback jreback merged commit 098661e into pandas-dev:master Aug 19, 2021
@jreback
Copy link
Contributor

jreback commented Aug 19, 2021

thanks @phofl

@simonjayhawkins
Copy link
Member

@phofl we have had reports regarding both these cases. Is the new behavior now consistent?

import pandas as pd

print(pd.__version__)
s = pd.Series(dtype="int", index=[0])
print(s)
s2 = pd.Series(dtype="bool", index=[0])
print(s2)
1.3.5
0    0
dtype: int64
0    False
dtype: bool
1.4.1
0   NaN
dtype: float64
0    True
dtype: bool

It appears to me (and maybe others from the issues) that the changes are confusing.

We have changed the int case and and instead create a float array saying that the missing value cannot be held in a integer array and yet for the bool case we continue to keep the bool dtype even though we cannot represent a missing value in a boolean array?

I appreciate that int(np.nan) raises ValueError: cannot convert float NaN to integer and bool(np.nan) is True but the users are not specifying np.nan, they are not supplying data. I wonder whether we ought to revert this until we change the constructors to use nullable dtypes?

@jbrockmendel
Copy link
Member

I wonder whether we ought to revert this until we change the constructors to use nullable dtypes?

no opinion on the reversion, but i advise against the implicit assumption that constructors are going to default to nullable dtypes. support has improved, but its still a mess.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Series Series data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Default value for Series(dtype=int) is not pd.NA
4 participants