Skip to content

BUG: Create empty dataframe with string dtype fails #33651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
2 changes: 2 additions & 0 deletions pandas/core/arrays/integer.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,8 @@ def coerce_to_array(
-------
tuple of (values, mask)
"""
values = [] if values is np.nan else values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please look at the caller? This indicates that we're passing np.nan to a place where we shouldn't be (probably IntegerArray._from_sequence). That means there may be other ExtensionArrays facing the same issue. I'd much rather fix it at the source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means there may be other ExtensionArrays facing the same issue

Will we work this on other issues?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the size of the required changes to get this working.

I'm not comfortable merging this until the problem is better understood. We should not be passing np.nan to _from_sequence. We should be passing [].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making changes here may not be necessary once the changes to sanitize_array in #33846 are merged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pd.DataFrame(columns=["a"], dtype="Int64").dtypes now works on master following #33846. have reverted this change.


# if values is integer numpy array, preserve it's dtype
if dtype is None and hasattr(values, "dtype"):
if is_integer_dtype(values.dtype):
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/internals/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,11 @@ def init_dict(data, index, columns, dtype=None):

# no obvious "empty" int column
if missing.any() and not is_integer_dtype(dtype):
if dtype is None or np.issubdtype(dtype, np.flexible):
if (
dtype is None
or is_extension_array_dtype(dtype)
or np.issubdtype(dtype, np.flexible)
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing this fixes the interval case

Suggested change
if (
dtype is None
or is_extension_array_dtype(dtype)
or np.issubdtype(dtype, np.flexible)
):
if dtype is None or (
not is_extension_array_dtype(dtype)
and np.issubdtype(dtype, np.flexible)
):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggestion doesn't work...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've pushed this change, works on my machine. can you elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @simonjayhawkins . works on my environment too.

# GH#1783
nan_dtype = object
else:
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/arrow/test_bool.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ def test_from_dtype(self, data):
def test_from_sequence_from_cls(self, data):
super().test_from_sequence_from_cls(data)

@pytest.mark.xfail(reason="bad is-na for empty data")
def test_construct_empty_dataframe(self, dtype):
super().test_construct_empty_dataframe(dtype)


class TestReduce(base.BaseNoReduceTests):
def test_reduce_series_boolean(self):
Expand Down
6 changes: 6 additions & 0 deletions pandas/tests/extension/base/constructors.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,3 +83,9 @@ def test_pandas_array_dtype(self, data):
result = pd.array(data, dtype=np.dtype(object))
expected = pd.arrays.PandasArray(np.asarray(data, dtype=object))
self.assert_equal(result, expected)

def test_construct_empty_dataframe(self, dtype):
# GH 33623
result = pd.DataFrame(columns=["a"], dtype=dtype)
expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected = pd.DataFrame(data=[], columns=["a"], dtype=dtype)
expected = pd.DataFrame({"a": pd.array([], dtype=dtype})

This seems a bit safer way to get the expected result.

self.assert_frame_equal(result, expected)
4 changes: 3 additions & 1 deletion pandas/tests/extension/test_interval.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,9 @@ class TestCasting(BaseInterval, base.BaseCastingTests):


class TestConstructors(BaseInterval, base.BaseConstructorsTests):
pass
@pytest.mark.xfail(reason="bad is-na for empty data")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this xfailed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • object is not supported for IntervalArray
  • na_value of IntervalArray is float, so AttributeError: 'float' object has no attribute 'dtype' in construct_1d_arraylike_from_scalar().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would ideally fixed here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if nan_dtype is dtype (IntervalDtype), can create df.

if is_interval_dtype(dtype):
    nan_dtype = dtype

def test_construct_empty_dataframe(self, dtype):
super().test_construct_empty_dataframe(dtype)


class TestGetitem(BaseInterval, base.BaseGetitemTests):
Expand Down