Skip to content

BUG: Adding Series to empty DataFrame can reset dtype to float64 #42099

Closed
@JBGreisman

Description

@JBGreisman
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas (1.3.0rc1).

  • (optional) I have confirmed this bug exists on the master branch of pandas


Code Sample, a copy-pastable example

import pandas as pd
data = pd.array([0, 1, 2, 3], dtype="Int32")
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
result.loc[df.index, "data"] = df["data"]

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: float64 <--

Problem description

In my mind, this behavior seems unexpected because the provided dtype should be preserved and not coerced to the default type for an empty Series. This occurs for the nullable integer dtypes as well as Float32/Float64.

I came across this when trying to implement an ExtensionDtype that ended up failing on BaseSetitemTest. test_setitem_with_expansion_dataframe_column:

def test_setitem_with_expansion_dataframe_column(self, data, full_indexer):
# https://github.com/pandas-dev/pandas/issues/32395
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
key = full_indexer(df)
result.loc[key, "data"] = df["data"]
self.assert_frame_equal(result, expected)

Interestingly, in the tests for IntegerArray and FloatingArray, the test data includes NaN values which do not result in the coercion to float64:

import pandas as pd
data = pd.array([0, pd.NaT, 2, 3], dtype="Int32")
df = expected = pd.DataFrame({"data": pd.Series(data)})
result = pd.DataFrame(index=df.index)
result.loc[df.index, "data"] = df["data"]

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: Int32 <--

My expectation was that the dtype should be preserved in such cases, with/without NaN values.

Expected Output

I would expect that the dtype of the pd.Series being added to result would be preserved, in the case of the minimal example, result["data"] should be Int32Dtype.

print(df["data"].dtype)     # prints: Int32
print(result["data"].dtype) # prints: Int32 <--

Output of pd.show_versions()

This was generated from the latest release candidate, but it appears to also occur on the master branch (1.4.0.dev0+56.g648eb40abc)

INSTALLED VERSIONS

commit : 2dd9e9b
python : 3.8.5.final.0
python-bits : 64
OS : Darwin
OS-release : 17.7.0
Version : Darwin Kernel Version 17.7.0: Fri Oct 30 13:34:27 PDT 2020; root:xnu-4570.71.82.8~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.3.0rc1
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.2
setuptools : 49.6.0.post20210108
Cython : None
pytest : 6.2.1
hypothesis : None
sphinx : 3.3.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.24.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.05.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.3
sqlalchemy : None
tables : None
tabulate : 0.8.7
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIndexingRelated to indexing on series/frames, not to indexes themselvesRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions