Skip to content

combine_first not retaining dtypes - unmatched indexes #24357

Closed
@jmarshall9120

Description

@jmarshall9120
df1 = pd.DataFrame([["one", i] for i in range(3)], columns=["a","b"])
df2 = pd.DataFrame([["one", i] for i in range(5)], columns=["a","b"])
df3 = df1.combine_first(df2)
df1.dtypes
df2.dtypes
df3.dtypes

#all below statements should show a dtype of int64 for column b
df1.dtypes
df2.dtypes
df3.dtypes

#Actual Output
df1.dtypes
a    object
b     int64
dtype: object
>>> df2.dtypes
a    object
b     int64
dtype: object
>>> df3.dtypes
a     object
b    float64
dtype: object

Not sure this is intended behavior or not, but as you can see the from the output the dtype of the col b is changed to float64 when combine_first is called.

I've seen an old open issue: combine_first not retaining dtypes

That issue is from 2014 and explains why data types are coerced to float64 when there is a resulting nan, however in the example above there is no "resulting" nan. It could be because "under the hood" there are nans where the first index doesn't match the second index. Still this sort of leaves combine_first in a weird state because if i can't trust dtypes to not be coerced when appending data, then I need to guarantee matching indexes before hand. If i have to do that, I sort of have to do half of the work of combine_first manually, making it far less useful.

Expected Output

df1.dtypes
a    object
b     int64
dtype: object
>>> df2.dtypes
a    object
b     int64
dtype: object
>>> df3.dtypes
a     object
b    int64
dtype: object

Output of pd.show_versions()

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.1
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Dtype ConversionsUnexpected or buggy dtype conversionsReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions