Skip to content

Inconsistent behavior for df.replace() with NaN, NaT and None #29024

Open
@K3UL

Description

@K3UL

Code Sample, a copy-pastable example if possible

import datetime as dt
import pandas as pd
import numpy as np
data = [
    ['one', 1.0, pd.NaT],
    ['two', np.NaN, dt.datetime(2019, 2, 2)],
    [None, 3.0, dt.datetime(2019, 3, 3)]
    ]
df = pd.DataFrame(data, columns=["Name", "Value", "Event_date"])
>>> df
   Name  Value Event_date
0   one    1.0        NaT
1   two    NaN 2019-02-02
2  None    3.0 2019-03-03

>>> df.replace({pd.NaT: None})
   Name Value           Event_date
0   one     1                 None
1   two  None  2019-02-02 00:00:00
2  None     3  2019-03-03 00:00:00
>>> df.replace({np.NaN: None})
   Name Value           Event_date
0   one     1                 None
1   two  None  2019-02-02 00:00:00
2  None     3  2019-03-03 00:00:00

>>> df.replace({np.NaN: None}).replace({np.NaN: None})
   Name  Value           Event_date
0   one    1.0                 None
1   two    NaN  2019-02-02 00:00:00
2  None    3.0  2019-03-03 00:00:00

>>> df.replace({np.NaN: None}).replace({np.NaN: None}).replace({np.NaN: None})
   Name Value           Event_date
0   one     1                 None
1   two  None  2019-02-02 00:00:00
2  None     3  2019-03-03 00:00:00

>>> df.replace({pd.NaT: None, np.NaN: None})
   Name  Value           Event_date
0   one    1.0                 None
1   two    NaN  2019-02-02 00:00:00
2  None    3.0  2019-03-03 00:00:00

Problem description

This might seem somewhat related to #17494. Here I am using a dict to replace (which is the recommended way to do it in the related issue) but I suspect the function calls itself and passes None (replacement value) to the value arg, hitting the default arg value.

When calling df.replace() to replace NaN or NaT with None, I found several behaviours which don't seem right to me :

  • Replacing NaT with None (only) also replaces NaN with None.
  • Replacing NaN with None also replaces NaT with None
  • Replacing NaT and NaN with None, replaces NaT but leaves the NaN
  • Linked to previous, calling several times a replacement of NaN or NaT with None, switched between NaN and None for the float columns. An even number of calls will leave NaN, an odd number of calls will leave None.

This is a problem because I'm unable to replace only NaT or only NaN. This is also a problem because if I want to replace both, I intuitively call replace with the dict {pd.NaT: None, np.NaN: None} but end up with NaNs.

I suspect two problems here : NaN, NaT and None being all considered as equals, and replace() calling itself with None as value argument.

Expected Output

>>> df.replace({pd.NaT: None, np.NaN: None})
   Name  Value           Event_date
0   one    1.0                 None
1   two    None  2019-02-02 00:00:00
2  None    3.0  2019-03-03 00:00:00

>>> df.replace({pd.NaT: None})
   Name Value           Event_date
0   one     1                 None
1   two   NaN  2019-02-02 00:00:00
2  None     3  2019-03-03 00:00:00

>>> df.replace({np.NaN: None})
   Name Value           Event_date
0   one     1                  NaT
1   two  None  2019-02-02 00:00:00
2  None     3  2019-03-03 00:00:00

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.7.4.final.0 python-bits: 64 OS: Darwin OS-release: 18.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: None LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.7.0
feather: None
matplotlib: None
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.2.5
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.8.3 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolatereplacereplace method

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions