Description
Code Sample, a copy-pastable example if possible
import datetime as dt
import pandas as pd
import numpy as np
data = [
['one', 1.0, pd.NaT],
['two', np.NaN, dt.datetime(2019, 2, 2)],
[None, 3.0, dt.datetime(2019, 3, 3)]
]
df = pd.DataFrame(data, columns=["Name", "Value", "Event_date"])
>>> df
Name Value Event_date
0 one 1.0 NaT
1 two NaN 2019-02-02
2 None 3.0 2019-03-03
>>> df.replace({pd.NaT: None})
Name Value Event_date
0 one 1 None
1 two None 2019-02-02 00:00:00
2 None 3 2019-03-03 00:00:00
>>> df.replace({np.NaN: None})
Name Value Event_date
0 one 1 None
1 two None 2019-02-02 00:00:00
2 None 3 2019-03-03 00:00:00
>>> df.replace({np.NaN: None}).replace({np.NaN: None})
Name Value Event_date
0 one 1.0 None
1 two NaN 2019-02-02 00:00:00
2 None 3.0 2019-03-03 00:00:00
>>> df.replace({np.NaN: None}).replace({np.NaN: None}).replace({np.NaN: None})
Name Value Event_date
0 one 1 None
1 two None 2019-02-02 00:00:00
2 None 3 2019-03-03 00:00:00
>>> df.replace({pd.NaT: None, np.NaN: None})
Name Value Event_date
0 one 1.0 None
1 two NaN 2019-02-02 00:00:00
2 None 3.0 2019-03-03 00:00:00
Problem description
This might seem somewhat related to #17494. Here I am using a dict to replace (which is the recommended way to do it in the related issue) but I suspect the function calls itself and passes None
(replacement value) to the value
arg, hitting the default arg value.
When calling df.replace()
to replace NaN or NaT with None, I found several behaviours which don't seem right to me :
- Replacing NaT with None (only) also replaces NaN with None.
- Replacing NaN with None also replaces NaT with None
- Replacing NaT and NaN with None, replaces NaT but leaves the NaN
- Linked to previous, calling several times a replacement of NaN or NaT with None, switched between NaN and None for the float columns. An even number of calls will leave NaN, an odd number of calls will leave None.
This is a problem because I'm unable to replace only NaT or only NaN. This is also a problem because if I want to replace both, I intuitively call replace with the dict {pd.NaT: None, np.NaN: None}
but end up with NaNs.
I suspect two problems here : NaN, NaT and None being all considered as equals, and replace() calling itself with None as value argument.
Expected Output
>>> df.replace({pd.NaT: None, np.NaN: None})
Name Value Event_date
0 one 1.0 None
1 two None 2019-02-02 00:00:00
2 None 3.0 2019-03-03 00:00:00
>>> df.replace({pd.NaT: None})
Name Value Event_date
0 one 1 None
1 two NaN 2019-02-02 00:00:00
2 None 3 2019-03-03 00:00:00
>>> df.replace({np.NaN: None})
Name Value Event_date
0 one 1 NaT
1 two None 2019-02-02 00:00:00
2 None 3 2019-03-03 00:00:00
Output of pd.show_versions()
pandas: 0.24.2
pytest: None
pip: 19.2.2
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.5.1
numexpr: 2.7.0
feather: None
matplotlib: None
openpyxl: 2.6.2
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.8
lxml.etree: 4.2.5
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.8.3 (dt dec pq3 ext lo64)
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None