Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
def a_cleaning(value: object) -> object:
if isinstance(value, str):
return value.replace(',','')
else:
return value
data = pd.DataFrame({
'A':['1,200',np.nan,'400','200',np.nan]
})
data.to_csv('data.csv',index=False)
# converters
df = pd.read_csv('data.csv',
converters={
'A': a_cleaning
})
print('converters:')
display(df)
# apply
print('apply:')
df = pd.read_csv('data.csv')
df['A'] = df['A'].apply(a_cleaning)
display(df)
Issue Description
I'm wondering why using converters results in returning NaN values as '' when using the same function, but when switching to apply instead of converters, the NaN values are returned as NaN as before.
My function:
def a_cleaning(value: object) -> object:
if isinstance(value, str):
return value.replace(',','')
else:
return value
My Dataframe:
data = pd.DataFrame({
'A':['1,200',np.nan,'400','200',np.nan]
})
data.to_csv('data.csv',index=False)
My code when using converters
:
df = pd.read_csv('data.csv',
converters={
'A': a_cleaning
})
display(df)
My code when using apply
:
df = pd.read_csv('data.csv')
df['A'] = df['A'].apply(a_cleaning)
display(df)
Why are the results different?
I'm not sure if this issue will occur with other read functions. I've only tested it with read_csv so far.
Expected Behavior
It should produce the same result as using apply
.
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.133+
Version : #1 SMP Tue Dec 19 13:14:11 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : POSIX
LANG : C.UTF-8
LOCALE : None.None
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.0.3
pip : 23.3.2
Cython : 3.0.8
pytest : 8.2.1
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 5.2.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.20.0
pandas_datareader : 0.10.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.2
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : 2024.3.1
matplotlib : 3.7.5
numba : 0.59.1
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.3
pandas_gbq : None
pyarrow : 14.0.2
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : 2024.3.1
scipy : 1.11.4
sqlalchemy : 2.0.25
tables : 3.9.2
tabulate : 0.9.0
xarray : 2024.5.0
xlrd : None
zstandard : 0.19.0
tzdata : 2024.1
qtpy : None
pyqt5 : None