Description
Code Sample, a copy-pastable example if possible
In pandas
v0.20.2, the following code
import pandas as pd
from io import StringIO
data = "c1,c2\nfalse,1\n,1"
pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']
gives output
0 0.0
1 1.0
Name: c1, dtype: float32
Problem description
In this example, the column of boolean data contains a missing value. If I read the column as booleans (either explicitly via dtype
or by allowing pandas
to infer the type), then the missing value is given as NaN
, as it should be. If I force the column type to be a float (or an integer) via the dtype
argument to read_csv
, then the missing value is given as 1.0
, the same as True
.
Expected Output
The output of
pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']
should be the same as the output of
pd.read_csv(StringIO(data))['c1'].astype('float32')
which is
0 0.0
1 NaN
Name: c1, dtype: float32
I.e., the missing value in the input CSV should be cast to NaN
rather than 1.0
.
Output of pd.show_versions()
------------------
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None