Skip to content

read_csv: Casting boolean columns as floats turns missing values into 1.0 #16698

Closed
@stephen-hoover

Description

@stephen-hoover

Code Sample, a copy-pastable example if possible

In pandas v0.20.2, the following code

import pandas as pd
from io import StringIO
data = "c1,c2\nfalse,1\n,1"
pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']

gives output

0    0.0
1    1.0
Name: c1, dtype: float32

Problem description

In this example, the column of boolean data contains a missing value. If I read the column as booleans (either explicitly via dtype or by allowing pandas to infer the type), then the missing value is given as NaN, as it should be. If I force the column type to be a float (or an integer) via the dtype argument to read_csv, then the missing value is given as 1.0, the same as True.

Expected Output

The output of

pd.read_csv(StringIO(data), dtype={'c1': 'float32'})['c1']

should be the same as the output of

pd.read_csv(StringIO(data))['c1'].astype('float32')

which is

0    0.0
1    NaN
Name: c1, dtype: float32

I.e., the missing value in the input CSV should be cast to NaN rather than 1.0.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None python: 3.6.0.final.0 python-bits: 64 OS: Darwin OS-release: 15.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 33.1.1.post20170320
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 6.0.0
sphinx: 1.5.5
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.5.3
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions