Skip to content

Default behavior of read_csv for data values of NA and NaN #10647

Closed
@johne13

Description

@johne13

I was surprised to learn that default behavior for reading in a CSV like this:

%more foo.txt
x,y
"NA",NA
"foo",foo

is the following:

In [56]: pd.read_csv('foo.txt')
Out[56]: 
     x    y
0  NaN  NaN
1  foo  foo

I realize this can be changed by using pd.read_csv('foo.txt', keep_default_na=False), but this doesn't seem like ideal default behavior for the following reasons.

  • NA can be a valid value in numerous cases
    • Abbreviation for North America, sodium, lots of other things
    • Possible value for any code with two alphabetic characters
    • NA might very well be a valid response (distinct from true missing) in survey data even if it literally means something like "not applicable".
  • This behavior is likely unexpected in many cases and dangerous for that reason. Here is a recent example from stack overflow. As a counter-example, I checked the behavior of Excel and Stata and both will read in NA as a text value, not as missing.
  • It just seems like better practice would be to handle interpretation of NA and NaN after reading in the data as text rather than doing this on the fly. When this behavior is unwanted, it may be very hard to detect after the fact.

Edit to add: At risk of belaboring the point, note that the following round trips with either read_csv or from_csv (using all default values) will convert 'NA' to NaN:

df=pd.DataFrame({ 'x':['NA','foo'] })
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv')
df2 = pd.DataFrame.from_csv('test.csv')

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions