Closed
Description
I was surprised to learn that default behavior for reading in a CSV like this:
%more foo.txt
x,y
"NA",NA
"foo",foo
is the following:
In [56]: pd.read_csv('foo.txt')
Out[56]:
x y
0 NaN NaN
1 foo foo
I realize this can be changed by using pd.read_csv('foo.txt', keep_default_na=False)
, but this doesn't seem like ideal default behavior for the following reasons.
- NA can be a valid value in numerous cases
- Abbreviation for North America, sodium, lots of other things
- Possible value for any code with two alphabetic characters
- NA might very well be a valid response (distinct from true missing) in survey data even if it literally means something like "not applicable".
- This behavior is likely unexpected in many cases and dangerous for that reason. Here is a recent example from stack overflow. As a counter-example, I checked the behavior of Excel and Stata and both will read in NA as a text value, not as missing.
- It just seems like better practice would be to handle interpretation of NA and NaN after reading in the data as text rather than doing this on the fly. When this behavior is unwanted, it may be very hard to detect after the fact.
Edit to add: At risk of belaboring the point, note that the following round trips with either read_csv
or from_csv
(using all default values) will convert 'NA' to NaN:
df=pd.DataFrame({ 'x':['NA','foo'] })
df.to_csv('test.csv')
df1 = pd.read_csv('test.csv')
df2 = pd.DataFrame.from_csv('test.csv')