Description
Writing a dataframe to csv using df.to_csv("/path/to/file.csv")
causes the creation of an "unnamed" field containing the indexes of the rows. Continually writing/reading to/from this file will result in many "unnamed" fields. Once there are three unnamed fields you can no longer use loc to replace values, what's more, it fails silently.
In [2]: df = DataFrame({'A': [1,2,3],'B': [4,5,501]})
In [3]: df
Out[3]:
A B
0 1 4
1 2 5
2 3 501
In [4]: df.B.loc[df.B > 500]
Out[4]:
2 501
Name: B, dtype: int64
In [5]: df.B.loc[df.B > 500] = None
In [6]: df
Out[6]:
A B
0 1 4.0
1 2 5.0
2 3 NaN
So far so good, I was able to replace all values in df.B
with NaN
. I then write this out and read it back in
In [7]: df.to_csv("./test.csv")
In [8]: df = read_csv("./test.csv")
In [9]: df.columns
Out[9]: Index([u'Unnamed: 0', u'A', u'B'], dtype='object')
In [10]: df
Out[10]:
Unnamed: 0 A B
0 0 1 4.0
1 1 2 5.0
2 2 3 NaN
As you can see, this has create an unnamed field, but let's continue
In [14]: df.B.fillna(501, inplace=True)
This is jsut to get 501 in place of the NaN
I created earlier which I forgot to do before writing out
In [15]: df.B
Out[15]:
0 4.0
1 5.0
2 501.0
Name: B, dtype: float64
In [16]: df.B.loc[df.B > 500]
Out[16]:
2 501.0
Name: B, dtype: float64
In [17]: df.B.loc[df.B > 500] = None
...SettingWithCopyWarning...
In [18]: df.B
Out[18]:
0 4.0
1 5.0
2 NaN
Name: B, dtype: float64
Everything working fine
In [19]: df.fillna(501, inplace=True)
In [20]: df.to_csv("./test.csv")
In [21]: df = read_csv("./test.csv")
In [22]: df.columns
Out[22]: Index([u'Unnamed: 0', u'Unnamed: 0.1', u'A', u'B'], dtype='object')
In [23]: df
Out[23]:
Unnamed: 0 Unnamed: 0.1 A B
0 0 0 1 4.0
1 1 1 2 5.0
2 2 2 3 501.0
Writing and reading again creates a 2nd unnamed field
In [24]: df.B.loc[df.B > 500]
Out[24]:
2 501.0
Name: B, dtype: float64
In [25]: df.B.loc[df.B > 500] = None
In [26]: df.B
Out[26]:
0 4.0
1 5.0
2 NaN
Name: B, dtype: float64
Which is no problem, everything still works so far...however
In [27]: df.fillna(501, inplace=True)
In [28]: df
Out[28]:
Unnamed: 0 Unnamed: 0.1 A B
0 0 0 1 4.0
1 1 1 2 5.0
2 2 2 3 501.0
In [29]: df.to_csv("./test.csv")
In [30]: df = read_csv("./test.csv")
In [31]: df.columns
Out[31]: Index([u'Unnamed: 0', u'Unnamed: 0.1', u'Unnamed: 0.1', u'A', u'B'], dtype='object')
In [32]: df
Out[32]:
Unnamed: 0 Unnamed: 0.1 Unnamed: 0.1 A B
0 0 0 0 1 4.0
1 1 1 1 2 5.0
2 2 2 2 3 501.0
We now have 3 unnamed fields
In [33]: df.B.loc[df.B > 500]
Out[33]:
2 501.0
Name: B, dtype: float64
In [34]: df.B.loc[df.B > 500] = None
In [35]: df.B
Out[35]:
0 4.0
1 5.0
2 501.0
Name: B, dtype: float64
The method of replacing all values over 500 with Nan
no longer works but also throws no errors or warnings.
You CAN get around this using df.loc[df.B > 500, 'B'] = None
but obviously you shouldn't have to.