Closed
Description
Using a namedtuple
as a column name for read_csv
in Pandas 0.14 results NaNs
being loaded.
Here is a simple demonstration of the problem (this code works in Pandas 0.13.1):
import pandas as pd
from collections import namedtuple
from StringIO import StringIO
TestTuple = namedtuple('test', ['a'])
CSV = """10
20
30"""
pd.read_csv(StringIO(CSV), header=None, names=[TestTuple('foo')],
tupleize_cols=True)
Pandas 0.14, this is the output:
(foo,)
0 NaN
1 NaN
2 NaN
Strangely enough, Pandas 0.14 works fine if we used a tuple instead of a namedtuple:
pd.read_csv(StringIO(CSV), header=None, names=[('foo')], tupleize_cols=False)
Here is the output:
foo
0 10
1 20
2 30
So, for some reason, read_csv
in Pandas 0.14 doesn't like using a namedtuple
as a column name. (The ugly fix is to not pass any column names to read_csv
and then, once the DataFrame is loaded, replace the column names with df.columns = [TestTuple('foo')]
)
(Really love Pandas by the way, thanks so much for all your work!)
My software versions:
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.14.0
nose: 1.3.3
Cython: 0.20
numpy: 1.8.1
scipy: 0.13.3
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.1
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.2.2
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: 3.3.3
bs4: 4.2.1
html5lib: 0.999
bq: None
apiclient: None
rpy2: 2.3.8
sqlalchemy: None
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)