Skip to content

Strict parsing for read_fwf #17245

Open
Open
@eoghanmurray

Description

@eoghanmurray

I'm importing a fixed width file which has 2 types of records (each with their own definitions).

>>> print "good:\n", pandas.read_fwf(StringIO('T1001\nT1020'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'])
good:
  TYPE  A  B  C
0   T1  0  0  1
1   T1  0  2  0
>>> print "good:\n", pandas.read_fwf(StringIO('T2XY\nT2XZ'),  
    widths=[2,1,1], names=['TYPE', 'D', 'E'])
good:
  TYPE  D  E
0   T2  X  Y
1   T2  X  Z
>>> print "silently dropped data from first 2 rows:\n", pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1], names=['TYPE', 'D', 'E'])
silently dropped data from first 2 rows:
  TYPE  D  E
0   T1  0  0
1   T1  0  2
2   T2  X  Y
3   T2  X  Z
>>> print "unexpected NaN fields:\n", pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'])
unexpected NaN fields:
  TYPE  A  B    C
0   T1  0  0  1.0
1   T1  0  2  0.0
2   T2  X  Y  NaN
3   T2  X  Z  NaN

Problem description

I expected that lines not matching the passed-in spec would result in a 'bad line' error for that line, and those lines could be ignored.

Expected Output

I expected these lines to raise an error, with the error_bad_lines option available to ignore the lines and show warnings instead (which could be turned off with warn_bad_lines).

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1], names=['TYPE', 'D', 'E'])
ParserError: Error tokenizing data. Expected 4 characters in line 1, saw 5

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'])
ParserError: Error tokenizing data. Expected 5 characters in line 3, saw 4

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1], names=['TYPE', 'D', 'E'], error_bad_lines=False)
Skipping Line 1: expected 4 characters, saw 5
Skipping Line 2: expected 4 characters, saw 5

  TYPE  D  E
0   T2  X  Y
1   T2  X  Z

>>> pandas.read_fwf(StringIO('T1001\nT1020\nT2XY\nT2XZ'), 
    widths=[2,1,1,1], names=['TYPE', 'A', 'B', 'C'], error_bad_lines=False)
Skipping Line 3: expected 5 characters, saw 4
Skipping Line 4: expected 5 characters, saw 4

  TYPE  A  B  C
0   T1  0  0  1
1   T1  0  2  0

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-87-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_IE.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 2.9.1
pip: 9.0.1
setuptools: 27.3.0
Cython: None
numpy: 1.13.1
scipy: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 1.5
pytz: 2014.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.1.0b1
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext lo64)
jinja2: 2.7.2
s3fs: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions