Skip to content

Column inference in read_fwf() has unexpected outcome for sparse files #14945

Open
@jonemo

Description

@jonemo

Code Sample

import io
import pandas as pd

filecontent = u"""
Qty Value                                Parts                        Description                       MAN_PART_NO        VENDOR
1   200kOhm                              R5                           Resistor                          MCR01MRTF2003      Digikey
5   FACE_BOARD_CONNECTOR                 U$10, U$11, U$12, U$13, U$14                                   1-84981-0          Digikey
1   I2C_VOLTAGE_LEVEL_TRANSLATOR         U$27                                                           PCA9306DCUR        Digikey
2   LED                                  GREEN, ORGNE                 Small inidicator LED              SML-P11MTT86       Digikey
1   MEGA164A/PA/324A/PA/644A/PA/1284/PAU U1                           TQFP44 package of Atmega1284      ATMEGA1284-AUR     Digikey
1   POWER_REG5.0V-SOT23                  VR5.0                        Linear Power Regulator            MIC5219-5.0YM5 TR  Digikey
2   RESISTOR_ARRAY                       RA1, RA2                                                       CRA04S08310K0JTD   Digikey
"""

print pd.read_fwf(io.StringIO(filecontent), colspecs='infer', skiprows=1)

Problem description

In electronics, some older CAD softwares export the Bill of Materials (BOM) document as fixed width text file. Importing such files using pd.read_fwf(..., colspec='infer') often results in unexpected outcomes. This is because the colspec inference isn't written with files in mind that exhibit the following two properties:

  • high variance in content length of cells in a column
  • space (or other delimiter) characters within the cell content

The code sample uses a simplified example BOM document from an open source electronics project that exhibits these properties: Note that the "Parts" column is split up into multiple columns.

Actual Output

   Qty                                 Value              Parts Unnamed: 3  \
0    1                               200kOhm                 R5        NaN   
1    5                  FACE_BOARD_CONNECTOR  U$10, U$11, U$12,      U$13,   
2    1          I2C_VOLTAGE_LEVEL_TRANSLATOR               U$27        NaN   
3    2                                   LED       GREEN, ORGNE        NaN   
4    1  MEGA164A/PA/324A/PA/644A/PA/1284/PAU                 U1        NaN   
5    1                   POWER_REG5.0V-SOT23              VR5.0        NaN   
6    2                        RESISTOR_ARRAY           RA1, RA2        NaN   

  Unnamed: 4                   Description        MAN_PART_NO   VENDOR  
0        NaN                      Resistor      MCR01MRTF2003  Digikey  
1       U$14                           NaN          1-84981-0  Digikey  
2        NaN                           NaN        PCA9306DCUR  Digikey  
3        NaN          Small inidicator LED       SML-P11MTT86  Digikey  
4        NaN  TQFP44 package of Atmega1284     ATMEGA1284-AUR  Digikey  
5        NaN        Linear Power Regulator  MIC5219-5.0YM5 TR  Digikey  
6        NaN                           NaN   CRA04S08310K0JTD  Digikey

Expected Output

   Qty                                 Value                         Parts   \
0    1                               200kOhm                            R5  
1    5                  FACE_BOARD_CONNECTOR  U$10, U$11, U$12, U$13, U$14   
2    1          I2C_VOLTAGE_LEVEL_TRANSLATOR                          U$27   
3    2                                   LED                  GREEN, ORGNE   
4    1  MEGA164A/PA/324A/PA/644A/PA/1284/PAU                            U1   
5    1                   POWER_REG5.0V-SOT23                         VR5.0   
6    2                        RESISTOR_ARRAY                      RA1, RA2   

                    Description        MAN_PART_NO   VENDOR  
0                      Resistor      MCR01MRTF2003  Digikey  
1                           NaN          1-84981-0  Digikey  
2                           NaN        PCA9306DCUR  Digikey  
3          Small inidicator LED       SML-P11MTT86  Digikey  
4  TQFP44 package of Atmega1284     ATMEGA1284-AUR  Digikey  
5        Linear Power Regulator  MIC5219-5.0YM5 TR  Digikey  
6                           NaN   CRA04S08310K0JTD  Digikey

Possible Improvements

Right now any set of neighboring columns of the text file that contain delimiter characters only across all rows is considered a boundary. This is prone to include false positives. Possible alternative strategies for colspec inference:

  • Require more than one column to transition from delimiter to non-delimiter character. A possible cutoff might be the median number of occurrences of this transition across all columns of the file.
  • Consider the header row (if present) separately from the content rows and give it additional weight.

I realize that this use case falls outside the usual/common use cases of pandas and wouldn't be surprised if this gets triaged as #wontfix or if my suggested improvements interfere with more common use cases that I am not aware of. If maintainers agree that my suggestions would add value, I'd be happy to contribute the PR.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.1.2
Cython: None
numpy: 1.11.3
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions