Description
Code Sample
import io
import pandas as pd
filecontent = u"""
Qty Value Parts Description MAN_PART_NO VENDOR
1 200kOhm R5 Resistor MCR01MRTF2003 Digikey
5 FACE_BOARD_CONNECTOR U$10, U$11, U$12, U$13, U$14 1-84981-0 Digikey
1 I2C_VOLTAGE_LEVEL_TRANSLATOR U$27 PCA9306DCUR Digikey
2 LED GREEN, ORGNE Small inidicator LED SML-P11MTT86 Digikey
1 MEGA164A/PA/324A/PA/644A/PA/1284/PAU U1 TQFP44 package of Atmega1284 ATMEGA1284-AUR Digikey
1 POWER_REG5.0V-SOT23 VR5.0 Linear Power Regulator MIC5219-5.0YM5 TR Digikey
2 RESISTOR_ARRAY RA1, RA2 CRA04S08310K0JTD Digikey
"""
print pd.read_fwf(io.StringIO(filecontent), colspecs='infer', skiprows=1)
Problem description
In electronics, some older CAD softwares export the Bill of Materials (BOM) document as fixed width text file. Importing such files using pd.read_fwf(..., colspec='infer')
often results in unexpected outcomes. This is because the colspec inference isn't written with files in mind that exhibit the following two properties:
- high variance in content length of cells in a column
- space (or other delimiter) characters within the cell content
The code sample uses a simplified example BOM document from an open source electronics project that exhibits these properties: Note that the "Parts" column is split up into multiple columns.
Actual Output
Qty Value Parts Unnamed: 3 \
0 1 200kOhm R5 NaN
1 5 FACE_BOARD_CONNECTOR U$10, U$11, U$12, U$13,
2 1 I2C_VOLTAGE_LEVEL_TRANSLATOR U$27 NaN
3 2 LED GREEN, ORGNE NaN
4 1 MEGA164A/PA/324A/PA/644A/PA/1284/PAU U1 NaN
5 1 POWER_REG5.0V-SOT23 VR5.0 NaN
6 2 RESISTOR_ARRAY RA1, RA2 NaN
Unnamed: 4 Description MAN_PART_NO VENDOR
0 NaN Resistor MCR01MRTF2003 Digikey
1 U$14 NaN 1-84981-0 Digikey
2 NaN NaN PCA9306DCUR Digikey
3 NaN Small inidicator LED SML-P11MTT86 Digikey
4 NaN TQFP44 package of Atmega1284 ATMEGA1284-AUR Digikey
5 NaN Linear Power Regulator MIC5219-5.0YM5 TR Digikey
6 NaN NaN CRA04S08310K0JTD Digikey
Expected Output
Qty Value Parts \
0 1 200kOhm R5
1 5 FACE_BOARD_CONNECTOR U$10, U$11, U$12, U$13, U$14
2 1 I2C_VOLTAGE_LEVEL_TRANSLATOR U$27
3 2 LED GREEN, ORGNE
4 1 MEGA164A/PA/324A/PA/644A/PA/1284/PAU U1
5 1 POWER_REG5.0V-SOT23 VR5.0
6 2 RESISTOR_ARRAY RA1, RA2
Description MAN_PART_NO VENDOR
0 Resistor MCR01MRTF2003 Digikey
1 NaN 1-84981-0 Digikey
2 NaN PCA9306DCUR Digikey
3 Small inidicator LED SML-P11MTT86 Digikey
4 TQFP44 package of Atmega1284 ATMEGA1284-AUR Digikey
5 Linear Power Regulator MIC5219-5.0YM5 TR Digikey
6 NaN CRA04S08310K0JTD Digikey
Possible Improvements
Right now any set of neighboring columns of the text file that contain delimiter characters only across all rows is considered a boundary. This is prone to include false positives. Possible alternative strategies for colspec inference:
- Require more than one column to transition from delimiter to non-delimiter character. A possible cutoff might be the median number of occurrences of this transition across all columns of the file.
- Consider the header row (if present) separately from the content rows and give it additional weight.
I realize that this use case falls outside the usual/common use cases of pandas and wouldn't be surprised if this gets triaged as #wontfix or if my suggested improvements interfere with more common use cases that I am not aware of. If maintainers agree that my suggestions would add value, I'd be happy to contribute the PR.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.1
nose: None
pip: 9.0.1
setuptools: 32.1.2
Cython: None
numpy: 1.11.3
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None