@@ -1373,8 +1373,7 @@ Files with fixed width columns
1373
1373
1374
1374
While :func: `read_csv ` reads delimited data, the :func: `read_fwf ` function works
1375
1375
with data files that have known and fixed column widths. The function parameters
1376
- to ``read_fwf `` are largely the same as ``read_csv `` with two extra parameters, and
1377
- a different usage of the ``delimiter `` parameter:
1376
+ to ``read_fwf `` are largely the same as ``read_csv `` with four extra parameters:
1378
1377
1379
1378
* ``colspecs ``: A list of pairs (tuples) giving the extents of the
1380
1379
fixed-width fields of each line as half-open intervals (i.e., [from, to[ ).
@@ -1383,47 +1382,65 @@ a different usage of the ``delimiter`` parameter:
1383
1382
behavior, if not specified, is to infer.
1384
1383
* ``widths ``: A list of field widths which can be used instead of 'colspecs'
1385
1384
if the intervals are contiguous.
1386
- * ``delimiter ``: Characters to consider as filler characters in the fixed-width file.
1387
- Can be used to specify the filler character of the fields
1388
- if it is not spaces (e.g., '~').
1385
+ * ``keep_whitespace ``: A boolean (default True) for explicit handling of whitespace
1386
+ from fields / columns.
1387
+ * ``whitespace_chars ``: A string of characters to treat as whitespace when
1388
+ ``keep_whitespace `` is False. Defaults to [space] and [tab] characters.
1389
1389
1390
1390
Consider a typical fixed-width data file:
1391
1391
1392
1392
.. ipython :: python
1393
1393
1394
1394
data1 = (
1395
- " id8141 360.242940 149.910199 11950.7\n "
1396
- " id1594 444.953632 166.985655 11788.4\n "
1397
- " id1849 364.136849 183.628767 11806.2\n "
1398
- " id1230 413.836124 184.375703 11916.8\n "
1399
- " id1948 502.953953 173.237159 12468.3"
1395
+ " Amy BBYBC 38BC1052AF____test_1_____\n "
1396
+ " Bob VANBC 7290603ED _ _test_2__ _ \n "
1397
+ " ChrisVICBC 0005473D1B N/A \n "
1398
+ " Dave KAMBC 315395AC $150.00\n "
1400
1399
)
1401
- with open (" bar.csv " , " w" ) as f:
1400
+ with open (" bar.dat " , " w" ) as f:
1402
1401
f.write(data1)
1403
1402
1404
1403
In order to parse this file into a ``DataFrame ``, we simply need to supply the
1405
- column specifications to the ``read_fwf `` function along with the file name:
1404
+ column specifications (or widths) to the ``read_fwf `` function along with the file name:
1406
1405
1407
1406
.. ipython :: python
1408
1407
1409
- # Column specifications are a list of half-intervals
1410
- colspecs = [(0 , 6 ), (8 , 20 ), (21 , 33 ), (34 , 43 )]
1411
- df = pd.read_fwf(" bar.csv" , colspecs = colspecs, header = None , index_col = 0 )
1408
+ df = pd.read_fwf(" bar.dat" ,
1409
+ # Column specifications are a list of half-intervals
1410
+ # colspecs=[(0,5), (5, 8), (8,10), (11,22), (22,37)],
1411
+ widths = [5 ,3 ,2 ,12 ,15 ],
1412
+ names = [" fname" , " city" , " prov" , " month_$_flags" ," test_whitespace" ],
1413
+ header = None ,
1414
+ index_col = None ,
1415
+ # # Do not convert "N/A" to NaN:
1416
+ keep_default_na = False ,
1417
+ )
1412
1418
df
1419
+ df.values
1413
1420
1414
- Note how the parser automatically picks column names X.< column number> when
1415
- `` header=None `` argument is specified. Alternatively, you can supply just the
1416
- column widths for contiguous columns:
1421
+ Note the `` names `` are used as column names, however the column names can be
1422
+ retrieved from the first row of data with the `` header=0 `` option.
1423
+ Otherwise, the parser automatically assigns column numbers as column names.
1417
1424
1418
- .. ipython :: python
1425
+ Also note the whitespace has been preserved inside the fields. To remove whitespace
1426
+ from the beginning and ending of fields, use ``keep_whitespace=False `` and, optionally
1427
+ specify ``whitespace_chars `` if other than default ([space] and [tab] characters):
1419
1428
1420
- # Widths are a list of integers
1421
- widths = [6 , 14 , 13 , 10 ]
1422
- df = pd.read_fwf(" bar.csv" , widths = widths, header = None )
1423
- df
1429
+ .. ipython :: python
1424
1430
1425
- The parser will take care of extra white spaces around the columns
1426
- so it's ok to have extra separation between the columns in the file.
1431
+ df = pd.read_fwf(" bar.dat" ,
1432
+ # Column specifications are a list of half-intervals
1433
+ # colspecs=[(0,5), (5, 8), (8,10), (11,22), (22,37)],
1434
+ widths = [5 ,3 ,2 ,12 ,15 ],
1435
+ names = [" fname" , " city" , " prov" , " month_$_flags" ," test_whitespace" ],
1436
+ header = None ,
1437
+ index_col = None ,
1438
+ # # Do not convert "N/A" to NaN:
1439
+ keep_default_na = False ,
1440
+ keep_whitespace = False ,
1441
+ whitespace_chars = " _" ,
1442
+ )
1443
+ df.values
1427
1444
1428
1445
By default, ``read_fwf `` will try to infer the file's ``colspecs `` by using the
1429
1446
first 100 rows of the file. It can do it only in cases when the columns are
@@ -1432,16 +1449,16 @@ is whitespace).
1432
1449
1433
1450
.. ipython :: python
1434
1451
1435
- df = pd.read_fwf(" bar.csv " , header = None , index_col = 0 )
1452
+ df = pd.read_fwf(" bar.dat " , header = None , index_col = 0 )
1436
1453
df
1437
1454
1438
1455
``read_fwf `` supports the ``dtype `` parameter for specifying the types of
1439
1456
parsed columns to be different from the inferred type.
1440
1457
1441
1458
.. ipython :: python
1442
1459
1443
- pd.read_fwf(" bar.csv " , header = None , index_col = 0 ).dtypes
1444
- pd.read_fwf(" bar.csv " , header = None , dtype = {2 : " object" }).dtypes
1460
+ pd.read_fwf(" bar.dat " , header = None , index_col = 0 ).dtypes
1461
+ pd.read_fwf(" bar.dat " , header = None , dtype = {2 : " object" }).dtypes
1445
1462
1446
1463
.. ipython :: python
1447
1464
:suppress:
0 commit comments