API/ENH: read_csv handling of bad lines (too many/few fields)

Currently `read_csv` has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):

- by default, it will error for too many fields, and fill with NaNs for too few fields
- with `error_bad_lines=false` rows with too many fields will be dropped instead of raising an error (and in that case, `warn_bad_lines` controls to get a warning or not) 
- with `usecols` you can select certain columns, and in this way deal with rows with too many fields.

Some possibilities are missing in this scheme:

- "process" bad lines with too many fields, i.e. drop the excessive fields instead of either raising an error or dropping the full row (discussed in #9549)
- getting a warning or error with too few fields instead of automatically filling with NaNs (asked for in #9729), or dropping those rows

Apart from that, https://github.com/pandas-dev/pandas/issues/5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.

In https://github.com/pandas-dev/pandas/issues/9549#issuecomment-76498787 (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:

Provide more fine grained control in a new keyword (and deprecate `error_bad_lines`):

```
bad_lines='error'|'warn'|'skip'|'process'
```

or leave out `'warn'` and keep `warn_bad_lines` to be able to combine a warning with both 'skip' and 'process'.

We should further think about whether we can integrate this with the case of too few fields and not only too many.

I think it would be nice to have some better control here, but we should think a bit about the best API for this.
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions