Skip to content

API/ENH: read_csv handling of bad lines (too many/few fields) #15122

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Currently read_csv has some ways to deal with "bad lines" (bad in the sense of too many or too few fields compared to the determined number of columns):

  • by default, it will error for too many fields, and fill with NaNs for too few fields
  • with error_bad_lines=false rows with too many fields will be dropped instead of raising an error (and in that case, warn_bad_lines controls to get a warning or not)
  • with usecols you can select certain columns, and in this way deal with rows with too many fields.

Some possibilities are missing in this scheme:

Apart from that, #5686 makes the request to be able to specify a custom function to process a bad line, to have even more control.

In #9549 (comment) (and surrounding comments) there was some discussion about how to integrate this, and some idea from there from @jreback and @selasley:

Provide more fine grained control in a new keyword (and deprecate error_bad_lines):

bad_lines='error'|'warn'|'skip'|'process'

or leave out 'warn' and keep warn_bad_lines to be able to combine a warning with both 'skip' and 'process'.

We should further think about whether we can integrate this with the case of too few fields and not only too many.

I think it would be nice to have some better control here, but we should think a bit about the best API for this.

Metadata

Metadata

Assignees

Labels

DeprecateFunctionality to remove in pandasIO CSVread_csv, to_csv

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions