Skip to content

encoding/csv: "LazyQuotes" not lazy enough, need an "IgnoreQuotes" mode for highly messy csv input #3150

Closed
@gopherbot

Description

@gopherbot

by philipp.schumann:

---What steps will reproduce the problem?---

1. Download and extract http://download.geonames.org/export/dump/allCountries.zip --
contains a single tab-separated file about 940MB

2. Set up a TSV reader in Go:

        tsv = csv.NewReader(txtFile)
        tsv.Comma = '\t'
        tsv.Comment = '#'
        tsv.LazyQuotes = true
        tsv.TrailingComma = true // retain rather than remove empty slots
        tsv.TrimLeadingSpace = false // retain rather than remove empty slots

3. Iterate through the records returned by tsv.Read() (after each read, set
tsv.FieldsPerRecord = 0) until the file's line 2293755 which begins like this:

3376027 ”S” Falls   "S" Falls     4.533......

---What is the expected result?---

With LazyQuotes set, the reader should return a string array containing the
tab-separated items of only this line.

---What do you see instead?---

The reader packs all fields of the current line, starting from field 2 (if counting
0-based), PLUS all consecutive lines until line 3043730 (record 6489131 B&B "a
Casa di Griffi"    B&B "a Casa di Griffi"    ... into a single 91MB big
string value.

6g, weekly.2012-02-22 under openSuSE 12.1 64bit.

NOTES: guessing this is due to quote character mismatches of some sort or another. So
this behaviour might very well be due to non-sanitized input data. However, such is the
nature of 99.9% of real-world CSV files out there. If I have to run custom code to scan
and sanitize this 950MB file myself prior to feeding it to encoding/csv, I can just
parse manually in the first place. Ideally, the csv package would offer an
"IgnoreQuotes" mode -- for use-cases where I *know* that there are no
multi-line records and where I *know* all newlines and commas (or tabs in this case)
absolutely *need* to take the strictest precedence over any quotes that may or may not
be appearing in the data and should ideally be taken as they are since nobody ever
bothered to properly escape those quotes that are 100% part of the data, not record or
field delimiters...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions