Closed
Description
by philipp.schumann:
---What steps will reproduce the problem?--- 1. Download and extract http://download.geonames.org/export/dump/allCountries.zip -- contains a single tab-separated file about 940MB 2. Set up a TSV reader in Go: tsv = csv.NewReader(txtFile) tsv.Comma = '\t' tsv.Comment = '#' tsv.LazyQuotes = true tsv.TrailingComma = true // retain rather than remove empty slots tsv.TrimLeadingSpace = false // retain rather than remove empty slots 3. Iterate through the records returned by tsv.Read() (after each read, set tsv.FieldsPerRecord = 0) until the file's line 2293755 which begins like this: 3376027 ”S” Falls "S" Falls 4.533...... ---What is the expected result?--- With LazyQuotes set, the reader should return a string array containing the tab-separated items of only this line. ---What do you see instead?--- The reader packs all fields of the current line, starting from field 2 (if counting 0-based), PLUS all consecutive lines until line 3043730 (record 6489131 B&B "a Casa di Griffi" B&B "a Casa di Griffi" ... into a single 91MB big string value. 6g, weekly.2012-02-22 under openSuSE 12.1 64bit. NOTES: guessing this is due to quote character mismatches of some sort or another. So this behaviour might very well be due to non-sanitized input data. However, such is the nature of 99.9% of real-world CSV files out there. If I have to run custom code to scan and sanitize this 950MB file myself prior to feeding it to encoding/csv, I can just parse manually in the first place. Ideally, the csv package would offer an "IgnoreQuotes" mode -- for use-cases where I *know* that there are no multi-line records and where I *know* all newlines and commas (or tabs in this case) absolutely *need* to take the strictest precedence over any quotes that may or may not be appearing in the data and should ideally be taken as they are since nobody ever bothered to properly escape those quotes that are 100% part of the data, not record or field delimiters...