Skip to content

Segmentation fault or UnicodeDecodeError when reading csv-file depending on chunksize. #5291

Closed
@Hedendahl

Description

@Hedendahl

I have encountered an issue with the csv parser, pandas.io.parser.read_csv. I get segmentation fault or UnicodeDecodeError when reading a csv-file in chunks and it seems like that the problem depends on the size of the chunks.
Consider the following code:

import codecs
import csv
import pandas as pd


def create_csv_file(columns, rows):
    csv_file_name = 'csv_test_file.csv'
    with codecs.open(csv_file_name, mode='w', encoding='utf_8') as csv_file:
        csv_writer = csv.writer(csv_file, delimiter=',')

        for row in xrange(rows):
            csv_writer.writerow(
                [float(row)] * columns)

    return csv_file_name


def main():
    """
    """
    columns = 20
    rows = 10000
    chunksize = 999
    csv_file_name = create_csv_file(columns, rows)
    reader = pd.io.parsers.read_csv(csv_file_name,
                                    header=None,
                                    chunksize=chunksize,
                                    encoding='utf_8')

    for x, dataframe in enumerate(reader, 1):
        print x * chunksize


if __name__ == "__main__":
    main()

I get segmentation fault from the attached code, when the chunksize is 999 rows. If the chunksize is decreased to 998 rows, I instead get an UnicodeDecodeError. If the chunksize is increased to 1000 rows there are no problems of reading the csv-file. My first guess was that the problem appears when the last chunk include too few rows, but I was surprised when reading of the csv-file with the following setting,

    columns = 20
    rows = 1000
    chunksize = 99

worked properly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvUnicodeUnicode strings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions