Closed
Description
I have encountered an issue with the csv parser, pandas.io.parser.read_csv. I get segmentation fault or UnicodeDecodeError when reading a csv-file in chunks and it seems like that the problem depends on the size of the chunks.
Consider the following code:
import codecs
import csv
import pandas as pd
def create_csv_file(columns, rows):
csv_file_name = 'csv_test_file.csv'
with codecs.open(csv_file_name, mode='w', encoding='utf_8') as csv_file:
csv_writer = csv.writer(csv_file, delimiter=',')
for row in xrange(rows):
csv_writer.writerow(
[float(row)] * columns)
return csv_file_name
def main():
"""
"""
columns = 20
rows = 10000
chunksize = 999
csv_file_name = create_csv_file(columns, rows)
reader = pd.io.parsers.read_csv(csv_file_name,
header=None,
chunksize=chunksize,
encoding='utf_8')
for x, dataframe in enumerate(reader, 1):
print x * chunksize
if __name__ == "__main__":
main()
I get segmentation fault from the attached code, when the chunksize is 999 rows. If the chunksize is decreased to 998 rows, I instead get an UnicodeDecodeError. If the chunksize is increased to 1000 rows there are no problems of reading the csv-file. My first guess was that the problem appears when the last chunk include too few rows, but I was surprised when reading of the csv-file with the following setting,
columns = 20
rows = 1000
chunksize = 99
worked properly.