Skip to content

to_csv() surrogates not allowed #22610

Closed
@obilodeau

Description

@obilodeau

Code Sample

s = '\ud800'
srs = pd.Series()
srs.loc[ 0 ] = s
srs.to_csv('testcase.csv')

Stack trace:

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-50-769583baba38> in <module>()
      4 srs = pd.Series()
      5 srs.loc[ 0 ] = s
----> 6 srs.to_csv('testcase.csv')

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in to_csv(self, path, index, sep, na_rep, float_format, header, index_label, mode, encoding, compression, date_format, decimal)
   3779                            index_label=index_label, mode=mode,
   3780                            encoding=encoding, compression=compression,
-> 3781                            date_format=date_format, decimal=decimal)
   3782         if path is None:
   3783             return result

/opt/conda/lib/python3.6/site-packages/pandas/core/frame.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, decimal)
   1743                                  doublequote=doublequote,
   1744                                  escapechar=escapechar, decimal=decimal)
-> 1745         formatter.save()
   1746 
   1747         if path_or_buf is None:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in save(self)
    169                 self.writer = UnicodeWriter(f, **writer_kwargs)
    170 
--> 171             self._save()
    172 
    173         finally:

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save(self)
    284                 break
    285 
--> 286             self._save_chunk(start_i, end_i)
    287 
    288     def _save_chunk(self, start_i, end_i):

/opt/conda/lib/python3.6/site-packages/pandas/io/formats/csvs.py in _save_chunk(self, start_i, end_i)
    311 
    312         libwriters.write_csv_rows(self.data, ix, self.nlevels,
--> 313                                   self.cols, self.writer)

pandas/_libs/writers.pyx in pandas._libs.writers.write_csv_rows()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 2: surrogates not allowed

Problem description

The presence of Unicode surrogates in a dataframe (or Series) causes an error in .to_csv(). This has already been fixed in .to_hdf() by allowing the errors= argument to be used where we can use the surrogatepass or surrogateescape error handler.

See the original bug report and the PR that fixed it.

Expected Output

No error.

Output of pd.show_versions()

I forgot to grab this before the end of my workshop and I destroyed the cloud instance. Sorry. It was Python 3.6 and pandas 0.23.4 I think.

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions