Skip to content

ENH: Add encoding errors option in pandas.read_csv #39017

Closed
@davidleejy

Description

@davidleejy

Related to problem:

df.to_csv('abc.csv', errors='surrogatepass')   # saving works fine.

# Try to load:

# Attempt 1:
pd.read_csv('abc.csv') 
# Fails.    UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30682-30685: surrogates not allowed

# Attempt 2:
pd.read_csv('abc.csv', errors='surrogatepass') 
# Fails. No `errors` parameter.

# Attempt 3:
with open('abc.csv', errors='surrogatepass') as _file:
    df = pd.read_csv(_file)
# Fails.    UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30682-30685: surrogates not allowed

Describe the solution you'd like

Recently, we added errors as a function parameter to to_csv in this merged PR. Can we do the same for read_csv? This solution would make Attempt 2 work.

(Not sure why Attempt 3 doesn't work since read_csv accepts a file handler object.)

API breaking implications

Should not break.

Describe alternatives you've considered

see (futile) Attempt 3 above.

Additional context

Section "Error handlers" in https://docs.python.org/3/library/codecs.html says:

Screenshot 2021-01-07 at 7 57 20 PM

Example of encoding & decoding:

x="\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
print(x)
# prints 🙏

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions