Closed
Description
Related to problem:
df.to_csv('abc.csv', errors='surrogatepass') # saving works fine.
# Try to load:
# Attempt 1:
pd.read_csv('abc.csv')
# Fails. UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30682-30685: surrogates not allowed
# Attempt 2:
pd.read_csv('abc.csv', errors='surrogatepass')
# Fails. No `errors` parameter.
# Attempt 3:
with open('abc.csv', errors='surrogatepass') as _file:
df = pd.read_csv(_file)
# Fails. UnicodeEncodeError: 'utf-8' codec can't encode characters in position 30682-30685: surrogates not allowed
Describe the solution you'd like
Recently, we added errors
as a function parameter to to_csv
in this merged PR. Can we do the same for read_csv
? This solution would make Attempt 2 work.
(Not sure why Attempt 3 doesn't work since read_csv
accepts a file handler object.)
API breaking implications
Should not break.
Describe alternatives you've considered
see (futile) Attempt 3 above.
Additional context
Section "Error handlers" in https://docs.python.org/3/library/codecs.html says:
Example of encoding & decoding:
x="\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')
print(x)
# prints 🙏