Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
pd.DataFrame({"mø": [1, 2, 3, 4]}).to_stata("data.dta", version=119)
#.venv/lib/python3.9/site-packages/pandas/io/stata.py:2491: InvalidColumnName:
#Not all pandas column names were valid Stata variable names.
#The following replacements have been made:
#
# mø -> m_
#
#If this is not what you expect, please make sure you have Stata-compliant
#column names in your DataFrame (strings only, max 32 characters, only
#alphanumerics and underscores, no Stata reserved words)
#
# warnings.warn(ws, InvalidColumnName)
pd.read_stata("data.dta")
# index m_
# 0 0 1
# 1 1 2
# 2 2 3
# 3 3 4
Issue Description
All characters in the range 128 <= ord(c) < 256
are replaced with underscore by StataWriterUTF8
, but only (128 <= ord(c) < 192) or ord(c) in {215, 247}
need to be removed. The rest (ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
) are perfectly valid in variable names in Stata version >= 118
Expected Behavior
The characters ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
should not be removed from variable names when saving with StataWriterUTF8
. Happy to submit a pull request.
Installed Versions
pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.2
setuptools : 59.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None