Skip to content

BUG: StataWriterUTF8 is needlessly strict when converting variable names #47276

Closed
@eirki

Description

@eirki

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
pd.DataFrame({"mø": [1, 2, 3, 4]}).to_stata("data.dta", version=119)

#.venv/lib/python3.9/site-packages/pandas/io/stata.py:2491: InvalidColumnName: 
#Not all pandas column names were valid Stata variable names.
#The following replacements have been made:
#
#    mø   ->   m_
#
#If this is not what you expect, please make sure you have Stata-compliant
#column names in your DataFrame (strings only, max 32 characters, only
#alphanumerics and underscores, no Stata reserved words)
#
#  warnings.warn(ws, InvalidColumnName)

pd.read_stata("data.dta")

#    index  m_
# 0      0   1
# 1      1   2
# 2      2   3
# 3      3   4

Issue Description

All characters in the range 128 <= ord(c) < 256 are replaced with underscore by StataWriterUTF8, but only (128 <= ord(c) < 192) or ord(c) in {215, 247} need to be removed. The rest (ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ) are perfectly valid in variable names in Stata version >= 118

Expected Behavior

The characters ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ should not be removed from variable names when saving with StataWriterUTF8. Happy to submit a pull request.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 4bfe3d0 python : 3.10.4.final.0 python-bits : 64 OS : Linux OS-release : 5.15.0-33-generic Version : #34-Ubuntu SMP Wed May 18 13:34:26 UTC 2022 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.22.4
pytz : 2022.1
dateutil : 2.8.2
pip : 22.0.2
setuptools : 59.6.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.4.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
markupsafe : None
matplotlib : 3.5.2
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 8.0.0
pyreadstat : 1.1.6
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions