Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Unfortunately as this occurs when importing an external file I can't replicate it
# without a data file, so I have included this with the issue description
import pandas as pd
df = pd.read_stata("stata-intlimits-111.dta")
Issue Description
Stata version 8 (corresponding to the dta formats 112 and 113) introduced multiple missing values codes [1]. To support this for integer values the valid limits were changed to accommodate reusing certain values to represent missing codes. These changed from [2] [3]:
byte
minimum nonmissing -128
maximum nonmissing +126
code for . +127
int
minimum nonmissing -32768
maximum nonmissing +32766
code for . +32767
long
minimum nonmissing -2,147,483,648
maximum nonmissing +2,147,483,646
code for . +2,147,483,647
to [4]:
byte
minimum nonmissing -127
maximum nonmissing +100
code for . +101
code for .a +102
code for .b +103
...
code for .z +127
int
minimum nonmissing -32767
maximum nonmissing +32740
code for . +32741
code for .a +32742
code for .b +32743
...
code for .z +32767
long
minimum nonmissing -2,147,483,647
maximum nonmissing +2,147,483,620
code for . +2,147,483,621
code for .a +2,147,483,622
code for .b +2,147,483,623
...
code for .z +2,147,483,647
When reading older format files into Stata the repurposed values are retained by promoting the variable type to the next largest (i.e. byte->int, int->long, long->double). Pandas on the other hand interprets the values as if they had the newer coding and converts them to missing:
>>> df = pd.read_stata("stata-intlimits-111.dta")
>>> df
index byte int long
0 1 0.0 0.0 0.0
1 2 NaN NaN NaN
2 3 NaN NaN NaN
3 4 NaN NaN NaN
>>> df.dtypes
index int16
byte float64
int float64
long float64
dtype: object
Most of the time this isn't a problem as the new .z missing value type corresponds to the old single missing value code, however if there are non-missing values that correspond to the new codes they will be incorrectly interpreted.
References:
[1] What's new in release 8.0, What's new in data management , section 20 (https://www.stata.com/help.cgi?whatsnew7to8)
[2] Stata 1 reference manual, page 4 (https://www.statalist.org/forums/forum/general-stata-discussion/general/1638352-stata-1-reference-manual-now-available-to-anyone-who-wants-it)
[3] Description of the .dta file format 110, representation of numbers (dta_110.txt)
[4] Description of the .dta file format 113, representation of numbers (https://www.stata.com/help.cgi?dta_113#numbers)
Test files:
stata-intlimits.zip
Expected Behavior
I would expect the data to appear as it does in Stata (the following is output from Stata version 18):
. dtaversion "stata-intlimits-111.dta"
(file "C:\Users\edcmjc\stata-intlimits-111.dta" is .dta-format 111 from Stata 7)
. use "stata-intlimits-111.dta"
(Integer limits for 111 dta format)
. list
+-------------------------------------+
| index byte int long |
|-------------------------------------|
1. | 1 0 0 0 |
2. | 2 -128 -32768 -2147483648 |
3. | 3 126 32766 2147483646 |
4. | 4 . . . |
+-------------------------------------+
. describe
Contains data from stata-intlimits_111.dta
Observations: 4 Integer limits for 111 dta format
Variables: 4
-----------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-----------------------------------------------------------------------------------------------
index int %8.0g
byte int %8.0g byte variable
int long %8.0g int variable
long double %12.0g long variable
-----------------------------------------------------------------------------------------------
Sorted by:
Note: Dataset has changed since last saved.
Installed Versions
INSTALLED VERSIONS
commit : bdc79c1
python : 3.12.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252
pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.3
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None