Skip to content

BUG: Integers with extreme values are incorrectly interpreted as missing when importing older versions (pre-version 8) of the Stata dta file format #58130

Closed
@cmjcharlton

Description

@cmjcharlton

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Unfortunately as this occurs when importing an external file I can't replicate it
# without a data file, so I have included this with the issue description
import pandas as pd
df = pd.read_stata("stata-intlimits-111.dta")

Issue Description

Stata version 8 (corresponding to the dta formats 112 and 113) introduced multiple missing values codes [1]. To support this for integer values the valid limits were changed to accommodate reusing certain values to represent missing codes. These changed from [2] [3]:

        byte
            minimum nonmissing    -128
            maximum nonmissing    +126
            code for .            +127

        int
            minimum nonmissing    -32768
            maximum nonmissing    +32766
            code for .            +32767

        long
            minimum nonmissing    -2,147,483,648
            maximum nonmissing    +2,147,483,646
            code for .            +2,147,483,647

to [4]:

        byte
            minimum nonmissing    -127
            maximum nonmissing    +100
            code for .            +101
            code for .a           +102
            code for .b           +103
            ...
            code for .z           +127

        int
            minimum nonmissing    -32767
            maximum nonmissing    +32740
            code for .            +32741
            code for .a           +32742
            code for .b           +32743
            ...
            code for .z           +32767

        long
            minimum nonmissing    -2,147,483,647
            maximum nonmissing    +2,147,483,620
            code for .            +2,147,483,621
            code for .a           +2,147,483,622
            code for .b           +2,147,483,623
            ...
            code for .z           +2,147,483,647

When reading older format files into Stata the repurposed values are retained by promoting the variable type to the next largest (i.e. byte->int, int->long, long->double). Pandas on the other hand interprets the values as if they had the newer coding and converts them to missing:

>>> df = pd.read_stata("stata-intlimits-111.dta")
>>> df
   index  byte  int  long
0      1   0.0  0.0   0.0
1      2   NaN  NaN   NaN
2      3   NaN  NaN   NaN
3      4   NaN  NaN   NaN
>>> df.dtypes
index      int16
byte     float64
int      float64
long     float64
dtype: object

Most of the time this isn't a problem as the new .z missing value type corresponds to the old single missing value code, however if there are non-missing values that correspond to the new codes they will be incorrectly interpreted.

References:
[1] What's new in release 8.0, What's new in data management , section 20 (https://www.stata.com/help.cgi?whatsnew7to8)
[2] Stata 1 reference manual, page 4 (https://www.statalist.org/forums/forum/general-stata-discussion/general/1638352-stata-1-reference-manual-now-available-to-anyone-who-wants-it)
[3] Description of the .dta file format 110, representation of numbers (dta_110.txt)
[4] Description of the .dta file format 113, representation of numbers (https://www.stata.com/help.cgi?dta_113#numbers)

Test files:
stata-intlimits.zip

Expected Behavior

I would expect the data to appear as it does in Stata (the following is output from Stata version 18):

. dtaversion "stata-intlimits-111.dta"
  (file "C:\Users\edcmjc\stata-intlimits-111.dta" is .dta-format 111 from Stata 7)

. use "stata-intlimits-111.dta" 
(Integer limits for 111 dta format)

. list

     +-------------------------------------+
     | index   byte      int          long |
     |-------------------------------------|
  1. |     1      0        0             0 |
  2. |     2   -128   -32768   -2147483648 |
  3. |     3    126    32766    2147483646 |
  4. |     4      .        .             . |
     +-------------------------------------+

. describe

Contains data from stata-intlimits_111.dta
 Observations:             4                  Integer limits for 111 dta format
    Variables:             4                  
-----------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-----------------------------------------------------------------------------------------------
index           int     %8.0g                 
byte            int     %8.0g                 byte variable
int             long    %8.0g                 int variable
long            double  %12.0g                long variable
-----------------------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.12.2.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19045
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United Kingdom.1252

pandas : 2.2.1
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.8.3
numba : None
numexpr : 2.9.0
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugNeeds TriageIssue that has not been reviewed by a pandas team member

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions