Skip to content

to_stata always stores strings as str244 #8969

Closed
@dmsul

Description

@dmsul

Copied from #7858:

I am still getting this bug (pandas 0.15.1, numpy 1.9.1, Stata 13.1 on Windows 7). The written DTA file still stores the strings as str244 even though the strings themselves have length 1. This also means that when they're read back into pandas the DataFrame looks the same. I think the only way to detect it from within Pandas is to look at the size of the DTA file itself.

df = pd.DataFrame(['a', 'b', 'c'], columns=['alpha'])
df.to_stata('test.dta')
df2 = pd.read_stata('test.dta')
assert (df['alpha'] == df2['alpha']).min()

But when loading in Stata:

. use test
. describe

Contains data from D:\data\Pollution\test.dta
  obs:             3                          
 vars:             2                          02 Dec 2014 10:30
 size:           744                          
--------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------
index           long    %12.0g                
alpha           str244  %1s                   
--------------------------------------------------------------------------------------------
Sorted by:  

. compress
  index was long now byte
  alpha was str244 now str1
  (738 bytes saved)

. describe

Contains data from D:\data\Pollution\test.dta
  obs:             3                          
 vars:             2                          02 Dec 2014 10:30
 size:             6                          
--------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
--------------------------------------------------------------------------------------------
index           byte    %12.0g                
alpha           str1    %9s                   
--------------------------------------------------------------------------------------------
Sorted by:  

I don't know how critical this is since there are workarounds. You can either use compress; save, replace in Stata after every use of pandas or, if the 244 problem makes the DTA exceed your memory limit (which causes quite the system error lightshow as I just experienced), you could pass it through a CSV first.

It's just a matter of convenience.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO Stataread_stata, to_stataOutput-Formatting__repr__ of pandas objects, to_string

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions