Skip to content

BUG: Fixed incorrect string length calculation when writing strings in Stata #7862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 1, 2014

Conversation

bashtage
Copy link
Contributor

Strings were incorrectly written using 244 character irrespective of the actual
length of the underlying due to changes in pandas where the underlying NumPy
datatype of strings is always np.object_, and never np.string_. Closes #7858

@jreback jreback added this to the 0.15.0 milestone Jul 29, 2014
@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

@bashtage I would generate some files in stata in both formats and add to the repo as well. Is this a bug fix or an API change?

@bashtage
Copy link
Contributor Author

@jreback I'm not sure a Stata-generated file would add much - the bug is in StataWriter which no longer correctly identifies string sizes, and so defaults to the maximum size 244 for all strings.

I think I've decided this is a bug since the code does not appear to do what is intended (even though it is functional).

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

ok, that's fine then

@jreback jreback added Bug and removed Enhancement labels Jul 29, 2014
@bashtage
Copy link
Contributor Author

Hold off a while on the merge - I need to check what happens for longer strings (> 244, which are a problem)

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

I c, it was being actually set before. never saw that.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

@bashtage np. lmk

you truncate / warn / raise for too long strings

@bashtage
Copy link
Contributor Author

Just want to make sure it is still working correctly, and be certain there is a test for that case.

@bashtage
Copy link
Contributor Author

Definitely not working for longer strings, so more work needed. The files generated actually crash Stata 12...

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

looks like you were truncating before (which is fine I think) (maybe with a warning thougH)

@bashtage
Copy link
Contributor Author

I have raised it as an error in the PR. Not completely sure that it the correct behavior though.

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

stata has max of 244, right? so then that is ok

@bashtage
Copy link
Contributor Author

I just checked 0.14.1 - attempting to write a 500 character string produced no warning or error, and outputs a corrupt stata file (that segfaults Stata)

@jreback
Copy link
Contributor

jreback commented Jul 29, 2014

ok gr8! so raising is good (put this fix in API changes, even though its really a bug -fix). in theory its more visible there.

@bashtage
Copy link
Contributor Author

@jreback I think this is ready, subject to doc/other checks

- ``DataFrame.to_stata`` and ``StataWriter`` check string length for
compatibility with limitations imposed in dta files where fixed-width
strings must contain 244 or fewer characters. Attempting to write Stata
dta files with strings longer than 244 characters raises an error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the issue number here as well, say that this raises ValueError? (or whatever it does)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a very similar note or warning to the io.rst / stata section as well (and maybe on the doc-string of to_stata)

…o Stata

Strings were incorrectly written using 244 character irrespective of the actual
length of the underlying due to changes in pandas where the underlying NumPy
datatype of strings is always np.object_, and never np.string_. Closes pandas-dev#7858

String types were also not being checked for excessive length, and DataFrames
with strings containing more then 244 characters were producing invalid
dta files.  Attempting to write long strings raises an error now.
@bashtage
Copy link
Contributor Author

@jreback Hopefully ready now

@jreback
Copy link
Contributor

jreback commented Jul 31, 2014

gr8 ping when green

jreback added a commit that referenced this pull request Aug 1, 2014
BUG: Fixed incorrect string length calculation when writing strings in Stata
@jreback jreback merged commit 9f42640 into pandas-dev:master Aug 1, 2014
@jreback
Copy link
Contributor

jreback commented Aug 1, 2014

@bashtage awesome as usual!

@bashtage bashtage deleted the stata-minimal-width-strings branch August 20, 2014 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_stata results in unnecessarily long strings
2 participants