-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: Fixed incorrect string length calculation when writing strings in Stata #7862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fixed incorrect string length calculation when writing strings in Stata #7862
Conversation
@bashtage I would generate some files in stata in both formats and add to the repo as well. Is this a bug fix or an API change? |
@jreback I'm not sure a Stata-generated file would add much - the bug is in I think I've decided this is a bug since the code does not appear to do what is intended (even though it is functional). |
ok, that's fine then |
Hold off a while on the merge - I need to check what happens for longer strings (> 244, which are a problem) |
I c, it was being actually set before. never saw that. |
@bashtage np. lmk you truncate / warn / raise for too long strings |
Just want to make sure it is still working correctly, and be certain there is a test for that case. |
Definitely not working for longer strings, so more work needed. The files generated actually crash Stata 12... |
looks like you were truncating before (which is fine I think) (maybe with a warning thougH) |
I have raised it as an error in the PR. Not completely sure that it the correct behavior though. |
stata has max of 244, right? so then that is ok |
I just checked 0.14.1 - attempting to write a 500 character string produced no warning or error, and outputs a corrupt stata file (that segfaults Stata) |
ok gr8! so raising is good (put this fix in API changes, even though its really a bug -fix). in theory its more visible there. |
@jreback I think this is ready, subject to doc/other checks |
- ``DataFrame.to_stata`` and ``StataWriter`` check string length for | ||
compatibility with limitations imposed in dta files where fixed-width | ||
strings must contain 244 or fewer characters. Attempting to write Stata | ||
dta files with strings longer than 244 characters raises an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the issue number here as well, say that this raises ValueError
? (or whatever it does)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a very similar note or warning to the io.rst / stata section as well (and maybe on the doc-string of to_stata)
…o Stata Strings were incorrectly written using 244 character irrespective of the actual length of the underlying due to changes in pandas where the underlying NumPy datatype of strings is always np.object_, and never np.string_. Closes pandas-dev#7858 String types were also not being checked for excessive length, and DataFrames with strings containing more then 244 characters were producing invalid dta files. Attempting to write long strings raises an error now.
@jreback Hopefully ready now |
gr8 ping when green |
BUG: Fixed incorrect string length calculation when writing strings in Stata
@bashtage awesome as usual! |
Strings were incorrectly written using 244 character irrespective of the actual
length of the underlying due to changes in pandas where the underlying NumPy
datatype of strings is always np.object_, and never np.string_. Closes #7858