Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor string methods for StringArray + return IntegerArray for numeric results #29640
Changes from 12 commits
14ca804
07fc53d
ddef378
a147081
36b1bf7
d01ddfe
22a2c79
686de87
0ea2d6b
358ce59
7b25d8d
46dfc20
1b96e53
3b20ffc
e4bae5e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need any updating an/or reference to your new section that you added in text.rst?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger you can probably update the sentence "The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype." 20 lines below this line, to include that it can also return IntegerDtype in certain cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it
func
rather thanf
.I guess
na_map
calls itna_result
while_map
calls itna_value
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean with "-1 on 2 different branches here" ?
The whole purpose of this PR is to add a separate branch in case of StringArray (because we can be more efficient and want to be more specific in the result dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this.
As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,
map_infer_dtype
for both, so the core implementation is the same..str
on object-dtype, so we will end up with just this implementation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same.
and _map -> _map_object and _stringarray_map -> _map_stringarray.
I think this is crucial for not adding technical debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback is your suggestion to add
na_mask
to the _stringarray_map signature and have it just not be used? I think this relates to the "this should be a StringArray method" discussionThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_map
and_na_map
already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And there is also a good reason for that, as
_map
has an additional argumentna_mask
that is used internally in_map
(for a recursive call).I think refactoring
_map
is outside of the scope of this PR.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename these as indicated above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you be explicit in what exact names you want to propose? (I am getting a bit lost in all the comments)
Tom already renamed
_ea_map
to_stringarray_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right we should change _map -> _object_map (though I actually prefer the opposite, _map_object and _map_stringarray), but either is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, make sense. I renamed to
_map_object
and_map_stringarray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
na_value restricted to scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Elsewhere we say "np.dtype or ExtensionDtype". Is this the new policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we have a policy. If it matters, this is an internal docstring, so I'm OK to use things from
pandas._typing
that we wouldn't have in a public docstring yet.