Description
History: when originally designing EAs there was a hope/thought that many methods could be implemented in terms of a small number of core methods, of which values_for_factorize (vff) and values_for_argsort (vfa) were two of the main ones. Over time we found that many of the places we used these other than factorize/argsort were causing problems and they got pruned.
At this point we are down to only a few internal uses of each. _from_factorized is used only in EA.factorize. vfa is used in EA.argsort, EA.rank, and nargminmax (which in turn is used in EA.argmin/argmax). vff is used in EA.factorize and merge._factorize_keys. #53475 will restore it as being used in hash_pandas_object.
We should deprecate these patterns entirely.
- The status quo regarding whether these are required/encouraged is confusing. The solution is to have less stuff.
- Because the default vff casts to object, any place we use it on a EA that doesn't override it is slow.
2b) In factorize_keys we special-case MaskedDtype and ArrowDtype to avoid this performance hit. That special-casing is a code smell. - The merge._factorize_keys usage means authors cannot override a cast to numpy. This would be a huge pain point for potential GPU/distributed EAs.
Implementation-wise, a deprecation could look like:
- Deprecate EA.factorize saying in the future it will raise AbstractMethodError.
- Move nanargminmax to an EA method _nanargminmax.
- Deprecate EA._nanargminmax, EA.argsort, EA.rank saying in the future they will raise AbstractMethodError. Can suggest the vfa pattern for interested authors.
- Not sure exactly about merge._factorize_keys, will look into it.