Description
I've been giving some thought to how we can move towards having nullable integer/bool dtypes by default (from the ice cream agreement last august).
Terminology note: I am using "nullable" to mean "supports some missing sentinel without taking a stance on what that sentinel is or what semantics it has"
On the user-end, I think it will need to be opt-in for a while. This can mirror the pyarrow-hybrid string future option. In the medium-term, we can implement hybrid Integer/Boolean dtype/EAs that use nan as their sentinel. This will minimize the behavior changes users see and avoids introducing mixed-propagation behavior. A subsequent deprecation cycle can move to all-propagating.
Open Questions
- Do we disallow numpy int/bool dtypes entirely?
- Lots of users have legacy code that says
dtype=np.int64
, do we warn/raise or map that to future dtype (assuming the user has opted in)? - Similarly if they do
df.dtypes == np.int64
?
Now that I write that out, I'm talking myself into being strict on this front and avoiding headaches down the road.
Thoughts?