Skip to content

CLN: unify NumpyBlock, ObjectBlock, and NumericBlock #52817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 24, 2023

Conversation

ngoldbaum
Copy link
Contributor

@ngoldbaum ngoldbaum commented Apr 20, 2023

This PR unifies NumpyBlock, ObjectBlock, and NumericBlock into NumpyBlock.

xref #52413

This is part of my effort to make it easier for pandas to support new-style dtypes like the variable width string dtype I'm working on that will hopefully replace usages of object string arrays in pandas (#47884). New-style dtypes have dtype.kind set to a null character, so using dtype.kind to discriminate between numpy dtypes as pandas.core.internals.blocks.get_block_type currently does makes it more difficult to support new-style dtypes, as I'd need to add additional checks there that will hurt performance in a hot code path. @jbrockmendel suggested this approach instead.

It wasn't clear to me what the backward compatibility guarantees are for pandas.core.internals. I think adding an ObjectBlock and NumericBlock that are (deprecated?) aliases to NumpyBlock will be sufficient to maintain back compat if that's desired.

As a bonus, it's no longer necessary to check dtype.kind twice in the block creation fast path - I moved the second check into the new NumpyBlock.is_numeric cached property - so I think this may be a tiny bit faster for benchmarks sensitive to block creation overhead.

@rhshadrach rhshadrach added Internals Related to non-user accessible pandas implementation Clean labels Apr 21, 2023
@jbrockmendel
Copy link
Member

It wasn't clear to me what the backward compatibility guarantees are for pandas.core.internals. I think adding an ObjectBlock and NumericBlock that are (deprecated?) aliases to NumpyBlock will be sufficient to maintain back compat if that's desired.

Its complicated. We've landed on calling pandas.core.internals.api "pseudo-public" and acknowledging that some packages (pyarrow) access them even though they're not really public. Last time we removed a Block subclass there were complaints and we ended up reverting it and changing the removal to a deprecation (which was enforced in 2.0, so now it is gone).

@ngoldbaum
Copy link
Contributor Author

OK, I’ll go ahead and add deprecated aliases. Thanks for the context and the prior art pointer. Will hopefully be pushing an update today that fixes the tests and addresses the comments.

@ngoldbaum ngoldbaum force-pushed the unify-numpy-blocks branch from b8c3fa0 to 7aa66d5 Compare April 21, 2023 18:01
@ngoldbaum
Copy link
Contributor Author

Is there anything I can do on my end to move this forward?

@ngoldbaum ngoldbaum force-pushed the unify-numpy-blocks branch from 9dffc8a to 7819691 Compare May 1, 2023 16:24
@ngoldbaum ngoldbaum force-pushed the unify-numpy-blocks branch from 648cc6c to f140bbd Compare May 10, 2023 14:48
@ngoldbaum
Copy link
Contributor Author

Rebased to fix a conflict.

@@ -387,7 +384,7 @@ def test_constructor_no_pandas_array(self):
result = DataFrame({"A": arr})
expected = DataFrame({"A": [1, 2, 3]})
tm.assert_frame_equal(result, expected)
assert isinstance(result._mgr.blocks[0], NumericBlock)
assert isinstance(result._mgr.blocks[0], NumpyBlock)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check is_numeric here?


@cache_readonly
def is_object(self) -> bool: # type: ignore[override]
return self.values.dtype.kind == "O"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there dtypes other than object-dtype that have kind=="O"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngoldbaum ngoldbaum force-pushed the unify-numpy-blocks branch from f140bbd to e6f9b3f Compare May 11, 2023 17:23
@ngoldbaum
Copy link
Contributor Author

Would appreciate another round of review. The test failure is unrelated.

@jbrockmendel
Copy link
Member

LGTM @ngoldbaum can you merge main one more time to be on the safe side

@mroeschke mroeschke added this to the 2.1 milestone May 24, 2023
@mroeschke mroeschke merged commit 5a0230b into pandas-dev:main May 24, 2023
@mroeschke
Copy link
Member

Nice work @ngoldbaum

topper-123 pushed a commit to topper-123/pandas that referenced this pull request May 27, 2023
* CLN: unify NumpyBlock, ObjectBlock, and NumericBlock

* CLN: respond to review comments

* CLN: deprecate ObjectBlock and NumericBlock

* CLN: appease mypy

* CLN: remove out-of-date reference to maybe_split

* CLN: respond to review comments

* CLN: remove NumpyBlock from the semi-public API

* CLN: test for is_numeric in block internals tests
topper-123 pushed a commit to topper-123/pandas that referenced this pull request Jun 5, 2023
* CLN: unify NumpyBlock, ObjectBlock, and NumericBlock

* CLN: respond to review comments

* CLN: deprecate ObjectBlock and NumericBlock

* CLN: appease mypy

* CLN: remove out-of-date reference to maybe_split

* CLN: respond to review comments

* CLN: remove NumpyBlock from the semi-public API

* CLN: test for is_numeric in block internals tests
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
* CLN: unify NumpyBlock, ObjectBlock, and NumericBlock

* CLN: respond to review comments

* CLN: deprecate ObjectBlock and NumericBlock

* CLN: appease mypy

* CLN: remove out-of-date reference to maybe_split

* CLN: respond to review comments

* CLN: remove NumpyBlock from the semi-public API

* CLN: test for is_numeric in block internals tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean Internals Related to non-user accessible pandas implementation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants