Skip to content

gh-116909: fix data race with versions in typeobject #134651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

duaneg
Copy link
Contributor

@duaneg duaneg commented May 25, 2025

Global state _PyRuntime.types.next_version_tag is being accessed without synchronization or atomics. This could potentially result in the same version being used in two different type objects, incrementing past the maximum limit, and the usual non-atomic memory access issues.

Fix this by using atomics to ensure the version is accessed and updated in a race-free manner, while also ensuring it is never incremented past the expected (maximum + 1).

Note there is a theoretical change in behaviour with the second use, for static builtin types, if their versions exceed the maximum, with assertions disabled: previously they would have continued incrementing past the maximum and using the increasing version number; now they will all use the maximum. It might be better to replace the assert with an abort(): I assume we would be crashing shortly if we continue after this, both before and after this change.

Global state `_PyRuntime.types.next_version_tag` is being accessed without
synchronization or atomics. This could potentially result in the same version
being used in two different type objects, incrementing past the maximum limit,
and the usual non-atomic memory access issues.

Fix this by using atomics to ensure the version is accessed and updated in a
race-free manner, while also ensuring it is never incremented past the expected
(maximum + 1).

Note there is a theoretical change in behaviour with the second use, for static
builtin types, if their versions exceed the maximum, with assertions disabled:
previously they would have continued incrementing past the maximum and using
the increasing version number; now they will all use the maximum. It might be
better to replace the `assert` with an `abort()`: I assume we would be crashing
shortly if we continue after this, both before and after this change.
Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. I'm not sure we're able to skip the subinterpreter test yet, because there are still races when doing non-atomic stores of the version on static types. I've looked into fixing those myself, and it's signficantly more complex.

@duaneg
Copy link
Contributor Author

duaneg commented May 26, 2025

Thanks for doing this

My pleasure!

I'm not sure we're able to skip the subinterpreter test yet, because there are still races when doing non-atomic stores of the version on static types. I've looked into fixing those myself, and it's signficantly more complex.

Hmm, yes. I had tested running it with thread sanitizer enabled, with-and-without the GIL, and didn't trigger any data race reports. However, adding a new test specifically to try and exercise static type initialization indeed shows a bunch of problems remain. Not only with version, but various other accesses (most often tp_flags).

I suppose those problems may also be hit with the existing test, depending on timing. I'll drop re-enabling that test for now. I might come back and try and fix some of these issues separately, if I have time and no-one else gets there first.

For reference, the test I used:

    @requires_subinterpreters
    @threading_helper.requires_working_threading()
    def test_static_type_initialization(self):

        # This is specifically testing static type initialization thread safety
        # and is intended to be run with the thread-santizer enabled
        def init(barrier):
            interpid = _interpreters.create()
            barrier.wait()
            try:
                _interpreters.run_string(interpid, "import datetime")
            finally:
                _interpreters.destroy(interpid)

        N = 3
        ITERATIONS = 10
        barrier = threading.Barrier(N)
        for _ in range(ITERATIONS):
            threads = [threading.Thread(target=init, args=(barrier,))
                       for _ in range(N)]
            with threading_helper.start_threads(threads):
                pass

I'm sure using subinterpreters is not strictly required, but they seemed like an easy way to reliably run static type initialisation. Note that if assertions are enabled this test immediately fails:

Objects/typeobject.c:208: managed_static_type_state_init: Assertion `!managed_static_type_index_is_set(self)' failed.

It definitely looks like this is all very racy.

Copy link
Member

@ZeroIntensity ZeroIntensity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants