-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][python] Make the Context/Operation capsule creation methods work as documented. #76010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][python] Make the Context/Operation capsule creation methods work as documented. #76010
Conversation
…rk as documented.
@llvm/pr-subscribers-mlir Author: Stella Laurenzo (stellaraccident) ChangesThis fixes a longstanding bug in the This patch corrects the semantics and will only do an ownership transfer for It also adds some Full diff: https://github.com/llvm/llvm-project/pull/76010.diff 4 Files Affected:
diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 5412c3dec4b1b6..39757dfad5be1d 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -602,7 +602,7 @@ py::object PyMlirContext::createFromCapsule(py::object capsule) {
MlirContext rawContext = mlirPythonCapsuleToContext(capsule.ptr());
if (mlirContextIsNull(rawContext))
throw py::error_already_set();
- return forContext(rawContext).releaseObject();
+ return stealExternalContext(rawContext).releaseObject();
}
PyMlirContext *PyMlirContext::createNewContextForInit() {
@@ -615,18 +615,35 @@ PyMlirContextRef PyMlirContext::forContext(MlirContext context) {
auto &liveContexts = getLiveContexts();
auto it = liveContexts.find(context.ptr);
if (it == liveContexts.end()) {
- // Create.
- PyMlirContext *unownedContextWrapper = new PyMlirContext(context);
- py::object pyRef = py::cast(unownedContextWrapper);
- assert(pyRef && "cast to py::object failed");
- liveContexts[context.ptr] = unownedContextWrapper;
- return PyMlirContextRef(unownedContextWrapper, std::move(pyRef));
+ throw std::runtime_error(
+ "Cannot use a context that is not owned by the Python bindings.");
}
+
// Use existing.
py::object pyRef = py::cast(it->second);
return PyMlirContextRef(it->second, std::move(pyRef));
}
+PyMlirContextRef PyMlirContext::stealExternalContext(MlirContext context) {
+ py::gil_scoped_acquire acquire;
+ auto &liveContexts = getLiveContexts();
+ auto it = liveContexts.find(context.ptr);
+ if (it != liveContexts.end()) {
+ throw std::runtime_error(
+ "Cannot transfer ownership of the context to Python "
+ "as it is already owned by Python.");
+ }
+
+ PyMlirContext *unownedContextWrapper = new PyMlirContext(context);
+ // Note that the default return value policy on cast is automatic_reference,
+ // which does not take ownership (delete will not be called).
+ // Just be explicit.
+ py::object pyRef =
+ py::cast(unownedContextWrapper, py::return_value_policy::take_ownership);
+ assert(pyRef && "cast to py::object failed");
+ return PyMlirContextRef(unownedContextWrapper, std::move(pyRef));
+}
+
PyMlirContext::LiveContextMap &PyMlirContext::getLiveContexts() {
static LiveContextMap liveContexts;
return liveContexts;
@@ -1145,6 +1162,18 @@ PyOperationRef PyOperation::forOperation(PyMlirContextRef contextRef,
return PyOperationRef(existing, std::move(pyRef));
}
+PyOperationRef PyOperation::stealExternalOperation(PyMlirContextRef contextRef,
+ MlirOperation operation) {
+ auto &liveOperations = contextRef->liveOperations;
+ auto it = liveOperations.find(operation.ptr);
+ if (it != liveOperations.end()) {
+ throw std::runtime_error(
+ "Cannot transfer ownership of the operation to Python "
+ "as it is already owned by Python.");
+ }
+ return createInstance(std::move(contextRef), operation, py::none());
+}
+
PyOperationRef PyOperation::createDetached(PyMlirContextRef contextRef,
MlirOperation operation,
py::object parentKeepAlive) {
@@ -1316,7 +1345,8 @@ py::object PyOperation::createFromCapsule(py::object capsule) {
if (mlirOperationIsNull(rawOperation))
throw py::error_already_set();
MlirContext rawCtxt = mlirOperationGetContext(rawOperation);
- return forOperation(PyMlirContext::forContext(rawCtxt), rawOperation)
+ return stealExternalOperation(PyMlirContext::forContext(rawCtxt),
+ rawOperation)
.releaseObject();
}
@@ -2548,6 +2578,16 @@ void mlir::python::populateIRCore(py::module &m) {
.def("_get_live_operation_count", &PyMlirContext::getLiveOperationCount)
.def("_clear_live_operations", &PyMlirContext::clearLiveOperations)
.def("_get_live_module_count", &PyMlirContext::getLiveModuleCount)
+ .def_static("_testing_create_raw_context_capsule",
+ []() {
+ // Creates an MlirContext not known to the Python bindings
+ // and puts it in a capsule. Used to test interop. Using
+ // this without passing it back to the capsule creation
+ // API will leak.
+ return py::reinterpret_steal<py::object>(
+ mlirPythonContextToCapsule(
+ mlirContextCreateWithThreading(false)));
+ })
.def_property_readonly(MLIR_PYTHON_CAPI_PTR_ATTR,
&PyMlirContext::getCapsule)
.def(MLIR_PYTHON_CAPI_FACTORY_ATTR, &PyMlirContext::createFromCapsule)
@@ -2973,8 +3013,7 @@ void mlir::python::populateIRCore(py::module &m) {
py::arg("binary") = false, kOperationPrintStateDocstring)
.def("print",
py::overload_cast<std::optional<int64_t>, bool, bool, bool, bool,
- bool, py::object, bool>(
- &PyOperationBase::print),
+ bool, py::object, bool>(&PyOperationBase::print),
// Careful: Lots of arguments must match up with print method.
py::arg("large_elements_limit") = py::none(),
py::arg("enable_debug_info") = false,
@@ -3046,6 +3085,25 @@ void mlir::python::populateIRCore(py::module &m) {
.def_property_readonly(MLIR_PYTHON_CAPI_PTR_ATTR,
&PyOperation::getCapsule)
.def(MLIR_PYTHON_CAPI_FACTORY_ATTR, &PyOperation::createFromCapsule)
+ .def_static(
+ "_testing_create_raw_capsule",
+ [](std::string sourceStr) {
+ // Creates a raw context and an operation via parsing the given
+ // source and returns them in a capsule. Error handling is
+ // minimal as this is purely intended for testing interop with
+ // operation creation from capsule functions.
+ MlirContext context = mlirContextCreateWithThreading(false);
+ MlirOperation op = mlirOperationCreateParse(
+ context, toMlirStringRef(sourceStr), toMlirStringRef("temp"));
+ if (mlirOperationIsNull(op)) {
+ mlirContextDestroy(context);
+ throw std::invalid_argument("Failed to parse");
+ }
+ return py::make_tuple(py::reinterpret_steal<py::object>(
+ mlirPythonContextToCapsule(context)),
+ py::reinterpret_steal<py::object>(
+ mlirPythonOperationToCapsule(op)));
+ })
.def_property_readonly("operation", [](py::object self) { return self; })
.def_property_readonly("opview", &PyOperation::createOpView)
.def_property_readonly(
diff --git a/mlir/lib/Bindings/Python/IRModule.h b/mlir/lib/Bindings/Python/IRModule.h
index 79b7e0c96188c1..04164b78b3e250 100644
--- a/mlir/lib/Bindings/Python/IRModule.h
+++ b/mlir/lib/Bindings/Python/IRModule.h
@@ -176,8 +176,19 @@ class PyMlirContext {
static PyMlirContext *createNewContextForInit();
/// Returns a context reference for the singleton PyMlirContext wrapper for
- /// the given context.
+ /// the given context. It is only valid to call this on an MlirContext that
+ /// is already owned by the Python bindings. Typically this will be because
+ /// it came in some fashion from createNewContextForInit(). However, it
+ /// is also possible to explicitly transfer ownership of an existing
+ /// MlirContext to the Python bindings via stealExternalContext().
static PyMlirContextRef forContext(MlirContext context);
+
+ /// Explicitly takes ownership of an MlirContext that must not already be
+ /// known to the Python bindings. Once done, the life-cycle of the context
+ /// will be controlled by the Python bindings, and it will be destroyed
+ /// when the reference count goes to zero.
+ static PyMlirContextRef stealExternalContext(MlirContext context);
+
~PyMlirContext();
/// Accesses the underlying MlirContext.
@@ -606,6 +617,12 @@ class PyOperation : public PyOperationBase, public BaseContextObject {
forOperation(PyMlirContextRef contextRef, MlirOperation operation,
pybind11::object parentKeepAlive = pybind11::object());
+ /// Explicitly takes ownership of an operation that must not already be known
+ /// to the Python bindings. Once done, the life-cycle of the operation
+ /// will be controlled by the Python bindings.
+ static PyOperationRef stealExternalOperation(PyMlirContextRef contextRef,
+ MlirOperation operation);
+
/// Creates a detached operation. The operation must not be associated with
/// any existing live operation.
static PyOperationRef
diff --git a/mlir/test/python/ir/context_lifecycle.py b/mlir/test/python/ir/context_lifecycle.py
index c20270999425ee..fbd1851ba70aee 100644
--- a/mlir/test/python/ir/context_lifecycle.py
+++ b/mlir/test/python/ir/context_lifecycle.py
@@ -45,5 +45,46 @@
c4 = mlir.ir.Context()
c4_capsule = c4._CAPIPtr
assert '"mlir.ir.Context._CAPIPtr"' in repr(c4_capsule)
-c5 = mlir.ir.Context._CAPICreate(c4_capsule)
-assert c4 is c5
+# Because the context is already owned by Python, it cannot be created
+# a second time.
+try:
+ c5 = mlir.ir.Context._CAPICreate(c4_capsule)
+except RuntimeError:
+ pass
+else:
+ raise AssertionError(
+ "Should have gotten a RuntimeError when attempting to "
+ "re-create an already owned context"
+ )
+c4 = None
+c4_capsule = None
+gc.collect()
+assert mlir.ir.Context._get_live_count() == 0
+
+# Use a private testing method to create an unowned context capsule and
+# import it.
+c6_capsule = mlir.ir.Context._testing_create_raw_context_capsule()
+c6 = mlir.ir.Context._CAPICreate(c6_capsule)
+assert mlir.ir.Context._get_live_count() == 1
+c6_capsule = None
+c6 = None
+gc.collect()
+assert mlir.ir.Context._get_live_count() == 0
+
+# Also test operation import/export as it is tightly coupled to the context.
+(
+ raw_context_capsule,
+ raw_operation_capsule,
+) = mlir.ir.Operation._testing_create_raw_capsule("builtin.module {}")
+assert '"mlir.ir.Operation._CAPIPtr"' in repr(raw_operation_capsule)
+# Attempting to import an operation for an unknown context should fail.
+try:
+ mlir.ir.Operation._CAPICreate(raw_operation_capsule)
+except RuntimeError:
+ pass
+else:
+ raise AssertionError("Expected exception for unknown context")
+
+# Try again having imported the context.
+c7 = mlir.ir.Context._CAPICreate(raw_context_capsule)
+op7 = mlir.ir.Operation._CAPICreate(raw_operation_capsule)
diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py
index 04f8a9936e31f7..f59b1a26ba48b5 100644
--- a/mlir/test/python/ir/operation.py
+++ b/mlir/test/python/ir/operation.py
@@ -844,19 +844,6 @@ def testOperationName():
print(op.operation.name)
-# CHECK-LABEL: TEST: testCapsuleConversions
-@run
-def testCapsuleConversions():
- ctx = Context()
- ctx.allow_unregistered_dialects = True
- with Location.unknown(ctx):
- m = Operation.create("custom.op1").operation
- m_capsule = m._CAPIPtr
- assert '"mlir.ir.Operation._CAPIPtr"' in repr(m_capsule)
- m2 = Operation._CAPICreate(m_capsule)
- assert m2 is m
-
-
# CHECK-LABEL: TEST: testOperationErase
@run
def testOperationErase():
|
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently. Includes llvm/llvm-project#76010 for testing on CI.
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently. Includes llvm/llvm-project#76010 for testing on CI. ci-extra: build_test_all_windows
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently. Includes llvm/llvm-project#76010 for testing on CI. ci-extra: build_test_all_windows
// Note that the default return value policy on cast is automatic_reference, | ||
// which does not take ownership (delete will not be called). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: is this specific to py::cast
and/or a bug in pybind? The docs say about return_value_policy::automatic
:
This policy falls back to the policy return_value_policy::take_ownership when the return value is a pointer... This is the default policy for
py::class_
-wrapped types.
and about return_value_policy::automatic_reference
:
As above, but use policy
return_value_policy::reference
when the return value is a pointer. This is the default conversion policy for function arguments when calling Python functions manually from C++ code (i.e. viahandle::operator()
) and the casters inpybind11/stl.h
.
Are we somehow in either of those two cases here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's somewhat poorly documented -- the behavior is specific to cast whereas the documentation is talking about the more common cases of function args/returns.
Note that I got that comment from the cast in fromOperation
, which gets it right. That path is used heavily and was fixed long ago. This branch of context creation, though, had no upstream uses and was just wrong.
Looking for the reference on how I know this. I remember researching it some time ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of links below - explanation by the author and the source.
TIL:
Okay will review with a fine-toothed comb tomorrow! |
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently. Includes llvm/llvm-project#76010 for testing on CI. ci-extra: build_test_all_windows
Thanks. I beefed up the downstream tests for this and enabled on all platforms. Afaict, this is the first time this feature has worked reliably: iree-org/iree#15982 I think I got the vital characteristics of it tested upstream now, so should be an overall improvement to reliability. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - I actually think this will break me in some experimental stuff in mlir-aie but solid fix nonetheless :)
If your experience mirrors mine, you may have realized soon that what you were doing was already subtly broken but just hadn't been caught yet :) |
I definitely ran into the occasional ownership issue but I've always been able to brush it under the rug by fiddling with |
@makslevental could you give an example of the wrong things downstreams were doing and how to make it right? This will help downstreams to fix themselves after they bump LLVM version past this commit. |
Well I'm doing super-gnarly stuff like shipping arbitrary return static_cast<PySwitchboxOp &>(
PyOperation::forOperation(DefaultingPyMlirContext::resolve().getRef(),
wrap(switchboxOp.getOperation()))
->getOperation()); I highly doubt anyone else downstream is doing this (I don't think anyone else knows how to...). I was actually curious what exactly @stellaraccident was experiencing that prompted this PR (there's a linked PR in iree but it doesn't say much).
My plan is to change |
Just to be clear: I don't think normal users will be broken by this - I was speaking specifically about myself because I am using |
Yes, without
|
…thods work as documented. (#76010)" This reverts commit bbc2976. This change seems to be at odds with the non-owning part semantics of MlirOperation in C API. Since downstream clients can only take and return MlirOperation, it does not sound correct to force all returns of MlirOperation transfer ownership. Specifically, this makes it impossible for downstreams to implement IR-traversing functions that, e.g., look at neighbors of an operation. The following patch triggers the exception, and there does not seem to be an alternative way for a downstream binding writer to express this: ``` diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp index 39757df..2ce640674245 100644 --- a/mlir/lib/Bindings/Python/IRCore.cpp +++ b/mlir/lib/Bindings/Python/IRCore.cpp @@ -3071,6 +3071,11 @@ void mlir::python::populateIRCore(py::module &m) { py::arg("successors") = py::none(), py::arg("regions") = 0, py::arg("loc") = py::none(), py::arg("ip") = py::none(), py::arg("infer_type") = false, kOperationCreateDocstring) + .def("_get_first_in_block", [](PyOperation &self) -> MlirOperation { + MlirBlock block = mlirOperationGetBlock(self.get()); + MlirOperation first = mlirBlockGetFirstOperation(block); + return first; + }) .def_static( "parse", [](const std::string &sourceStr, const std::string &sourceName, diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py index f59b1a2..6b12b8da5c24 100644 --- a/mlir/test/python/ir/operation.py +++ b/mlir/test/python/ir/operation.py @@ -24,6 +24,25 @@ def expect_index_error(callback): except IndexError: pass +@run +def testCustomBind(): + ctx = Context() + ctx.allow_unregistered_dialects = True + module = Module.parse( + r""" + func.func @F1(%arg0: i32) -> i32 { + %1 = "custom.addi"(%arg0, %arg0) : (i32, i32) -> i32 + return %1 : i32 + } + """, + ctx, + ) + add = module.body.operations[0].regions[0].blocks[0].operations[0] + op = add.operation + # This will get a reference to itself. + f1 = op._get_first_in_block() + + # Verify iterator based traversal of the op/region/block hierarchy. # CHECK-LABEL: TEST: testTraverseOpRegionBlockIterators ```
Reverting for now. I think it's either a matter of (a) updating the caster in PybindAdaptors.h so that it can interpret |
I see static handle cast(MlirOperation v, return_value_policy, handle) {
return py::module::import(MAKE_MLIR_PYTHON_QUALNAME("ir"))
.attr("Operation")
.attr(MLIR_PYTHON_CAPI_FACTORY_ATTR)(capsule)
.release();
} which could be changed to static handle cast(MlirOperation v, return_value_policy p, handle h) {
// call MLIR_PYTHON_CAPI_FACTORY_ATTR somehow differently depending on p/h
} ( This would enable downstream users to .def("_get_first_in_block", [](MlirOperation op) -> py::object {
MlirBlock block = mlirOperationGetBlock(op);
MlirOperation first = mlirBlockGetFirstOperation(block);
return py::cast(first, return_value_policy::reference);
}) in order to communicate intent. But I don't actually know what
I'll take this opportunity to beat my favorite dead horse: we could just start exposing the utilities in |
Darn - I was trying to not break that case... Which was the reason for the disjoint implementations between context and op import. Thanks for the test case. That will help in rolling this forward. Probably next week at this point. |
I might add an arg to indicate whether ownership must be transferred. The case I was running into was subtle non-determinism based on access pattern when a completely external API with its own ownership semantics was negotiating to give ownership to the python API. The implicit ownership capture of context and operation makes this very easy to get wrong with perfectly valid Python sequences. I thought it was a Windows ghost for months but it just turned out to be more sensitive to the fallout. I finally had evidence that a real program was also triggering the issue on Linux (but it was in the bowels of an integration, and what I sent upstream was my reduced test case). That is what made me trade it seriously. |
Wouldn't help me. I was dealing with something that was using the raw C API via ctypes. The contract here must be built in to the API and enforced correctly. |
Yeah, this is likely the sanest approach. Maybe we were wrong about not exposing ownership in C API types since it makes it harder to map the API to languages that do have explicit ownership. One thing that I find tricky here is the "transitive" ownership of IR objects. On the C++ side, the caller has ownership of the top-level module and potentially some detached operations. In Python, we seem to be modeling that Python user owns any operation they have a reference to, which isn't strictly necessary. Though we might not want to traverse the IR every time something returns a Same may happen with contexts by the way. We just never have a need to return an I don't know offhand if we can just mimic C++ ownership rules and somehow require Python client to keep the top-level module alive rather than say that any reference into the connected IR blob maintains all of it alive.
We can go the other way around and eat our own dogfood by actually using PybindAdaptors.h upstream... |
This fixes a longstanding bug in the
Context._CAPICreate
method whereby it was not taking ownership of the PyMlirContext wrapper when casting to a Python object. The result was minimally that all such contexts transferred in that way would leak. In addition, counter to the documentation for the_CAPICreate
helper (seemlir-c/Bindings/Python/Interop.h
) and theforContext
/forOperation
methods, we were silently upgrading any unknown context/operation pointer to steal-ownership semantics. This is dangerous and was causing some subtle bugs downstream where this facility is getting the most use.This patch corrects the semantics and will only do an ownership transfer for
_CAPICreate
, and it will further require that it is an ownership transfer (if already transferred, it was just silently succeeding). Removing the mis-aligned behavior made it clear where the downstream was doing the wrong thing.It also adds some
_testing_
functions to create unowned context and operation capsules so that this can be fully tested upstream, reworking the tests to verify the behavior.In some torture testing downstream, I was not able to trigger any memory corruption with the newly enforced semantics. When getting it wrong, a regular exception is raised.