Skip to content

[mlir][python] Make the Context/Operation capsule creation methods work as documented. #76010

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 20, 2023

Conversation

stellaraccident
Copy link
Contributor

@stellaraccident stellaraccident commented Dec 20, 2023

This fixes a longstanding bug in the Context._CAPICreate method whereby it was not taking ownership of the PyMlirContext wrapper when casting to a Python object. The result was minimally that all such contexts transferred in that way would leak. In addition, counter to the documentation for the _CAPICreate helper (see mlir-c/Bindings/Python/Interop.h) and the forContext / forOperation methods, we were silently upgrading any unknown context/operation pointer to steal-ownership semantics. This is dangerous and was causing some subtle bugs downstream where this facility is getting the most use.

This patch corrects the semantics and will only do an ownership transfer for _CAPICreate, and it will further require that it is an ownership transfer (if already transferred, it was just silently succeeding). Removing the mis-aligned behavior made it clear where the downstream was doing the wrong thing.

It also adds some _testing_ functions to create unowned context and operation capsules so that this can be fully tested upstream, reworking the tests to verify the behavior.

In some torture testing downstream, I was not able to trigger any memory corruption with the newly enforced semantics. When getting it wrong, a regular exception is raised.

@stellaraccident stellaraccident marked this pull request as ready for review December 20, 2023 05:40
@llvmbot llvmbot added the mlir label Dec 20, 2023
@llvmbot
Copy link
Member

llvmbot commented Dec 20, 2023

@llvm/pr-subscribers-mlir

Author: Stella Laurenzo (stellaraccident)

Changes

This fixes a longstanding bug in the Context._CAPICreate method whereby it was not taking ownership of the PyMlirContext wrapper when casting to a Python object. The result was minimally that all such contexts transferred in that way would leak. In addition, counter to the documentation for the _CAPICreate helper (see mlir-c/Bindings/Python/Interop.h) and the forContext / forOperation methods, we were silently upgraded any unknown context/operation pointer to steal-ownership semantics. This is dangerous and was causing some subtle bugs downstream where this facility is getting the most use.

This patch corrects the semantics and will only do an ownership transfer for _CAPICreate, and it will further require that it is an ownership transfer (if already transferred, it was just silently succeeding). Removing the mis-aligned behavior made it clear where the downstream was doing the wrong thing.

It also adds some _testing_ functions to create unowned context and operation capsules so that this can be fully tested upstream, reworking the tests to verify the behavior.


Full diff: https://github.com/llvm/llvm-project/pull/76010.diff

4 Files Affected:

  • (modified) mlir/lib/Bindings/Python/IRCore.cpp (+68-10)
  • (modified) mlir/lib/Bindings/Python/IRModule.h (+18-1)
  • (modified) mlir/test/python/ir/context_lifecycle.py (+43-2)
  • (modified) mlir/test/python/ir/operation.py (-13)
diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 5412c3dec4b1b6..39757dfad5be1d 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -602,7 +602,7 @@ py::object PyMlirContext::createFromCapsule(py::object capsule) {
   MlirContext rawContext = mlirPythonCapsuleToContext(capsule.ptr());
   if (mlirContextIsNull(rawContext))
     throw py::error_already_set();
-  return forContext(rawContext).releaseObject();
+  return stealExternalContext(rawContext).releaseObject();
 }
 
 PyMlirContext *PyMlirContext::createNewContextForInit() {
@@ -615,18 +615,35 @@ PyMlirContextRef PyMlirContext::forContext(MlirContext context) {
   auto &liveContexts = getLiveContexts();
   auto it = liveContexts.find(context.ptr);
   if (it == liveContexts.end()) {
-    // Create.
-    PyMlirContext *unownedContextWrapper = new PyMlirContext(context);
-    py::object pyRef = py::cast(unownedContextWrapper);
-    assert(pyRef && "cast to py::object failed");
-    liveContexts[context.ptr] = unownedContextWrapper;
-    return PyMlirContextRef(unownedContextWrapper, std::move(pyRef));
+    throw std::runtime_error(
+        "Cannot use a context that is not owned by the Python bindings.");
   }
+
   // Use existing.
   py::object pyRef = py::cast(it->second);
   return PyMlirContextRef(it->second, std::move(pyRef));
 }
 
+PyMlirContextRef PyMlirContext::stealExternalContext(MlirContext context) {
+  py::gil_scoped_acquire acquire;
+  auto &liveContexts = getLiveContexts();
+  auto it = liveContexts.find(context.ptr);
+  if (it != liveContexts.end()) {
+    throw std::runtime_error(
+        "Cannot transfer ownership of the context to Python "
+        "as it is already owned by Python.");
+  }
+
+  PyMlirContext *unownedContextWrapper = new PyMlirContext(context);
+  // Note that the default return value policy on cast is automatic_reference,
+  // which does not take ownership (delete will not be called).
+  // Just be explicit.
+  py::object pyRef =
+      py::cast(unownedContextWrapper, py::return_value_policy::take_ownership);
+  assert(pyRef && "cast to py::object failed");
+  return PyMlirContextRef(unownedContextWrapper, std::move(pyRef));
+}
+
 PyMlirContext::LiveContextMap &PyMlirContext::getLiveContexts() {
   static LiveContextMap liveContexts;
   return liveContexts;
@@ -1145,6 +1162,18 @@ PyOperationRef PyOperation::forOperation(PyMlirContextRef contextRef,
   return PyOperationRef(existing, std::move(pyRef));
 }
 
+PyOperationRef PyOperation::stealExternalOperation(PyMlirContextRef contextRef,
+                                                   MlirOperation operation) {
+  auto &liveOperations = contextRef->liveOperations;
+  auto it = liveOperations.find(operation.ptr);
+  if (it != liveOperations.end()) {
+    throw std::runtime_error(
+        "Cannot transfer ownership of the operation to Python "
+        "as it is already owned by Python.");
+  }
+  return createInstance(std::move(contextRef), operation, py::none());
+}
+
 PyOperationRef PyOperation::createDetached(PyMlirContextRef contextRef,
                                            MlirOperation operation,
                                            py::object parentKeepAlive) {
@@ -1316,7 +1345,8 @@ py::object PyOperation::createFromCapsule(py::object capsule) {
   if (mlirOperationIsNull(rawOperation))
     throw py::error_already_set();
   MlirContext rawCtxt = mlirOperationGetContext(rawOperation);
-  return forOperation(PyMlirContext::forContext(rawCtxt), rawOperation)
+  return stealExternalOperation(PyMlirContext::forContext(rawCtxt),
+                                rawOperation)
       .releaseObject();
 }
 
@@ -2548,6 +2578,16 @@ void mlir::python::populateIRCore(py::module &m) {
       .def("_get_live_operation_count", &PyMlirContext::getLiveOperationCount)
       .def("_clear_live_operations", &PyMlirContext::clearLiveOperations)
       .def("_get_live_module_count", &PyMlirContext::getLiveModuleCount)
+      .def_static("_testing_create_raw_context_capsule",
+                  []() {
+                    // Creates an MlirContext not known to the Python bindings
+                    // and puts it in a capsule. Used to test interop. Using
+                    // this without passing it back to the capsule creation
+                    // API will leak.
+                    return py::reinterpret_steal<py::object>(
+                        mlirPythonContextToCapsule(
+                            mlirContextCreateWithThreading(false)));
+                  })
       .def_property_readonly(MLIR_PYTHON_CAPI_PTR_ATTR,
                              &PyMlirContext::getCapsule)
       .def(MLIR_PYTHON_CAPI_FACTORY_ATTR, &PyMlirContext::createFromCapsule)
@@ -2973,8 +3013,7 @@ void mlir::python::populateIRCore(py::module &m) {
            py::arg("binary") = false, kOperationPrintStateDocstring)
       .def("print",
            py::overload_cast<std::optional<int64_t>, bool, bool, bool, bool,
-                             bool, py::object, bool>(
-               &PyOperationBase::print),
+                             bool, py::object, bool>(&PyOperationBase::print),
            // Careful: Lots of arguments must match up with print method.
            py::arg("large_elements_limit") = py::none(),
            py::arg("enable_debug_info") = false,
@@ -3046,6 +3085,25 @@ void mlir::python::populateIRCore(py::module &m) {
       .def_property_readonly(MLIR_PYTHON_CAPI_PTR_ATTR,
                              &PyOperation::getCapsule)
       .def(MLIR_PYTHON_CAPI_FACTORY_ATTR, &PyOperation::createFromCapsule)
+      .def_static(
+          "_testing_create_raw_capsule",
+          [](std::string sourceStr) {
+            // Creates a raw context and an operation via parsing the given
+            // source and returns them in a capsule. Error handling is
+            // minimal as this is purely intended for testing interop with
+            // operation creation from capsule functions.
+            MlirContext context = mlirContextCreateWithThreading(false);
+            MlirOperation op = mlirOperationCreateParse(
+                context, toMlirStringRef(sourceStr), toMlirStringRef("temp"));
+            if (mlirOperationIsNull(op)) {
+              mlirContextDestroy(context);
+              throw std::invalid_argument("Failed to parse");
+            }
+            return py::make_tuple(py::reinterpret_steal<py::object>(
+                                      mlirPythonContextToCapsule(context)),
+                                  py::reinterpret_steal<py::object>(
+                                      mlirPythonOperationToCapsule(op)));
+          })
       .def_property_readonly("operation", [](py::object self) { return self; })
       .def_property_readonly("opview", &PyOperation::createOpView)
       .def_property_readonly(
diff --git a/mlir/lib/Bindings/Python/IRModule.h b/mlir/lib/Bindings/Python/IRModule.h
index 79b7e0c96188c1..04164b78b3e250 100644
--- a/mlir/lib/Bindings/Python/IRModule.h
+++ b/mlir/lib/Bindings/Python/IRModule.h
@@ -176,8 +176,19 @@ class PyMlirContext {
   static PyMlirContext *createNewContextForInit();
 
   /// Returns a context reference for the singleton PyMlirContext wrapper for
-  /// the given context.
+  /// the given context. It is only valid to call this on an MlirContext that
+  /// is already owned by the Python bindings. Typically this will be because
+  /// it came in some fashion from createNewContextForInit(). However, it
+  /// is also possible to explicitly transfer ownership of an existing
+  /// MlirContext to the Python bindings via stealExternalContext().
   static PyMlirContextRef forContext(MlirContext context);
+
+  /// Explicitly takes ownership of an MlirContext that must not already be
+  /// known to the Python bindings. Once done, the life-cycle of the context
+  /// will be controlled by the Python bindings, and it will be destroyed
+  /// when the reference count goes to zero.
+  static PyMlirContextRef stealExternalContext(MlirContext context);
+
   ~PyMlirContext();
 
   /// Accesses the underlying MlirContext.
@@ -606,6 +617,12 @@ class PyOperation : public PyOperationBase, public BaseContextObject {
   forOperation(PyMlirContextRef contextRef, MlirOperation operation,
                pybind11::object parentKeepAlive = pybind11::object());
 
+  /// Explicitly takes ownership of an operation that must not already be known
+  /// to the Python bindings. Once done, the life-cycle of the operation
+  /// will be controlled by the Python bindings.
+  static PyOperationRef stealExternalOperation(PyMlirContextRef contextRef,
+                                               MlirOperation operation);
+
   /// Creates a detached operation. The operation must not be associated with
   /// any existing live operation.
   static PyOperationRef
diff --git a/mlir/test/python/ir/context_lifecycle.py b/mlir/test/python/ir/context_lifecycle.py
index c20270999425ee..fbd1851ba70aee 100644
--- a/mlir/test/python/ir/context_lifecycle.py
+++ b/mlir/test/python/ir/context_lifecycle.py
@@ -45,5 +45,46 @@
 c4 = mlir.ir.Context()
 c4_capsule = c4._CAPIPtr
 assert '"mlir.ir.Context._CAPIPtr"' in repr(c4_capsule)
-c5 = mlir.ir.Context._CAPICreate(c4_capsule)
-assert c4 is c5
+# Because the context is already owned by Python, it cannot be created
+# a second time.
+try:
+    c5 = mlir.ir.Context._CAPICreate(c4_capsule)
+except RuntimeError:
+    pass
+else:
+    raise AssertionError(
+        "Should have gotten a RuntimeError when attempting to "
+        "re-create an already owned context"
+    )
+c4 = None
+c4_capsule = None
+gc.collect()
+assert mlir.ir.Context._get_live_count() == 0
+
+# Use a private testing method to create an unowned context capsule and
+# import it.
+c6_capsule = mlir.ir.Context._testing_create_raw_context_capsule()
+c6 = mlir.ir.Context._CAPICreate(c6_capsule)
+assert mlir.ir.Context._get_live_count() == 1
+c6_capsule = None
+c6 = None
+gc.collect()
+assert mlir.ir.Context._get_live_count() == 0
+
+# Also test operation import/export as it is tightly coupled to the context.
+(
+    raw_context_capsule,
+    raw_operation_capsule,
+) = mlir.ir.Operation._testing_create_raw_capsule("builtin.module {}")
+assert '"mlir.ir.Operation._CAPIPtr"' in repr(raw_operation_capsule)
+# Attempting to import an operation for an unknown context should fail.
+try:
+    mlir.ir.Operation._CAPICreate(raw_operation_capsule)
+except RuntimeError:
+    pass
+else:
+    raise AssertionError("Expected exception for unknown context")
+
+# Try again having imported the context.
+c7 = mlir.ir.Context._CAPICreate(raw_context_capsule)
+op7 = mlir.ir.Operation._CAPICreate(raw_operation_capsule)
diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py
index 04f8a9936e31f7..f59b1a26ba48b5 100644
--- a/mlir/test/python/ir/operation.py
+++ b/mlir/test/python/ir/operation.py
@@ -844,19 +844,6 @@ def testOperationName():
         print(op.operation.name)
 
 
-# CHECK-LABEL: TEST: testCapsuleConversions
-@run
-def testCapsuleConversions():
-    ctx = Context()
-    ctx.allow_unregistered_dialects = True
-    with Location.unknown(ctx):
-        m = Operation.create("custom.op1").operation
-        m_capsule = m._CAPIPtr
-        assert '"mlir.ir.Operation._CAPIPtr"' in repr(m_capsule)
-        m2 = Operation._CAPICreate(m_capsule)
-        assert m2 is m
-
-
 # CHECK-LABEL: TEST: testOperationErase
 @run
 def testOperationErase():

stellaraccident added a commit to iree-org/iree that referenced this pull request Dec 20, 2023
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently.

Includes llvm/llvm-project#76010 for testing on CI.
stellaraccident added a commit to iree-org/iree that referenced this pull request Dec 20, 2023
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently.

Includes llvm/llvm-project#76010 for testing on CI.

ci-extra: build_test_all_windows
stellaraccident added a commit to iree-org/iree that referenced this pull request Dec 20, 2023
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently.

Includes llvm/llvm-project#76010 for testing on CI.

ci-extra: build_test_all_windows
Comment on lines +638 to +639
// Note that the default return value policy on cast is automatic_reference,
// which does not take ownership (delete will not be called).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: is this specific to py::cast and/or a bug in pybind? The docs say about return_value_policy::automatic:

This policy falls back to the policy return_value_policy::take_ownership when the return value is a pointer... This is the default policy for py::class_-wrapped types.

and about return_value_policy::automatic_reference:

As above, but use policy return_value_policy::reference when the return value is a pointer. This is the default conversion policy for function arguments when calling Python functions manually from C++ code (i.e. via handle::operator()) and the casters in pybind11/stl.h.

Are we somehow in either of those two cases here?

Copy link
Contributor Author

@stellaraccident stellaraccident Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's somewhat poorly documented -- the behavior is specific to cast whereas the documentation is talking about the more common cases of function args/returns.

Note that I got that comment from the cast in fromOperation, which gets it right. That path is used heavily and was fixed long ago. This branch of context creation, though, had no upstream uses and was just wrong.

Looking for the reference on how I know this. I remember researching it some time ago.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of links below - explanation by the author and the source.

@stellaraccident
Copy link
Contributor Author

@makslevental
Copy link
Contributor

makslevental commented Dec 20, 2023

pybind/pybind11#287 (comment)

TIL:

(You can enforce a custom return value policy as a template argument to cast.)

Okay will review with a fine-toothed comb tomorrow!

stellaraccident added a commit to iree-org/iree that referenced this pull request Dec 20, 2023
It was previously thought that this was only an issue on Windows, but some subtle bugs in the ownership semantics upstream were merely manifesting there most prominently.

Includes llvm/llvm-project#76010 for testing on CI.

ci-extra: build_test_all_windows
@stellaraccident
Copy link
Contributor Author

pybind/pybind11#287 (comment)

TIL:

(You can enforce a custom return value policy as a template argument to cast.)

Okay will review with a fine-toothed comb tomorrow!

Thanks. I beefed up the downstream tests for this and enabled on all platforms. Afaict, this is the first time this feature has worked reliably: iree-org/iree#15982

I think I got the vital characteristics of it tested upstream now, so should be an overall improvement to reliability.

Copy link
Contributor

@makslevental makslevental left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I actually think this will break me in some experimental stuff in mlir-aie but solid fix nonetheless :)

@stellaraccident
Copy link
Contributor Author

LGTM - I actually think this will break me in some experimental stuff in mlir-aie but solid fix nonetheless :)

If your experience mirrors mine, you may have realized soon that what you were doing was already subtly broken but just hadn't been caught yet :)

@stellaraccident stellaraccident merged commit bbc2976 into llvm:main Dec 20, 2023
@stellaraccident stellaraccident deleted the py_steal_from_capsule branch December 20, 2023 20:19
@makslevental
Copy link
Contributor

If your experience mirrors mine, you may have realized soon that what you were doing was already subtly broken but just hadn't been caught yet :)

I definitely ran into the occasional ownership issue but I've always been able to brush it under the rug by fiddling with return_value_policy 😁. I look forward to trying out this API.

@ftynse
Copy link
Member

ftynse commented Dec 21, 2023

@makslevental could you give an example of the wrong things downstreams were doing and how to make it right? This will help downstreams to fix themselves after they bump LLVM version past this commit.

@makslevental
Copy link
Contributor

makslevental commented Dec 21, 2023

@makslevental could you give an example of the wrong things downstreams were doing and how to make it right? This will help downstreams to fix themselves after they bump LLVM version past this commit.

Well I'm doing super-gnarly stuff like shipping arbitrary mlir::Operations across the boundary:

 return static_cast<PySwitchboxOp &>(
      PyOperation::forOperation(DefaultingPyMlirContext::resolve().getRef(),
                                wrap(switchboxOp.getOperation()))
          ->getOperation());

I highly doubt anyone else downstream is doing this (I don't think anyone else knows how to...). return_value_policy I've had to reach for when returning arbitrary C++ functions across the boundary (another thing I doubt anyone else is doing).

I was actually curious what exactly @stellaraccident was experiencing that prompted this PR (there's a linked PR in iree but it doesn't say much).

This will help downstreams to fix themselves after they bump LLVM version past this commit.

My plan is to change forOperation -> stealExternalOperation and see what happens. Not very scientific unfortunately.

@makslevental
Copy link
Contributor

Just to be clear: I don't think normal users will be broken by this - I was speaking specifically about myself because I am using forOperation outside of lib/Bindings (which isn't normally possible).

@ftynse
Copy link
Member

ftynse commented Dec 21, 2023

Yes, without forOpeation, users are broken. This basically makes it impossible for "normal" users to return a non-owning reference to an operation. I don't think that's what we want. Here's the minimal example that breaks upstream. Note that anybody downstream cannot return something else than MlirOperation AFAIK. Upstream doesn't have a problem because it can, and does, return a PyOpreationRef.

diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 39757dfad5be..2ce640674245 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -3071,6 +3071,11 @@ void mlir::python::populateIRCore(py::module &m) {
                   py::arg("successors") = py::none(), py::arg("regions") = 0,
                   py::arg("loc") = py::none(), py::arg("ip") = py::none(),
                   py::arg("infer_type") = false, kOperationCreateDocstring)
+      .def("_get_first_in_block", [](PyOperation &self) -> MlirOperation {
+        MlirBlock block = mlirOperationGetBlock(self.get());
+        MlirOperation first = mlirBlockGetFirstOperation(block);
+        return first;
+      })
       .def_static(
           "parse",
           [](const std::string &sourceStr, const std::string &sourceName,
diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py
index f59b1a26ba48..6b12b8da5c24 100644
--- a/mlir/test/python/ir/operation.py
+++ b/mlir/test/python/ir/operation.py
@@ -24,6 +24,25 @@ def expect_index_error(callback):
     except IndexError:
         pass
 
+@run
+def testCustomBind():
+    ctx = Context()
+    ctx.allow_unregistered_dialects = True
+    module = Module.parse(
+        r"""
+    func.func @f1(%arg0: i32) -> i32 {
+      %1 = "custom.addi"(%arg0, %arg0) : (i32, i32) -> i32
+      return %1 : i32
+    }
+  """,
+        ctx,
+    )
+    add = module.body.operations[0].regions[0].blocks[0].operations[0]
+    op = add.operation
+    # This will get a reference to itself.
+    f1 = op._get_first_in_block()
+
+
 
 # Verify iterator based traversal of the op/region/block hierarchy.
 # CHECK-LABEL: TEST: testTraverseOpRegionBlockIterators

ftynse added a commit that referenced this pull request Dec 21, 2023
…thods work as documented. (#76010)"

This reverts commit bbc2976.

This change seems to be at odds with the non-owning part semantics of
MlirOperation in C API. Since downstream clients can only take and
return MlirOperation, it does not sound correct to force all returns of
MlirOperation transfer ownership. Specifically, this makes it impossible
for downstreams to implement IR-traversing functions that, e.g., look at
neighbors of an operation.

The following patch triggers the exception, and there does not seem to
be an alternative way for a downstream binding writer to express this:

```
diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 39757df..2ce640674245 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -3071,6 +3071,11 @@ void mlir::python::populateIRCore(py::module &m) {
                   py::arg("successors") = py::none(), py::arg("regions") = 0,
                   py::arg("loc") = py::none(), py::arg("ip") = py::none(),
                   py::arg("infer_type") = false, kOperationCreateDocstring)
+      .def("_get_first_in_block", [](PyOperation &self) -> MlirOperation {
+        MlirBlock block = mlirOperationGetBlock(self.get());
+        MlirOperation first = mlirBlockGetFirstOperation(block);
+        return first;
+      })
       .def_static(
           "parse",
           [](const std::string &sourceStr, const std::string &sourceName,
diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py
index f59b1a2..6b12b8da5c24 100644
--- a/mlir/test/python/ir/operation.py
+++ b/mlir/test/python/ir/operation.py
@@ -24,6 +24,25 @@ def expect_index_error(callback):
     except IndexError:
         pass

+@run
+def testCustomBind():
+    ctx = Context()
+    ctx.allow_unregistered_dialects = True
+    module = Module.parse(
+        r"""
+    func.func @F1(%arg0: i32) -> i32 {
+      %1 = "custom.addi"(%arg0, %arg0) : (i32, i32) -> i32
+      return %1 : i32
+    }
+  """,
+        ctx,
+    )
+    add = module.body.operations[0].regions[0].blocks[0].operations[0]
+    op = add.operation
+    # This will get a reference to itself.
+    f1 = op._get_first_in_block()
+
+

 # Verify iterator based traversal of the op/region/block hierarchy.
 # CHECK-LABEL: TEST: testTraverseOpRegionBlockIterators
```
@ftynse
Copy link
Member

ftynse commented Dec 21, 2023

Reverting for now.

I think it's either a matter of (a) updating the caster in PybindAdaptors.h so that it can interpret MlirOperation as reference when the corresponding operation itself (and maybe one of its ancestors) is already owned by python or (b) somehow providing a mechanism for downstream binding implementers to indicate whether they intend to transfer ownership. But I'm rusty on this whole ownership business.

@makslevental
Copy link
Contributor

makslevental commented Dec 21, 2023

so that it can interpret MlirOperation as reference when the corresponding operation itself (and maybe one of its ancestors) is already owned by python or (b) somehow providing a mechanism for downstream binding implementers to indicate whether they intend to transfer ownership.

I see

static handle cast(MlirOperation v, return_value_policy, handle) {
  return py::module::import(MAKE_MLIR_PYTHON_QUALNAME("ir"))
        .attr("Operation")
        .attr(MLIR_PYTHON_CAPI_FACTORY_ATTR)(capsule)
        .release();
}

which could be changed to

static handle cast(MlirOperation v, return_value_policy p, handle h) {
  // call MLIR_PYTHON_CAPI_FACTORY_ATTR somehow differently depending on p/h
}

(h is the parent object).

This would enable downstream users to

.def("_get_first_in_block", [](MlirOperation op) -> py::object {
  MlirBlock block = mlirOperationGetBlock(op);
  MlirOperation first = mlirBlockGetFirstOperation(block);
  return py::cast(first, return_value_policy::reference);
})

in order to communicate intent.

But I don't actually know what // call MLIR_PYTHON_CAPI_FACTORY_ATTR somehow differently depending on p/h should be - the call to _CAPICreate is the important piece and it's a call through the python API (which we don't want to expand to support something like a return_value_policy arg?).

Upstream doesn't have a problem because it can, and does, return a PyOpreationRef.

I'll take this opportunity to beat my favorite dead horse: we could just start exposing the utilities in lib/Bindings, such as PyOpreationRef, in order to avoid the inherent difficulties of relying on only PybindAdaptors.h to always do the right things.

@stellaraccident
Copy link
Contributor Author

Yes, without forOpeation, users are broken. This basically makes it impossible for "normal" users to return a non-owning reference to an operation. I don't think that's what we want. Here's the minimal example that breaks upstream. Note that anybody downstream cannot return something else than MlirOperation AFAIK. Upstream doesn't have a problem because it can, and does, return a PyOpreationRef.

diff --git a/mlir/lib/Bindings/Python/IRCore.cpp b/mlir/lib/Bindings/Python/IRCore.cpp
index 39757dfad5be..2ce640674245 100644
--- a/mlir/lib/Bindings/Python/IRCore.cpp
+++ b/mlir/lib/Bindings/Python/IRCore.cpp
@@ -3071,6 +3071,11 @@ void mlir::python::populateIRCore(py::module &m) {
                   py::arg("successors") = py::none(), py::arg("regions") = 0,
                   py::arg("loc") = py::none(), py::arg("ip") = py::none(),
                   py::arg("infer_type") = false, kOperationCreateDocstring)
+      .def("_get_first_in_block", [](PyOperation &self) -> MlirOperation {
+        MlirBlock block = mlirOperationGetBlock(self.get());
+        MlirOperation first = mlirBlockGetFirstOperation(block);
+        return first;
+      })
       .def_static(
           "parse",
           [](const std::string &sourceStr, const std::string &sourceName,
diff --git a/mlir/test/python/ir/operation.py b/mlir/test/python/ir/operation.py
index f59b1a26ba48..6b12b8da5c24 100644
--- a/mlir/test/python/ir/operation.py
+++ b/mlir/test/python/ir/operation.py
@@ -24,6 +24,25 @@ def expect_index_error(callback):
     except IndexError:
         pass
 
+@run
+def testCustomBind():
+    ctx = Context()
+    ctx.allow_unregistered_dialects = True
+    module = Module.parse(
+        r"""
+    func.func @f1(%arg0: i32) -> i32 {
+      %1 = "custom.addi"(%arg0, %arg0) : (i32, i32) -> i32
+      return %1 : i32
+    }
+  """,
+        ctx,
+    )
+    add = module.body.operations[0].regions[0].blocks[0].operations[0]
+    op = add.operation
+    # This will get a reference to itself.
+    f1 = op._get_first_in_block()
+
+
 
 # Verify iterator based traversal of the op/region/block hierarchy.
 # CHECK-LABEL: TEST: testTraverseOpRegionBlockIterators

Darn - I was trying to not break that case... Which was the reason for the disjoint implementations between context and op import. Thanks for the test case. That will help in rolling this forward. Probably next week at this point.

@stellaraccident
Copy link
Contributor Author

Reverting for now.

I think it's either a matter of (a) updating the caster in PybindAdaptors.h so that it can interpret MlirOperation as reference when the corresponding operation itself (and maybe one of its ancestors) is already owned by python or (b) somehow providing a mechanism for downstream binding implementers to indicate whether they intend to transfer ownership. But I'm rusty on this whole ownership business.

I might add an arg to indicate whether ownership must be transferred. The case I was running into was subtle non-determinism based on access pattern when a completely external API with its own ownership semantics was negotiating to give ownership to the python API. The implicit ownership capture of context and operation makes this very easy to get wrong with perfectly valid Python sequences. I thought it was a Windows ghost for months but it just turned out to be more sensitive to the fallout. I finally had evidence that a real program was also triggering the issue on Linux (but it was in the bowels of an integration, and what I sent upstream was my reduced test case). That is what made me trade it seriously.

@stellaraccident
Copy link
Contributor Author

I'll take this opportunity to beat my favorite dead horse: we could just start exposing the utilities in lib/Bindings, such as PyOpreationRef, in order to avoid the inherent difficulties of relying on only PybindAdaptors.h to always do the right things.

Wouldn't help me. I was dealing with something that was using the raw C API via ctypes. The contract here must be built in to the API and enforced correctly.

@ftynse
Copy link
Member

ftynse commented Dec 22, 2023

I might add an arg to indicate whether ownership must be transferred.

Yeah, this is likely the sanest approach. Maybe we were wrong about not exposing ownership in C API types since it makes it harder to map the API to languages that do have explicit ownership.

One thing that I find tricky here is the "transitive" ownership of IR objects. On the C++ side, the caller has ownership of the top-level module and potentially some detached operations. In Python, we seem to be modeling that Python user owns any operation they have a reference to, which isn't strictly necessary. Though we might not want to traverse the IR every time something returns a MlirOperation to check whether a parent is already owned.

Same may happen with contexts by the way. We just never have a need to return an MlirContext from functions because all Python code is executed in a with Context block anyway.

I don't know offhand if we can just mimic C++ ownership rules and somehow require Python client to keep the top-level module alive rather than say that any reference into the connected IR blob maintains all of it alive.

I'll take this opportunity to beat my favorite dead horse: we could just start exposing the utilities in lib/Bindings, such as PyOpreationRef, in order to avoid the inherent difficulties of relying on only PybindAdaptors.h to always do the right things.

We can go the other way around and eat our own dogfood by actually using PybindAdaptors.h upstream...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants