Skip to content

Disallow dual-sync-async persistence without restarting #3737

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

TheBlueMatt
Copy link
Collaborator

In general, we don't expect users to persist
`ChannelMonitor[Update]`s both synchronously and asynchronously for
a single `ChannelManager` instance. If a user has implemented
asynchronous persistence, they should generally always use that,
as there is then no advantage to them to occasionally persist
synchronously.

Even still, in 920d96edb6595289902f287419de2d002e2dc2ee we fixed
some bugs related to such operation, and noted that "there isn't
much cost to supporting it". Sadly, this is not true.

Specifically, the dual-sync-async persistence flow is ill-defined
and difficult to define in away that a user can realistically
implement.

Consider the case of a `ChannelMonitorUpdate` which is persisted
asynchronously and while it is still being persisted a new
`ChannelMonitorUpdate` is created. If the second
`ChannelMonitorUpdate` is persisted synchronously, the
`ChannelManager` will be left with a single pending
`ChannelMonitorUpdate` which is not the latest.

If we were to then restart, the latest copy of the `ChannelMonitor`
would be that without any updates, but the `ChannelManager` has a
pending `ChannelMonitorUpdate` for the next update, but not the one
after that. The user would then have to handle the replayed
`ChannelMonitorUpdate` and then find the second
`ChannelMonitorUpdate` on disk and somehow know to replay that one
as well.

Further, we currently have a bug in handling this scenario as we'll
complete all pending post-update actions when the second
`ChannelMonitorUpdate` gets persisted synchronously, even though
the first `ChannelMonitorUpdate` is still pending. While we could
rather trivially fix these issues, addressing the larger API
question above is difficult and as we don't anticipate this
use-case being important, we just disable it here.

Note that we continue to support it internally as some 39 tests
rely on it.

Issue highlighted by (changes to the) chanmon_consistency fuzz
target (in the next commit).

@TheBlueMatt TheBlueMatt added this to the 0.2 milestone Apr 15, 2025
@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Apr 15, 2025

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

@valentinewallace
Copy link
Contributor

We may be able to remove the test test_sync_async_persist_doesnt_hang since it specifically covers async + sync persistence.

@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from 4e41cb7 to 70f9409 Compare April 16, 2025 14:37
@valentinewallace
Copy link
Contributor

Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore?

Comment on lines +2572 to +2573
/// We only support using one of [`ChannelMonitorUpdateStatus::InProgress`] and
/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts: is there a way to enforce this with the compiler? And is it a goal of the project to continue supporting both sync + async persistence forever?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we could have some kind of flag you set on startup which indicates which persistence method you're using, but I'm sure really sure that it would be an issue in practice - we expect users to use either Persist or some future AsyncPersist which will handle this stuff for them, and I don't really see why someone would ever persist sometimes-sync-sometimes-async - either your disk is slow and you need async or you don't.

We definitely want to support both at the project level, though, for the same reason we want to support async and sync everywhere in the project - Rust-async-native projects are becoming more common, but Rust-non-async projects also exist (as well as platforms where that's required) as well as our current bindings are not async.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want to support both at the project level, though, for the same reason we want to support async and sync everywhere in the project - Rust-async-native projects are becoming more common, but Rust-non-async projects also exist (as well as platforms where that's required) as well as our current bindings are not async.

To clarify, by sync + async persistence I mean supporting either returning ChannelMonitorUpdateStatus::Completed or ::InProgress. It feels like we expect most production users to persist data asynchronously and then later call channel_monitor_updated, so it could be reasonable to drop support for the ::Completed option down the line. Seems like users that are the exception to that can just call channel_monitor_updated immediately, as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I imagine many small nodes will still want the simplicity of sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, seems worth exploring though. Bookmarking to discuss further.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, maybe when we're super 100% on the stability of async persist? It wouldn't free up a lot of code though cause it's all the same pathway now anyway.

Comment on lines 968 to 970
let empty_node_a_ser = node_a.encode();
let empty_node_b_ser = node_b.encode();
let empty_node_c_ser = node_c.encode();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These don't seem to be used anywhere?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, part of a later patch :)

@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from 70f9409 to ddccbda Compare April 16, 2025 19:58
Copy link

codecov bot commented Apr 16, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.82%. Comparing base (83e9e80) to head (ddccbda).
Report is 19 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3737      +/-   ##
==========================================
+ Coverage   89.12%   89.82%   +0.69%     
==========================================
  Files         156      156              
  Lines      123514   129338    +5824     
  Branches   123514   129338    +5824     
==========================================
+ Hits       110086   116176    +6090     
+ Misses      10749    10554     -195     
+ Partials     2679     2608      -71     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheBlueMatt
Copy link
Collaborator Author

Would it be a good take-a-Friday issue to fix the legacy tests that use both sync + async so we don't need to gate the panics anymore?

Yea, I imagine its a good bit of work, though.

@wpaulino
Copy link
Contributor

Needs a squash

In general, we don't expect users to persist
`ChannelMonitor[Update]`s both synchronously and asynchronously for
a single `ChannelManager` instance. If a user has implemented
asynchronous persistence, they should generally always use that,
as there is then no advantage to them to occasionally persist
synchronously.

Even still, in 920d96e we fixed
some bugs related to such operation, and noted that "there isn't
much cost to supporting it". Sadly, this is not true.

Specifically, the dual-sync-async persistence flow is ill-defined
and difficult to define in away that a user can realistically
implement.

Consider the case of a `ChannelMonitorUpdate` which is persisted
asynchronously and while it is still being persisted a new
`ChannelMonitorUpdate` is created. If the second
`ChannelMonitorUpdate` is persisted synchronously, the
`ChannelManager` will be left with a single pending
`ChannelMonitorUpdate` which is not the latest.

If we were to then restart, the latest copy of the `ChannelMonitor`
would be that without any updates, but the `ChannelManager` has a
pending `ChannelMonitorUpdate` for the next update, but not the one
after that. The user would then have to handle the replayed
`ChannelMonitorUpdate` and then find the second
`ChannelMonitorUpdate` on disk and somehow know to replay that one
as well.

Further, we currently have a bug in handling this scenario as we'll
complete all pending post-update actions when the second
`ChannelMonitorUpdate` gets persisted synchronously, even though
the first `ChannelMonitorUpdate` is still pending. While we could
rather trivially fix these issues, addressing the larger API
question above is difficult and as we don't anticipate this
use-case being important, we just disable it here.

Note that we continue to support it internally as some 39 tests
rely on it. We do, however, remove
test_sync_async_persist_doesnt_hang which was added specifically to
test for this use-case, and we now do not support it.

Issue highlighted by (changes to the) chanmon_consistency fuzz
target (in the next commit).
When we reload a node in the `chanmon_consistency` fuzzer, we
always reload with the latest `ChannelMonitor` state which was
confirmed as persisted to the running `ChannelManager`. This is
nice in that it tests losing the latest `ChannelMonitor`, but there
may also be bugs in the on-startup `ChannelMonitor` replay.

Thus, here, we optionally reload with a newer `ChannelMonitor` than
the last-persisted one.
`chanmon_consistency` was originally written with lots of macros
due to some misguided concept of code being unrolled at
compile-time. This is, of course, a terrible idea not just for
compile times but also for performance.

Here, we make `reload_node` a function in anticipation of it being
used in additional places in future work.
@TheBlueMatt TheBlueMatt force-pushed the 2025-04-no-dual-sync-async branch from ddccbda to d009b39 Compare April 29, 2025 13:42
@TheBlueMatt
Copy link
Collaborator Author

Squashed with a minor comment fix:

$ git diff-tree -U1 ddccbdabd d009b3918
diff --git a/lightning/src/ln/channelmanager.rs b/lightning/src/ln/channelmanager.rs
index c86eed77b..54b607d88 100644
--- a/lightning/src/ln/channelmanager.rs
+++ b/lightning/src/ln/channelmanager.rs
@@ -2573,4 +2573,4 @@ where
 	/// [`ChannelMonitorUpdateStatus::Completed`] without restarting. Because the API does not
-	/// otherwise directly enforce this, we enforce it in debug builds here by storing which one is
-	/// in use.
+	/// otherwise directly enforce this, we enforce it in non-test builds here by storing which one
+	/// is in use.
 	#[cfg(not(test))]

@wpaulino
Copy link
Contributor

These externalized tests seem to be failing:

- test_batch_funding_close_after_funding_signed
- test_batch_channel_open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants