-
Notifications
You must be signed in to change notification settings - Fork 405
Wake background-processor on async monitor update completion #2090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wake background-processor on async monitor update completion #2090
Conversation
@@ -175,6 +119,9 @@ impl FutureState { | |||
} | |||
|
|||
/// A simple future which can complete once, and calls some callback(s) when it does so. | |||
/// | |||
/// Clones can be made and all futures cloned from the same source will complete at the same time. | |||
#[derive(Clone)] | |||
pub struct Future { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confusing to see Future
!= StdFuture
. maybe rename to Notified
or NotifiedFuture
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I agree, let's change it in a followup though, this PR is already kinda big.
Yeah, I certainly don't want to add more complexity / maintenance burden to LDK especially given that we're not actually using the upstream background processor. As I wrote in #2052:
In other words, the main guarantee that we're interested in is that if any event is generated which needs processing, an associated future (either the chain monitor or channel manager future will do) will complete, which will allow our background processor to immediately begin processing events (in order to minimize latency and response time). Previously, before these futures were exposed, the 100ms sleeper future acted as a catch-all to ensure that any generated events are processed within at most 100ms, which meant that the average minimum response time was about 50ms, rather than the 0ms that we want. Now that we have the channel manager and chain monitor futures, (IIUC) the only reason the 100ms sleeper future still needs to exist is to detect if the node has been starved of CPU cycles on some platforms in order to reconnect sockets, NOT as a catch-all to process events. So it seems that the BGP could be rewritten so that events are processed ONLY IF one of those two futures completed, rather than any time the 100ms sleeper future completes. This would be a strong signal to us that the guarantee we're interested in has been satisfied. Also, Lexe likely has more stringent CPU and performance requirements than most LDK users, so I wouldn't want you all to take on the extra complexity to reduce the CPU usage by just a factor of 2 in a BGP that we don't even use, just for us. So if you feel that most users won't need the 2x performance improvement, definitely feel free to remove the last bits which introduce complexity - as emphasized above, our primary interest in the upstream BGP is mostly as a reference to see which guarantees LDK has implicitly promised to uphold. |
This is absolutely a goal, was before, and continues to be. Notably even the 100ms that provides us a backstop is an impossibly long delay to have to wait twice per payment! #2052 is, after all, a bug :)
Indeed, largely. Sadly I totally misread the constants, they say 1 second for testing, but I skimmed that as 1 second in all builds :). Thus, the proposed rewrite here makes no sense, really. Rather, we should just always sleep on the PING_TIMER's 10 second interval, and expose the concept of "am I on mobile".
I don't think Lexe is the only LDK user who will ever run a ton of nodes in one process on one server, rather y'all are just the furthest along towards building that. It's definitely a use-case we want to support, even if we also have ten other use-cases we also have to support :) |
Btw, we actually run our nodes as separate processes, for SGX reasons. But yeah, all the above sounds good to me! |
003be28
to
4f2b20b
Compare
Replaced the last commit with one that simply bumps the sleep time to 10s based on a new option, which should suffice. |
7eba7ef
to
2f1fb7c
Compare
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## main #2090 +/- ##
==========================================
- Coverage 91.38% 91.37% -0.02%
==========================================
Files 102 102
Lines 49767 49832 +65
Branches 49767 49832 +65
==========================================
+ Hits 45482 45532 +50
- Misses 4285 4300 +15
... and 2 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Will have to wait on #2107 |
74775fe
to
c827d88
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh, seems we now also need to add wait_while
/wait_timeout_while
to nostd_sync::Condvar
. Should be trivial I believe as all of the methods there are dummies anyway?
c827d88
to
42a630a
Compare
Mmm, actually, yea, that's nonsense, I opted to remove the dummy Condvar instead and make the relevant wait methods std-only (which they should have been to begin with). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do a full pass after squash
.wait_timeout(Duration::from_millis(10))); | ||
|
||
// Test ordering somewhat more. | ||
notifier_a.notify(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have expected this call to only complete the previous future because we were already waiting on it, rather than also completing the next future below. Is there a reason behind this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, our goal is to ensure that after any notification the user can see it. The sleepers that were already running, but that we'd given up on by this point absolutely shouldn't count for that - the user has given up on them, dropped them, and moved on. We need to ensure that their next wait completes so that they will persist ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this after re-visiting the Notifier
(and I may be missing something), it seems this is not the case there. The future may be dropped by the user, but not by the Notifier
itself. A notify
would wake the "new" future, but it's really just a clone of the future they previously dropped (we test this in notifier_future_completes_wake
). With the Sleeper
, I guess the new Sleeper
future could be polling futures from a distinct set of Notifier
s than the previous Sleeper
future dropped, so we'd definitely want to wake the new future there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there's no such thing as a "new" future (until we set callbacks_made
in the old one's state), but that's not really observable to the user. As far as the user is concerned, they got a new future, which completes instantly because a notification came in after they dropped the previous future (or after the last time they poll'ed it).
At a high level, our goal is "after we call notify
, the user will be notified" - if the user has a future, and they don't sleep on it (or have already been woken on a sleeper on it for a different future) or they stopped poll'ing already, or....we will notify them the next time they go to poll an associated future. This applies no matter what the internal state of the future is - we always make sure to wake/complete a future at some point after the notify
.
Concretely, in this test, you can imagine the BP - it is polling two notifiers - the ChannelManager
and the ChainMonitor
- it got notified by both, and both caused a wakeup, then, it went to poll one of the futures again, but there wasn't a wake, we timed out instead. At this point we're gonna go around the loop. If we get a notify to one of our objects while we're in the loop body processing some events, when we get back to the top we don't want to sleep - we want the new future to complete instantly.
Just because we polled before and it timed out (ignoring whether the future was drop'd - we treat it the same whether it was or not) doesn't mean we dont need to be woken up now - if some notify happens, we need to make sure the next poll/sleep wakes instantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was very helpful, thanks! I think we could do a bit better with documentation here, even if it's mostly internal, just noting some of the requirements.
42a630a
to
8a2a9c1
Compare
Went ahead and squashed down all the fixups. |
.wait_timeout(Duration::from_millis(10))); | ||
|
||
// Test ordering somewhat more. | ||
notifier_a.notify(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coming back to this after re-visiting the Notifier
(and I may be missing something), it seems this is not the case there. The future may be dropped by the user, but not by the Notifier
itself. A notify
would wake the "new" future, but it's really just a clone of the future they previously dropped (we test this in notifier_future_completes_wake
). With the Sleeper
, I guess the new Sleeper
future could be polling futures from a distinct set of Notifier
s than the previous Sleeper
future dropped, so we'd definitely want to wake the new future there?
7cd102c
to
90587bc
Compare
Had to re-add the loop-based waits in a fixup and then remove them later - reording the commits here is...kinda tough - we want to remove the no-std condvar and make all the sleeps std-only, except first we add the sleeper, which is then used to move the pub wait methods on channelmanager, which then include moving to futures, which then allow us to remove the no-std waits...I figured it wasn't worth bothering trying to rewrite half the commits here just to reorder them, rather adding code that's probably correct and then removing it to simplify in a later commit seemed fine. |
Feel free to squash, mostly LGTM but will give it a final pass after |
LGTM after squash, but CI is not happy. Will likely do a final pass once @tnull comes around. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some non-blocking nits.
8e19f08
to
23325ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would be ready for the squashing from my side.
23325ba
to
afa548b
Compare
Squashed without further changes. |
Could also use a rebase to get the MSRV builds running |
The `lightning-net-tokio` crate-level example contained a carryover from when it was the primary notifier of the background processor and now just shows an "example" of creating a method to call another method with the same parameters and then do event processing (which doesn't make sense, the BP should do that). Instead, the examples are simply removed and the documentation is tweaked to include recent changes.
128bb01
to
e1ba7ff
Compare
Nope, still not fixed in all commits:
|
e1ba7ff
to
7f1bbf5
Compare
Lol sorry I was "fixing" the issue in the wrong place, actually fixed now. |
Needs squash |
These are useful, but we previously couldn't use them due to our MSRV. Now that we can, we should use them, so we expose them via our normal debug_sync wrappers.
`Send` is rather useless on a `no-std` target - we don't have threads and are just needlessly restricting ourselves, so here we skip it for the wakers callback.
In the next commits we'll be adding a second notify pipeline - from the `ChainMonitor` back to the background processor. This will cause the `background-processor` to need to await multiple wakers at once, which we cannot do in the current scheme without taking on a full async runtime. Building a multi-future waiter isn't so bad, and notably will allow us to remove the existing sleep pipeline entirely, reducing the complexity of our wakers implementation by only having one notify path to consider.
Rather than having three ways to await a `ChannelManager` being persistable, this moves to just exposing the awaitable `Future` and having sleep functions on that.
In `no-std`, we exposed `wait` functions which rely on a dummy `Condvar` which never actually sleeps. This is somwhat nonsensical, not to mention confusing to users. Instead, we simply remove the `wait` methods in `no-std` builds.
If the `ChainMonitor` gets an async monitor update completion, this means the `ChannelManager` needs to be polled for event processing. Here we wake it using the new multi-`Future`-await `Sleeper`, or the existing `select` block in the async BP. Fixes lightningdevkit#2052.
Some users have suggested that waking every 100ms can be CPU-intensive in deployments with hundreds or thousands of nodes all running on the same machine. Thus, we add an option to the futures-based `background-processor` to avoid waking every 100ms to check for iOS having backgrounded our app and cut our TCP sockets. This cuts the normal sleep time down from 100ms to 10s, for those who turn it on.
7f1bbf5
to
94a11f7
Compare
Squashed without further changes. |
0.0.115 - Apr 24, 2023 - "Rebroadcast the Bugfixes" API Updates =========== * The MSRV of the main LDK crates has been increased to 1.48 (lightningdevkit#2107). * Attempting to claim an un-expired payment on a channel which has closed no longer fails. The expiry time of payments is exposed via `PaymentClaimable::claim_deadline` (lightningdevkit#2148). * `payment_metadata` is now supported in `Invoice` deserialization, sending, and receiving (via a new `RecipientOnionFields` struct) (lightningdevkit#2139, lightningdevkit#2127). * `Event::PaymentFailed` now exposes a failure reason (lightningdevkit#2142). * BOLT12 messages now support stateless generation and validation (lightningdevkit#1989). * The `NetworkGraph` is now pruned of stale data after RGS processing (lightningdevkit#2161). * Max inbound HTLCs in-flight can be changed in the handshake config (lightningdevkit#2138). * `lightning-transaction-sync` feature `esplora-async-https` was added (lightningdevkit#2085). * A `ChannelPending` event is now emitted after the initial handshake (lightningdevkit#2098). * `PaymentForwarded::outbound_amount_forwarded_msat` was added (lightningdevkit#2136). * `ChannelManager::list_channels_by_counterparty` was added (lightningdevkit#2079). * `ChannelDetails::feerate_sat_per_1000_weight` was added (lightningdevkit#2094). * `Invoice::fallback_addresses` was added to fetch `bitcoin` types (lightningdevkit#2023). * The offer/refund description is now exposed in `Invoice{,Request}` (lightningdevkit#2206). Backwards Compatibility ======================= * Payments sent with the legacy `*_with_route` methods on LDK 0.0.115+ will no longer be retryable via the LDK 0.0.114- `retry_payment` method (lightningdevkit#2139). * `Event::PaymentPathFailed::retry` was removed and will always be `None` for payments initiated on 0.0.115 which fail on an earlier version (lightningdevkit#2063). * `Route`s and `PaymentParameters` with blinded path information will not be readable on prior versions of LDK. Such objects are not currently constructed by LDK, but may be when processing BOLT12 data in a coming release (lightningdevkit#2146). * Providing `ChannelMonitorUpdate`s generated by LDK 0.0.115 to a `ChannelMonitor` on 0.0.114 or before may panic (lightningdevkit#2059). Note that this is in general unsupported, and included here only for completeness. Bug Fixes ========= * Fixed a case where `process_events_async` may `poll` a `Future` which has already completed (lightningdevkit#2081). * Fixed deserialization of `u16` arrays. This bug may have previously corrupted the historical buckets in a `ProbabilisticScorer`. Users relying on the historical buckets may wish to wipe their scorer on upgrade to remove corrupt data rather than waiting on it to decay (lightningdevkit#2191). * The `process_events_async` task is now `Send` and can thus be polled on a multi-threaded runtime (lightningdevkit#2199). * Fixed a missing macro export causing `impl_writeable_tlv_based_enum{,_upgradable}` calls to not compile (lightningdevkit#2091). * Fixed compilation of `lightning-invoice` with both `no-std` and serde (lightningdevkit#2187) * Fix an issue where the `background-processor` would not wake when a `ChannelMonitorUpdate` completed asynchronously, causing delays (lightningdevkit#2090). * Fix an issue where `process_events_async` would exit immediately (lightningdevkit#2145). * `Router` calls from the `ChannelManager` now call `find_route_with_id` rather than `find_route`, as was intended and described in the API (lightningdevkit#2092). * Ensure `process_events_async` always exits if any sleep future returns true, not just if all sleep futures repeatedly return true (lightningdevkit#2145). * `channel_update` messages no longer set the disable bit unless the peer has been disconnected for some time. This should resolve cases where channels are disabled for extended periods of time (lightningdevkit#2198). * We no longer remove CLN nodes from the network graph for violating the BOLT spec in some cases after failing to pay through them (lightningdevkit#2220). * Fixed a debug assertion which may panic under heavy load (lightningdevkit#2172). * `CounterpartyForceClosed::peer_msg` is now wrapped in UntrustedString (lightningdevkit#2114) * Fixed a potential deadlock in `funding_transaction_generated` (lightningdevkit#2158). Security ======== * Transaction re-broadcasting is now substantially more aggressive, including a new regular rebroadcast feature called on a timer from the `background-processor` or from `ChainMonitor::rebroadcast_pending_claims`. This should substantially increase transaction confirmation reliability without relying on downstream `TransactionBroadcaster` implementations for rebroadcasting (lightningdevkit#2203, lightningdevkit#2205, lightningdevkit#2208). * Implemented the changes from BOLT PRs lightningdevkit#1031, lightningdevkit#1032, and lightningdevkit#1040 which resolve a privacy vulnerability which allows an intermediate node on the path to discover the final destination for a payment (lightningdevkit#2062). In total, this release features 110 files changed, 11928 insertions, 6368 deletions in 215 commits from 21 authors, in alphabetical order: * Advait * Alan Cohen * Alec Chen * Allan Douglas R. de Oliveira * Arik Sosman * Elias Rohrer * Evan Feenstra * Jeffrey Czyz * John Cantrell * Lucas Soriano del Pino * Marc Tyndel * Matt Corallo * Paul Miller * Steven * Steven Williamson * Steven Zhao * Tony Giorgio * Valentine Wallace * Wilmer Paulino * benthecarman * munjesi
Fixes #2052.
Open to feedback especially on the last commit, its nice to reduce our polling interval, but the complexity here just to reduce wakeups by something like 2x (3 times per second plus once every 500ms, or something like that down from 10 times per second) seems...not worth it?
cc @MaxFangX