Background processing of ChannelManager and ChannelMonitor events #920

jkczyz · 2021-05-12T18:00:29Z

Support processing Events from ChannelManager and ChainMonitor in the background. Adds an EventHandler trait for users to implement and pass to BackgroundProcessor. Refactors the EventsProvider trait to take such an event handler.

This also addresses a race condition identified in #899 by making use of PersistenceNotifierGuard in ChannelManager when processing events.

codecov · 2021-05-12T18:21:13Z

Codecov Report

Merging #920 (db0f707) into main (3a0356f) will increase coverage by 0.27%.
The diff coverage is 96.51%.

❗ Current head db0f707 differs from pull request most recent head a1f95de. Consider uploading reports for the commit a1f95de to get more accurate results

@@            Coverage Diff             @@
##             main     #920      +/-   ##
==========================================
+ Coverage   90.55%   90.83%   +0.27%     
==========================================
  Files          59       59              
  Lines       30015    33235    +3220     
==========================================
+ Hits        27181    30188    +3007     
- Misses       2834     3047     +213

Impacted Files	Coverage Δ
lightning/src/chain/channelmonitor.rs	`91.16% <ø> (-4.30%)`	⬇️
lightning/src/ln/chanmon_update_fail_tests.rs	`97.64% <ø> (-0.14%)`	⬇️
lightning/src/ln/functional_test_utils.rs	`95.06% <ø> (ø)`
lightning/src/ln/functional_tests.rs	`97.97% <ø> (+1.14%)`	⬆️
lightning/src/ln/onion_route_tests.rs	`96.80% <ø> (ø)`
lightning/src/ln/reorg_tests.rs	`99.62% <ø> (ø)`
lightning-background-processor/src/lib.rs	`94.73% <93.65%> (-1.38%)`	⬇️
lightning/src/ln/channelmanager.rs	`88.45% <97.56%> (+5.53%)`	⬆️
lightning-net-tokio/src/lib.rs	`75.92% <100.00%> (-0.34%)`	⬇️
lightning/src/chain/chainmonitor.rs	`95.80% <100.00%> (+0.28%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a0356f...a1f95de. Read the comment docs.

TheBlueMatt · 2021-05-12T19:42:48Z

lightning/src/ln/channelmanager.rs

-	fn get_and_clear_pending_events(&self) -> Vec<Event> {
+	fn process_pending_events<H: Deref>(&self, handler: H) -> Result<(), E>
+	where H::Target: EventHandler<E> {
+		let _persistence_guard = PersistenceNotifierGuard::new(&self.total_consistency_lock, &self.persistence_notifier);


After #918 this probably should figure out if any events were processed and not always persist.

I think this meant to say #916.

lightning/src/util/events.rs

jkczyz

Added some comments that I could use some feedback on.

jkczyz · 2021-05-12T21:02:45Z

lightning/src/ln/channelmanager.rs

-		mem::swap(&mut ret, &mut *pending_events);
-		ret
+		for event in pending_events.drain(..) {
+			handler.handle_event(event)?;


The code as written here will pass back the error of the first encountered error and stop processing events. However, any remaining events are lost because of the drain. So this leads to some other questions:

Should event processing continue if handling one event fails?

Can a user reasonably recover from an error while handling an event?

Should unprocessed events remain pending for the next call?

The answer to these will determine what the interface will ultimately look like.

Right, if we take an error from users we definitely shouldn't just drop events, we should keep feeding them to users. That said, I'm a little dubious that we should let users provide Errors either. We shouldn't take an error unless we can do something reasonable with it - maybe try the event again later or something, but short of that I'm not sure what an error even means.

Yeah, I'm starting to think not allowing errors is the best approach. The user can always write an EventHandler to do whatever they want (including collecting errors or remaining events) and act upon it after process_pending_events returns. But, depending on how implemented, it could leave them in a state where ChannelManager is persisted without those events but the user hasn't "fully" handled them either. May just need to spell it out in the documentation.

That said, if they are using BackgroundProcessor there would be no way for them to examine the handler after processing.

Right, I guess the only case where errors make sense is some kind of transitory issue that we know will resolve very soon, allowing us to handle the error momentarily while holding off persisting the ChannelManager. That said, I think you could just block in the handler in that case, as long as we only hold an event-specific lock?

Maybe we can have a separate lock allowing us to block any other event handlers from entering while still allowing adding new events from other things going on?

Right, I guess the only case where errors make sense is some kind of transitory issue that we know will resolve very soon, allowing us to handle the error momentarily while holding off persisting the ChannelManager. That said, I think you could just block in the handler in that case, as long as we only hold an event-specific lock?

Will update this to remove the return type from handle_event.

Maybe we can have a separate lock allowing us to block any other event handlers from entering while still allowing adding new events from other things going on?

pending_events is already behind a Mutex. But I guess if we want to allow other event producers to add more events, we could immediately drop the lock after either draining fully into a Vec or swapping as before.

pending_events is already behind a Mutex. But I guess if we want to allow other event producers to add more events, we could immediately drop the lock after either draining fully into a Vec or swapping as before.

Right, I guess I was thinking that if a user were handling events in parallel it would be a bug. Now that I think about it more maybe its fine, as long as that's what the user wants.

FWIW, swapping retains the same behavior as get_and_clear_pending_events since there the lock was held for the duration of the method and then the events were handled by the user. I updated the code accordingly.

lightning-background-processor/src/lib.rs

valentinewallace · 2021-05-17T20:22:18Z

Could also be useful to document somewhere what we discussed offline re: expected behavior if we crash in the middle of handling a series of events.

jkczyz

Could also be useful to document somewhere what we discussed offline re: expected behavior if we crash in the middle of handling a series of events.

Done.

lightning-background-processor/src/lib.rs

TheBlueMatt · 2021-05-18T15:41:06Z

lightning-net-tokio/src/lib.rs

-//!         for _event in chain_monitor.get_and_clear_pending_events().drain(..) {
-//!             // Handle the event!
-//!         }
+//!         channel_manager.process_pending_events(&|event| {


We can do it in a followup if you prefer, but I think now that we're doing an event check every time the ChannelManager wakes us up we can drop the stupid mpsc channel hack here entirely.

Sure, I can remove it. But I don't quite understand why (if?) my change caused this to longer be necessary.

The point of the channel here is to have something that you can use to wake up every time an Event may be generated (by hooking network reads). It's a stupid hack, but it works. Since we now do the handling based on the ChannelManager's own notification interface, I don't think there's any need to do it by hooking network reads anymore.

Ah, yeah, that's the piece I was missing. Sill need to call that interface here to avoid busy waiting.

TheBlueMatt · 2021-05-18T16:05:28Z

lightning/src/util/events.rs

+/// A trait indicating an object may generate events.
+///
+/// Events are processed by a handler given to [`process_pending_events`]. Therefore, implementors
+/// should be mindful to avoid lock reentrancy if a handler may call back into the provider.


Should we make it a part of the contract that callbacks into the provider are allowed/don't run locked, except for (possibly) serialization (in the case of a ChannelManager)? Its kinda hard to work with this API without that guarantee, but its kinda annoying to specify it.

For ChannelManager, both process_pending_events and handle_event take a read lock. The write lock is only taken after notification (i.e., after process_pending_events has returned) -- the handler itself doesn't serialize ChannelManager.

So I think the general advice is "don't introduce a deadlock". But now that I think of it, does RwLock allow reentrant reads? It appears so though the docs say it might panic.

https://doc.rust-lang.org/std/sync/struct.RwLock.html#panics

So I think the general advice is "don't introduce a deadlock".

Right, but the docs don't really give you enough information to know what would and what would not cause a deadlock. Reading it naively, I'd think I need to queue all the events and then call back to ChannelManager separately as calling back to it may cause a deadlock. Providing more information would let users do more without fear, though we have to be precise and decide what we're willing to commit to in the API and what not.

does RwLock allow reentrant reads?

I assumed it would...I mean you're allowed to take as many read locks as you want....

Right, but the docs don't really give you enough information to know what would and what would not cause a deadlock. Reading it naively, I'd think I need to queue all the events and then call back to ChannelManager separately as calling back to it may cause a deadlock. Providing more information would let users do more without fear, though we have to be precise and decide what we're willing to commit to in the API and what not.

Understood, but these aren't ChannelManager's docs. And note that this line is talking about someone implementing EventsProvider not EventHandler. It's really up to an implementor of EventsProvider to determine what is safe for a handler to call. We don't need to restrict it at the trait level.

Right, I guess then the docs are somewhat confusing to me - I read them as "user, you should be careful of reentrancy when calling methods here", not "note to RL implementors, be careful of reentrancy". We should have some docs per-implementor that talks about reentrancy, I guess, though sadly Rust really hides the impl-level docs for a trait implementation, making them hard to find :(.

Re-wrote the docs to separate implementation and uses. Also added docs to ChannelManager and ChainMonitor implementations. Let me know if anything isn't clear or could be expanded upon.

Having a whole section dedicated to Implementations that contains instructions for how to implement just feels really wrong in public docs here? In practice, I can't imagine we ever expect users to implement this, so it becomes just a list of instructions to ourselves in public docs? Maybe there's a way to rephrase it to just talk about the requirements on both sides without losing detail?

Combined and rephrased in terms of requirements.

lightning/src/chain/channelmonitor.rs

valentinewallace · 2021-05-19T01:17:33Z

lightning-net-tokio/src/lib.rs

@@ -43,32 +41,30 @@
 //!
 //! // Connect to node with pubkey their_node_id at addr:
 //! async fn connect_to_node(peer_manager: PeerManager, chain_monitor: Arc<ChainMonitor>, channel_manager: ChannelManager, their_node_id: PublicKey, addr: SocketAddr) {
-//!     let (sender, mut receiver) = mpsc::channel(2);


Hum, so could this change affect new users of the sample if it isn't updated? Been wondering if the sample should be pinned to a specific commit for this reason. Or maybe the Cargo.lock makes it ok?

Good question! I think Cargo.lock handles this for us since it will add the revision to the package source.

valentinewallace · 2021-05-19T01:42:30Z

lightning/src/util/events.rs

+/// A trait indicating an object may generate events.
+///
+/// Events are processed by a handler given to [`process_pending_events`]. Therefore, implementors
+/// should be mindful to avoid lock reentrancy if a handler may call back into the provider.


Is "the provider" here the event-provider? Could possibly be clarified and I think an example of how such reentrancy could occur could make the docs more actionable (also for my understanding, because I'm not 100% clear on what reentrancy users should be aware of).

Rewrote as per earlier discussion. Let me know if anything could be cleared up further.

lightning-background-processor/src/lib.rs

valentinewallace

Submitted some comments too early. Just some docs nits + the reentrancy clarification, otherwise LGTM

lightning/src/util/events.rs

lightning-background-processor/src/lib.rs

lightning/src/chain/channelmonitor.rs

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2021-05-20T17:48:17Z

lightning/src/ln/channelmanager.rs

 	}
 }

+/// Provides events that must be periodically handled.


These docs are collapsed by default on the ChannelManager page, beside four other trait implementations, making them basically invisible IMO. I don't know if there's a way to change rustdoc's behavior here, but I wonder if it makes sense to explicitly link to the known concrete implementations in the EventsProvider docs and say, eg, "eg see [`ChannelManager::process_pending_events`] and ..." which seems to handle opening the panel.

Couldn't find any way to configure this, unfortunately.

Weird that linking to the method implementations will expand both the trait impl and method documentation but linking to the trait implementation won't do the former. Added links as suggested and moved the docs from impl to the methods.

TheBlueMatt

Aside from two functional comments here and the two doc comments, this all looks good.

TheBlueMatt · 2021-05-20T18:19:48Z

lightning/src/ln/channelmanager.rs

+				NotifyOption::SkipPersist
+			} else {
+				result.replace(pending_events);
+				NotifyOption::DoPersist


Because message events aren't persisted as a part of ChannelManager (and we never assume that messages will be delivered between when we create them and give them to peers), I dont think this needs a persistence. However, we should be checking if a persist is required as a result of process_pending_monitor_events (both here and in process_pending_events below).

Ah, didn't realize that. Fixed.

TheBlueMatt · 2021-05-20T18:27:16Z

lightning-net-tokio/src/lib.rs

@@ -218,11 +202,10 @@ impl Connection {
 		}
 		if let Disconnect::PeerDisconnected = disconnect_type {
 			peer_manager_ref.socket_disconnected(&our_descriptor);
-			Self::event_trigger(&mut us.lock().unwrap());


Oops, I forgot that we're also relying on the event trigger thing here to call peer_manager.process_events() (which is also dumb...). Can you add a call to it after the select in the loop and maybe here after the disconnect as well? (while you're at it, we don't need the peer_manager_ref clone anymore, it seems?).

lightning/src/chain/chainmonitor.rs

TheBlueMatt · 2021-05-24T21:08:26Z

D'oh. Sorry, I missed this conflicted with 851 before I hit merge. Could you do a quick rebase.

jkczyz · 2021-05-24T21:15:35Z

D'oh. Sorry, I missed this conflicted with 851 before I hit merge. Could you do a quick rebase.

Sure thing. One sec.

TheBlueMatt · 2021-05-24T21:41:49Z

lightning/src/ln/channelmanager.rs


-		self.check_free_holding_cells();
+			self.check_free_holding_cells();


I think we need to check check_free_holding_cells for updates to persist as well.

Ah, sorry. Didn't look at that closely enough. Figured it was a simple merge conflict. Let me fix it.

To confirm my understand, should we persist whenever check_free_holding_cells gets any failed HTLCs back from maybe_free_holding_cell_htlcs? Or is the condition more complicated?

https://github.com/rust-bitcoin/rust-lightning/blob/4e8bed20760ee1b53cf2fd15d4a9897780c4fab5/lightning/src/ln/channelmanager.rs#L3524

We should also persist if we get a ChannelMonitorUpdate back from maybe_free_holding_cell_htlcs (ChannelMonitorUpdates are always a surefire reason to persist - if we're out of sync when we come back up we have to force-close). Reading the code in channel that should suffice, but maybe add a comment in maybe_free_holding_cell_htlcs and/or free_holding_cell_htlcs that notes that we won't persist unless we get some values back in the return.

Left the comment out since it's really a concern of ChannelManager. Made the logic in check_free_holding_cells explicit, though, so hopefully this is clear.

It is unnecessary since ChannelManager has a notification interface as of 12c735a.

Rebroadcast channel_announcements when we broadcast a node_announce

jkczyz mentioned this pull request May 12, 2021

Process events with BackgroundProcessor lightningdevkit/ldk-sample#12

Merged

TheBlueMatt reviewed May 12, 2021

View reviewed changes

jkczyz commented May 12, 2021

View reviewed changes

jkczyz force-pushed the 2021-05-event-processing branch 2 times, most recently from 11b9e60 to be39d80 Compare May 13, 2021 21:48

TheBlueMatt added this to the 0.0.15 milestone May 15, 2021

jkczyz force-pushed the 2021-05-event-processing branch 4 times, most recently from 9ae9261 to 405c1f3 Compare May 15, 2021 18:46

jkczyz marked this pull request as ready for review May 15, 2021 18:47

jkczyz mentioned this pull request May 17, 2021

Event handling races with channelmanager storage #899

Closed

valentinewallace reviewed May 17, 2021

View reviewed changes

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved

lightning-background-processor/src/lib.rs Show resolved Hide resolved

jkczyz force-pushed the 2021-05-event-processing branch from 405c1f3 to 0cc378f Compare May 18, 2021 00:35

jkczyz commented May 18, 2021

View reviewed changes

lightning-background-processor/src/lib.rs Outdated Show resolved Hide resolved

lightning-background-processor/src/lib.rs Show resolved Hide resolved

TheBlueMatt reviewed May 18, 2021

View reviewed changes

valentinewallace reviewed May 19, 2021

View reviewed changes

lightning/src/util/events.rs Outdated Show resolved Hide resolved

lightning-background-processor/src/lib.rs Show resolved Hide resolved

lightning/src/chain/channelmonitor.rs Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

TheBlueMatt mentioned this pull request May 19, 2021

Clean up and more liberally free holding cell HTLCs (without re-entrancy) #851

Merged

TheBlueMatt reviewed May 19, 2021

View reviewed changes

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

jkczyz force-pushed the 2021-05-event-processing branch from 0d8400e to 6905217 Compare May 19, 2021 21:48

TheBlueMatt reviewed May 20, 2021

View reviewed changes

jkczyz force-pushed the 2021-05-event-processing branch from 6905217 to db0f707 Compare May 21, 2021 21:36

TheBlueMatt approved these changes May 22, 2021

View reviewed changes

jkczyz force-pushed the 2021-05-event-processing branch from db0f707 to 197bf3c Compare May 24, 2021 17:56

valentinewallace approved these changes May 24, 2021

View reviewed changes

lightning/src/chain/chainmonitor.rs Show resolved Hide resolved

jkczyz force-pushed the 2021-05-event-processing branch from 197bf3c to 15b3e54 Compare May 24, 2021 21:14

jkczyz added 3 commits May 24, 2021 14:16

Refactor EventsProvider to take an EventHandler

7c465d6

Remove unnecessary scope from ChannelManager

282d092

Guard get_and_clear_pending_msg_events

248a107

jkczyz force-pushed the 2021-05-event-processing branch from 15b3e54 to 4e8bed2 Compare May 24, 2021 21:20

valentinewallace approved these changes May 24, 2021

View reviewed changes

TheBlueMatt reviewed May 24, 2021

View reviewed changes

jkczyz added 4 commits May 25, 2021 00:28

f - Persist when check_free_holding_cells has updates

86ce446

Process ChannelManager events in the background

501b543

Process ChainMonitor events in the background

f63fd83

Remove event channel from lightning-net-tokio

a1f95de

It is unnecessary since ChannelManager has a notification interface as of 12c735a.

jkczyz force-pushed the 2021-05-event-processing branch from 4e8bed2 to a1f95de Compare May 25, 2021 07:32

TheBlueMatt approved these changes May 25, 2021

View reviewed changes

TheBlueMatt merged commit f8450a7 into lightningdevkit:main May 25, 2021

TheBlueMatt added a commit that referenced this pull request May 25, 2021

Merge pull request #920 from TheBlueMatt/2021-05-tlv-ser

499d84c

Rebroadcast channel_announcements when we broadcast a node_announce

jkczyz mentioned this pull request Jul 28, 2021

Stop BackgroundProcessor's thread on drop #1007

Merged


		self.check_free_holding_cells();
		self.check_free_holding_cells();

Background processing of ChannelManager and ChannelMonitor events #920

Background processing of ChannelManager and ChannelMonitor events #920

Uh oh!

Conversation

jkczyz commented May 12, 2021

Uh oh!

codecov bot commented May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valentinewallace commented May 17, 2021

Uh oh!

jkczyz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheBlueMatt May 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

valentinewallace left a comment

Choose a reason for hiding this comment

codecov bot commented May 12, 2021 •

edited

Loading

TheBlueMatt May 20, 2021 •

edited

Loading