Skip to content

Background processing of ChannelManager and ChannelMonitor events #920

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 25, 2021

Conversation

jkczyz
Copy link
Contributor

@jkczyz jkczyz commented May 12, 2021

Support processing Events from ChannelManager and ChainMonitor in the background. Adds an EventHandler trait for users to implement and pass to BackgroundProcessor. Refactors the EventsProvider trait to take such an event handler.

This also addresses a race condition identified in #899 by making use of PersistenceNotifierGuard in ChannelManager when processing events.

@codecov
Copy link

codecov bot commented May 12, 2021

Codecov Report

Merging #920 (db0f707) into main (3a0356f) will increase coverage by 0.27%.
The diff coverage is 96.51%.

❗ Current head db0f707 differs from pull request most recent head a1f95de. Consider uploading reports for the commit a1f95de to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main     #920      +/-   ##
==========================================
+ Coverage   90.55%   90.83%   +0.27%     
==========================================
  Files          59       59              
  Lines       30015    33235    +3220     
==========================================
+ Hits        27181    30188    +3007     
- Misses       2834     3047     +213     
Impacted Files Coverage Δ
lightning/src/chain/channelmonitor.rs 91.16% <ø> (-4.30%) ⬇️
lightning/src/ln/chanmon_update_fail_tests.rs 97.64% <ø> (-0.14%) ⬇️
lightning/src/ln/functional_test_utils.rs 95.06% <ø> (ø)
lightning/src/ln/functional_tests.rs 97.97% <ø> (+1.14%) ⬆️
lightning/src/ln/onion_route_tests.rs 96.80% <ø> (ø)
lightning/src/ln/reorg_tests.rs 99.62% <ø> (ø)
lightning-background-processor/src/lib.rs 94.73% <93.65%> (-1.38%) ⬇️
lightning/src/ln/channelmanager.rs 88.45% <97.56%> (+5.53%) ⬆️
lightning-net-tokio/src/lib.rs 75.92% <100.00%> (-0.34%) ⬇️
lightning/src/chain/chainmonitor.rs 95.80% <100.00%> (+0.28%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3a0356f...a1f95de. Read the comment docs.

fn get_and_clear_pending_events(&self) -> Vec<Event> {
fn process_pending_events<H: Deref>(&self, handler: H) -> Result<(), E>
where H::Target: EventHandler<E> {
let _persistence_guard = PersistenceNotifierGuard::new(&self.total_consistency_lock, &self.persistence_notifier);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #918 this probably should figure out if any events were processed and not always persist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this meant to say #916.

Copy link
Contributor Author

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments that I could use some feedback on.

mem::swap(&mut ret, &mut *pending_events);
ret
for event in pending_events.drain(..) {
handler.handle_event(event)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code as written here will pass back the error of the first encountered error and stop processing events. However, any remaining events are lost because of the drain. So this leads to some other questions:

  • Should event processing continue if handling one event fails?
  • Can a user reasonably recover from an error while handling an event?
  • Should unprocessed events remain pending for the next call?

The answer to these will determine what the interface will ultimately look like.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, if we take an error from users we definitely shouldn't just drop events, we should keep feeding them to users. That said, I'm a little dubious that we should let users provide Errors either. We shouldn't take an error unless we can do something reasonable with it - maybe try the event again later or something, but short of that I'm not sure what an error even means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm starting to think not allowing errors is the best approach. The user can always write an EventHandler to do whatever they want (including collecting errors or remaining events) and act upon it after process_pending_events returns. But, depending on how implemented, it could leave them in a state where ChannelManager is persisted without those events but the user hasn't "fully" handled them either. May just need to spell it out in the documentation.

That said, if they are using BackgroundProcessor there would be no way for them to examine the handler after processing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess the only case where errors make sense is some kind of transitory issue that we know will resolve very soon, allowing us to handle the error momentarily while holding off persisting the ChannelManager. That said, I think you could just block in the handler in that case, as long as we only hold an event-specific lock?

Maybe we can have a separate lock allowing us to block any other event handlers from entering while still allowing adding new events from other things going on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess the only case where errors make sense is some kind of transitory issue that we know will resolve very soon, allowing us to handle the error momentarily while holding off persisting the ChannelManager. That said, I think you could just block in the handler in that case, as long as we only hold an event-specific lock?

Will update this to remove the return type from handle_event.

Maybe we can have a separate lock allowing us to block any other event handlers from entering while still allowing adding new events from other things going on?

pending_events is already behind a Mutex. But I guess if we want to allow other event producers to add more events, we could immediately drop the lock after either draining fully into a Vec or swapping as before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending_events is already behind a Mutex. But I guess if we want to allow other event producers to add more events, we could immediately drop the lock after either draining fully into a Vec or swapping as before.

Right, I guess I was thinking that if a user were handling events in parallel it would be a bug. Now that I think about it more maybe its fine, as long as that's what the user wants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, swapping retains the same behavior as get_and_clear_pending_events since there the lock was held for the duration of the method and then the events were handled by the user. I updated the code accordingly.

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch 2 times, most recently from 11b9e60 to be39d80 Compare May 13, 2021 21:48
@TheBlueMatt TheBlueMatt added this to the 0.0.15 milestone May 15, 2021
@jkczyz jkczyz force-pushed the 2021-05-event-processing branch 4 times, most recently from 9ae9261 to 405c1f3 Compare May 15, 2021 18:46
@jkczyz jkczyz marked this pull request as ready for review May 15, 2021 18:47
@valentinewallace
Copy link
Contributor

Could also be useful to document somewhere what we discussed offline re: expected behavior if we crash in the middle of handling a series of events.

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from 405c1f3 to 0cc378f Compare May 18, 2021 00:35
Copy link
Contributor Author

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also be useful to document somewhere what we discussed offline re: expected behavior if we crash in the middle of handling a series of events.

Done.

//! for _event in chain_monitor.get_and_clear_pending_events().drain(..) {
//! // Handle the event!
//! }
//! channel_manager.process_pending_events(&|event| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it in a followup if you prefer, but I think now that we're doing an event check every time the ChannelManager wakes us up we can drop the stupid mpsc channel hack here entirely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can remove it. But I don't quite understand why (if?) my change caused this to longer be necessary.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of the channel here is to have something that you can use to wake up every time an Event may be generated (by hooking network reads). It's a stupid hack, but it works. Since we now do the handling based on the ChannelManager's own notification interface, I don't think there's any need to do it by hooking network reads anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, that's the piece I was missing. Sill need to call that interface here to avoid busy waiting.

/// A trait indicating an object may generate events.
///
/// Events are processed by a handler given to [`process_pending_events`]. Therefore, implementors
/// should be mindful to avoid lock reentrancy if a handler may call back into the provider.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it a part of the contract that callbacks into the provider are allowed/don't run locked, except for (possibly) serialization (in the case of a ChannelManager)? Its kinda hard to work with this API without that guarantee, but its kinda annoying to specify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ChannelManager, both process_pending_events and handle_event take a read lock. The write lock is only taken after notification (i.e., after process_pending_events has returned) -- the handler itself doesn't serialize ChannelManager.

So I think the general advice is "don't introduce a deadlock". But now that I think of it, does RwLock allow reentrant reads? It appears so though the docs say it might panic.

https://doc.rust-lang.org/std/sync/struct.RwLock.html#panics

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think the general advice is "don't introduce a deadlock".

Right, but the docs don't really give you enough information to know what would and what would not cause a deadlock. Reading it naively, I'd think I need to queue all the events and then call back to ChannelManager separately as calling back to it may cause a deadlock. Providing more information would let users do more without fear, though we have to be precise and decide what we're willing to commit to in the API and what not.

does RwLock allow reentrant reads?

I assumed it would...I mean you're allowed to take as many read locks as you want....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the docs don't really give you enough information to know what would and what would not cause a deadlock. Reading it naively, I'd think I need to queue all the events and then call back to ChannelManager separately as calling back to it may cause a deadlock. Providing more information would let users do more without fear, though we have to be precise and decide what we're willing to commit to in the API and what not.

Understood, but these aren't ChannelManager's docs. And note that this line is talking about someone implementing EventsProvider not EventHandler. It's really up to an implementor of EventsProvider to determine what is safe for a handler to call. We don't need to restrict it at the trait level.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess then the docs are somewhat confusing to me - I read them as "user, you should be careful of reentrancy when calling methods here", not "note to RL implementors, be careful of reentrancy". We should have some docs per-implementor that talks about reentrancy, I guess, though sadly Rust really hides the impl-level docs for a trait implementation, making them hard to find :(.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-wrote the docs to separate implementation and uses. Also added docs to ChannelManager and ChainMonitor implementations. Let me know if anything isn't clear or could be expanded upon.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a whole section dedicated to Implementations that contains instructions for how to implement just feels really wrong in public docs here? In practice, I can't imagine we ever expect users to implement this, so it becomes just a list of instructions to ourselves in public docs? Maybe there's a way to rephrase it to just talk about the requirements on both sides without losing detail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined and rephrased in terms of requirements.

@@ -43,32 +41,30 @@
//!
//! // Connect to node with pubkey their_node_id at addr:
//! async fn connect_to_node(peer_manager: PeerManager, chain_monitor: Arc<ChainMonitor>, channel_manager: ChannelManager, their_node_id: PublicKey, addr: SocketAddr) {
//! let (sender, mut receiver) = mpsc::channel(2);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, so could this change affect new users of the sample if it isn't updated? Been wondering if the sample should be pinned to a specific commit for this reason. Or maybe the Cargo.lock makes it ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I think Cargo.lock handles this for us since it will add the revision to the package source.

/// A trait indicating an object may generate events.
///
/// Events are processed by a handler given to [`process_pending_events`]. Therefore, implementors
/// should be mindful to avoid lock reentrancy if a handler may call back into the provider.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is "the provider" here the event-provider? Could possibly be clarified and I think an example of how such reentrancy could occur could make the docs more actionable (also for my understanding, because I'm not 100% clear on what reentrancy users should be aware of).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote as per earlier discussion. Let me know if anything could be cleared up further.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Submitted some comments too early. Just some docs nits + the reentrancy clarification, otherwise LGTM

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from 0d8400e to 6905217 Compare May 19, 2021 21:48
}
}

/// Provides events that must be periodically handled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docs are collapsed by default on the ChannelManager page, beside four other trait implementations, making them basically invisible IMO. I don't know if there's a way to change rustdoc's behavior here, but I wonder if it makes sense to explicitly link to the known concrete implementations in the EventsProvider docs and say, eg, "eg see [`ChannelManager::process_pending_events`] and ..." which seems to handle opening the panel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't find any way to configure this, unfortunately.

Weird that linking to the method implementations will expand both the trait impl and method documentation but linking to the trait implementation won't do the former. Added links as suggested and moved the docs from impl to the methods.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from two functional comments here and the two doc comments, this all looks good.

NotifyOption::SkipPersist
} else {
result.replace(pending_events);
NotifyOption::DoPersist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because message events aren't persisted as a part of ChannelManager (and we never assume that messages will be delivered between when we create them and give them to peers), I dont think this needs a persistence. However, we should be checking if a persist is required as a result of process_pending_monitor_events (both here and in process_pending_events below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't realize that. Fixed.

@@ -218,11 +202,10 @@ impl Connection {
}
if let Disconnect::PeerDisconnected = disconnect_type {
peer_manager_ref.socket_disconnected(&our_descriptor);
Self::event_trigger(&mut us.lock().unwrap());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I forgot that we're also relying on the event trigger thing here to call peer_manager.process_events() (which is also dumb...). Can you add a call to it after the select in the loop and maybe here after the disconnect as well? (while you're at it, we don't need the peer_manager_ref clone anymore, it seems?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from 6905217 to db0f707 Compare May 21, 2021 21:36
@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from db0f707 to 197bf3c Compare May 24, 2021 17:56
@TheBlueMatt
Copy link
Collaborator

D'oh. Sorry, I missed this conflicted with 851 before I hit merge. Could you do a quick rebase.

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from 197bf3c to 15b3e54 Compare May 24, 2021 21:14
@jkczyz
Copy link
Contributor Author

jkczyz commented May 24, 2021

D'oh. Sorry, I missed this conflicted with 851 before I hit merge. Could you do a quick rebase.

Sure thing. One sec.

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from 15b3e54 to 4e8bed2 Compare May 24, 2021 21:20

self.check_free_holding_cells();
self.check_free_holding_cells();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to check check_free_holding_cells for updates to persist as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. Didn't look at that closely enough. Figured it was a simple merge conflict. Let me fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To confirm my understand, should we persist whenever check_free_holding_cells gets any failed HTLCs back from maybe_free_holding_cell_htlcs? Or is the condition more complicated?

https://github.com/rust-bitcoin/rust-lightning/blob/4e8bed20760ee1b53cf2fd15d4a9897780c4fab5/lightning/src/ln/channelmanager.rs#L3524

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also persist if we get a ChannelMonitorUpdate back from maybe_free_holding_cell_htlcs (ChannelMonitorUpdates are always a surefire reason to persist - if we're out of sync when we come back up we have to force-close). Reading the code in channel that should suffice, but maybe add a comment in maybe_free_holding_cell_htlcs and/or free_holding_cell_htlcs that notes that we won't persist unless we get some values back in the return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left the comment out since it's really a concern of ChannelManager. Made the logic in check_free_holding_cells explicit, though, so hopefully this is clear.

@jkczyz jkczyz force-pushed the 2021-05-event-processing branch from 4e8bed2 to a1f95de Compare May 25, 2021 07:32
@TheBlueMatt TheBlueMatt merged commit f8450a7 into lightningdevkit:main May 25, 2021
TheBlueMatt added a commit that referenced this pull request May 25, 2021
Rebroadcast channel_announcements when we broadcast a node_announce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants