-
Notifications
You must be signed in to change notification settings - Fork 407
Background processing of ChannelManager and ChannelMonitor events #920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Background processing of ChannelManager and ChannelMonitor events #920
Conversation
Codecov Report
@@ Coverage Diff @@
## main #920 +/- ##
==========================================
+ Coverage 90.55% 90.83% +0.27%
==========================================
Files 59 59
Lines 30015 33235 +3220
==========================================
+ Hits 27181 30188 +3007
- Misses 2834 3047 +213
Continue to review full report at Codecov.
|
lightning/src/ln/channelmanager.rs
Outdated
fn get_and_clear_pending_events(&self) -> Vec<Event> { | ||
fn process_pending_events<H: Deref>(&self, handler: H) -> Result<(), E> | ||
where H::Target: EventHandler<E> { | ||
let _persistence_guard = PersistenceNotifierGuard::new(&self.total_consistency_lock, &self.persistence_notifier); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #918 this probably should figure out if any events were processed and not always persist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this meant to say #916.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments that I could use some feedback on.
lightning/src/ln/channelmanager.rs
Outdated
mem::swap(&mut ret, &mut *pending_events); | ||
ret | ||
for event in pending_events.drain(..) { | ||
handler.handle_event(event)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code as written here will pass back the error of the first encountered error and stop processing events. However, any remaining events are lost because of the drain
. So this leads to some other questions:
- Should event processing continue if handling one event fails?
- Can a user reasonably recover from an error while handling an event?
- Should unprocessed events remain pending for the next call?
The answer to these will determine what the interface will ultimately look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, if we take an error from users we definitely shouldn't just drop events, we should keep feeding them to users. That said, I'm a little dubious that we should let users provide Errors either. We shouldn't take an error unless we can do something reasonable with it - maybe try the event again later or something, but short of that I'm not sure what an error even means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm starting to think not allowing errors is the best approach. The user can always write an EventHandler
to do whatever they want (including collecting errors or remaining events) and act upon it after process_pending_events
returns. But, depending on how implemented, it could leave them in a state where ChannelManager
is persisted without those events but the user hasn't "fully" handled them either. May just need to spell it out in the documentation.
That said, if they are using BackgroundProcessor
there would be no way for them to examine the handler after processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I guess the only case where errors make sense is some kind of transitory issue that we know will resolve very soon, allowing us to handle the error momentarily while holding off persisting the ChannelManager. That said, I think you could just block in the handler in that case, as long as we only hold an event-specific lock?
Maybe we can have a separate lock allowing us to block any other event handlers from entering while still allowing adding new events from other things going on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I guess the only case where errors make sense is some kind of transitory issue that we know will resolve very soon, allowing us to handle the error momentarily while holding off persisting the ChannelManager. That said, I think you could just block in the handler in that case, as long as we only hold an event-specific lock?
Will update this to remove the return type from handle_event
.
Maybe we can have a separate lock allowing us to block any other event handlers from entering while still allowing adding new events from other things going on?
pending_events
is already behind a Mutex
. But I guess if we want to allow other event producers to add more events, we could immediately drop the lock after either draining fully into a Vec
or swapping as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending_events is already behind a Mutex. But I guess if we want to allow other event producers to add more events, we could immediately drop the lock after either draining fully into a Vec or swapping as before.
Right, I guess I was thinking that if a user were handling events in parallel it would be a bug. Now that I think about it more maybe its fine, as long as that's what the user wants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, swapping retains the same behavior as get_and_clear_pending_events
since there the lock was held for the duration of the method and then the events were handled by the user. I updated the code accordingly.
11b9e60
to
be39d80
Compare
9ae9261
to
405c1f3
Compare
Could also be useful to document somewhere what we discussed offline re: expected behavior if we crash in the middle of handling a series of events. |
405c1f3
to
0cc378f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also be useful to document somewhere what we discussed offline re: expected behavior if we crash in the middle of handling a series of events.
Done.
lightning-net-tokio/src/lib.rs
Outdated
//! for _event in chain_monitor.get_and_clear_pending_events().drain(..) { | ||
//! // Handle the event! | ||
//! } | ||
//! channel_manager.process_pending_events(&|event| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it in a followup if you prefer, but I think now that we're doing an event check every time the ChannelManager wakes us up we can drop the stupid mpsc channel hack here entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can remove it. But I don't quite understand why (if?) my change caused this to longer be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of the channel here is to have something that you can use to wake up every time an Event may be generated (by hooking network reads). It's a stupid hack, but it works. Since we now do the handling based on the ChannelManager's own notification interface, I don't think there's any need to do it by hooking network reads anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, that's the piece I was missing. Sill need to call that interface here to avoid busy waiting.
lightning/src/util/events.rs
Outdated
/// A trait indicating an object may generate events. | ||
/// | ||
/// Events are processed by a handler given to [`process_pending_events`]. Therefore, implementors | ||
/// should be mindful to avoid lock reentrancy if a handler may call back into the provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it a part of the contract that callbacks into the provider are allowed/don't run locked, except for (possibly) serialization (in the case of a ChannelManager)? Its kinda hard to work with this API without that guarantee, but its kinda annoying to specify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For ChannelManager
, both process_pending_events
and handle_event
take a read lock. The write lock is only taken after notification (i.e., after process_pending_events
has returned) -- the handler itself doesn't serialize ChannelManager
.
So I think the general advice is "don't introduce a deadlock". But now that I think of it, does RwLock
allow reentrant reads? It appears so though the docs say it might panic.
https://doc.rust-lang.org/std/sync/struct.RwLock.html#panics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think the general advice is "don't introduce a deadlock".
Right, but the docs don't really give you enough information to know what would and what would not cause a deadlock. Reading it naively, I'd think I need to queue all the events and then call back to ChannelManager separately as calling back to it may cause a deadlock. Providing more information would let users do more without fear, though we have to be precise and decide what we're willing to commit to in the API and what not.
does RwLock allow reentrant reads?
I assumed it would...I mean you're allowed to take as many read locks as you want....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but the docs don't really give you enough information to know what would and what would not cause a deadlock. Reading it naively, I'd think I need to queue all the events and then call back to ChannelManager separately as calling back to it may cause a deadlock. Providing more information would let users do more without fear, though we have to be precise and decide what we're willing to commit to in the API and what not.
Understood, but these aren't ChannelManager
's docs. And note that this line is talking about someone implementing EventsProvider
not EventHandler
. It's really up to an implementor of EventsProvider
to determine what is safe for a handler to call. We don't need to restrict it at the trait level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I guess then the docs are somewhat confusing to me - I read them as "user, you should be careful of reentrancy when calling methods here", not "note to RL implementors, be careful of reentrancy". We should have some docs per-implementor that talks about reentrancy, I guess, though sadly Rust really hides the impl-level docs for a trait implementation, making them hard to find :(.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-wrote the docs to separate implementation and uses. Also added docs to ChannelManager
and ChainMonitor
implementations. Let me know if anything isn't clear or could be expanded upon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a whole section dedicated to Implementations
that contains instructions for how to implement just feels really wrong in public docs here? In practice, I can't imagine we ever expect users to implement this, so it becomes just a list of instructions to ourselves in public docs? Maybe there's a way to rephrase it to just talk about the requirements on both sides without losing detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined and rephrased in terms of requirements.
@@ -43,32 +41,30 @@ | |||
//! | |||
//! // Connect to node with pubkey their_node_id at addr: | |||
//! async fn connect_to_node(peer_manager: PeerManager, chain_monitor: Arc<ChainMonitor>, channel_manager: ChannelManager, their_node_id: PublicKey, addr: SocketAddr) { | |||
//! let (sender, mut receiver) = mpsc::channel(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, so could this change affect new users of the sample if it isn't updated? Been wondering if the sample should be pinned to a specific commit for this reason. Or maybe the Cargo.lock makes it ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I think Cargo.lock
handles this for us since it will add the revision to the package source.
lightning/src/util/events.rs
Outdated
/// A trait indicating an object may generate events. | ||
/// | ||
/// Events are processed by a handler given to [`process_pending_events`]. Therefore, implementors | ||
/// should be mindful to avoid lock reentrancy if a handler may call back into the provider. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "the provider" here the event-provider? Could possibly be clarified and I think an example of how such reentrancy could occur could make the docs more actionable (also for my understanding, because I'm not 100% clear on what reentrancy users should be aware of).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rewrote as per earlier discussion. Let me know if anything could be cleared up further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Submitted some comments too early. Just some docs nits + the reentrancy clarification, otherwise LGTM
0d8400e
to
6905217
Compare
lightning/src/ln/channelmanager.rs
Outdated
} | ||
} | ||
|
||
/// Provides events that must be periodically handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are collapsed by default on the ChannelManager page, beside four other trait implementations, making them basically invisible IMO. I don't know if there's a way to change rustdoc's behavior here, but I wonder if it makes sense to explicitly link to the known concrete implementations in the EventsProvider
docs and say, eg, "eg see [`ChannelManager::process_pending_events`] and ..." which seems to handle opening the panel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't find any way to configure this, unfortunately.
Weird that linking to the method implementations will expand both the trait impl and method documentation but linking to the trait implementation won't do the former. Added links as suggested and moved the docs from impl
to the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from two functional comments here and the two doc comments, this all looks good.
lightning/src/ln/channelmanager.rs
Outdated
NotifyOption::SkipPersist | ||
} else { | ||
result.replace(pending_events); | ||
NotifyOption::DoPersist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because message events aren't persisted as a part of ChannelManager (and we never assume that messages will be delivered between when we create them and give them to peers), I dont think this needs a persistence. However, we should be checking if a persist is required as a result of process_pending_monitor_events
(both here and in process_pending_events
below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, didn't realize that. Fixed.
@@ -218,11 +202,10 @@ impl Connection { | |||
} | |||
if let Disconnect::PeerDisconnected = disconnect_type { | |||
peer_manager_ref.socket_disconnected(&our_descriptor); | |||
Self::event_trigger(&mut us.lock().unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I forgot that we're also relying on the event trigger thing here to call peer_manager.process_events()
(which is also dumb...). Can you add a call to it after the select in the loop and maybe here after the disconnect as well? (while you're at it, we don't need the peer_manager_ref
clone anymore, it seems?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
6905217
to
db0f707
Compare
db0f707
to
197bf3c
Compare
D'oh. Sorry, I missed this conflicted with 851 before I hit merge. Could you do a quick rebase. |
197bf3c
to
15b3e54
Compare
Sure thing. One sec. |
15b3e54
to
4e8bed2
Compare
lightning/src/ln/channelmanager.rs
Outdated
|
||
self.check_free_holding_cells(); | ||
self.check_free_holding_cells(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to check check_free_holding_cells
for updates to persist as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. Didn't look at that closely enough. Figured it was a simple merge conflict. Let me fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To confirm my understand, should we persist whenever check_free_holding_cells
gets any failed HTLCs back from maybe_free_holding_cell_htlcs
? Or is the condition more complicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also persist if we get a ChannelMonitorUpdate
back from maybe_free_holding_cell_htlcs
(ChannelMonitorUpdate
s are always a surefire reason to persist - if we're out of sync when we come back up we have to force-close). Reading the code in channel that should suffice, but maybe add a comment in maybe_free_holding_cell_htlcs
and/or free_holding_cell_htlcs
that notes that we won't persist unless we get some values back in the return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left the comment out since it's really a concern of ChannelManager
. Made the logic in check_free_holding_cells
explicit, though, so hopefully this is clear.
It is unnecessary since ChannelManager has a notification interface as of 12c735a.
4e8bed2
to
a1f95de
Compare
Rebroadcast channel_announcements when we broadcast a node_announce
Support processing
Event
s fromChannelManager
andChainMonitor
in the background. Adds anEventHandler
trait for users to implement and pass toBackgroundProcessor
. Refactors theEventsProvider
trait to take such an event handler.This also addresses a race condition identified in #899 by making use of
PersistenceNotifierGuard
inChannelManager
when processing events.