-
Notifications
You must be signed in to change notification settings - Fork 408
Add ChannelManager support for monitor update failure in one place #213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ChannelManager support for monitor update failure in one place #213
Conversation
c884ffe
to
6a90619
Compare
route: route.clone(), | ||
session_priv: session_priv.clone(), | ||
}, onion_packet).map_err(|he| APIError::ChannelUnavailable{err: he.err})? | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could remove update and commitment_signed from the returns from chan.send_htlc_and_comit() and chan.get_last_commitment_update() here? seems it could simplify code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, would have to think harder about that refactor, but I dont think it would help here - we have to get the new ChannelMonitor out of chan, then drop the chan/channel_state reference, then call add_update_monitor, then (possibly) call handle_monitor_update_fail either way, so even if we dont have to pass the commitment update up we'd still have (most of) the awkward structure here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation with add_update_monitor seems to require an individual lock for each channel and under which add_update_monitor is called. Plus, get_last_commitment_update(). Without the lock, it seems we may need a rollback logic on the channel state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that's why its currently all under the channel_state lock (or did I misunderstand your point?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, basically the right point. Was thinking more like more fine-grained locks - perhaps 2 for the remote and the local state for each channel. And we don't generate messages if we can't update channel_monitor and the client API call is blocked (or fail/queue after try-lock).
For the receiving side, I guess the incoming messages are buffered (bounded sized buffer) while update_monitor() is on retry logic.
Wonder it's too pessimistic affecting perf/, but there is nothing much to do/progress without committing states to monitor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, as I noted at #213 (comment) there's kinda two approaches that could have been taken here - I took the more complicated approach that doesnt require message queueing but could have just queued messages and handled the DoS concerns some other way. Eventually we should move towards per-channel locks or so to fix the paralellism issue here.
What is the whole thought process of moving the channel_monitor updates inside channel_state lock ? I mean it's to be sure that channel stuck to go forward if monitor fail to register update ? |
src/ln/msgs.rs
Outdated
@@ -488,6 +488,17 @@ pub enum HTLCFailChannelUpdate { | |||
}, | |||
} | |||
|
|||
/// For events which result in both a RevokeAndACK and a CommitmentUpdate, by default they should | |||
/// be sent in the order they appear in the return value, however sometimes the order needs to be | |||
/// variable at rumtime. In those cases, this enum is also returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kinda prefer rum-time, but, ok, fixed :p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nop, need to get out the 0.1 before to get there ;)
src/ln/msgs.rs
Outdated
@@ -488,6 +488,17 @@ pub enum HTLCFailChannelUpdate { | |||
}, | |||
} | |||
|
|||
/// For events which result in both a RevokeAndACK and a CommitmentUpdate, by default they should | |||
/// be sent in the order they appear in the return value, however sometimes the order needs to be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe precise an example in which case the order needs to be variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
src/ln/channel.rs
Outdated
@@ -296,6 +296,12 @@ pub(super) struct Channel { | |||
cur_local_commitment_transaction_number: u64, | |||
cur_remote_commitment_transaction_number: u64, | |||
value_to_self_msat: u64, // Excluding all pending_htlcs, excluding fees | |||
/// Unon recipt of a channel_reestablish we have to figure out whether to send a revoke_and_ack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*upon, *receipt
src/ln/channel.rs
Outdated
/// Unon recipt of a channel_reestablish we have to figure out whether to send a revoke_and_ack | ||
/// first or a commitment update first. Generally, we prefer to send revoke_and_ack first, but | ||
/// if we had a pending commitment update of our own waiting on a remote revoke when we | ||
/// received the latest commitment update from the remote we have to make sure that gets resent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*it gets?
Tested and reviewed the message ordering part, seems good, started to review the more consequential ChannelMonitor part, will finish later today/tomorrow |
6a90619
to
22d89eb
Compare
The move inside the lock is, sadly, required to keep things consistent. Actually right now things are kinda fucked - message must be sent to the remote end in the order they are delivered, but we don't actually have any mechanism to enforce that when they are delivered in different ways - we either deliver messages by putting them in events (and assume that those events get processed before any further calls into ChannelManager generate messages for the same peer) or by retunring them. This could absolutely lead to bugs, but its made even worse by having callbacks into the user's code in the middle of this process, where those callbacks could absolutely take variable lenghts of time (eg if they're sending to a watchtower over the network especially). Moving them inside the lock at least helps address this, but all that to say I'm not too worried about breaking locking crap right now, cause it's gonna need to be re-written anyway. That aside, there are kinda two ways I could have approached channel pausing - either a) store inbound/outbound messages for a given channel in some Vec in ChannelManager and don't deliver anything until the monitor is updated or b) the way its done here - deliver things and process them but don't generate responses until we get a monitor update in. Option (a) is way simpler, but also leads to DoS questions about peers sending us a bunch of messages that we can't process/check but keep around in memory until some time in the future. Option (b) is way more complicated, but I think overall (if its actually implementable without too much complexity blowup) a better approach. Option (b) here, however, does mean that we have to call Channel::monitor_update_failed before any other calls into the same Channel happen after the call which resulted in a monitor updating fail. We'd have a similar issue for option (a) in terms of making sure further messages aren't sent to the peer, but it may be simpler. Sorry, I guess I should have described a bunch of that in the PR description. |
chan | ||
}; | ||
mem::drop(channel_state_lock); | ||
self.finish_force_close_channel(chan.force_shutdown()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could modify finish_force_close_channel parameters to get an Option<channel_state> to avoid drop-then-get-same-lock again in case of htlc to fail backward ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, sadly fail_htlc_backwards_internal eats the lock so it still has to be taken multiple times inside the fail loop in finish_force_close_channel. If we move towards per-channel locks it'll be a different lock anyway, so that wont help.
/// instead return PermanentFailure to force closure of the channel ASAP. | ||
/// | ||
/// Even when a channel has been "frozen" updates to the ChannelMonitor can continue to occur | ||
/// (eg if an inbound HTLC which we forwarded was claimed upstream resulting in us attempting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm in case of "frozen" channel could we have a ChannelMonitor trying to claim ghost htlc outputs, and fail to do so, on remote commitment tx, this last one up-to-date without the concerned htlc output because we pass it the preimage ? Seems not to me because remote commitment tx needs commitment_signed but wonder if there is other weird cases like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean by "ghost htlc outputs"? The unupdated ChannelMonitor should ignore any offered HTLC outputs that it doesn't have the preimage for assuming the remote side is just gonna claim them via timeout, whereas our local/updated copy should do the claim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't exactly the case I had in head, the one was what if we have a channel in ChannelMonitorUpdateErr::TemporaryFailure and we pass backward the preimage via claim_funds ? Harmless in fact because other node needs commitment_signed. Go ahead, I was just trying to speculate on weird on-chain consequences in case of frozen channel..
src/ln/channel.rs
Outdated
@@ -1685,6 +1709,10 @@ impl Channel { | |||
} | |||
} | |||
} | |||
if self.channel_state & (ChannelState::MonitorUpdateFailed as u32) == 0 { | |||
// This is a response to our post-monitor-failed unfreeze messages... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Due to being in a post-monitor-failed state, we ensure that order is reset to normal after all freeze messages habe been sent" ? Sorry don't find the comment enough explicit..
src/ln/channel.rs
Outdated
@@ -1843,6 +1879,10 @@ impl Channel { | |||
self.their_cur_commitment_point = Some(msg.next_per_commitment_point); | |||
self.cur_remote_commitment_transaction_number -= 1; | |||
self.received_commitment_while_awaiting_raa = false; | |||
if self.channel_state & (ChannelState::MonitorUpdateFailed as u32) == 0 { | |||
// This is a response to our post-monitor-failed unfreeze messages... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
Side-thought : if channel went to be unconfirmed due to a reorg maybe we should notify SimpleManyChannelMonitor to prune relative ChannelMonitor ? |
Well, apart of do_test_monitor_temporary_update_fail (see comment) I've reviewed the whole. As a temporary solution waiting for fine-grained locks, I'm okay with this approach and the logic seems good to me. For the details please see the bunch of comments. Option a) seems to me a second-best choice, yes per-channel locking need maybe more thought at first but once done it'll avoid us an intermediate logic layer between ChannelManager and PeerHandler to bufferize/prune/resend messages. |
22d89eb
to
71d666e
Compare
Re: side-note: Indeed, ChannelMonitor pruning should be implemented...somewhere. I dont believe we do any pruning at all right now? Note that the (a)/(b) options above don't (really) have a ton to do with per-channel locking, only a question of whether we want an inbound-message-buffer in ChannelManager, mostly. |
71d666e
to
b30511d
Compare
This really, really sucks as it defeats almost all of the cross-channel parallelism we'd intended to have - waiting on a client to update a watchtower for an unrelated channel to process any messages is really shitty. We should revisit this with per-channel locks as a compile-time option post-0.1.
b30511d
to
497643a
Compare
Gonna go ahead and take this to move #217/0.0.6 forward, though review of the test and any other bits that folks want to do post-merge is obviously also welcome. |
@@ -2964,6 +3100,9 @@ impl Channel { | |||
if (self.channel_state & (ChannelState::PeerDisconnected as u32)) == (ChannelState::PeerDisconnected as u32) { | |||
panic!("Cannot create commitment tx while disconnected, as send_htlc will have returned an Err so a send_commitment precondition has been violated"); | |||
} | |||
if (self.channel_state & (ChannelState::MonitorUpdateFailed as u32)) == (ChannelState::PeerDisconnected as u32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops?
This is the first tiny step towards #61, mostly implementing the logic at the Channel level and getting the structure in place. It only actually adds support for failed ChannelMonitor updates in one place (send_payment's generated monitor update). I'm mostly PR'ing this cause it got large and I want to shift focus to 0.0.6 but don't want this to bitrot.
Based on #212.
You can tell how confident I am in this all being correct by the size of the tests...