Add ChannelManager support for monitor update failure in one place #213

TheBlueMatt · 2018-10-18T17:11:00Z

This is the first tiny step towards #61, mostly implementing the logic at the Channel level and getting the structure in place. It only actually adds support for failed ChannelMonitor updates in one place (send_payment's generated monitor update). I'm mostly PR'ing this cause it got large and I want to shift focus to 0.0.6 but don't want this to bitrot.

Based on #212.

You can tell how confident I am in this all being correct by the size of the tests...

yuntai · 2018-10-19T06:31:00Z

src/ln/channelmanager.rs

+						route: route.clone(),
+						session_priv: session_priv.clone(),
+					}, onion_packet).map_err(|he| APIError::ChannelUnavailable{err: he.err})?
+				};


we could remove update and commitment_signed from the returns from chan.send_htlc_and_comit() and chan.get_last_commitment_update() here? seems it could simplify code here.

Hmm, would have to think harder about that refactor, but I dont think it would help here - we have to get the new ChannelMonitor out of chan, then drop the chan/channel_state reference, then call add_update_monitor, then (possibly) call handle_monitor_update_fail either way, so even if we dont have to pass the commitment update up we'd still have (most of) the awkward structure here.

The operation with add_update_monitor seems to require an individual lock for each channel and under which add_update_monitor is called. Plus, get_last_commitment_update(). Without the lock, it seems we may need a rollback logic on the channel state.

Indeed, that's why its currently all under the channel_state lock (or did I misunderstand your point?).

Yeah, basically the right point. Was thinking more like more fine-grained locks - perhaps 2 for the remote and the local state for each channel. And we don't generate messages if we can't update channel_monitor and the client API call is blocked (or fail/queue after try-lock).
For the receiving side, I guess the incoming messages are buffered (bounded sized buffer) while update_monitor() is on retry logic.
Wonder it's too pessimistic affecting perf/, but there is nothing much to do/progress without committing states to monitor?

Yea, as I noted at #213 (comment) there's kinda two approaches that could have been taken here - I took the more complicated approach that doesnt require message queueing but could have just queued messages and handled the DoS concerns some other way. Eventually we should move towards per-channel locks or so to fix the paralellism issue here.

ariard · 2018-10-19T15:12:34Z

What is the whole thought process of moving the channel_monitor updates inside channel_state lock ? I mean it's to be sure that channel stuck to go forward if monitor fail to register update ?

ariard · 2018-10-19T15:31:13Z

src/ln/msgs.rs

@@ -488,6 +488,17 @@ pub enum HTLCFailChannelUpdate {
 	},
 }

+/// For events which result in both a RevokeAndACK and a CommitmentUpdate, by default they should
+/// be sent in the order they appear in the return value, however sometimes the order needs to be
+/// variable at rumtime. In those cases, this enum is also returned.


I kinda prefer rum-time, but, ok, fixed :p

Ah nop, need to get out the 0.1 before to get there ;)

ariard · 2018-10-19T15:34:06Z

src/ln/msgs.rs

@@ -488,6 +488,17 @@ pub enum HTLCFailChannelUpdate {
 	},
 }

+/// For events which result in both a RevokeAndACK and a CommitmentUpdate, by default they should
+/// be sent in the order they appear in the return value, however sometimes the order needs to be


maybe precise an example in which case the order needs to be variable?

ariard · 2018-10-19T15:50:35Z

src/ln/channel.rs

@@ -296,6 +296,12 @@ pub(super) struct Channel {
 	cur_local_commitment_transaction_number: u64,
 	cur_remote_commitment_transaction_number: u64,
 	value_to_self_msat: u64, // Excluding all pending_htlcs, excluding fees
+	/// Unon recipt of a channel_reestablish we have to figure out whether to send a revoke_and_ack


*upon, *receipt

ariard · 2018-10-19T15:53:03Z

src/ln/channel.rs

+	/// Unon recipt of a channel_reestablish we have to figure out whether to send a revoke_and_ack
+	/// first or a commitment update first. Generally, we prefer to send revoke_and_ack first, but
+	/// if we had a pending commitment update of our own waiting on a remote revoke when we
+	/// received the latest commitment update from the remote we have to make sure that gets resent


ariard · 2018-10-19T16:43:57Z

Tested and reviewed the message ordering part, seems good, started to review the more consequential ChannelMonitor part, will finish later today/tomorrow

TheBlueMatt · 2018-10-19T17:51:33Z

The move inside the lock is, sadly, required to keep things consistent. Actually right now things are kinda fucked - message must be sent to the remote end in the order they are delivered, but we don't actually have any mechanism to enforce that when they are delivered in different ways - we either deliver messages by putting them in events (and assume that those events get processed before any further calls into ChannelManager generate messages for the same peer) or by retunring them. This could absolutely lead to bugs, but its made even worse by having callbacks into the user's code in the middle of this process, where those callbacks could absolutely take variable lenghts of time (eg if they're sending to a watchtower over the network especially). Moving them inside the lock at least helps address this, but all that to say I'm not too worried about breaking locking crap right now, cause it's gonna need to be re-written anyway.

That aside, there are kinda two ways I could have approached channel pausing - either a) store inbound/outbound messages for a given channel in some Vec in ChannelManager and don't deliver anything until the monitor is updated or b) the way its done here - deliver things and process them but don't generate responses until we get a monitor update in. Option (a) is way simpler, but also leads to DoS questions about peers sending us a bunch of messages that we can't process/check but keep around in memory until some time in the future. Option (b) is way more complicated, but I think overall (if its actually implementable without too much complexity blowup) a better approach.

Option (b) here, however, does mean that we have to call Channel::monitor_update_failed before any other calls into the same Channel happen after the call which resulted in a monitor updating fail. We'd have a similar issue for option (a) in terms of making sure further messages aren't sent to the peer, but it may be simpler.

Sorry, I guess I should have described a bunch of that in the PR description.

ariard · 2018-10-19T23:26:20Z

src/ln/channelmanager.rs

+					chan
+				};
+				mem::drop(channel_state_lock);
+				self.finish_force_close_channel(chan.force_shutdown());


maybe we could modify finish_force_close_channel parameters to get an Option<channel_state> to avoid drop-then-get-same-lock again in case of htlc to fail backward ?

Hmm, sadly fail_htlc_backwards_internal eats the lock so it still has to be taken multiple times inside the fail loop in finish_force_close_channel. If we move towards per-channel locks it'll be a different lock anyway, so that wont help.

src/ln/channelmanager.rs

ariard · 2018-10-19T23:53:08Z

src/ln/channelmonitor.rs

+	/// instead return PermanentFailure to force closure of the channel ASAP.
+	///
+	/// Even when a channel has been "frozen" updates to the ChannelMonitor can continue to occur
+	/// (eg if an inbound HTLC which we forwarded was claimed upstream resulting in us attempting


Hmm in case of "frozen" channel could we have a ChannelMonitor trying to claim ghost htlc outputs, and fail to do so, on remote commitment tx, this last one up-to-date without the concerned htlc output because we pass it the preimage ? Seems not to me because remote commitment tx needs commitment_signed but wonder if there is other weird cases like that.

Not sure what you mean by "ghost htlc outputs"? The unupdated ChannelMonitor should ignore any offered HTLC outputs that it doesn't have the preimage for assuming the remote side is just gonna claim them via timeout, whereas our local/updated copy should do the claim.

It wasn't exactly the case I had in head, the one was what if we have a channel in ChannelMonitorUpdateErr::TemporaryFailure and we pass backward the preimage via claim_funds ? Harmless in fact because other node needs commitment_signed. Go ahead, I was just trying to speculate on weird on-chain consequences in case of frozen channel..

src/ln/channelmanager.rs

ariard · 2018-10-20T00:22:15Z

src/ln/channel.rs

@@ -1685,6 +1709,10 @@ impl Channel {
 				}
 			}
 		}
+		if self.channel_state & (ChannelState::MonitorUpdateFailed as u32) == 0 {
+			// This is a response to our post-monitor-failed unfreeze messages...


"Due to being in a post-monitor-failed state, we ensure that order is reset to normal after all freeze messages habe been sent" ? Sorry don't find the comment enough explicit..

ariard · 2018-10-20T00:27:22Z

src/ln/channel.rs

@@ -1843,6 +1879,10 @@ impl Channel {
 		self.their_cur_commitment_point = Some(msg.next_per_commitment_point);
 		self.cur_remote_commitment_transaction_number -= 1;
 		self.received_commitment_while_awaiting_raa = false;
+		if self.channel_state & (ChannelState::MonitorUpdateFailed as u32) == 0 {
+			// This is a response to our post-monitor-failed unfreeze messages...


Same as above

ariard · 2018-10-20T00:40:35Z

Side-thought : if channel went to be unconfirmed due to a reorg maybe we should notify SimpleManyChannelMonitor to prune relative ChannelMonitor ?

src/ln/channel.rs

src/ln/channelmanager.rs

ariard · 2018-10-20T01:50:13Z

Well, apart of do_test_monitor_temporary_update_fail (see comment) I've reviewed the whole. As a temporary solution waiting for fine-grained locks, I'm okay with this approach and the logic seems good to me. For the details please see the bunch of comments.

Option a) seems to me a second-best choice, yes per-channel locking need maybe more thought at first but once done it'll avoid us an intermediate logic layer between ChannelManager and PeerHandler to bufferize/prune/resend messages.

TheBlueMatt · 2018-10-20T19:31:40Z

Re: side-note: Indeed, ChannelMonitor pruning should be implemented...somewhere. I dont believe we do any pruning at all right now?

Note that the (a)/(b) options above don't (really) have a ton to do with per-channel locking, only a question of whether we want an inbound-message-buffer in ChannelManager, mostly.

This really, really sucks as it defeats almost all of the cross-channel parallelism we'd intended to have - waiting on a client to update a watchtower for an unrelated channel to process any messages is really shitty. We should revisit this with per-channel locks as a compile-time option post-0.1.

TheBlueMatt · 2018-10-23T20:36:57Z

Gonna go ahead and take this to move #217/0.0.6 forward, though review of the test and any other bits that folks want to do post-merge is obviously also welcome.

yuntai · 2018-10-26T10:56:35Z

src/ln/channel.rs

@@ -2964,6 +3100,9 @@ impl Channel {
 		if (self.channel_state & (ChannelState::PeerDisconnected as u32)) == (ChannelState::PeerDisconnected as u32) {
 			panic!("Cannot create commitment tx while disconnected, as send_htlc will have returned an Err so a send_commitment precondition has been violated");
 		}
+		if (self.channel_state & (ChannelState::MonitorUpdateFailed as u32)) == (ChannelState::PeerDisconnected as u32) {


TheBlueMatt force-pushed the 2018-10-monitor-fail-pause branch 3 times, most recently from c884ffe to 6a90619 Compare October 18, 2018 17:20

yuntai reviewed Oct 19, 2018

View reviewed changes

ariard reviewed Oct 19, 2018

View reviewed changes

TheBlueMatt force-pushed the 2018-10-monitor-fail-pause branch from 6a90619 to 22d89eb Compare October 19, 2018 17:42

TheBlueMatt mentioned this pull request Oct 19, 2018

Fix reconnect message order on remote updates while waiting on RAA #212

Merged

ariard reviewed Oct 19, 2018

View reviewed changes

src/ln/channelmanager.rs Show resolved Hide resolved

ariard reviewed Oct 19, 2018

View reviewed changes