Allow get_per_commitment_point to fail. #2487

waterson · 2023-08-10T19:54:17Z

This changes ChannelSigner::get_per_commitment_point to return a Result<PublicKey, (), which means that it can fail. In order to accomodate failure at the callsites that otherwise assumed it was infallible, we cache the next per-commitment point each time the state advances in the ChannelContext and change the callsites to instead use the cached value.

devrandom

I think reducing complexity by caching the next point may be a good approach, since it will reduce the spots where we need to do something clever. but see below about attempting to force-close.

devrandom · 2023-08-10T20:53:21Z

lightning/src/ln/channel.rs

+	// *next* state. We recompute it each time the state changes because the state changes in places
+	// that might be fallible: in particular, if the commitment point must be fetched from a remote
+	// source, we want to ensure it happens at a point where we can actually fail somewhat gracefully;
+	// i.e., force-closing a channel is better than a panic!


well, if you can't compute the commitment point, then you can't force-close either. and once you can force close, the signer is back, and you can just continue as normal instead. so it doesn't seem like ChannelClose::Error is the right action in this case. crashing may actually be more reasonable until we have a retry mechanism (crash, and you'll retry when you restart).

🤦 Yes, of course... you are right about this.

At least during channel open, error seems fine enough, better than nothing, but indeed during normal channel operation, force-closign the channel sucks. Maybe we could limit this PR to just the opening parts and handle the during-run parts separately?

codecov-commenter · 2023-08-18T13:45:06Z

Codecov Report

Patch coverage: 81.98% and project coverage change: -0.11% ⚠️

Comparison is base (1f2ee21) 90.58% compared to head (b13e4e8) 90.47%.
Report is 17 commits behind head on main.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2487      +/-   ##
==========================================
- Coverage   90.58%   90.47%   -0.11%     
==========================================
  Files         110      110              
  Lines       57422    57889     +467     
  Branches    57422    57889     +467     
==========================================
+ Hits        52017    52377     +360     
- Misses       5405     5512     +107

Files Changed	Coverage Δ
lightning/src/util/test_utils.rs	`73.61% <ø> (ø)`
lightning/src/ln/channelmanager.rs	`86.31% <57.89%> (-0.89%)`	⬇️
lightning/src/chain/onchaintx.rs	`90.78% <66.66%> (+0.37%)`	⬆️
lightning/src/util/test_channel_signer.rs	`84.17% <76.92%> (-2.93%)`	⬇️
lightning/src/ln/functional_test_utils.rs	`89.03% <78.57%> (+0.04%)`	⬆️
lightning/src/ln/channel.rs	`89.37% <83.33%> (-0.47%)`	⬇️
lightning/src/chain/channelmonitor.rs	`94.64% <100.00%> (ø)`
lightning/src/ln/functional_tests.rs	`98.15% <100.00%> (-0.02%)`	⬇️
lightning/src/sign/mod.rs	`81.47% <100.00%> (ø)`

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

TheBlueMatt · 2023-08-20T20:33:10Z

lightning/src/chain/channelmonitor.rs

@@ -2633,7 +2633,7 @@ impl<Signer: WriteableEcdsaChannelSigner> ChannelMonitorImpl<Signer> {
 							per_commitment_number: htlc.per_commitment_number,
 							per_commitment_point: self.onchain_tx_handler.signer.get_per_commitment_point(
 								htlc.per_commitment_number, &self.onchain_tx_handler.secp_ctx,
-							),
+							).unwrap(),


I think we can swallow this one - in general all the values returned in get_repeated_events can be lost and its okay, as long as they're only lost for a few minutes at a time. We should definitely log_error, though.

TheBlueMatt · 2023-08-20T21:02:17Z

lightning/src/ln/channel.rs

-			let expected_point = self.context.holder_signer.get_per_commitment_point(INITIAL_COMMITMENT_NUMBER - msg.next_remote_commitment_number + 1, &self.context.secp_ctx);
+			let state_index = INITIAL_COMMITMENT_NUMBER - msg.next_remote_commitment_number + 1;
+			let expected_point = self.context.holder_signer.get_per_commitment_point(state_index, &self.context.secp_ctx)
+				.map_err(|_| ChannelError::Close(format!("Unable to retrieve per-commitment point for state {state_index}")))?;


Hmm, this is kinda awkward, we're force-closing because we failed to double-check something our peer sent us that we don't need. I guess there's kinda a broader question here - are we expecting these calls to just sometimes fail, or always fail with the expectation that they're resolving async and we'll come back to it. If its the first, we can probably just swallow the error here (and decline to log_and_panic even if they set a high state counter, below), if its the second, we may want to do some kind of async verification of this value.

lightning/src/ln/channel.rs

TheBlueMatt · 2023-08-20T21:07:08Z

lightning/src/ln/channel.rs

@@ -1441,13 +1448,14 @@ impl<Signer: ChannelSigner> ChannelContext<Signer> {
 	/// our counterparty!)
 	/// The result is a transaction which we can revoke broadcastership of (ie a "local" transaction)
 	/// TODO Some magic rust shit to compile-time check this?
-	fn build_holder_transaction_keys(&self, commitment_number: u64) -> TxCreationKeys {
-		let per_commitment_point = self.holder_signer.get_per_commitment_point(commitment_number, &self.secp_ctx);


Can we fetch this in #[cfg(any(test, fuzzing))] and debug_assert it?

TheBlueMatt · 2023-08-20T21:24:41Z

lightning/src/ln/channel.rs

+		self.context.next_per_commitment_point =
+			self.context.holder_signer.get_per_commitment_point(
+				self.context.cur_holder_commitment_transaction_number, &self.context.secp_ctx
+			).map_err(|_| ChannelError::Close("Unable to generate commitment point".to_owned()))?;


Right, what are we thinking on this one? Should we try to push off the next-point generation until the monitor-updating-unpaused call? Or somehow set this to None and try again later on a timer?

TheBlueMatt · 2023-08-25T22:41:40Z

Let me know when this is ready for another round of review.

This changes `ChannelSigner::get_per_commitment_point` to return a `Result<PublicKey, ()`, which means that it can fail. Similarly, it changes `ChannelSigner::release_commitment_secret`. In order to accomodate failure at the callsites that otherwise assumed it was infallible, we cache the next per-commitment point each time the state advances in the `ChannelContext` and change the callsites to instead use the cached value.

devrandom · 2023-08-29T18:28:03Z

lightning/src/ln/channel.rs

-						next_per_commitment_point,
-						short_channel_id_alias: Some(self.context.outbound_scid_alias),
-					});
+					if let Ok(next_per_commitment_point) = self.context.holder_signer.as_ref().get_per_commitment_point(INITIAL_COMMITMENT_NUMBER - 1, &self.context.secp_ctx) {


I think this means we don't respond with ChannelReady unless the signer happens to be available when a block comes in. this may, depending on luck, not happen for a long time. unless there's another retry mechanism that I'm not thinking of?

devrandom · 2023-08-29T18:31:21Z

lightning/src/ln/channel.rs

@@ -5641,6 +5696,9 @@ impl<SP: Deref> OutboundV1Channel<SP> where SP::Target: SignerProvider {

 		let temporary_channel_id = ChannelId::temporary_from_entropy_source(entropy_source);

+		let next_per_commitment_point = holder_signer.get_per_commitment_point(INITIAL_COMMITMENT_NUMBER, &secp_ctx)
+			.map_err(|_| APIError::ChannelUnavailable { err: "Unable to generate initial commitment point".to_owned()})?;


here and elsewhere would be good if either the specific signer error was included, or this said something about the signer not being available - otherwise it might be puzzling for the dev.

Yep... as I mentioned in the Discord chat I haven't gotten to several of these. Right now, they're basically just modified to handle the changes to the signer's signature.

devrandom · 2023-08-29T18:36:40Z

lightning/src/ln/channelmanager.rs

@@ -5622,6 +5622,12 @@ where
 				Some(inbound_chan) => {
 					match inbound_chan.funding_created(msg, best_block, &self.signer_provider, &self.logger) {
 						Ok(res) => res,
+						Err((inbound_chan, ChannelError::Ignore(_))) => {
+							// If we get an `Ignore` error then something transient went wrong. Put the channel
+							// back into the table and bail.


when does this get retried (would be good to doc here)?

I am assuming that there will be a to-be-determined mechanism that allows the remote signer's implementation to initiate a restart.

TheBlueMatt · 2023-08-29T20:28:09Z

lightning/src/ln/channel.rs

@@ -2980,6 +3029,10 @@ impl<SP: Deref> Channel<SP> where
 		self.context.holder_signer.as_ref().validate_holder_commitment(&holder_commitment_tx, commitment_stats.preimages)
 			.map_err(|_| ChannelError::Close("Failed to validate our commitment".to_owned()))?;

+		// Retrieve the next commitment point: if this results in a transient failure we'll unwind here
+		// and rely on retry to complete the commitment_operation.


Similarly here, I think we should do the state update, but unset next_per_commitment_point. Then, when we try to send the RevokeAndACK message (when we need it), we can simply fail and try again when the signer is ready for us.

TheBlueMatt · 2023-08-29T20:30:22Z

lightning/src/ln/channel.rs

@@ -2522,6 +2567,9 @@ impl<SP: Deref> Channel<SP> where
 		self.context.holder_signer.as_ref().validate_holder_commitment(&holder_commitment_tx, Vec::new())
 			.map_err(|_| ChannelError::Close("Failed to validate our commitment".to_owned()))?;

+		// Retrieve the next commitment point: if this results in a transient failure we'll unwind here
+		// and rely on retry to complete the funding_signed operation.
+		let (next_holder_commitment_transaction_number, next_per_commitment_point) = self.context.get_next_holder_per_commitment_point(logger)?;


This is a message processing function, and once we return we throw away the message. Thus, I think we should actually do the full state update, and not fail at all but rather if we fail to get the next point we just refuse to make progress when we try to access it (when sending our ChannelReady message, which we can pretty easily retry I think).

We'll just always retry when we get an Err.

This allows us to resume a channel from where we left it suspended when the signer returned an error.

When accepting a new inbound channel, we need to acquire the first per-commitment point. If the signer is not immediately available to do so, then unwind and allow for retry later. This changes the next_per_commitment_point to be an Option<PublicKey> which will only be None while waiting for the first per-commitment point.

As elsewhere, when reestablishing a channel we need to get a commitment point. Here, we need an _arbitrary_ point, so refactored code in channel context appropriately.

Rather than assume that we can release the commitment secret arbitrarily, this releases the secret eagerly: as soon as the counterparty has commited to the new state. The secret is then stored as part of the channel context and can be accessed without a subsequent API request as needed. Also adds support for serializing and deserializing the previous commitment secret and next per-commitment point with the channel state.

TheBlueMatt · 2023-09-04T00:03:04Z

Let us know when you want another review pass on this!

waterson · 2023-09-06T03:04:49Z

Ok, just pushed up a set of changes to cover o/b channel open, but I think I'm going to abandon this approach in favor of something like #2554.

TheBlueMatt · 2023-10-05T23:34:43Z

Gonna go ahead and close this, then. If you want to reopen this one as-is just comment.

devrandom reviewed Aug 10, 2023

View reviewed changes

waterson force-pushed the fallible-per-commitment-point branch from e86381f to 4c12822 Compare August 18, 2023 13:44

TheBlueMatt reviewed Aug 20, 2023

View reviewed changes

waterson force-pushed the fallible-per-commitment-point branch 2 times, most recently from b9a76e9 to 745be58 Compare August 25, 2023 18:31

waterson force-pushed the fallible-per-commitment-point branch 2 times, most recently from 552e9cb to 74da67a Compare August 29, 2023 16:07

waterson force-pushed the fallible-per-commitment-point branch from 74da67a to b886679 Compare August 29, 2023 16:26

devrandom reviewed Aug 29, 2023

View reviewed changes

TheBlueMatt reviewed Aug 29, 2023

View reviewed changes

waterson added 6 commits August 29, 2023 15:47

Clean up warnings.

4fc9079

Get rid of SignerError.

ae055bc

We'll just always retry when we get an Err.

Introduce ChannelError::Retry

07f0269

Create a retry mechanism

71e9065

This allows us to resume a channel from where we left it suspended when the signer returned an error.

Handle signer unavailable during channel reestablish

2dd3a3e

As elsewhere, when reestablishing a channel we need to get a commitment point. Here, we need an _arbitrary_ point, so refactored code in channel context appropriately.

waterson mentioned this pull request Aug 31, 2023

ChannelSigner::sign_holder_commitment_and_htlcs causes panic when returning Err #2520

Closed

waterson added 2 commits September 1, 2023 10:05

Fix misspelling

e2f054d

Handle outbound channel creation, too.

b13e4e8

waterson mentioned this pull request Sep 12, 2023

Handle retrying sign_counterparty_commitment failures #2558

Merged

TheBlueMatt closed this Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow get_per_commitment_point to fail. #2487

Allow get_per_commitment_point to fail. #2487

waterson commented Aug 10, 2023

devrandom left a comment •

edited

Loading

devrandom Aug 10, 2023 •

edited

Loading

waterson Aug 11, 2023

TheBlueMatt Aug 14, 2023 •

edited

Loading

codecov-commenter commented Aug 18, 2023 •

edited

Loading

TheBlueMatt Aug 20, 2023

TheBlueMatt Aug 20, 2023

TheBlueMatt Aug 20, 2023

TheBlueMatt Aug 20, 2023

TheBlueMatt commented Aug 25, 2023

devrandom Aug 29, 2023

devrandom Aug 29, 2023

waterson Aug 29, 2023

devrandom Aug 29, 2023

waterson Aug 29, 2023

TheBlueMatt Aug 29, 2023

TheBlueMatt Aug 29, 2023

TheBlueMatt commented Sep 4, 2023

waterson commented Sep 6, 2023

TheBlueMatt commented Oct 5, 2023

Allow get_per_commitment_point to fail. #2487

Allow get_per_commitment_point to fail. #2487

Conversation

waterson commented Aug 10, 2023

devrandom left a comment • edited Loading

Choose a reason for hiding this comment

devrandom Aug 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt Aug 14, 2023 • edited Loading

Choose a reason for hiding this comment

codecov-commenter commented Aug 18, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Aug 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Sep 4, 2023

waterson commented Sep 6, 2023

TheBlueMatt commented Oct 5, 2023

devrandom left a comment •

edited

Loading

devrandom Aug 10, 2023 •

edited

Loading

TheBlueMatt Aug 14, 2023 •

edited

Loading

codecov-commenter commented Aug 18, 2023 •

edited

Loading