Skip to content

Allow get_per_commitment_point to fail. #2487

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

waterson
Copy link
Contributor

This changes ChannelSigner::get_per_commitment_point to return a Result<PublicKey, (), which means that it can fail. In order to accomodate failure at the callsites that otherwise assumed it was infallible, we cache the next per-commitment point each time the state advances in the ChannelContext and change the callsites to instead use the cached value.

Copy link
Member

@devrandom devrandom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reducing complexity by caching the next point may be a good approach, since it will reduce the spots where we need to do something clever. but see below about attempting to force-close.

// *next* state. We recompute it each time the state changes because the state changes in places
// that might be fallible: in particular, if the commitment point must be fetched from a remote
// source, we want to ensure it happens at a point where we can actually fail somewhat gracefully;
// i.e., force-closing a channel is better than a panic!
Copy link
Member

@devrandom devrandom Aug 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, if you can't compute the commitment point, then you can't force-close either. and once you can force close, the signer is back, and you can just continue as normal instead. so it doesn't seem like ChannelClose::Error is the right action in this case. crashing may actually be more reasonable until we have a retry mechanism (crash, and you'll retry when you restart).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 Yes, of course... you are right about this.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt Aug 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least during channel open, error seems fine enough, better than nothing, but indeed during normal channel operation, force-closign the channel sucks. Maybe we could limit this PR to just the opening parts and handle the during-run parts separately?

@waterson waterson force-pushed the fallible-per-commitment-point branch from e86381f to 4c12822 Compare August 18, 2023 13:44
@codecov-commenter
Copy link

codecov-commenter commented Aug 18, 2023

Codecov Report

Patch coverage: 81.98% and project coverage change: -0.11% ⚠️

Comparison is base (1f2ee21) 90.58% compared to head (b13e4e8) 90.47%.
Report is 17 commits behind head on main.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2487      +/-   ##
==========================================
- Coverage   90.58%   90.47%   -0.11%     
==========================================
  Files         110      110              
  Lines       57422    57889     +467     
  Branches    57422    57889     +467     
==========================================
+ Hits        52017    52377     +360     
- Misses       5405     5512     +107     
Files Changed Coverage Δ
lightning/src/util/test_utils.rs 73.61% <ø> (ø)
lightning/src/ln/channelmanager.rs 86.31% <57.89%> (-0.89%) ⬇️
lightning/src/chain/onchaintx.rs 90.78% <66.66%> (+0.37%) ⬆️
lightning/src/util/test_channel_signer.rs 84.17% <76.92%> (-2.93%) ⬇️
lightning/src/ln/functional_test_utils.rs 89.03% <78.57%> (+0.04%) ⬆️
lightning/src/ln/channel.rs 89.37% <83.33%> (-0.47%) ⬇️
lightning/src/chain/channelmonitor.rs 94.64% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 98.15% <100.00%> (-0.02%) ⬇️
lightning/src/sign/mod.rs 81.47% <100.00%> (ø)

... and 12 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -2633,7 +2633,7 @@ impl<Signer: WriteableEcdsaChannelSigner> ChannelMonitorImpl<Signer> {
per_commitment_number: htlc.per_commitment_number,
per_commitment_point: self.onchain_tx_handler.signer.get_per_commitment_point(
htlc.per_commitment_number, &self.onchain_tx_handler.secp_ctx,
),
).unwrap(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can swallow this one - in general all the values returned in get_repeated_events can be lost and its okay, as long as they're only lost for a few minutes at a time. We should definitely log_error, though.

let expected_point = self.context.holder_signer.get_per_commitment_point(INITIAL_COMMITMENT_NUMBER - msg.next_remote_commitment_number + 1, &self.context.secp_ctx);
let state_index = INITIAL_COMMITMENT_NUMBER - msg.next_remote_commitment_number + 1;
let expected_point = self.context.holder_signer.get_per_commitment_point(state_index, &self.context.secp_ctx)
.map_err(|_| ChannelError::Close(format!("Unable to retrieve per-commitment point for state {state_index}")))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is kinda awkward, we're force-closing because we failed to double-check something our peer sent us that we don't need. I guess there's kinda a broader question here - are we expecting these calls to just sometimes fail, or always fail with the expectation that they're resolving async and we'll come back to it. If its the first, we can probably just swallow the error here (and decline to log_and_panic even if they set a high state counter, below), if its the second, we may want to do some kind of async verification of this value.

@@ -1441,13 +1448,14 @@ impl<Signer: ChannelSigner> ChannelContext<Signer> {
/// our counterparty!)
/// The result is a transaction which we can revoke broadcastership of (ie a "local" transaction)
/// TODO Some magic rust shit to compile-time check this?
fn build_holder_transaction_keys(&self, commitment_number: u64) -> TxCreationKeys {
let per_commitment_point = self.holder_signer.get_per_commitment_point(commitment_number, &self.secp_ctx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fetch this in #[cfg(any(test, fuzzing))] and debug_assert it?

self.context.next_per_commitment_point =
self.context.holder_signer.get_per_commitment_point(
self.context.cur_holder_commitment_transaction_number, &self.context.secp_ctx
).map_err(|_| ChannelError::Close("Unable to generate commitment point".to_owned()))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, what are we thinking on this one? Should we try to push off the next-point generation until the monitor-updating-unpaused call? Or somehow set this to None and try again later on a timer?

@waterson waterson force-pushed the fallible-per-commitment-point branch 2 times, most recently from b9a76e9 to 745be58 Compare August 25, 2023 18:31
@TheBlueMatt
Copy link
Collaborator

Let me know when this is ready for another round of review.

@waterson waterson force-pushed the fallible-per-commitment-point branch 2 times, most recently from 552e9cb to 74da67a Compare August 29, 2023 16:07
This changes `ChannelSigner::get_per_commitment_point` to return a
`Result<PublicKey, ()`, which means that it can fail. Similarly, it changes
`ChannelSigner::release_commitment_secret`. In order to accomodate failure at
the callsites that otherwise assumed it was infallible, we cache the next
per-commitment point each time the state advances in the `ChannelContext` and
change the callsites to instead use the cached value.
@waterson waterson force-pushed the fallible-per-commitment-point branch from 74da67a to b886679 Compare August 29, 2023 16:26
next_per_commitment_point,
short_channel_id_alias: Some(self.context.outbound_scid_alias),
});
if let Ok(next_per_commitment_point) = self.context.holder_signer.as_ref().get_per_commitment_point(INITIAL_COMMITMENT_NUMBER - 1, &self.context.secp_ctx) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this means we don't respond with ChannelReady unless the signer happens to be available when a block comes in. this may, depending on luck, not happen for a long time. unless there's another retry mechanism that I'm not thinking of?

@@ -5641,6 +5696,9 @@ impl<SP: Deref> OutboundV1Channel<SP> where SP::Target: SignerProvider {

let temporary_channel_id = ChannelId::temporary_from_entropy_source(entropy_source);

let next_per_commitment_point = holder_signer.get_per_commitment_point(INITIAL_COMMITMENT_NUMBER, &secp_ctx)
.map_err(|_| APIError::ChannelUnavailable { err: "Unable to generate initial commitment point".to_owned()})?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and elsewhere would be good if either the specific signer error was included, or this said something about the signer not being available - otherwise it might be puzzling for the dev.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep... as I mentioned in the Discord chat I haven't gotten to several of these. Right now, they're basically just modified to handle the changes to the signer's signature.

@@ -5622,6 +5622,12 @@ where
Some(inbound_chan) => {
match inbound_chan.funding_created(msg, best_block, &self.signer_provider, &self.logger) {
Ok(res) => res,
Err((inbound_chan, ChannelError::Ignore(_))) => {
// If we get an `Ignore` error then something transient went wrong. Put the channel
// back into the table and bail.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this get retried (would be good to doc here)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming that there will be a to-be-determined mechanism that allows the remote signer's implementation to initiate a restart.

@@ -2980,6 +3029,10 @@ impl<SP: Deref> Channel<SP> where
self.context.holder_signer.as_ref().validate_holder_commitment(&holder_commitment_tx, commitment_stats.preimages)
.map_err(|_| ChannelError::Close("Failed to validate our commitment".to_owned()))?;

// Retrieve the next commitment point: if this results in a transient failure we'll unwind here
// and rely on retry to complete the commitment_operation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I think we should do the state update, but unset next_per_commitment_point. Then, when we try to send the RevokeAndACK message (when we need it), we can simply fail and try again when the signer is ready for us.

@@ -2522,6 +2567,9 @@ impl<SP: Deref> Channel<SP> where
self.context.holder_signer.as_ref().validate_holder_commitment(&holder_commitment_tx, Vec::new())
.map_err(|_| ChannelError::Close("Failed to validate our commitment".to_owned()))?;

// Retrieve the next commitment point: if this results in a transient failure we'll unwind here
// and rely on retry to complete the funding_signed operation.
let (next_holder_commitment_transaction_number, next_per_commitment_point) = self.context.get_next_holder_per_commitment_point(logger)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a message processing function, and once we return we throw away the message. Thus, I think we should actually do the full state update, and not fail at all but rather if we fail to get the next point we just refuse to make progress when we try to access it (when sending our ChannelReady message, which we can pretty easily retry I think).

We'll just always retry when we get an Err.
This allows us to resume a channel from where we left it suspended when the
signer returned an error.
When accepting a new inbound channel, we need to acquire the first
per-commitment point. If the signer is not immediately available to do so,
then unwind and allow for retry later.

This changes the next_per_commitment_point to be an Option<PublicKey> which
will only be None while waiting for the first per-commitment point.
As elsewhere, when reestablishing a channel we need to get a commitment
point. Here, we need an _arbitrary_ point, so refactored code in channel
context appropriately.
Rather than assume that we can release the commitment secret arbitrarily, this
releases the secret eagerly: as soon as the counterparty has commited to the
new state. The secret is then stored as part of the channel context and can be
accessed without a subsequent API request as needed.

Also adds support for serializing and deserializing the previous commitment
secret and next per-commitment point with the channel state.
@TheBlueMatt
Copy link
Collaborator

Let us know when you want another review pass on this!

@waterson
Copy link
Contributor Author

waterson commented Sep 6, 2023

Ok, just pushed up a set of changes to cover o/b channel open, but I think I'm going to abandon this approach in favor of something like #2554.

@TheBlueMatt
Copy link
Collaborator

Gonna go ahead and close this, then. If you want to reopen this one as-is just comment.

@TheBlueMatt TheBlueMatt closed this Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants