-
Notifications
You must be signed in to change notification settings - Fork 407
Allow get_per_commitment_point to fail. #2487
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow get_per_commitment_point to fail. #2487
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reducing complexity by caching the next point may be a good approach, since it will reduce the spots where we need to do something clever. but see below about attempting to force-close.
// *next* state. We recompute it each time the state changes because the state changes in places | ||
// that might be fallible: in particular, if the commitment point must be fetched from a remote | ||
// source, we want to ensure it happens at a point where we can actually fail somewhat gracefully; | ||
// i.e., force-closing a channel is better than a panic! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, if you can't compute the commitment point, then you can't force-close either. and once you can force close, the signer is back, and you can just continue as normal instead. so it doesn't seem like ChannelClose::Error
is the right action in this case. crashing may actually be more reasonable until we have a retry mechanism (crash, and you'll retry when you restart).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 Yes, of course... you are right about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least during channel open, error seems fine enough, better than nothing, but indeed during normal channel operation, force-closign the channel sucks. Maybe we could limit this PR to just the opening parts and handle the during-run parts separately?
e86381f
to
4c12822
Compare
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the GitHub App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## main #2487 +/- ##
==========================================
- Coverage 90.58% 90.47% -0.11%
==========================================
Files 110 110
Lines 57422 57889 +467
Branches 57422 57889 +467
==========================================
+ Hits 52017 52377 +360
- Misses 5405 5512 +107
☔ View full report in Codecov by Sentry. |
@@ -2633,7 +2633,7 @@ impl<Signer: WriteableEcdsaChannelSigner> ChannelMonitorImpl<Signer> { | |||
per_commitment_number: htlc.per_commitment_number, | |||
per_commitment_point: self.onchain_tx_handler.signer.get_per_commitment_point( | |||
htlc.per_commitment_number, &self.onchain_tx_handler.secp_ctx, | |||
), | |||
).unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can swallow this one - in general all the values returned in get_repeated_events
can be lost and its okay, as long as they're only lost for a few minutes at a time. We should definitely log_error
, though.
lightning/src/ln/channel.rs
Outdated
let expected_point = self.context.holder_signer.get_per_commitment_point(INITIAL_COMMITMENT_NUMBER - msg.next_remote_commitment_number + 1, &self.context.secp_ctx); | ||
let state_index = INITIAL_COMMITMENT_NUMBER - msg.next_remote_commitment_number + 1; | ||
let expected_point = self.context.holder_signer.get_per_commitment_point(state_index, &self.context.secp_ctx) | ||
.map_err(|_| ChannelError::Close(format!("Unable to retrieve per-commitment point for state {state_index}")))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is kinda awkward, we're force-closing because we failed to double-check something our peer sent us that we don't need. I guess there's kinda a broader question here - are we expecting these calls to just sometimes fail, or always fail with the expectation that they're resolving async and we'll come back to it. If its the first, we can probably just swallow the error here (and decline to log_and_panic even if they set a high state counter, below), if its the second, we may want to do some kind of async verification of this value.
lightning/src/ln/channel.rs
Outdated
@@ -1441,13 +1448,14 @@ impl<Signer: ChannelSigner> ChannelContext<Signer> { | |||
/// our counterparty!) | |||
/// The result is a transaction which we can revoke broadcastership of (ie a "local" transaction) | |||
/// TODO Some magic rust shit to compile-time check this? | |||
fn build_holder_transaction_keys(&self, commitment_number: u64) -> TxCreationKeys { | |||
let per_commitment_point = self.holder_signer.get_per_commitment_point(commitment_number, &self.secp_ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fetch this in #[cfg(any(test, fuzzing))]
and debug_assert
it?
lightning/src/ln/channel.rs
Outdated
self.context.next_per_commitment_point = | ||
self.context.holder_signer.get_per_commitment_point( | ||
self.context.cur_holder_commitment_transaction_number, &self.context.secp_ctx | ||
).map_err(|_| ChannelError::Close("Unable to generate commitment point".to_owned()))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, what are we thinking on this one? Should we try to push off the next-point generation until the monitor-updating-unpaused call? Or somehow set this to None
and try again later on a timer?
b9a76e9
to
745be58
Compare
Let me know when this is ready for another round of review. |
552e9cb
to
74da67a
Compare
This changes `ChannelSigner::get_per_commitment_point` to return a `Result<PublicKey, ()`, which means that it can fail. Similarly, it changes `ChannelSigner::release_commitment_secret`. In order to accomodate failure at the callsites that otherwise assumed it was infallible, we cache the next per-commitment point each time the state advances in the `ChannelContext` and change the callsites to instead use the cached value.
74da67a
to
b886679
Compare
next_per_commitment_point, | ||
short_channel_id_alias: Some(self.context.outbound_scid_alias), | ||
}); | ||
if let Ok(next_per_commitment_point) = self.context.holder_signer.as_ref().get_per_commitment_point(INITIAL_COMMITMENT_NUMBER - 1, &self.context.secp_ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this means we don't respond with ChannelReady
unless the signer happens to be available when a block comes in. this may, depending on luck, not happen for a long time. unless there's another retry mechanism that I'm not thinking of?
lightning/src/ln/channel.rs
Outdated
@@ -5641,6 +5696,9 @@ impl<SP: Deref> OutboundV1Channel<SP> where SP::Target: SignerProvider { | |||
|
|||
let temporary_channel_id = ChannelId::temporary_from_entropy_source(entropy_source); | |||
|
|||
let next_per_commitment_point = holder_signer.get_per_commitment_point(INITIAL_COMMITMENT_NUMBER, &secp_ctx) | |||
.map_err(|_| APIError::ChannelUnavailable { err: "Unable to generate initial commitment point".to_owned()})?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and elsewhere would be good if either the specific signer error was included, or this said something about the signer not being available - otherwise it might be puzzling for the dev.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep... as I mentioned in the Discord chat I haven't gotten to several of these. Right now, they're basically just modified to handle the changes to the signer's signature.
@@ -5622,6 +5622,12 @@ where | |||
Some(inbound_chan) => { | |||
match inbound_chan.funding_created(msg, best_block, &self.signer_provider, &self.logger) { | |||
Ok(res) => res, | |||
Err((inbound_chan, ChannelError::Ignore(_))) => { | |||
// If we get an `Ignore` error then something transient went wrong. Put the channel | |||
// back into the table and bail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this get retried (would be good to doc here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am assuming that there will be a to-be-determined mechanism that allows the remote signer's implementation to initiate a restart.
lightning/src/ln/channel.rs
Outdated
@@ -2980,6 +3029,10 @@ impl<SP: Deref> Channel<SP> where | |||
self.context.holder_signer.as_ref().validate_holder_commitment(&holder_commitment_tx, commitment_stats.preimages) | |||
.map_err(|_| ChannelError::Close("Failed to validate our commitment".to_owned()))?; | |||
|
|||
// Retrieve the next commitment point: if this results in a transient failure we'll unwind here | |||
// and rely on retry to complete the commitment_operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly here, I think we should do the state update, but unset next_per_commitment_point
. Then, when we try to send the RevokeAndACK message (when we need it), we can simply fail and try again when the signer is ready for us.
@@ -2522,6 +2567,9 @@ impl<SP: Deref> Channel<SP> where | |||
self.context.holder_signer.as_ref().validate_holder_commitment(&holder_commitment_tx, Vec::new()) | |||
.map_err(|_| ChannelError::Close("Failed to validate our commitment".to_owned()))?; | |||
|
|||
// Retrieve the next commitment point: if this results in a transient failure we'll unwind here | |||
// and rely on retry to complete the funding_signed operation. | |||
let (next_holder_commitment_transaction_number, next_per_commitment_point) = self.context.get_next_holder_per_commitment_point(logger)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a message processing function, and once we return we throw away the message. Thus, I think we should actually do the full state update, and not fail at all but rather if we fail to get the next point we just refuse to make progress when we try to access it (when sending our ChannelReady
message, which we can pretty easily retry I think).
We'll just always retry when we get an Err.
This allows us to resume a channel from where we left it suspended when the signer returned an error.
When accepting a new inbound channel, we need to acquire the first per-commitment point. If the signer is not immediately available to do so, then unwind and allow for retry later. This changes the next_per_commitment_point to be an Option<PublicKey> which will only be None while waiting for the first per-commitment point.
As elsewhere, when reestablishing a channel we need to get a commitment point. Here, we need an _arbitrary_ point, so refactored code in channel context appropriately.
Rather than assume that we can release the commitment secret arbitrarily, this releases the secret eagerly: as soon as the counterparty has commited to the new state. The secret is then stored as part of the channel context and can be accessed without a subsequent API request as needed. Also adds support for serializing and deserializing the previous commitment secret and next per-commitment point with the channel state.
Let us know when you want another review pass on this! |
Ok, just pushed up a set of changes to cover o/b channel open, but I think I'm going to abandon this approach in favor of something like #2554. |
Gonna go ahead and close this, then. If you want to reopen this one as-is just comment. |
This changes
ChannelSigner::get_per_commitment_point
to return aResult<PublicKey, ()
, which means that it can fail. In order to accomodate failure at the callsites that otherwise assumed it was infallible, we cache the next per-commitment point each time the state advances in theChannelContext
and change the callsites to instead use the cached value.