BaseSign + ChannelMonitor temporary error support #1538

devrandom · 2022-06-13T11:52:27Z

Early draft for discussion.

TheBlueMatt · 2022-06-13T17:45:01Z

Hmm, yea, I don't think this is really the right direction to go - instead of passing errors around everywhere, I'd really strongly prefer if we first aggressively split all the ChannelMonitor logic into two passes - first process all transactions, build all packages, etc, and only once that's done, in a single function, sign all the packages and actually send transactions. That way the first parts are all infallible, and the second relies on the signer being available, and can be trivially retried. eg as a first pass, split OnchainTxHandler::generate_claim_tx into two?

devrandom · 2022-06-13T17:47:38Z

OK, I see. so we build the packages, not the transactions, that makes more sense. let me try that.

…alization

codecov-commenter · 2022-06-14T08:50:16Z

Codecov Report

Merging #1538 (0d3e990) into main (716539e) will increase coverage by 0.33%.
The diff coverage is 87.90%.

❗ Current head 0d3e990 differs from pull request most recent head dffb6c9. Consider uploading reports for the commit dffb6c9 to get more accurate results

@@            Coverage Diff             @@
##             main    #1538      +/-   ##
==========================================
+ Coverage   90.94%   91.27%   +0.33%     
==========================================
  Files          80       80              
  Lines       43469    45380    +1911     
  Branches    43469    45380    +1911     
==========================================
+ Hits        39533    41422    +1889     
- Misses       3936     3958      +22

Impacted Files	Coverage Δ
lightning/src/chain/channelmonitor.rs	`91.02% <82.35%> (-0.23%)`	⬇️
lightning/src/chain/onchaintx.rs	`92.79% <88.13%> (-1.19%)`	⬇️
lightning/src/chain/package.rs	`92.86% <91.66%> (ø)`
lightning/src/chain/keysinterface.rs	`93.41% <92.30%> (-0.18%)`	⬇️
lightning/src/ln/channelmanager.rs	`84.32% <100.00%> (ø)`
lightning/src/ln/functional_tests.rs	`97.03% <100.00%> (-0.11%)`	⬇️
lightning/src/util/enforcing_trait_impls.rs	`89.47% <100.00%> (ø)`
lightning/src/ln/features.rs	`99.28% <0.00%> (-0.19%)`	⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 716539e...dffb6c9. Read the comment docs.

devrandom · 2022-06-14T08:51:20Z

lightning/src/chain/onchaintx.rs

+					let txid = tx.txid();
+					for k in req.outpoints() {
+						log_info!(logger, "Registering claiming request for {}:{}", k.txid, k.vout);
+						// XXX this cannot be reordered to later than previous block because of data dependency - tests are failing


Tests start failing if this finalization loop is moved to this point or later - seems to be a data dependency with previous block of code. Not sure if this dependency is spurious or real.

hmm... so it looks like the original code is trying to claim an outpoint in two different ways - and that's even alluded to in the failing test (test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx). moving the loop as I did in the current PR is causing one of them (originally an entry in bump_candidates) to not be created. it's not clear what's the right strategy and I think I'm missing some context.

The dependency is real and it's good to see the test caught this. When a revoked commitment confirms, the broadcasting party may still be able to sweep a pending HTLC, but they still incur the additional to_self delay, which gives the remote party time to claim through the revocation path on the new output from the HTLC success/timeout transaction. We need to track the pending claim request so that we can detect such case and split the now spent output from it and start tracking the new one.

As for the end goal of this PR, it seems like we should split PackageTemplate::finalize_package into two: create_package and finalize/sign_package, the former takes care of crafting the transaction(s), while the latter signs it.

From what I can tell tweaking test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx, it sounds to me the second, correct justice transaction is not generated.

What I believe is happening, without looking that much into details, is that now than we don't push anymore the request into pending_claim_requests, this map is not updated to reflect the commitment transaction being spent by our counterparty, what the above "loop is doing. Therefore we forget to generate the revoked_htlc_txn as a bump_candidate entry.

A solution could be to put this "conflicting-outpoints-reconciliation" loop before the generate_claim_tx as a preprocessing step and therefore ensure that all the claim request we have are sane. Doing so, you might have few tests failing that I would say it's just because they have pathological cases like test_static_...success_tx.

ariard · 2022-06-14T21:24:27Z

lightning/src/chain/keysinterface.rs

+	Temporary,
+	/// A signer internal error
+	Internal
+}


Just to confirm if we agree all on the direction, in case of SignError::Temporary yielded by finalize_claim_tx, we return a PendingEvent::SignTransaction to ChannelMonitor queued on a new internal buffer ?

How we would dry-up the queue ? Inside block_confirmed or with a new monitor_timer_tick_occured(), I lean to favor the latter as blocks elapsing is a bit precious in Lightning.

If we get an SignError::Internal maybe we should flow it back up to ChannelManager with a ChannelMonitor::TemporaryFailure and stop accepting HTLCs we won't be able to claim anyways ?

Just to confirm if we agree all on the direction, in case of SignError::Temporary yielded by finalize_claim_tx, we return a PendingEvent::SignTransaction to ChannelMonitor queued on a new internal buffer ?

Initially, I thought we would just call ChannelMonitor.update_claims_view with empty inputs and the current logic would have it retry the requests in self.pending_claim_requests. It seemed like the least invasive approach.

However, it seems that @TheBlueMatt prefers if we just retry the signing/broadcasting part, and not the rest of the logic. Maybe he can clarify how his envisions the triggering of the retry.

As to implementation details - there are already three buffers in ChannelMonitorImpl:

pending_monitor_events: Vec<MonitorEvent>, pending_events: Vec<Event>, onchain_events_awaiting_threshold_conf: Vec<OnchainEventEntry>,

I'm wondering if any of these could be used for this purpose.

How we would dry-up the queue ? Inside block_confirmed or with a new monitor_timer_tick_occured(), I lean to favor the latter as blocks elapsing is a bit precious in Lightning.

The signer may connect at any time, which may not overlap with the block arrival time. So it's probably best that the BackgroundProcessor tick gets triggered by the signer connecting so we can catch up on all housekeeping tasks. ChannelMonitor already implements EventsProvider::process_pending_events, so perhaps it makes sense to add the code there. Or it may be cleaner to add another call as you say.

If we get an SignError::Internal maybe we should flow it back up to ChannelManager with a ChannelMonitor::TemporaryFailure and stop accepting HTLCs we won't be able to claim anyways ?

The Internal error is just a renaming of the existing Result<_, ()> error returns from BaseSign. It seems like we currently almost always panic, or we should, because it's caused by a key derivation failing, which has an infinitesimal chance of happening and probably indicates a hardware / RNG problem. In any case, it's out of scope of the current effort to change the handling of this case.

However, it seems that @TheBlueMatt prefers if we just retry the signing/broadcasting part, and not the rest of the logic. Maybe he can clarify how his envisions the triggering of the retry.

I'd imagine we'd leave that up to the user or maybe even do it on every block (though that will basically break ~every test, so maybe we don't do that by default in tests?). If the signer is connected intermittently, hopefully the user has some mechanism to learn when the signer is available so they can re-sign.

I'd imagine we'd leave that up to the user or maybe even do it on every block (though that will basically break ~every test, so maybe we don't do that by default in tests?).

IIUC the idea as just retrying the signing/broadcasting part I'm good with that, it's more likely the update_claims_view logic is going to be more complex in the future (e.g locking fee-bumping UTXOs) though to break it to avoid dependency bug like the one above.

As to implementation details - there are already three buffers in ChannelMonitorImpl:

I think the last one onchain_events_awaiting_threshold_conf would fit, all the events are destinied to be consumed by ChannelMonitorImpl. Maybe can be recall something like "monitor_events_awaiting_time_point` where time point is either a blockchain height or the signer waking up.

The signer may connect at any time, which may not overlap with the block arrival time. So it's probably best that the BackgroundProcessor tick gets triggered by the signer connecting so we can catch up on all housekeeping tasks. ChannelMonitor already implements EventsProvider::process_pending_events, so perhaps it makes sense to add the code there. Or it may be cleaner to add another call as you say.

Yeah ChannelMonitorImpl::get_and_clear_pending_events sounds a good temporary location for now.

The Internal error is just a renaming of the existing Result<_, ()> error returns from BaseSign. It seems like we currently almost always panic, or we should, because it's caused by a key derivation failing, which has an infinitesimal chance of happening and probably indicates a hardware / RNG problem. In any case, it's out of scope of the current effort to change the handling of this case.

Sure, we can deferred how we react to catastrophic hardware failures to the future.

ariard · 2022-06-14T21:46:57Z

lightning/src/chain/onchaintx.rs

+					let txid = tx.txid();
+					for k in req.outpoints() {
+						log_info!(logger, "Registering claiming request for {}:{}", k.txid, k.vout);
+						// XXX this cannot be reordered to later than previous block because of data dependency - tests are failing


From what I can tell tweaking test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx, it sounds to me the second, correct justice transaction is not generated.

What I believe is happening, without looking that much into details, is that now than we don't push anymore the request into pending_claim_requests, this map is not updated to reflect the commitment transaction being spent by our counterparty, what the above "loop is doing. Therefore we forget to generate the revoked_htlc_txn as a bump_candidate entry.

A solution could be to put this "conflicting-outpoints-reconciliation" loop before the generate_claim_tx as a preprocessing step and therefore ensure that all the claim request we have are sane. Doing so, you might have few tests failing that I would say it's just because they have pathological cases like test_static_...success_tx.

TheBlueMatt · 2022-06-24T16:04:56Z

When we start moving towards anchor, in discussion with @wpaulino yesterday, I noted that we may want to shift the way our API works here entirely as a first step - instead of calling into the signer, generate events for transactions that need to be signed and let the user do it at their convenience. That would make signer disconnection entirely up to the user rather than something we have to handle in LDK itself.

devrandom · 2022-06-24T22:13:23Z

instead of calling into the signer, generate events for transactions that need to be signed and let the user do it at their convenience.

This comment is regarding L1 signing, not L2 (commitment tx) signing, right?

TheBlueMatt · 2022-06-27T20:33:51Z

Right, but all ChannelMonitor stuff is L1 signing, kinda-sorta, or at least its "finalize signing" where we never need to update the L2 state anymore. You're right that its a bit different in that we also still do the commitment tx signing, so maybe I'm being too rosy in my view here, but we could also imagine pushing that signing to via an event? Maybe that's too much code duplication on the signer side so its just a bit of a hairbrained idea.

TheBlueMatt · 2023-01-05T17:34:22Z

Post-#1897 this should get way easier - after that we persist the ChannelMonitor updates before we actually sign the counterparty commitment. That will, at a minimum, ensure we cannot ever request the signer sign two different things at the same commitment number, even across crashes, but with some work it could allow us to retry the signing without too much effort, I think.

In the mean time, do you want to close this PR or will it get picked back up after that?

devrandom · 2023-01-06T09:04:13Z

I'll close this for now, until we get a chance to reevaluate. Thank you for the update.

BaseSign + ChannelMonitor temporary error support

018dd52

devrandom force-pushed the 2022-06-sign-error branch from 164f6c8 to 018dd52 Compare June 13, 2022 12:33

f justice

0d3e990

Split finalize_claim_tx from generate_claim_tx and try to reorder fin…

dffb6c9

…alization

devrandom commented Jun 14, 2022

View reviewed changes

ariard reviewed Jun 14, 2022

View reviewed changes

devrandom closed this Jan 6, 2023

TheBlueMatt mentioned this pull request Mar 8, 2023

Async + Fallible Signing #2088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaseSign + ChannelMonitor temporary error support #1538

BaseSign + ChannelMonitor temporary error support #1538

devrandom commented Jun 13, 2022

TheBlueMatt commented Jun 13, 2022

devrandom commented Jun 13, 2022

codecov-commenter commented Jun 14, 2022

devrandom Jun 14, 2022

devrandom Jun 14, 2022

wpaulino Jun 14, 2022

ariard Jun 14, 2022

ariard Jun 14, 2022

devrandom Jun 15, 2022

TheBlueMatt Jun 16, 2022

ariard Jun 23, 2022

ariard Jun 14, 2022

TheBlueMatt commented Jun 24, 2022

devrandom commented Jun 24, 2022

TheBlueMatt commented Jun 27, 2022

TheBlueMatt commented Jan 5, 2023

devrandom commented Jan 6, 2023

BaseSign + ChannelMonitor temporary error support #1538

BaseSign + ChannelMonitor temporary error support #1538

Conversation

devrandom commented Jun 13, 2022

TheBlueMatt commented Jun 13, 2022

devrandom commented Jun 13, 2022

codecov-commenter commented Jun 14, 2022

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TheBlueMatt commented Jun 24, 2022

devrandom commented Jun 24, 2022

TheBlueMatt commented Jun 27, 2022

TheBlueMatt commented Jan 5, 2023

devrandom commented Jan 6, 2023