Skip to content

BaseSign + ChannelMonitor temporary error support #1538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

devrandom
Copy link
Member

Early draft for discussion.

@devrandom devrandom force-pushed the 2022-06-sign-error branch from 164f6c8 to 018dd52 Compare June 13, 2022 12:33
@TheBlueMatt
Copy link
Collaborator

Hmm, yea, I don't think this is really the right direction to go - instead of passing errors around everywhere, I'd really strongly prefer if we first aggressively split all the ChannelMonitor logic into two passes - first process all transactions, build all packages, etc, and only once that's done, in a single function, sign all the packages and actually send transactions. That way the first parts are all infallible, and the second relies on the signer being available, and can be trivially retried. eg as a first pass, split OnchainTxHandler::generate_claim_tx into two?

@devrandom
Copy link
Member Author

OK, I see. so we build the packages, not the transactions, that makes more sense. let me try that.

@codecov-commenter
Copy link

Codecov Report

Merging #1538 (0d3e990) into main (716539e) will increase coverage by 0.33%.
The diff coverage is 87.90%.

❗ Current head 0d3e990 differs from pull request most recent head dffb6c9. Consider uploading reports for the commit dffb6c9 to get more accurate results

@@            Coverage Diff             @@
##             main    #1538      +/-   ##
==========================================
+ Coverage   90.94%   91.27%   +0.33%     
==========================================
  Files          80       80              
  Lines       43469    45380    +1911     
  Branches    43469    45380    +1911     
==========================================
+ Hits        39533    41422    +1889     
- Misses       3936     3958      +22     
Impacted Files Coverage Δ
lightning/src/chain/channelmonitor.rs 91.02% <82.35%> (-0.23%) ⬇️
lightning/src/chain/onchaintx.rs 92.79% <88.13%> (-1.19%) ⬇️
lightning/src/chain/package.rs 92.86% <91.66%> (ø)
lightning/src/chain/keysinterface.rs 93.41% <92.30%> (-0.18%) ⬇️
lightning/src/ln/channelmanager.rs 84.32% <100.00%> (ø)
lightning/src/ln/functional_tests.rs 97.03% <100.00%> (-0.11%) ⬇️
lightning/src/util/enforcing_trait_impls.rs 89.47% <100.00%> (ø)
lightning/src/ln/features.rs 99.28% <0.00%> (-0.19%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 716539e...dffb6c9. Read the comment docs.

let txid = tx.txid();
for k in req.outpoints() {
log_info!(logger, "Registering claiming request for {}:{}", k.txid, k.vout);
// XXX this cannot be reordered to later than previous block because of data dependency - tests are failing
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests start failing if this finalization loop is moved to this point or later - seems to be a data dependency with previous block of code. Not sure if this dependency is spurious or real.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... so it looks like the original code is trying to claim an outpoint in two different ways - and that's even alluded to in the failing test (test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx). moving the loop as I did in the current PR is causing one of them (originally an entry in bump_candidates) to not be created. it's not clear what's the right strategy and I think I'm missing some context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependency is real and it's good to see the test caught this. When a revoked commitment confirms, the broadcasting party may still be able to sweep a pending HTLC, but they still incur the additional to_self delay, which gives the remote party time to claim through the revocation path on the new output from the HTLC success/timeout transaction. We need to track the pending claim request so that we can detect such case and split the now spent output from it and start tracking the new one.

As for the end goal of this PR, it seems like we should split PackageTemplate::finalize_package into two: create_package and finalize/sign_package, the former takes care of crafting the transaction(s), while the latter signs it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell tweaking test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx, it sounds to me the second, correct justice transaction is not generated.

What I believe is happening, without looking that much into details, is that now than we don't push anymore the request into pending_claim_requests, this map is not updated to reflect the commitment transaction being spent by our counterparty, what the above "loop is doing. Therefore we forget to generate the revoked_htlc_txn as a bump_candidate entry.

A solution could be to put this "conflicting-outpoints-reconciliation" loop before the generate_claim_tx as a preprocessing step and therefore ensure that all the claim request we have are sane. Doing so, you might have few tests failing that I would say it's just because they have pathological cases like test_static_...success_tx.

Temporary,
/// A signer internal error
Internal
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm if we agree all on the direction, in case of SignError::Temporary yielded by finalize_claim_tx, we return a PendingEvent::SignTransaction to ChannelMonitor queued on a new internal buffer ?

How we would dry-up the queue ? Inside block_confirmed or with a new monitor_timer_tick_occured(), I lean to favor the latter as blocks elapsing is a bit precious in Lightning.

If we get an SignError::Internal maybe we should flow it back up to ChannelManager with a ChannelMonitor::TemporaryFailure and stop accepting HTLCs we won't be able to claim anyways ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm if we agree all on the direction, in case of SignError::Temporary yielded by finalize_claim_tx, we return a PendingEvent::SignTransaction to ChannelMonitor queued on a new internal buffer ?

Initially, I thought we would just call ChannelMonitor.update_claims_view with empty inputs and the current logic would have it retry the requests in self.pending_claim_requests. It seemed like the least invasive approach.

However, it seems that @TheBlueMatt prefers if we just retry the signing/broadcasting part, and not the rest of the logic. Maybe he can clarify how his envisions the triggering of the retry.

As to implementation details - there are already three buffers in ChannelMonitorImpl:

	pending_monitor_events: Vec<MonitorEvent>,
	pending_events: Vec<Event>,
	onchain_events_awaiting_threshold_conf: Vec<OnchainEventEntry>,

I'm wondering if any of these could be used for this purpose.

How we would dry-up the queue ? Inside block_confirmed or with a new monitor_timer_tick_occured(), I lean to favor the latter as blocks elapsing is a bit precious in Lightning.

The signer may connect at any time, which may not overlap with the block arrival time. So it's probably best that the BackgroundProcessor tick gets triggered by the signer connecting so we can catch up on all housekeeping tasks. ChannelMonitor already implements EventsProvider::process_pending_events, so perhaps it makes sense to add the code there. Or it may be cleaner to add another call as you say.

If we get an SignError::Internal maybe we should flow it back up to ChannelManager with a ChannelMonitor::TemporaryFailure and stop accepting HTLCs we won't be able to claim anyways ?

The Internal error is just a renaming of the existing Result<_, ()> error returns from BaseSign. It seems like we currently almost always panic, or we should, because it's caused by a key derivation failing, which has an infinitesimal chance of happening and probably indicates a hardware / RNG problem. In any case, it's out of scope of the current effort to change the handling of this case.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it seems that @TheBlueMatt prefers if we just retry the signing/broadcasting part, and not the rest of the logic. Maybe he can clarify how his envisions the triggering of the retry.

I'd imagine we'd leave that up to the user or maybe even do it on every block (though that will basically break ~every test, so maybe we don't do that by default in tests?). If the signer is connected intermittently, hopefully the user has some mechanism to learn when the signer is available so they can re-sign.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine we'd leave that up to the user or maybe even do it on every block (though that will basically break ~every test, so maybe we don't do that by default in tests?).

IIUC the idea as just retrying the signing/broadcasting part I'm good with that, it's more likely the update_claims_view logic is going to be more complex in the future (e.g locking fee-bumping UTXOs) though to break it to avoid dependency bug like the one above.

As to implementation details - there are already three buffers in ChannelMonitorImpl:

I think the last one onchain_events_awaiting_threshold_conf would fit, all the events are destinied to be consumed by ChannelMonitorImpl. Maybe can be recall something like "monitor_events_awaiting_time_point` where time point is either a blockchain height or the signer waking up.

The signer may connect at any time, which may not overlap with the block arrival time. So it's probably best that the BackgroundProcessor tick gets triggered by the signer connecting so we can catch up on all housekeeping tasks. ChannelMonitor already implements EventsProvider::process_pending_events, so perhaps it makes sense to add the code there. Or it may be cleaner to add another call as you say.

Yeah ChannelMonitorImpl::get_and_clear_pending_events sounds a good temporary location for now.

The Internal error is just a renaming of the existing Result<_, ()> error returns from BaseSign. It seems like we currently almost always panic, or we should, because it's caused by a key derivation failing, which has an infinitesimal chance of happening and probably indicates a hardware / RNG problem. In any case, it's out of scope of the current effort to change the handling of this case.

Sure, we can deferred how we react to catastrophic hardware failures to the future.

let txid = tx.txid();
for k in req.outpoints() {
log_info!(logger, "Registering claiming request for {}:{}", k.txid, k.vout);
// XXX this cannot be reordered to later than previous block because of data dependency - tests are failing
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell tweaking test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx, it sounds to me the second, correct justice transaction is not generated.

What I believe is happening, without looking that much into details, is that now than we don't push anymore the request into pending_claim_requests, this map is not updated to reflect the commitment transaction being spent by our counterparty, what the above "loop is doing. Therefore we forget to generate the revoked_htlc_txn as a bump_candidate entry.

A solution could be to put this "conflicting-outpoints-reconciliation" loop before the generate_claim_tx as a preprocessing step and therefore ensure that all the claim request we have are sane. Doing so, you might have few tests failing that I would say it's just because they have pathological cases like test_static_...success_tx.

@TheBlueMatt
Copy link
Collaborator

When we start moving towards anchor, in discussion with @wpaulino yesterday, I noted that we may want to shift the way our API works here entirely as a first step - instead of calling into the signer, generate events for transactions that need to be signed and let the user do it at their convenience. That would make signer disconnection entirely up to the user rather than something we have to handle in LDK itself.

@devrandom
Copy link
Member Author

instead of calling into the signer, generate events for transactions that need to be signed and let the user do it at their convenience.

This comment is regarding L1 signing, not L2 (commitment tx) signing, right?

@TheBlueMatt
Copy link
Collaborator

Right, but all ChannelMonitor stuff is L1 signing, kinda-sorta, or at least its "finalize signing" where we never need to update the L2 state anymore. You're right that its a bit different in that we also still do the commitment tx signing, so maybe I'm being too rosy in my view here, but we could also imagine pushing that signing to via an event? Maybe that's too much code duplication on the signer side so its just a bit of a hairbrained idea.

@TheBlueMatt
Copy link
Collaborator

Post-#1897 this should get way easier - after that we persist the ChannelMonitor updates before we actually sign the counterparty commitment. That will, at a minimum, ensure we cannot ever request the signer sign two different things at the same commitment number, even across crashes, but with some work it could allow us to retry the signing without too much effort, I think.

In the mean time, do you want to close this PR or will it get picked back up after that?

@devrandom
Copy link
Member Author

I'll close this for now, until we get a chance to reevaluate. Thank you for the update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants