-
Notifications
You must be signed in to change notification settings - Fork 404
BaseSign + ChannelMonitor temporary error support #1538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BaseSign + ChannelMonitor temporary error support #1538
Conversation
164f6c8
to
018dd52
Compare
Hmm, yea, I don't think this is really the right direction to go - instead of passing errors around everywhere, I'd really strongly prefer if we first aggressively split all the ChannelMonitor logic into two passes - first process all transactions, build all packages, etc, and only once that's done, in a single function, sign all the packages and actually send transactions. That way the first parts are all infallible, and the second relies on the signer being available, and can be trivially retried. eg as a first pass, split |
OK, I see. so we build the packages, not the transactions, that makes more sense. let me try that. |
Codecov Report
@@ Coverage Diff @@
## main #1538 +/- ##
==========================================
+ Coverage 90.94% 91.27% +0.33%
==========================================
Files 80 80
Lines 43469 45380 +1911
Branches 43469 45380 +1911
==========================================
+ Hits 39533 41422 +1889
- Misses 3936 3958 +22
Continue to review full report at Codecov.
|
let txid = tx.txid(); | ||
for k in req.outpoints() { | ||
log_info!(logger, "Registering claiming request for {}:{}", k.txid, k.vout); | ||
// XXX this cannot be reordered to later than previous block because of data dependency - tests are failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests start failing if this finalization loop is moved to this point or later - seems to be a data dependency with previous block of code. Not sure if this dependency is spurious or real.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm... so it looks like the original code is trying to claim an outpoint in two different ways - and that's even alluded to in the failing test (test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx
). moving the loop as I did in the current PR is causing one of them (originally an entry in bump_candidates
) to not be created. it's not clear what's the right strategy and I think I'm missing some context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dependency is real and it's good to see the test caught this. When a revoked commitment confirms, the broadcasting party may still be able to sweep a pending HTLC, but they still incur the additional to_self
delay, which gives the remote party time to claim through the revocation path on the new output from the HTLC success/timeout transaction. We need to track the pending claim request so that we can detect such case and split the now spent output from it and start tracking the new one.
As for the end goal of this PR, it seems like we should split PackageTemplate::finalize_package
into two: create_package
and finalize/sign_package
, the former takes care of crafting the transaction(s), while the latter signs it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell tweaking test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx
, it sounds to me the second, correct justice transaction is not generated.
What I believe is happening, without looking that much into details, is that now than we don't push anymore the request into pending_claim_requests
, this map is not updated to reflect the commitment transaction being spent by our counterparty, what the above "loop is doing. Therefore we forget to generate the revoked_htlc_txn
as a bump_candidate
entry.
A solution could be to put this "conflicting-outpoints-reconciliation" loop before the generate_claim_tx
as a preprocessing step and therefore ensure that all the claim request we have are sane. Doing so, you might have few tests failing that I would say it's just because they have pathological cases like test_static_...success_tx
.
Temporary, | ||
/// A signer internal error | ||
Internal | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm if we agree all on the direction, in case of SignError::Temporary
yielded by finalize_claim_tx
, we return a PendingEvent::SignTransaction
to ChannelMonitor
queued on a new internal buffer ?
How we would dry-up the queue ? Inside block_confirmed
or with a new monitor_timer_tick_occured()
, I lean to favor the latter as blocks elapsing is a bit precious in Lightning.
If we get an SignError::Internal
maybe we should flow it back up to ChannelManager
with a ChannelMonitor::TemporaryFailure
and stop accepting HTLCs we won't be able to claim anyways ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm if we agree all on the direction, in case of
SignError::Temporary
yielded byfinalize_claim_tx
, we return aPendingEvent::SignTransaction
toChannelMonitor
queued on a new internal buffer ?
Initially, I thought we would just call ChannelMonitor.update_claims_view
with empty inputs and the current logic would have it retry the requests in self.pending_claim_requests
. It seemed like the least invasive approach.
However, it seems that @TheBlueMatt prefers if we just retry the signing/broadcasting part, and not the rest of the logic. Maybe he can clarify how his envisions the triggering of the retry.
As to implementation details - there are already three buffers in ChannelMonitorImpl
:
pending_monitor_events: Vec<MonitorEvent>,
pending_events: Vec<Event>,
onchain_events_awaiting_threshold_conf: Vec<OnchainEventEntry>,
I'm wondering if any of these could be used for this purpose.
How we would dry-up the queue ? Inside
block_confirmed
or with a newmonitor_timer_tick_occured()
, I lean to favor the latter as blocks elapsing is a bit precious in Lightning.
The signer may connect at any time, which may not overlap with the block arrival time. So it's probably best that the BackgroundProcessor
tick gets triggered by the signer connecting so we can catch up on all housekeeping tasks. ChannelMonitor
already implements EventsProvider::process_pending_events
, so perhaps it makes sense to add the code there. Or it may be cleaner to add another call as you say.
If we get an
SignError::Internal
maybe we should flow it back up toChannelManager
with aChannelMonitor::TemporaryFailure
and stop accepting HTLCs we won't be able to claim anyways ?
The Internal
error is just a renaming of the existing Result<_, ()>
error returns from BaseSign
. It seems like we currently almost always panic, or we should, because it's caused by a key derivation failing, which has an infinitesimal chance of happening and probably indicates a hardware / RNG problem. In any case, it's out of scope of the current effort to change the handling of this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it seems that @TheBlueMatt prefers if we just retry the signing/broadcasting part, and not the rest of the logic. Maybe he can clarify how his envisions the triggering of the retry.
I'd imagine we'd leave that up to the user or maybe even do it on every block (though that will basically break ~every test, so maybe we don't do that by default in tests?). If the signer is connected intermittently, hopefully the user has some mechanism to learn when the signer is available so they can re-sign.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine we'd leave that up to the user or maybe even do it on every block (though that will basically break ~every test, so maybe we don't do that by default in tests?).
IIUC the idea as just retrying the signing/broadcasting part I'm good with that, it's more likely the update_claims_view
logic is going to be more complex in the future (e.g locking fee-bumping UTXOs) though to break it to avoid dependency bug like the one above.
As to implementation details - there are already three buffers in ChannelMonitorImpl:
I think the last one onchain_events_awaiting_threshold_conf
would fit, all the events are destinied to be consumed by ChannelMonitorImpl
. Maybe can be recall something like "monitor_events_awaiting_time_point` where time point is either a blockchain height or the signer waking up.
The signer may connect at any time, which may not overlap with the block arrival time. So it's probably best that the BackgroundProcessor tick gets triggered by the signer connecting so we can catch up on all housekeeping tasks. ChannelMonitor already implements EventsProvider::process_pending_events, so perhaps it makes sense to add the code there. Or it may be cleaner to add another call as you say.
Yeah ChannelMonitorImpl::get_and_clear_pending_events
sounds a good temporary location for now.
The Internal error is just a renaming of the existing Result<_, ()> error returns from BaseSign. It seems like we currently almost always panic, or we should, because it's caused by a key derivation failing, which has an infinitesimal chance of happening and probably indicates a hardware / RNG problem. In any case, it's out of scope of the current effort to change the handling of this case.
Sure, we can deferred how we react to catastrophic hardware failures to the future.
let txid = tx.txid(); | ||
for k in req.outpoints() { | ||
log_info!(logger, "Registering claiming request for {}:{}", k.txid, k.vout); | ||
// XXX this cannot be reordered to later than previous block because of data dependency - tests are failing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I can tell tweaking test_static_spendable_outputs_justice_tx_revoked_htlc_success_tx
, it sounds to me the second, correct justice transaction is not generated.
What I believe is happening, without looking that much into details, is that now than we don't push anymore the request into pending_claim_requests
, this map is not updated to reflect the commitment transaction being spent by our counterparty, what the above "loop is doing. Therefore we forget to generate the revoked_htlc_txn
as a bump_candidate
entry.
A solution could be to put this "conflicting-outpoints-reconciliation" loop before the generate_claim_tx
as a preprocessing step and therefore ensure that all the claim request we have are sane. Doing so, you might have few tests failing that I would say it's just because they have pathological cases like test_static_...success_tx
.
When we start moving towards anchor, in discussion with @wpaulino yesterday, I noted that we may want to shift the way our API works here entirely as a first step - instead of calling into the signer, generate events for transactions that need to be signed and let the user do it at their convenience. That would make signer disconnection entirely up to the user rather than something we have to handle in LDK itself. |
This comment is regarding L1 signing, not L2 (commitment tx) signing, right? |
Right, but all |
Post-#1897 this should get way easier - after that we persist the In the mean time, do you want to close this PR or will it get picked back up after that? |
I'll close this for now, until we get a chance to reevaluate. Thank you for the update. |
Early draft for discussion.