-
Notifications
You must be signed in to change notification settings - Fork 407
2019 04 in flight txn tracking clean #336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2019 04 in flight txn tracking clean #336
Conversation
d3967e7
to
d74361c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I had these pending, I presume worth publishing until I get a chance to re-review, hopefully tomorrow.
@@ -390,6 +393,9 @@ pub struct ChannelMonitor { | |||
|
|||
destination_script: Script, | |||
|
|||
|
|||
onchain_events_waiting_threshold_conf: HashMap<u32, Vec<OnchainEvent>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment describing exactly how this is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe meaning of the u32 still isnt explained.
src/ln/channelmonitor.rs
Outdated
/// once they mature to enough confirmations (HTLC_FAIL_ANTI_REORG_DELAY) | ||
#[derive(Clone, PartialEq)] | ||
enum OnchainEvent { | ||
/// Outpoint under claim process by our own tx, at maturation we remote it from our_claim_txn_waiting_first_conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there is no variable called our_claim_txn_waiting_first_conf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's inside the added commit, what is best practice referring it even if not present in commit or doing a generic reference?
From IRC:
|
d74361c
to
3bb62d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly done reviewing, but here's some initial feedback.
src/ln/channelmonitor.rs
Outdated
@@ -2046,7 +2046,17 @@ impl ChannelMonitor { | |||
payment_preimage.0.copy_from_slice(&input.witness[1]); | |||
htlc_updated.push((source, Some(payment_preimage), payment_hash)); | |||
} else { | |||
htlc_updated.push((source, None, payment_hash)); | |||
log_trace!(self, "Failing HTLC with payment_hash {} timeout by a spend tx, waiting confirmation until {} height", log_bytes!(payment_hash.0), height + HTLC_FAIL_ANTI_REORG_DELAY - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm...we need some kind of mechanism to prevent us from waiting until after an HTLC expires to fail it backwards. In fact, currently, we only close-to-chain when we're CLTV_CLAIM_BUFFER (6 blocks) away from expiry, and we only fail back after HTLC_FAIL_ANTI_REORG_DELAY (6 blocks), so we're kinda screwed. We should significantly increase CLTV_CLAIM_BUFFER, add some static asserts, and then make sure we always fail back maybe one or two blocks before expiry if we have an unconfirmed resolving tx pending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm isn't more CLTV_EXPIRY_DELTA and HTLC_FAIL_ANTI_REORG_DELAY which are interwined ? Let's say remote commitment transaction hits onchain, and we resolves a received_HTLC with a timeout tx, after HTLC_FAIL_ANTI_REORG_DELAY we pass failure backward. As it's way inferior to CLTV_EXPIRY_DELTA, it should be safe, assuming would_broadcast_at_height react timely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yea, sorry, you're right, we should still try to check how far we are from where we need to have failed back by, but, indeed, its a question of CLTV_EXPIRY_DELTA.
3bb62d1
to
fc06077
Compare
fc06077
to
47db5be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh, I'll finish this review this week, but this is what I came up with when looking a week ago or so.
src/ln/channelmonitor.rs
Outdated
if let &Some(ref local_tx) = &self.current_local_signed_commitment_tx { | ||
for &(ref htlc, _, ref source) in &local_tx.htlc_outputs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this block only run in case either commitment_txid == current local txid or remote local txid? (ie doesn't this hunk, as written, always fail dust HTLCs backwards?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes modified to only fail backward if commitment tx is last or previous local one.
But we still need to iterate on both HTLCs set as you may receive a update_fail_htlc + commitment signed from your counterpart. In case of last local tx hitting the chain, our in-memory last local tx datas won't contain the failed dust HTLC output. That's a edge-case but test_failure_delay_dust_htlc_local_commitment is explicitly testing that.
b33d1df
to
e3d42ed
Compare
e3d42ed
to
ca03df9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't review individual commits, just the total diff. Didn't get materially to tests yet either, but it does look like the tests could use some love, definitely a few more new tests would be much appreciated (eg to fix the issues I point out).
src/ln/channelmanager.rs
Outdated
@@ -332,6 +332,8 @@ pub struct ChannelManager { | |||
channel_state: Mutex<ChannelHolder>, | |||
our_network_key: SecretKey, | |||
|
|||
channel_closing_waiting_threshold_conf: Mutex<HashMap<u32, Vec<[u8; 32]>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs please. Also, looks like right now this is always only ever taken while channel_state is taken. It could either be moved into the channel_state mutex or the places its taken broken up and scoped so that the locks don't always overlap. Don't sweat making fine-grained locks yet anyway, until we go around and split channels into their own locks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated doc and moved it in ChannelHolder ? IIRC this buffer is just there to avoid inconsistencies between channel_manager and channel_monitor in case of reorgs and an HTLC-timeout being swept by a preimage tx, in this case we need to pass success backward and we need the channel to be still open. Do you think it's ok right now before we put more thoughts in channel locks refactoring ?
@@ -390,6 +393,9 @@ pub struct ChannelMonitor { | |||
|
|||
destination_script: Script, | |||
|
|||
|
|||
onchain_events_waiting_threshold_conf: HashMap<u32, Vec<OnchainEvent>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THe meaning of the u32 still isnt explained.
src/ln/channelmonitor.rs
Outdated
assert!(predicted_weight >= single_htlc_tx.get_weight()); | ||
assert_eq!(single_htlc_tx.input.len(), 1); | ||
self.our_claim_txn_waiting_first_conf.insert(single_htlc_tx.input[0].previous_output.clone(), (height + 3, TxMaterial::Revoked { script: redeemscript, pubkey: Some(revocation_pubkey), key: revocation_key, is_htlc: true, amount: htlc.amount_msat / 1000 }, fee_estimator.get_est_sat_per_1000_weight(ConfirmationTarget::HighPriority))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, probably want something smarter than just height + 3, gien the HTLC may be expiring in < 3 blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, need to make sure we add to the map with the feerate actually used, not a new one fetched from the fee_estimator. Further, and this is somewhat of a strange edge-case, but in case we got called twice for the same block, but the feerate information changed between calls, we don't want to overwrite the entry in the map with new info.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that these comments apply equally to several other similar lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On feerate, updated to cache only actual value used. In case we got block twice, I kept only the first version seen, doesn't seem efficient, in case of bump needed, to send two bumped transactions at same height even their respective previous states has different feerate base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you may have the tx_broadcaster discard duplicate claim tx based on spend outpoint.
src/ln/channelmanager.rs
Outdated
msg: update | ||
}); | ||
log_trace!(self, "Detected channel-closing tx {} spending {}:{}, waiting until {} to close channel {}", tx.txid(), inp.previous_output.txid, inp.previous_output.vout, height + HTLC_FAIL_ANTI_REORG_DELAY - 1, log_bytes!(channel_id[..])); | ||
match channel_closing_lock.entry(height + HTLC_FAIL_ANTI_REORG_DELAY - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this and a bunch of other places): Is it possible that we need to fail this backwards before we reach HTLC_FAIL_ANTI_REORG_DELAY? ie because the inbound HTLC is close to timing out (and, thus, will result in a needless channel closure upstream). I think we need to check if the expiry + CLTV_EXPIRY_DELTA is at least HTLC_FAIL_ANTI_REORG_DELAY + CLTV_CLAIM_BUFFER (maybe + some headroom) away?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not something to do now, but I'm more and more thinking that channel_manager shouldn't be a ChainListener and get all blockchain events through ChannelMonitor and register outpoint to watch if needed. It would avoid inconsistencies between both of them due to different handling of block height timer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, see comment at https://github.com/rust-bitcoin/rust-lightning/pull/336/files#r304483221 I'd agree if block_connected in ChannelManager grows, but as long as we can keep it small, I don't see why its necessary.
src/ln/channelmonitor.rs
Outdated
|
||
add_dynamic_output!(htlc_timeout_tx, 0); | ||
//TODO: we should delay broadcast of HTLC-timeout tx until locktime - 1 to avoid being bounce off by mempool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not actually sure what you're saying here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going through core mempool rules the other day, I read that locktime transactions are going to be rejected from the mempool is locktime isn't equal or more to tip + 1 to avoid DoS vectors which means if we should broadcast timeout txn only at this given height to avoid having to rebuild it few blocks later. And for so track them with an OnchainEvent. It is how the mempool is working ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh, I see your point. Still, broadcast_transaction should imply re-announcement (potentially at the appropriate time), if only because mempool transactions may time out. The docs on broadcast_transaction should make sure this is clear, but we shouldn't bother dealing with that.
src/ln/channelmonitor.rs
Outdated
} | ||
|
||
// HTLCs set may differ between last and previous local commitment txn, in case of one hitting chain, ensure we cancel all of them backward | ||
let is_local_tx = if self.current_local_signed_commitment_tx.is_some() || self.prev_local_signed_commitment_tx.is_some() { true } else { false }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isnt enough to verify that the given transaction is a local transaction, is it? I think you also need to check that commitment_txid == local_tx.txid for either prev or current?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth trying to add a test case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, was broken but not spotted by any other tests, seems it's really an edge case because if it's spending the funding output, dust htlcs are going to be cancelled on the behalf of remote commitment. Nevermind, switched the logic to be sure we have a previous_commitment or latest_commitment. Added test_no_failure_dust_htlc_local_commitment, turn to true the is_local_tx in check_spend_local_transaction
df16818
to
0d3575a
Compare
Updated at 0d3575a with your last review, I think we need to think more how we handle timelocks given #336 (comment), #336 (comment) and #336 (comment) |
src/ln/channelmanager.rs
Outdated
/// before to close channel. Doing it to early would block us to pass backward | ||
/// preimage in case of successful claim on onward channel. Key is block | ||
/// height timer, value is array of channels id, waiting their closing. | ||
pub(super) channel_closing_waiting_threshold_conf: HashMap<u32, Vec<[u8; 32]>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, why? If our counterparty broadcasts a commitment transaction, and it gets reorged out, we shoulnd't try to relay payments over the channel, that commitment transaction is probably gonna be mined again in a block soon, we should just move forward happily with a closed channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I'm dumb, that's exactly what my test was trying to do, i.e testing some edge cases which is doable but doesn't make sense on a high-level. And from then, adding all this non-sense struct in channel_manager, I've dropped this commit and rewrote test_sweep_outbound_htlc_faiilure_update.
Broadcasting a commitment tx means that we have to fail inbound HTLC in backward channel. Doing it prematurely would put us at risk in case of reorg. So we delay passing failure update upstream until solving tx mature to HTLC_FAIL_ANTI_ REORG_DELAY. Requirements differ if HTLC is a revoked/non-revoked dust/ non-revoked non-dust one. Add connect_blocks in test_utils to fix broken tests due to anti-reorg delay enforcement Remove anti-duplicate htlc update stuff in ManySimpleChannelMonitor
Modify ChainListener API by adding height field to block_disconnect
Fix tests broken by introduced change
Add test_failure_delay_htlc_local_commitment and test_no_failure_dust_htlc_local_commitment Move some bits of check_spend_remote as we need to fail dust HTLCs which can be spread on both prev/lastest local commitment tx
Add test_sweep_outbound_htlc_failure_update
We need also to track claim tx until their maturation to know when we may safely remove them from could-be-bumped-txn buffer
ce5135f
to
ba511d5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty much ready to go.
src/ln/channelmonitor.rs
Outdated
//- htlc update there as failure-trigger tx (revoked commitment tx, non-revoked commitment tx, HTLC-timeout tx) has been disconnected | ||
//- our claim tx on a commitment tx output | ||
} | ||
self.our_claim_txn_waiting_first_conf.retain(|_, ref mut v| if v.0 == height + 3 { false } else { true }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this now wrong given get_height_timer? Maybe also track what block we saw the ting which is being spent explicitly to avoid any confusion.
src/ln/channelmanager.rs
Outdated
@@ -28,7 +28,7 @@ use secp256k1; | |||
use chain::chaininterface::{BroadcasterInterface,ChainListener,ChainWatchInterface,FeeEstimator}; | |||
use chain::transaction::OutPoint; | |||
use ln::channel::{Channel, ChannelError}; | |||
use ln::channelmonitor::{ChannelMonitor, ChannelMonitorUpdateErr, ManyChannelMonitor, CLTV_CLAIM_BUFFER, HTLC_FAIL_TIMEOUT_BLOCKS, HTLC_FAIL_ANTI_REORG_DELAY}; | |||
use ln::channelmonitor::{ChannelMonitor, ChannelMonitorUpdateErr, ManyChannelMonitor, CLTV_CLAIM_BUFFER, GRACE_PERIOD_BLOCKS, ANTI_REORG_DELAY}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New names are nice, but GRACE_PERIOD_BLOCKS isnt sufficiently clear, IMO, maybe NETWORK_SYNC_GRACE_PERIOD_BLOCKS?
ba511d5
to
9d8bfa2
Compare
Rename HTLC_FAIL_ANTI_REORG_DELAY to ANTI_REORG_DELAY because we are going to rely on it also to remove bump candidates outpoint from tracker after claim get enough depth. Rename HTLC_FAIL_TIMEOUT_BLOCKS to LATENCY_GRACE_PERIOD_BLOCKS because it's carrying more meaningfully that we are doing a favor to our peer instead of ruthlessly enforcing the contract. CLTV_EXPIRY_DELTA should be > to LATENCY_GRACE_PERIOD_BLOCKS + +CLTV_CLAIM_BUFFER + ANTI_REORG_DELAY + LATENCY_GRACE_PERIOD_BLOCKS When we reached height + LATENCY_GRACE_PERIOD_BLOCKS and we have pending unsolved outbound HTLC, we fail onchain with our local commitment tx. At this point we expect to get in chain in a worst-case delay of CLTV_CLAIM_BUFFER. When our HTLC-timeout is confirmed with ANTI_REORG_DELAY we may safely fail backward the corresponding inbound output.
When we generate a justice tx, a htlc tx on remote commitment or a htlc tx on local commitment we track them until first conf.
for (input, info) in spend_tx.input.iter_mut().zip(inputs_info.iter()) { | ||
let (redeemscript, revocation_key) = sign_input!(sighash_parts, input, info.0, info.1); | ||
let height_timer = Self::get_height_timer(height, info.2); | ||
match self.our_claim_txn_waiting_first_conf.entry(input.previous_output.clone()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't need to happen in this PR, but in #347 this is gonna bite us - we don't track that all of these inputs are spent in the same transaction, with different expiries, so we may try to bump only one input, and then end up getting that confirmed, leaving nothing in the mempool trying to punish other inputs.
We must adapt our delay between two bumps of claim txn in respect to the timelock encumbering the targeted outpoint. If HTLC or revoked output is near to expire, we should try to get our claim in every block. If it's reasonably in the future, we may give us more latency to bump
9d8bfa2
to
757bcc2
Compare
#333 without WIP Bump Tx stuff