Skip to content

Include short channel id in PaymentPathFailed #1077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 13, 2021

Conversation

jkczyz
Copy link
Contributor

@jkczyz jkczyz commented Sep 17, 2021

Update process_onion_failure to return a short channel id of the responsible channel and include it in PaymentPathFailed events. This will allow for scoring channels based on past performance when finding the best payments paths in get_route.

@codecov
Copy link

codecov bot commented Sep 17, 2021

Codecov Report

Merging #1077 (32c89cc) into main (ad819ea) will increase coverage by 1.42%.
The diff coverage is 90.74%.

❗ Current head 32c89cc differs from pull request most recent head 202acd9. Consider uploading reports for the commit 202acd9 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1077      +/-   ##
==========================================
+ Coverage   90.64%   92.06%   +1.42%     
==========================================
  Files          65       66       +1     
  Lines       34226    41090    +6864     
==========================================
+ Hits        31024    37831    +6807     
- Misses       3202     3259      +57     
Impacted Files Coverage Δ
lightning/src/util/events.rs 37.38% <40.00%> (+5.38%) ⬆️
lightning/src/ln/channelmanager.rs 88.88% <75.00%> (+4.02%) ⬆️
lightning/src/ln/onion_utils.rs 95.37% <90.47%> (+0.45%) ⬆️
lightning/src/ln/onion_route_tests.rs 96.64% <96.00%> (-0.46%) ⬇️
lightning/src/ln/functional_tests.rs 98.29% <100.00%> (+0.85%) ⬆️
lightning/src/routing/network_graph.rs 93.83% <100.00%> (+2.29%) ⬆️
lightning/src/chain/mod.rs 55.55% <0.00%> (-3.27%) ⬇️
lightning-block-sync/src/lib.rs 93.49% <0.00%> (-1.69%) ⬇️
lightning-net-tokio/src/lib.rs 76.69% <0.00%> (-0.59%) ⬇️
lightning/src/util/ser_macros.rs 86.51% <0.00%> (-0.33%) ⬇️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad819ea...202acd9. Read the comment docs.

/// A path that included the failing [`RouteHop`].
pub path: Vec<RouteHop>,

/// The index of the failing [`RouteHop`] within the path.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics of this are really confusing. Most of that is just cause RouteHop confusingly refers to both a node and a channel, which is itself confusing, but its compounded here.

In general, I think we have three choices here

  • take the complicated approach and have logic here to indicate if we're blaming a node or a channel.
  • take the moderate and decide which of the inbound or outbound channels to blame based on what the responding node said or,
  • take the easy approach and always blame the outbound channel from the node that we received the failure from.

I think the easy approach is basically what you do here, and ok for a firstpass - we'll never take the same path, which is good, but we may disable a channel that we really shouldn't.

In any case, I think this needs to be much clearer, and its unclear to me why we need this full struct when we could just return a short channel id.

Copy link
Contributor Author

@jkczyz jkczyz Sep 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I think this needs to be much clearer, and its unclear to me why we need this full struct when we could just return a short channel id.

We'll need the path for retry (to compute the amount), but I tend to agree it may be better to keep the path separate. It would at least be symmetric with the forthcoming event for a successful payment path.

As far as the scoring API goes, I had planned on giving a reference to the path and to the failing hop. But the latter could be the short channel id instead. Users could always scan the path if they want to find the matching hop, I suppose.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I guess could go either way, but it does feel cleaner imo to just give the SCID instead of a reference to a path id - you're just going to go update the SCID in the graph anyway, right?

failing_route_hop = if payment_failed { None } else {
Some(FailingRouteHop {
path: path.clone(),
index: if is_from_final_node { route_hop_idx } else { route_hop_idx + 1 } as u16,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least make a tiny bit of effort to decide if the error code seems to be blaming the inbound or outbound channel, instead of always saying outbound for not-final-node and inbound for final-node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was primarily matching the logic used for the NetworkUpdate above. Would we want to do something similar there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we should maybe do something similar there, and also map some things that are currently channel failures to other types of failures. The code here is pretty old and admittedly a bit lazy. This comment is maybe on the wrong line, though - I guess the bigger question is later on in the non-PERM blocks. Maybe its ok to handle this stuff later and just land it as-is for now, as long as we get the API where we want.

@TheBlueMatt
Copy link
Collaborator

Could use a rebase, not that we have the path itself now in PaymentPathFailed, just not the failing hop.

@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch from 02b4f09 to 6c23617 Compare September 28, 2021 21:52
@TheBlueMatt TheBlueMatt added this to the 0.0.102 milestone Sep 29, 2021
@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch 2 times, most recently from 735414a to b6536a1 Compare September 30, 2021 07:08
Comment on lines 36 to 39
pub struct Scorer<S: routing::Score> {
scoring: S,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have this struct rather than just always using the generic inner object?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm a bit confused why we have two structs here instead of just defining a new trait that requires both EventHandler and routing::Score?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally did this, but then the user would have to implement EventHandler on their own. Figure we can correctly implement the boilerplate for them and let them focus on implementing their scoring logic. One less thing for them to do and get wrong. Especially if the events need to be processed differently in the future.

There are other advantages to this approach. It lets us vary Scorer's interface without users needing to update their code. Any "default" scoring could be pushed directly into Scorer if desired. Plus, Scorer doesn't require payment_path_failed and payment_path_successful, so those doesn't need to be part of the interface.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess as a user it just looks confusing to expose all the wrapper guts - I've got a score object and a scorer trait and I have to construct the scorer object with my score-implementing thing then use the score wrapper that does.... magic?

Does it make more sense to do the adapting between events and path-fail-succeed calls in a wrapper in the background processor? Then we'd still get the nice adapted interface but wouldn't have to expose an awkward wrapper to the world.

More generally, though, I think we're optimizing for the wrong thing here - the common user behavior will be one of (a) constant penalty, (b) our default that uses payment data, (c) data fetched from a server that does active probing.

I don't know how much demand there really is for using payment data with a user-provider scorer that we need to go through the effort to have a public event-score-thing wrapper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess as a user it just looks confusing to expose all the wrapper guts - I've got a score object and a scorer trait and I have to construct the scorer object with my score-implementing thing then use the score wrapper that does.... magic?

That's overcomplicating what's happening, IMHO. Our interfaces take Scorer. Docs say it is parameterized by scoring logic. User either (a) decides they want custom logic and implement the parameterized trait or (b) use Scorer::default().

Does it make more sense to do the adapting between events and path-fail-succeed calls in a wrapper in the background processor? Then we'd still get the nice adapted interface but wouldn't have to expose an awkward wrapper to the world.

Hmmm... if anything given InvoicePayer takes a Router maybe it's best to push the event handling logic there? The Scorer (or whatever it ends up being) will need to make its way there somehow. So then it seems a Router implementation would need the scoring parameterization and path-fail-succeed functions, and InvoicePayer could call the latter as needed.

Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's overcomplicating what's happening, IMHO.

Yea, that's probably right, I just wanted to highlight by being a bit facetious that we've seen some user confusion and we should probably lean hard towards "simpler" APIs where possible.

Hmmm... if anything given InvoicePayer takes a Router maybe it's best to push the event handling logic there?

I'd buy that. That feels right.

@@ -203,6 +203,8 @@ pub enum Event {
all_paths_failed: bool,
/// The payment path that failed.
path: Vec<RouteHop>,
/// The channel responsible for the failed payment path.
short_channel_id: Option<u64>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems like network_update already effectively contains the short_channel_id(s). Explicitly if NetworkUpdate::ChannelUpdateMsg or ::ChannelClosed, and implicitly if NodeFailure (i.e. the relevant SCIDs would be all of that node's channels)

Possible to omit this field for those reasons?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are a few cases in test_onion_failure that test None for network_update and Some for short_channel_id. Then again, maybe I'm setting it when not appropriate?

@TheBlueMatt For NodeFailure, would negatively scoring all the nodes channels be appropriate here? We'd need to either include them all or have Scorer have a reference to the NetworkGraph so that it can call Score::payment_path_failed for each channel. But that would be strange since the given path wouldn't necessarily contain the short_channel_id. And there would be no simple way to improve the score for all the node's channels if a path containing the node is eventually successful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, CHannelUpdateMsg is set when the failure-generating node sets the UPDATE flag and we were able to parse a valid ChannelUpdate from that peer. eg if the peer fails to return a ChannelUpdate we definitely want to de-score the channel, cause they're returning garbage :).

For NodeFailure, would negatively scoring all the nodes channels be appropriate here?

I don't think there's a "right" answer for any of this. I don't have a gut feel for whether nodes will return, eg, temporary node failures that are per-channel or per-node, so for now lets keep it simple?

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to do the scorer and get_route->get_route_and_payment_hash changes in a new PR, to better parallelize landing the first few commits?

Comment on lines 36 to 39
pub struct Scorer<S: routing::Score> {
scoring: S,
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'm a bit confused why we have two structs here instead of just defining a new trait that requires both EventHandler and routing::Score?

@jkczyz
Copy link
Contributor Author

jkczyz commented Oct 4, 2021

Does it make sense to do the scorer and get_route->get_route_and_payment_hash changes in a new PR, to better parallelize landing the first few commits?

SGTM

@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch 2 times, most recently from f050f64 to 44156ba Compare October 7, 2021 04:26
jkczyz added 2 commits October 6, 2021 23:31
With channel scoring and payment retries, it is no longer necessary to
have expiry_too_far imply a payment failure.
This simplifies failing route hop calculation, which will be useful for
later identifying the failing hop for PaymentFailed events.
@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch from 44156ba to d4c85b3 Compare October 7, 2021 04:36
@jkczyz jkczyz changed the title Include FailingRouteHop in PaymentFailed Include short channel id in PaymentPathFailed Oct 7, 2021
@jkczyz jkczyz marked this pull request as ready for review October 7, 2021 04:48
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, one comment then I'm ACK, will need some test tweaks, I think.

@@ -446,10 +452,14 @@ pub(super) fn process_onion_failure<T: secp256k1::Signing, L: Deref>(secp_ctx: &
});
}

if short_channel_id.is_none() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I don't think we want to set scid when payment_failed as its things like "unknown payment hash" which shouldn't factor into scoring (when they come from the last-hop). However, we'll need to exempt the 18|19 cases from this which imply someone on the route screwed with the payment, probably the second-to-last-hop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, latest fixup should reflect this, IIUC. Think I'm getting a better feel for this code.

run_onion_failure_test("fee_insufficient", 0, &nodes, &route, &payment_hash, &payment_secret, |msg| {
msg.amount_msat -= 1;
}, || {}, true, Some(UPDATE|12), Some(NetworkUpdate::ChannelClosed { short_channel_id: channels[0].0.contents.short_channel_id, is_permanent: true}));
}, || {}, true, Some(UPDATE|12), Some(NetworkUpdate::ChannelClosed { short_channel_id, is_permanent: true}), None);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect fee_insufficient to include an SCID, especially if we're marking the channel as closed - a few things here:

  • are we generating invalid fee_insufficient responses (ie not including the ChannelUpdate object)?
  • whether we are or not, we probably should have a SCID to blame if we get a failure with UPDATE but no ChannelUpdate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect fee_insufficient to include an SCID, especially if we're marking the channel as closed - a few things here:

  • are we generating invalid fee_insufficient responses (ie not including the ChannelUpdate object)?

Looks like we are here: https://github.com/rust-bitcoin/rust-lightning/blob/d66574803e08c65db0e5cec66a3ac7ba5a68bf4f/lightning/src/ln/channelmanager.rs#L1847

  • whether we are or not, we probably should have a SCID to blame if we get a failure with UPDATE but no ChannelUpdate.

Updated accordingly.

@@ -362,7 +373,7 @@ fn test_onion_failure() {
let session_priv = SecretKey::from_slice(&[3; 32]).unwrap();
let onion_keys = onion_utils::construct_onion_keys(&Secp256k1::new(), &route.paths[0], &session_priv).unwrap();
msg.reason = onion_utils::build_first_hop_failure_packet(&onion_keys[0].shared_secret[..], NODE|2, &[0;0]);
}, ||{}, true, Some(NODE|2), Some(NetworkUpdate::NodeFailure{node_id: route.paths[0][0].pubkey, is_permanent: false}));
}, ||{}, true, Some(NODE|2), Some(NetworkUpdate::NodeFailure{node_id: route.paths[0][0].pubkey, is_permanent: false}), None);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - temporary_node_failure definitely needs an SCID to blame, otherwise we'll probably retry the same path through a node we dont know how to talk to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. That should have been an obvious one lol

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise we'll probably retry the same path through a node we dont know how to talk to.

Why does this apply to temporary node failures but not e.g. the BADONION permanent ones?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely applies to BADONION as well! Good catch. It should apply any time the payment is retryable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, FYI I interpret retryable == "not rejected by dest"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, we may need to debate changing that name, but the docs basically say "you can retry unless this is set"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added handling of BADONION but couldn't figure out a way to test this. 😕

Will update the docs but I'm afraid the logic is complicated enough that I'm not certain short_channel_id is always set when process_onion_failure returns that it is retryable. Seems there are a few test cases where it is true but we expect None, in fact.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with this code area but it's looking pretty sane! The redundant short_channel_id field makes me uncomfy but I can live with it for expediency :P

} else {
Some(NetworkUpdate::ChannelUpdateMessage {
if failing_route_hop.short_channel_id == chan_update.contents.short_channel_id {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment here could be useful, to know the reasoning for when to set short_channel_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

18|19 => Some(route_hop.short_channel_id),
_ => None,
};
} else {
// We can't understand their error messages and they failed to
// forward...they probably can't understand our forwards so its
// really not worth trying any further.
network_update = Some(NetworkUpdate::NodeFailure {
node_id: route_hop.pubkey,
is_permanent: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem odd to generate a permanent node failure without banning all of that node's channels. Should there be a todo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think of it, processing this NetworkUpdate should result in the node and all its channels being removed from NetworkGraph when the event is handled by NetGraphMsgHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I'm mistaken here. This happens through NetworkGraph's interface but our code for processing NodeFailure just has a TODO already.

@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch from 6553205 to 38a928f Compare October 11, 2021 19:29
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the onion tests in more detail, seem to make sense. Two small questions then I'm ACK 💯

Appreciate the cleanups in process_onion_failure! Readability 🚀

let (_, payment_hash, payment_secret) = get_payment_preimage_hash!(nodes[2]);

run_onion_failure_test("invalid_onion_version", 0, &nodes, &route, &payment_hash, &payment_secret, |msg| { msg.onion_routing_packet.version = 1; }, ||{}, true,
Some(BADONION|PERM|4), None);
Some(BADONION|PERM|4), None, None);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these tests seem to have retryable but no update, so we'd always retry the same path in a loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Jeff pointed out these errors are coming from our immediate peer, not a few-hops away one. Ideally we'd have tests for both cases, but these specific cases should not include an SCID I think - punishing our own channel isn't really helpful, we should be closing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment here as discussed offline.

@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch from b301aae to cfb8030 Compare October 12, 2021 19:10
@@ -2985,6 +2987,7 @@ impl<Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref> ChannelMana
network_update: None,
all_paths_failed,
path: path.clone(),
short_channel_id: Some(path.first().unwrap().short_channel_id),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really a huge fan of "punishing" our own channels - on one hand its nice that the payment goes through, but on the other hand it seems like we're addressing the problem in the wrong way - we have a peer who isn't able to read our HTLC-add messages, we shouldn't just stop sending payments their way, we should close the channel with them and disconnect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Punishment only occurs during event handling, so this is just signaling which is the responsible channel. The event handler could check if short_channel_id is the same as the first hop in the path and act accordingly. For instance, InvoicePayer could not pass the hop to the scorer in this case. And another event handler (perhaps in BackgroundProcessor?) could close the channel. Thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make sense, I suppose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could make it an enum

OurChannel(u64)
NetworkChannel(u64)
NotRetryable

Just some inspo, I'm fine with it as is

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this either way, but do want to discuss punishing our own channels.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this either way, but do want to discuss punishing our own channels.

run_onion_failure_test("fee_insufficient", 0, &nodes, &route, &payment_hash, &payment_secret, |msg| {
msg.amount_msat -= 1;
}, || {}, true, Some(UPDATE|12), Some(NetworkUpdate::ChannelClosed { short_channel_id: channels[0].0.contents.short_channel_id, is_permanent: true}));
Copy link
Contributor

@valentinewallace valentinewallace Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this was already here but why would UPDATE|12 result in NetworkUpdate::ChannelClosed instead of NetworkUpdate::ChannelUpdate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I had to go look up the same cause it was confusing - its because the update we got back shouldn't have resulted in a payment failure. They're telling us the fee they charged was, in fact, enough for the payment, implying the previous-hop node malleated the payment before forwarding, taking funds they weren't supposed to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. @jkczyz think we could add a comment for this one since we're here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, this is the case where fee_insufficient is returned but shouldn't have been based on our calculation using the data given to us in the update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, didn't notice last two comments before replying. I'll add a comment.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, I think after these comments + one outstanding I'm happy

@@ -333,9 +343,10 @@ fn test_onion_failure() {
// describing a length-1 TLV payload, which is obviously bogus.
new_payloads[0].data[0] = 1;
msg.onion_routing_packet = onion_utils::construct_onion_packet_bogus_hopdata(new_payloads, onion_keys, [0; 32], &payment_hash);
}, ||{}, true, Some(PERM|22), Some(NetworkUpdate::ChannelClosed{short_channel_id: channels[1].0.contents.short_channel_id, is_permanent: true}));
}, ||{}, true, Some(PERM|22), Some(NetworkUpdate::ChannelClosed{short_channel_id, is_permanent: true}), Some(short_channel_id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also already here but I think NetworkUpdate::ChannelClosed needs a rename or a docs update at some point if it's used in situations like these, given no channel was closed iiuc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. ChannelFailure would be a better name. We either disable or remove it from the NetworkGraph depending on is_permanent, which I guess affects gossip. But, indeed, the channel may not actually be closed.

let (_, payment_hash, payment_secret) = get_payment_preimage_hash!(nodes[2]);

// Our immediate peer sent UpdateFailMalformedHTLC because it couldn't understand the onion in
// the UpdateAddHTLC that we sent. Thus, short_channel_id is None because there's no sense
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment needs updating for Some short_channel_id

Could also add a note for why NetworkUpdate is None (is it the same reason, i think?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... @TheBlueMatt for these test cases presumably we don't give a NetworkUpdate since it's our immediate peer. But for the case of BADONION inside process_onion_failure, should we return NetworkUpdate::ChannelClosed? The latter case is currently untested as noted earlier.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this corresponds to the TODO at https://github.com/rust-bitcoin/rust-lightning/blob/main/lightning/src/ln/channelmanager.rs#L3099 noting that updating the network graph won't do anything. For BADONION from a later node, I'd agree a ChannelClosed makes sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come to think of it, since all BADONION failures are also PERM then we could probably drop the BADONION handling code in process_onion_failure and let the PERM handling code take care of this.

let (_, payment_hash, payment_secret) = get_payment_preimage_hash!(nodes[2]);

run_onion_failure_test("final_expiry_too_soon", 1, &nodes, &route, &payment_hash, &payment_secret, |msg| {
let height = msg.cltv_expiry - CLTV_CLAIM_BUFFER - LATENCY_GRACE_PERIOD_BLOCKS + 1;
connect_blocks(&nodes[0], height - nodes[0].best_block_info().1);
connect_blocks(&nodes[1], height - nodes[1].best_block_info().1);
connect_blocks(&nodes[2], height - nodes[2].best_block_info().1);
}, || {}, true, Some(17), None);
}, || {}, true, Some(17), None, None);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final_expiry_to_soon case should be retryable/have a Some scid IIUC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, yea, I think we should have a Some here. Specifically, lightning/bolts#922 may cause a node to bounce our payment because of simple incompatibility, but we should retry the payment via another path. This means we should punish the node, but sadly not because they're actually misbehaving, just cause lightning sucks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using test case 1 which indicates the final node failed, so it is not retryable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is why our definition of retryable/rejected_by_dest needs a little bit of nuance/description since that still sounds retryable to me, even though it was the final node that made that call. Fine for now, though

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we already give a buffer of a block or two, so its possible we hit that because some intermediary node delayed the payment by two blocks, but short of that the recipient is rejecting the payment for a bogus reason. Agree all of this is super subtle and some things can have multiple meanings. Can't solve that here, though :/

run_onion_failure_test("fee_insufficient", 0, &nodes, &route, &payment_hash, &payment_secret, |msg| {
msg.amount_msat -= 1;
}, || {}, true, Some(UPDATE|12), Some(NetworkUpdate::ChannelClosed { short_channel_id: channels[0].0.contents.short_channel_id, is_permanent: true}));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. @jkczyz think we could add a comment for this one since we're here?

@TheBlueMatt
Copy link
Collaborator

Created #1117 to track any followup stuff.

@jkczyz
Copy link
Contributor Author

jkczyz commented Oct 12, 2021

Ok, I think I addressed the remaining comments, but let me know if I missed something. Lots of moving parts. 😬

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fielding all this feedback! LGTM

@jkczyz jkczyz mentioned this pull request Oct 12, 2021
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs squash.

This will be useful for scoring channels when a payment fails.
@jkczyz jkczyz force-pushed the 2021-09-failing-route-hop branch from 32c89cc to 202acd9 Compare October 12, 2021 23:41
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue, but lets look at it in a followup, we need to move forward here.

@TheBlueMatt TheBlueMatt merged commit fe8c10d into lightningdevkit:main Oct 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants