Skip to content

Initiate gossip_queries #736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Dec 15, 2020
Merged

Conversation

bmancini55
Copy link
Contributor

@bmancini55 bmancini55 commented Oct 3, 2020

Part 1 for initiating gossip_queries for #615. This PR includes:

  • Adds feature flag
  • Adds msg types to wire enum
  • Adds SendMessageEvents
  • Adds message handlers to RoutingMessageHandler
  • Implements querying logic in NetGraphMsgHandler

@julianknutsen
Copy link

julianknutsen commented Oct 4, 2020

I would suggest basing your peer_handler changes on top of #714

Major restructure and unit tests that will save you some merge headache down the road.

FYI: @ariard @arik-so

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Structure looks good mod one note, would be nice to get some tests at least of the implementation of the message handler for NetGraphMsgHandler.

@@ -826,6 +826,33 @@ pub trait RoutingMessageHandler : Send + Sync {
fn should_request_full_sync(&self, node_id: &PublicKey) -> bool;
}


/// A trait that describes an object that can receive gossip queries messages
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just integrate this into RoutingMessageHandler. Indeed, it would make RoutingMessageHandler much more complex, but that's the way the protocol went, and the existing functions are not very useful without this, too. That would also create significantly less conflicts with other work, as Julian notes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I've been thinking the same thing. Thanks!

@bmancini55
Copy link
Contributor Author

I would suggest basing your peer_handler changes on top of #714

@julianknutsen, thanks I'll check it out.

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good ! Mainly see my point on DisconnectPeer, I would be more liberal with gossips queries errors from remote peers, as severing connections have consequence for channel processing. Also, on those cases, BOLT 7 recommendation is a MAY, so we really have room here. I wouldn't be surprised this protocol to be a bit buggy in other implementations and we should avoid error contamination.

MessageSendEvent::SendChannelsQuery { ref node_id, ref msg } => {
log_trace!(self.logger, "Handling SendChannelsQuery event in peer_handler");
let (mut descriptor, peer) = get_peer_for_forwarding!(node_id, {
//TODO: Do whatever we're gonna do for handling dropped messages
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just do nothing here. I assume that gossip queries will be scheduled by NetGraphMsgHandler until its is "satisfy" with its view of the network and should be ready to deal with peers with poor view of the network, which is the same of not finding the peer.

If this is too slow and peer are too much unreliable, we may add an Event but honestly let's not overload the number of events to manage.

/// The node_id of the node which should receive this message
node_id: PublicKey,
/// The message which should be sent.
msg: msgs::QueryShortChannelIds,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a worthy follow-up would be to coordinate better between query-gossips-at-connection and query-gossips-when-network-view-is-bad, AFAICT this patchset aims to query for each connection, which might be too much bandwidth hungry. I was thinking you could have a timer_refresh_network_view_with_queries in NetGraphMsgHandler on the model of PeerHandler::timer_tick_occured()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, sorry about that. I was doing some dev testing and left code that performs a sync on each connection. I still need to clean up the PeerHandler code to respect should_request_full_sync.

Thanks, I'll check out PeerHandler::timer_tick_occured(). For periodic anti-entropy, it might be nice to connect to a random node from the network view and attempt a sync.

@bmancini55
Copy link
Contributor Author

Updates:

  • Changed the handler methods to use IgnoreError instead of DisconnectPeer
  • Added tests for gossip related methods in NetGraphMsgHandler. Some are broader scoped than unit tests since I wanted to ensure the task setup wasn't faked. Some of the test code is also a bit repetitive, I like making test code as straight forward as possible. Happy to try to reduce some of the repetitiveness if that fits the project style better.

More to follow!

@ariard
Copy link

ariard commented Oct 15, 2020

@bmancini55 Is this ready for another round of review :) ?

EDIT: I'll have a look on the new tests anyway

@bmancini55
Copy link
Contributor Author

@bmancini55 Is this ready for another round of review :) ?

EDIT: I'll have a look on the new tests anyway

@ariard I'll be adding more commits this week, so not quite there yet for a complete look. The tests should be good to review though, thanks!

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far the code is pretty well-documented and coverage sounds to be exhaustive (missing handle_query_channel_range/handle_query_short_channels_ids IIUC)

Still have a look on the mutation I provided and the point on the task chain hash, I'm not sure if it's correct.

If you have a follow-up in mind, you're welcome to track them in an appropriate issue, let avoids this PR to become obese.

let first_blocknum = 0;
let number_of_blocks = 0xffff_ffff;

// When no active query exists for the node, it should send a query message and generate a task
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the active query be time-out at some point in case of irresponsive peer ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely agree on this point. Also need to clean up tasks on peer disconnection.

let mut query_range_tasks_lock = self.query_channel_range_tasks.lock().unwrap();
let query_range_tasks = &mut *query_range_tasks_lock;
if query_range_tasks.contains_key(their_node_id) {
return Err(LightningError {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a caller requirement to track channel_range query I think this might be a hint of a higher logic bug. Though I don't have a clear model of this method usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking is that this will be called from peer_handler after init negotiation and should_request_full_sync returns true. This guard prevents multiple in-flight query_channel_range messages which is spec non-compliant. This could be changed to queue channel range queries but I thought outright rejection was a simpler for now.

Relatedly, my plan is to make full_syncs_requested decrement on any failure in handle_reply_channel_range so that a full sync can be requested with the next peer.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can switch it later to an assert when higher sync logic is implemented.

@bmancini55
Copy link
Contributor Author

@ariard 🤦 major apologies for that sloppiness!! I have fixed the issues with chain_hash and reply fields. Also took your advice to harden by asserting the message. I added tests for handle_query_channel_range and handle_query_short_channels_ids and renamed send_channels_query as requested.

If you have a follow-up in mind, you're welcome to track them in an appropriate issue, let avoids this PR to become obese.

If you're ok with it, this PR can be limited to:

  • add gossip_queries feature (will change so it is not a default)
  • wire enum changes for gossip queries messages
  • add gossip_queries methods to RoutingMessageHandler
  • implement handler methods in NetGraphMsgHandler + tests

Subsequent PRs can:

  • initiate gossip_query sync in peer_handler on peer connection
  • respond to peer's gossip queries
  • clean up orphaned/timed out tasks
  • initiate periodic anti-entropy synchronization

If that sounds good, I will remove temporary code from peer_handler, squash commits appropriately, and flip this to ready for review.

@ariard
Copy link

ariard commented Oct 16, 2020

If that sounds good, I will remove temporary code from peer_handler, squash commits appropriately, and flip this to ready for review.

SGTM with proposed scope!

@bmancini55 bmancini55 force-pushed the gossip_queries branch 3 times, most recently from 24ebffc to a2325a5 Compare October 23, 2020 19:54
@codecov
Copy link

codecov bot commented Oct 23, 2020

Codecov Report

Merging #736 (c026764) into main (a008464) will decrease coverage by 0.17%.
The diff coverage is 78.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   91.33%   91.16%   -0.18%     
==========================================
  Files          37       37              
  Lines       22423    22623     +200     
==========================================
+ Hits        20481    20624     +143     
- Misses       1942     1999      +57     
Impacted Files Coverage Δ
lightning/src/ln/channelmanager.rs 84.94% <0.00%> (-0.09%) ⬇️
lightning/src/ln/msgs.rs 89.60% <ø> (ø)
lightning/src/util/events.rs 23.68% <0.00%> (-0.87%) ⬇️
lightning/src/util/test_utils.rs 84.83% <20.00%> (-3.80%) ⬇️
lightning/src/ln/peer_handler.rs 51.02% <23.07%> (-3.73%) ⬇️
lightning/src/ln/wire.rs 61.61% <28.57%> (-4.55%) ⬇️
lightning/src/routing/network_graph.rs 92.19% <97.70%> (+0.88%) ⬆️
lightning/src/ln/features.rs 98.80% <100.00%> (+0.08%) ⬆️
lightning/src/ln/functional_test_utils.rs 95.13% <100.00%> (ø)
lightning/src/routing/router.rs 95.89% <100.00%> (ø)
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a008464...c026764. Read the comment docs.

@bmancini55 bmancini55 marked this pull request as ready for review October 23, 2020 20:44
@bmancini55 bmancini55 changed the title Initiate gossip_queries Initiate gossip_queries part 1 Oct 26, 2020
Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds almost good to me, only call for better naming/comments. Thanks for the great code comments, it make the understand of this fuzzy query protocol easier.

Note to myself : review again test coverage at next round and then it's done.

/// added to it while we are waiting for a reply_short_channel_ids_end.
/// The pending short_channel_ids will be chunked and sent to the remote
/// peer.
pub struct GossipQueryShortChannelIdsTask {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comment why we need to maintain a state for queried remote peer and can't send simply query in a stateless flow ? AFAIU, it's only used to sanitize ReplyShortChannelIdsEnd. Maybe in the future it should be extended with a timestamp to know when to expire queries and retry with another peer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added, hopefully that clarifies the state.

if task_received_first_block.is_none() {
// The replies must include our range but can be a superset of the
// queried block range. We start by validating the reply's
// first_blocknum is less than or equal to the start of our query.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean "superior or equal" ? At least that was the check msg.first_blocknum > task_first_blocknum is enforcing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment here


/// Maximum number of reply_channel_range messages we will allow in
/// in reply to a query_channel_range.
const MAX_REPLY_CHANNEL_RANGE_PER_QUERY: usize = 250;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those values from BOLT 7 ? Can't find them there

Copy link
Contributor Author

@bmancini55 bmancini55 Nov 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not defined in BOLT 7. I updated the comments with more information. 8000 is used in other implementations as the raw batch size to fit inside the 65535-byte message limit. I can fill you in on max replies over another channel if you want some additional insight.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, the 65535 as an upper bound makes sense to me.

@julianknutsen
Copy link

@ariard What was the final conclusion on the dependency between this and #691? These patches look to be based on master and didn't see a resolution in any of the comments about the ordering.

@bmancini55
Copy link
Contributor Author

@julianknutsen @ariard I'm happy to rebase this PR if desired. The changes in peer_manager are fairly minimal for this PR.

@ariard, I'm looking over your comments and will make those changes shortly. Thanks!

@ariard
Copy link

ariard commented Nov 2, 2020

@bmancini55 It can go first, it's more near-merge mature than #691 :)

@julianknutsen What's the real issue with the ordering ? AFAICT, there is only one potential conflicting file lightning-net-tokio/src/lib.rsand in anycase it would be a minor one ? I might be wrong, in that case may you point the conflict and how much complexity it would be to fix it if #691 go second ?

@julianknutsen
Copy link

The merge conflicts should be easy to resolve as they all follow existing patterns. A more interesting question to answer is how much test coverage is needed in the peer_manager code and should there be an effort to increase code coverage over time when new features are added.

The tests in #692 introduce easy-to-use testing frameworks and exhaustive unit tests of the callback code that help ensure all the code is tested at some level. Those will likely need to be extended to include the new branches this PR introduces.

If #692 were to go first, this patch would result in a code coverage drop in the report which would be an automated flag to help catch the testing issues.

Not hard, just something to add to the TODO list and it is easy for these things to fall through the cracks with multiple changes to the same code.

I'd recommend this goes first to limit the patch maintenance as 692 is unlikely to be merged in the current form, but I wanted to share the pros/cons to help make the best decision.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay, was off for the last week. I think we can cut back a lot of code here by being a bit more willing to respond to bogus-but-not-harmful data from peers.

@@ -824,6 +824,27 @@ pub trait RoutingMessageHandler : Send + Sync {
fn get_next_node_announcements(&self, starting_point: Option<&PublicKey>, batch_amount: u8) -> Vec<NodeAnnouncement>;
/// Returns whether a full sync should be requested from a peer.
fn should_request_full_sync(&self, node_id: &PublicKey) -> bool;
/// Queries a peer for a list of channels with a funding UTXO in the requested
/// chain and range of blocks.
fn query_channel_range(&self, their_node_id: &PublicKey, chain_hash: BlockHash, first_blocknum: u32, number_of_blocks: u32) -> Result<(), LightningError>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these fn interfaces can be cut back a bit - at least chain_hash can be dropped because NetworkGraph objects only actually support one chain (though maybe we should track it in the NetworkGraph!) and I'm not sure exactly how this would be used so not sure why we need number_of_blocks - can we just always set it to MAX_INT? It may help to document exactly how these intend to be used, though I've left some further motivating comments below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are three use cases we may be interested in:

  1. A full sync
  2. A partial sync
  3. Perform an SCID query if we receive an unknown channel_update

Focusing on the full sync, we can drop the first_blocknum and number_of_blocks arguments and simplify the interface to a single method perform_routing_sync(&self, their_node_id: &PublicKey) or similarly named. This could initiate the message flow that covers both channel discovery and request of info. It will depend on how we want to handle state (more on that below).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may want to keep number_of_blocks, you might have clients with limited storage and thus building their network map iteratively (request channel announcement/updates for a range, discard not-interesting channels, move forward). I agree on the NetworkGraph+chain_hash, though it can be done as a follow-up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I guess my thinking here is in part because we haven't implemented the peer_handler part here, and I don't know how interesting it would be to have a generic API which allows all kinds of requests if we only ever make one or two request types in our own code - my gut says probably not so interesting.


// First we obtain a lock on the task hashmap. In order to avoid borrowing issues
// we will access the task as needed.
let mut query_range_tasks = self.chan_range_query_tasks.lock().unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we need the state tracking/validation at all - we query a peer for all the channel IDs in a given range, and it feeds us back channel IDs, we then do a ton of validation to make sure they line up with exactly the query we sent before asking for the channel announcements. In the case the peer is malicious and trying to make us waste upload bandwidth by sending more queries, they could tell us about fake channels in the right range (or just send us pings), and in the case they're buggy we should find that out when they send us the channel announcements in question and they fail to validate.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I did have the same wonder about the state tracking purpose : #736 (comment), my conclusion was that it's required by the way the spec is structured. Now does the overall gossip spec is DoS safe ? I don't think so if you assume disk-IO one on channel_announcement un-bounded reception.

I don't think there is a necessity that out-of-range channel announcements fail to validate if we have the blockstate yet available ?

let task = query_range_tasks.get_mut(their_node_id).unwrap();
let mut short_channel_ids = Vec::new();
std::mem::swap(&mut short_channel_ids, &mut task.short_channel_ids);
self.query_short_channel_ids(their_node_id, task.chain_hash, short_channel_ids)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not send the queries as we go? With that I think we can drop all of the state tracking here, which would clean up the code a ton and make reasoning about state and DoS issues much simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can make the SCID queries entirely stateless. The spec only allows a single query of each type to be in-flight at a time. A large channel range query will stream multiple batches of SCIDs. Because we can only have a single SCID query in-flight, we need to store the subsequent batches and wait until we get an _end message before we can initiate another SCID query. For example:

|------ query_channel_range               ----->|
|<----- reply_channel_range (batch 1)     ------|
|<----- reply_channel_range (batch 2)     ------|
|<----- reply_channel_range (batch n)     ------|
|                                               |
|------ query_short_channel_ids (batch 1) ----->|
|<----- many chan_ann, etc...             ------|
|<----- reply_short_channel_ids_end       ------|
|                                               |
|------ query_short_channel_ids (batch 2) ----->|
|<----- many chan_ann, etc...             ------|
|<----- reply_short_channel_ids_end       ------|
|                                               |
|------ query_short_channel_ids (batch n) ----->|
|<----- many chan_ann, etc...             ------|
|<----- reply_short_channel_ids_end       ------|

So at a bare minimum we'll need to serialize and defer SCID query events some how, even if we initiate the first SCID query as soon as receive the first batch.

The path I went down allows us to handle the three use cases described above, but is overkill if we only plan to support full syncs. I'm happy to come up with a few options to simplify things and report back!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. I wonder if we should just get the spec to relax the query limit for short_channel_ids data so that we can make all our responses completely stateless (modulo network buffers). The alternative would be to tightly limit the response set size and maybe even do requests one block at a time to make sure we can set a tight limit and not have to worry about it. I do think we can still cut back a bunch on the amount of state we track to enforce checks that aren't really DoS risks.

@ariard
Copy link

ariard commented Nov 4, 2020

Thanks @julianknutsen for the information, added a direct reference to your comment in #707 backlog.

@ariard
Copy link

ariard commented Nov 4, 2020

Thanks for the great new comments at 6a59483.

@TheBlueMatt We might bound the size of the scid query after reply_channel_range batch completion to avoid upload bandwidth/disk-IO DoS ? Assuming we pick up a per-peer bound, we may use the historical average number of channel opening per-block on the last year as an heuristic, plus some safety margin. The hard limit is max_number_of_p2wsh_per_block.

Also you assume a DoS adversary is able to gain our gossips-sync peer selection. We may recommend to pick them randomly to mitigate as a first step.

That said, I think we should defer this DoS discussion to future PRs, cutting in the implementation as required by the spec isn't going to solve the issue, as ultimately it's a question of receiving non-sanitized data from untrusted peers. IMO, we should first a) know the default bandwidth/disk resources we assume on a RL client (but maybe configurable?) b) define how do we select gossip-sync peers c) allocate a gossip budget to each peer (expressed in byte_per_block_scanned?)

@TheBlueMatt
Copy link
Collaborator

We might bound the size of the scid query after reply_channel_range batch completion to avoid upload bandwidth/disk-IO DoS ?

I can't say I'm super concerned about upload bandwidth - assuming we use a similar approach to the existing map sync stuff where we only send such messages with low priority and let channel-related messages go first, its not any worse than what a peer can do with ping/pongs anyway. As for disk, for now our map is all in memory, though it would be something worth considering if we ever move to a disk-backed version.

Also you assume a DoS adversary is able to gain our gossips-sync peer selection. We may recommend to pick them randomly to mitigate as a first step.

Maybe? I assume we'll want to sync with multiple peers at least, eventually doing some reconciliation. I think we can pretty tightly limit things and then we don't have to worry about it, though.

That said, I think we should defer this DoS discussion to future PRs

I don't see why? We have to deal with it sooner or later, might as well get it right in the first go? I think it should be pretty doable to make things pretty cheap here.

@bmancini55
Copy link
Contributor Author

I wonder if we should just get the spec to relax the query limit for short_channel_ids data

It would simplify things on the query end. IIRC, it's a simple DoS prevention mechanism so the query recipient can throttle by taking time responding to a query.

The alternative would be to tightly limit the response set size and maybe even do requests one block at a time to make sure we can set a tight limit and not have to worry about it.

One-by-one is a possibility, though our peers might not appreciate that haha. Alternatively, we could process just the first reply in a sequence. When the scid query completes for the first reply, we issue a new range query starting next in the sequence. The only state we would need in both of these options is the block height of where we are in the sync process.

To simplify as-is, we could prune the interface to only a sync method and use a single SyncTask struct to track completion of the range query and hold ScidQueryEvents that get flushed in sequence. It's not substantially different than the current code, but simplifies some of the moving pieces.

Generally I wasn't overly concerned with state because we're only going to initiate a sync with a small number of peers and we have an upper-bound on allocation of 16MB (250 messages limit * ~8000 scid * 8-bytes per scid) before we abort and clean up. But I certainly understand a desire to go stateless from an allocation and complexity standpoint!

Regardless of state, I also agree that we could simplify reply_channel_range validation logic without loss of functionality. If a node is going to send bogus data, the sequence validation isn't going to help us. We'll still need to validate whether the responder failed the query and if the query is complete. If we would rather leave the validation in, I'm ok with that too haha

@TheBlueMatt
Copy link
Collaborator

IIRC, it's a simple DoS prevention mechanism so the query recipient can throttle by taking time responding to a query.

Right, I see the motivation there, but it doesn't really work that way - in practice you're never going to implement it by not processing a new inbound message until you've finished uploading the responses from the previous message (and checking with the kernel that its already been sent?), so a peer can almost always send several in a row.

I think maybe the thinking might have been more along the lines of making sure other messages can be sent in between by encoding into the spec the idea of not having too much in-flight at once, though that also doesn't solve the problem (there's plenty of other issues around fresh announcements flooding message queues) and you really need a solution like ours where you have (effectively, in our case) a separate outbound message queue for "background" messages and one for channel messages.

I don't think its unreasonably to change the spec here, though other implementations may disagree. I'd be curious to know if any implementations actually have the ability to enforce the "one-in-flight" rule, anyway.

@bmancini55
Copy link
Contributor Author

bmancini55 commented Dec 1, 2020

Sorry for the delay, took some time off with the holiday and just got a chance to circle back to this. Latest update pushed as described above.

A few notes about adding the BlockHash to NetworkGraph

Happy to modify this if needed.

Support for the gossip_queries feature flag (bits 6/7) is added to the
Features struct. This feature is available in the Init and Node
contexts. The gossip_queries feature is not fully implemented so this
feature is disabled when sent to peers in the Init message.
To enable gossip_queries message decoding, this commit implements the
wire module's Encoding trait for each message type. It also adds these
messages to the wire module's Message enum and the read function to
enable decoding of a buffer.
@bmancini55 bmancini55 force-pushed the gossip_queries branch 2 times, most recently from 66a8917 to d3f7a9d Compare December 1, 2020 22:54
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good. Mostly trivial nitpicky stuff here. I didn't test it against any live nodes, have you gotten a chance to?

@@ -827,6 +827,25 @@ pub trait RoutingMessageHandler : Send + Sync {
fn get_next_node_announcements(&self, starting_point: Option<&PublicKey>, batch_amount: u8) -> Vec<NodeAnnouncement>;
/// Returns whether a full sync should be requested from a peer.
fn should_request_full_sync(&self, node_id: &PublicKey) -> bool;
/// Initiates routing gossip sync by querying a peer to discover channels
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at this can you update the docs on the above method to be more specific that this is just about the initial sync bit in Init messages, and that many lightning implementations ignore this completely - maybe even rename it should_set_initial_sync_bit or so.

/// Handling requests with first_blocknum very far away may trigger repeated
/// disk I/O if the NetworkGraph is not fully in-memory.
fn handle_query_channel_range(&self, _their_node_id: &PublicKey, _msg: &QueryChannelRange) -> Result<(), LightningError> {
// TODO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to presume you'll be working on this sooner or later? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, want to outbound queries buttoned up before diving into responding to them.

/// Initiates routing gossip sync by querying a peer to discover channels
/// and their associated routing gossip messages. This method will use a
/// sync strategy defined by the implementor.
fn sync_routing_table(&self, their_node_id: &PublicKey);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful to pass in the Init message here so that the NetGraphMessageHandler can analyze the InitFeatures. Same goes for should_request_full_sync - that way we can only return true in the second when the peer doesn't support gossip queries and send a query_channel_range when they do.

if msg.features.supports_data_loss_protect() { "supported" } else { "not supported"},
if msg.features.initial_routing_sync() { "requested" } else { "not requested" },
if msg.features.supports_upfront_shutdown_script() { "supported" } else { "not supported"},
if msg.features.supports_gossip_queries() { "supported" } else { "not supported" },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume we should call sync_routing_table at the bottom of this block where we call peer_connected?

@bmancini55
Copy link
Contributor Author

This is looking good. Mostly trivial nitpicky stuff here. I didn't test it against any live nodes, have you gotten a chance to?

Yep tested against LND with a minor modifications to PeerManager to call sync_routing_table when the peer connects, as you noted above.

On that, I was trying to keep the touches on PeerManager to a minimum and was planning to submit another PR to conditionally send the query or set the sync feature bit. Should I just make those changes now so this PR can button up initiating queries?

@bmancini55
Copy link
Contributor Author

bmancini55 commented Dec 3, 2020

New updates:

  • changed handler methods to pass ownership of gossip messages and cleaned up the comments as suggested
  • pass Init message to sync_routing_table
  • call sync_routing_table after receipt of Init from peer

In terms of adding Init to should_request_full_sync, we use this method in two places:

  1. when outbound, we send Init immediately after finalizing noise. In this case we do not yet have knowledge of the peer's features
  2. after we receive an Init message and the peer is an inbound peer

What I'm doing is using should_request_full_sync to set the initialize_routing_sync feature bit and flag the peer as requesting a sync. Then once we receive the peer's Init message, we use that flag to control if sync_routing_table should be called.

This bit of orchestration could be removed, along with removing should_request_full_sync from the RoutingMessageHandler trait if we follow the path of other implementations and never set the initial_routing_sync feature bit.

@bmancini55 bmancini55 changed the title Initiate gossip_queries part 1 Initiate gossip_queries Dec 3, 2020
Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for delay, the stateless approach is so much simpler !

What I'm doing is using should_request_full_sync to set the initialize_routing_sync feature bit and flag the peer as requesting a sync. Then once we receive the peer's Init message, we use that flag to control if sync_routing_table should be called.

IIUC, setting initialize_routing_sync on our Init message should trigger the receiving peer to send us a complete routing table dump, at least that the behavior recommended by the spec ? But how many implementations are still supporting this behavior, it sounds a huge upstream bandwidth leak ? I'm of the opinion to remove it completely and only rely on gossip queries.

Also have a look on lightning-dev meeting of yesterday : http://gnusha.org/lightning-dev/2020-12-07.log, we discussed both #804/#811

  • it has been proposed to repurpose full_information of channel range query as a clear EOF, thus standardizing behavior among implementation
  • we discussed introducing a new feature no-zlib-support thus easing discovery of raw encoding supporting peers for gossip queries

And we still have to make the stateless approach lawful by allowing concurrent gossip queries.

What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?

/// Returns whether a full sync should be requested from a peer.
/// Returns whether a full sync should be requested from a peer. The result is
/// used to set the initial_routing_sync feature bit when sending an Init message.
/// This feature bit is not used by some implementations in favor of using gossip_queries.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add implementations and their versions which doesn't support initial_routing_sync in part of comment, helpful for debugging on our side :)

/// Handles when a peer asks us to send routing gossip messages for a
/// list of short_channel_ids. There are potential DoS vectors when handling
/// inbound queries. Handling requests with first_blocknum very far away may
/// trigger repeated disk I/O if the NetworkGraph is not fully in-memory.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you can gather potential DoS vectors in "Implementor DoS Warnings" section around RoutingMessageHandler trait doc, with trade-offs if graph isn't in memory. I'm fine being in trait function docs, your call if you want to remove reference to NetworkGraph.

@bmancini55
Copy link
Contributor Author

bmancini55 commented Dec 9, 2020

IIUC, setting initialize_routing_sync on our Init message should trigger the receiving peer to send us a complete routing table dump, at least that the behavior recommended by the spec ? But how many implementations are still supporting this behavior, it sounds a huge upstream bandwidth leak ? I'm of the opinion to remove it completely and only rely on gossip queries.

When we enable gossip_queries feature in our init message the remote peer will ignore the initial_routing_sync flag. I know LND dropped support for initial_routing_sync entirely regardless of whether you signal gossip_queries to it. I haven't recently tested Eclair or CL to see if they still support initial_routing_sync when gossip_queries isn't negotiated.

For outbound sync, I'm in agreement that we should not set inititial_routing_sync and only rely on gossip_queries. This has the benefit of removing that conditional logic from PeerManager and removing should_request_full_sync from RoutingMessageHandler. Sync can then be fully encapsulated inside the implementation of RoutingMessageHandler.

Also have a look on lightning-dev meeting of yesterday : http://gnusha.org/lightning-dev/2020-12-07.log, we discussed both #804/#811

Thanks, unfortunately I had a conflict and was only able to log on for the second half! Both of those proposals sound good though.

What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?

Sounds good, I'll make these changes tomorrow!

This change enables initiating gossip queries with a peer using the
SendMessageEvent enum. Specifically we add an event for sending
query_channel_range to discover the existance of channels and an event
for sending query_short_channel_ids to request routing gossip messages
for a set of channels. These events are handled inside the process_events
method of PeerManager which sends the serialized message to the peer.
Defines message handlers for gossip_queries messages in the RoutingMessageHandler
trait. The MessageSendEventsProvider supertrait is added to RoutingMessageHandler
so that the implementor can use SendMessageEvents to send messages to a
peer at the appropriate time.

The trait methods are stubbed in NetGraphMsgHandler which implements
RoutingMessageHandler and return a "not implemented" error.
To perform a sync of routing gossip messages with a peer requires a two
step process where we first initiate a channel range query to discover
channels in a block range. Next we request the routing gossip messages
for discovered channels.

This code implements logic in NetGraphMsgHandler for performing these two
tasks while taking into account the specification and variance in
implementation.
This changes adds the genesis block hash as a BlockHash to the
NetworkGraph struct. Making the NetworkGraph aware allows the message
handler to validate the chain_hash for received messages. This change
also adds the hash value to the Writeable and Readable methods.
This commit simplifies the sync process for routing gossip messages. When
a sync is initiated, the process is handled statelessly by immediately
issuing SCID queries as channel range replies are received. This greatly
simplifies the state machine at the cost of fully validating and
conforming to the current spec.
This change modifies gossip_queries methods in RoutingMessageHandler to
move the message instead of passing a reference. This allows the message
handler to be more efficient by not requiring a full copy of SCIDs
passed in messages.
This commit modifies sync_routing_table in RoutingMessageHandler to
accept a reference to the Init message received by the peer. This allows
the method to use the Peer's features to drive the operations of the
gossip_queries routing table sync.
@TheBlueMatt
Copy link
Collaborator

What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?

This sounds good to me. Especially given it works in the real world, I see little reason to sit on it forever.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One note about docs but I'm happy with it.

/// Initiates routing gossip sync by querying a peer to discover channels
/// and their associated routing gossip messages. This method will use a
/// sync strategy defined by the implementor.
fn sync_routing_table(&self, their_node_id: &PublicKey, init: &Init);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since its ultimately up to the implementer what they want to do here, it may make sense for the docs to be a little less prescriptive (ie, after the first few nodes you may not want to always sync the routing table) and just describe that this is called on connection and implementers should decide if they want to sync the table.

Copy link

@ariard ariard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, just a small test coverage issue and SGTM :)

{
let (secp_ctx, net_graph_msg_handler) = create_net_graph_msg_handler();
let init_msg = Init { features: InitFeatures::known() };
for n in 1..6 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to iterate the for-loop until 7, otherwise it doesn't test syncing limitation to FULL_SYNCS_TO_REQUEST ?

Diff:

diff --git a/lightning/src/routing/network_graph.rs b/lightning/src/routing/network_graph.rs
index d6c72d08..4b0c2087 100644
--- a/lightning/src/routing/network_graph.rs
+++ b/lightning/src/routing/network_graph.rs
@@ -235,10 +235,10 @@ impl<C: Deref + Sync + Send, L: Deref + Sync + Send> RoutingMessageHandler for N
                        return ();
                }
 
-               // Check if we need to perform a full synchronization with this peer
-               if !self.should_request_full_sync(their_node_id) {
-                       return ();
-               }
+               //// Check if we need to perform a full synchronization with this peer
+               //if !self.should_request_full_sync(their_node_id) {
+               //      return ();
+               //}
 
                let first_blocknum = 0;
                let number_of_blocks = 0xffffffff;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dag-nabbit. Fixed! Thanks for catching that. A minor refactor of the test bit me.

This commit changes outbound routing table sync to use gossip_queries
instead of the effectively deprecated initial_routing_sync feature.

This change removes setting of initial_routing_sync in our outbound Init
message. Instead we now call sync_routing_table after receiving an Init
message from a peer. If the peer supports gossip_queries and
should_request_full_sync returns true, we initiate a full gossip_queries
sync.
This method was used to set the initial_routing_sync flag when sending
an outbound Init message to a peer. Since we are now relying on
gossip_queries instead of initial_routing_sync, synchronization can be
fully encapsulate into RoutingMessageHandler via sync_routing_table.

This commit removes should_request_full_sync from the trait
RoutingMessageHandler. The implementation is still used in
NetGraphMsgHandler and has been converted into a private method instead
of a trait function.
@ariard
Copy link

ariard commented Dec 15, 2020

Code Review ACK e0bb63b

Corrects the comment for sync_routing_table in RoutingMessageHandler to
be less prescriptive about the implementor's actions.
@TheBlueMatt TheBlueMatt merged commit 74cd96f into lightningdevkit:main Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants