Initiate gossip_queries #736

bmancini55 · 2020-10-03T20:55:36Z

Part 1 for initiating gossip_queries for #615. This PR includes:

Adds feature flag
Adds msg types to wire enum
Adds SendMessageEvents
Adds message handlers to RoutingMessageHandler
Implements querying logic in NetGraphMsgHandler

julianknutsen · 2020-10-04T06:07:14Z

I would suggest basing your peer_handler changes on top of #714

Major restructure and unit tests that will save you some merge headache down the road.

TheBlueMatt

Structure looks good mod one note, would be nice to get some tests at least of the implementation of the message handler for NetGraphMsgHandler.

TheBlueMatt · 2020-10-04T21:00:15Z

lightning/src/ln/msgs.rs

@@ -826,6 +826,33 @@ pub trait RoutingMessageHandler : Send + Sync {
 	fn should_request_full_sync(&self, node_id: &PublicKey) -> bool;
 }

+
+/// A trait that describes an object that can receive gossip queries messages


I think we should just integrate this into RoutingMessageHandler. Indeed, it would make RoutingMessageHandler much more complex, but that's the way the protocol went, and the existing functions are not very useful without this, too. That would also create significantly less conflicts with other work, as Julian notes.

Sounds good, I've been thinking the same thing. Thanks!

bmancini55 · 2020-10-05T12:50:43Z

I would suggest basing your peer_handler changes on top of #714

@julianknutsen, thanks I'll check it out.

ariard

Overall looks good ! Mainly see my point on DisconnectPeer, I would be more liberal with gossips queries errors from remote peers, as severing connections have consequence for channel processing. Also, on those cases, BOLT 7 recommendation is a MAY, so we really have room here. I wouldn't be surprised this protocol to be a bit buggy in other implementations and we should avoid error contamination.

lightning/src/ln/msgs.rs

ariard · 2020-10-05T15:10:55Z

lightning/src/ln/peer_handler.rs

+					MessageSendEvent::SendChannelsQuery { ref node_id, ref msg } => {
+						log_trace!(self.logger, "Handling SendChannelsQuery event in peer_handler");
+						let (mut descriptor, peer) = get_peer_for_forwarding!(node_id, {
+							//TODO: Do whatever we're gonna do for handling dropped messages


I think we can just do nothing here. I assume that gossip queries will be scheduled by NetGraphMsgHandler until its is "satisfy" with its view of the network and should be ready to deal with peers with poor view of the network, which is the same of not finding the peer.

If this is too slow and peer are too much unreliable, we may add an Event but honestly let's not overload the number of events to manage.

lightning/src/routing/network_graph.rs

ariard · 2020-10-05T16:06:22Z

lightning/src/util/events.rs

+		/// The node_id of the node which should receive this message
+		node_id: PublicKey,
+		/// The message which should be sent.
+		msg: msgs::QueryShortChannelIds,


I think a worthy follow-up would be to coordinate better between query-gossips-at-connection and query-gossips-when-network-view-is-bad, AFAICT this patchset aims to query for each connection, which might be too much bandwidth hungry. I was thinking you could have a timer_refresh_network_view_with_queries in NetGraphMsgHandler on the model of PeerHandler::timer_tick_occured()

Ah yes, sorry about that. I was doing some dev testing and left code that performs a sync on each connection. I still need to clean up the PeerHandler code to respect should_request_full_sync.

Thanks, I'll check out PeerHandler::timer_tick_occured(). For periodic anti-entropy, it might be nice to connect to a random node from the network view and attempt a sync.

bmancini55 · 2020-10-14T20:01:55Z

Updates:

Changed the handler methods to use IgnoreError instead of DisconnectPeer
Added tests for gossip related methods in NetGraphMsgHandler. Some are broader scoped than unit tests since I wanted to ensure the task setup wasn't faked. Some of the test code is also a bit repetitive, I like making test code as straight forward as possible. Happy to try to reduce some of the repetitiveness if that fits the project style better.

More to follow!

ariard · 2020-10-15T00:53:52Z

@bmancini55 Is this ready for another round of review :) ?

EDIT: I'll have a look on the new tests anyway

bmancini55 · 2020-10-15T11:55:49Z

@bmancini55 Is this ready for another round of review :) ?

EDIT: I'll have a look on the new tests anyway

@ariard I'll be adding more commits this week, so not quite there yet for a complete look. The tests should be good to review though, thanks!

ariard

So far the code is pretty well-documented and coverage sounds to be exhaustive (missing handle_query_channel_range/handle_query_short_channels_ids IIUC)

Still have a look on the mutation I provided and the point on the task chain hash, I'm not sure if it's correct.

If you have a follow-up in mind, you're welcome to track them in an appropriate issue, let avoids this PR to become obese.

ariard · 2020-10-16T00:07:33Z

lightning/src/routing/network_graph.rs

+		let first_blocknum = 0;
+		let number_of_blocks = 0xffff_ffff;
+
+		// When no active query exists for the node, it should send a query message and generate a task


I wonder if the active query be time-out at some point in case of irresponsive peer ?

Definitely agree on this point. Also need to clean up tasks on peer disconnection.

ariard · 2020-10-16T00:16:51Z

lightning/src/routing/network_graph.rs

+		let mut query_range_tasks_lock = self.query_channel_range_tasks.lock().unwrap();
+		let query_range_tasks = &mut *query_range_tasks_lock;
+		if query_range_tasks.contains_key(their_node_id) {
+			return Err(LightningError {


If it's a caller requirement to track channel_range query I think this might be a hint of a higher logic bug. Though I don't have a clear model of this method usage.

My thinking is that this will be called from peer_handler after init negotiation and should_request_full_sync returns true. This guard prevents multiple in-flight query_channel_range messages which is spec non-compliant. This could be changed to queue channel range queries but I thought outright rejection was a simpler for now.

Relatedly, my plan is to make full_syncs_requested decrement on any failure in handle_reply_channel_range so that a full sync can be requested with the next peer.

Right, we can switch it later to an assert when higher sync logic is implemented.

lightning/src/routing/network_graph.rs

bmancini55 · 2020-10-16T17:40:40Z

@ariard 🤦 major apologies for that sloppiness!! I have fixed the issues with chain_hash and reply fields. Also took your advice to harden by asserting the message. I added tests for handle_query_channel_range and handle_query_short_channels_ids and renamed send_channels_query as requested.

If you have a follow-up in mind, you're welcome to track them in an appropriate issue, let avoids this PR to become obese.

If you're ok with it, this PR can be limited to:

add gossip_queries feature (will change so it is not a default)
wire enum changes for gossip queries messages
add gossip_queries methods to RoutingMessageHandler
implement handler methods in NetGraphMsgHandler + tests

Subsequent PRs can:

initiate gossip_query sync in peer_handler on peer connection
respond to peer's gossip queries
clean up orphaned/timed out tasks
initiate periodic anti-entropy synchronization

If that sounds good, I will remove temporary code from peer_handler, squash commits appropriately, and flip this to ready for review.

ariard · 2020-10-16T23:29:21Z

If that sounds good, I will remove temporary code from peer_handler, squash commits appropriately, and flip this to ready for review.

SGTM with proposed scope!

codecov · 2020-10-23T20:06:59Z

Codecov Report

Merging #736 (c026764) into main (a008464) will decrease coverage by 0.17%.
The diff coverage is 78.24%.

@@            Coverage Diff             @@
##             main     #736      +/-   ##
==========================================
- Coverage   91.33%   91.16%   -0.18%     
==========================================
  Files          37       37              
  Lines       22423    22623     +200     
==========================================
+ Hits        20481    20624     +143     
- Misses       1942     1999      +57

Impacted Files	Coverage Δ
lightning/src/ln/channelmanager.rs	`84.94% <0.00%> (-0.09%)`	⬇️
lightning/src/ln/msgs.rs	`89.60% <ø> (ø)`
lightning/src/util/events.rs	`23.68% <0.00%> (-0.87%)`	⬇️
lightning/src/util/test_utils.rs	`84.83% <20.00%> (-3.80%)`	⬇️
lightning/src/ln/peer_handler.rs	`51.02% <23.07%> (-3.73%)`	⬇️
lightning/src/ln/wire.rs	`61.61% <28.57%> (-4.55%)`	⬇️
lightning/src/routing/network_graph.rs	`92.19% <97.70%> (+0.88%)`	⬆️
lightning/src/ln/features.rs	`98.80% <100.00%> (+0.08%)`	⬆️
lightning/src/ln/functional_test_utils.rs	`95.13% <100.00%> (ø)`
lightning/src/routing/router.rs	`95.89% <100.00%> (ø)`
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a008464...c026764. Read the comment docs.

ariard

Sounds almost good to me, only call for better naming/comments. Thanks for the great code comments, it make the understand of this fuzzy query protocol easier.

Note to myself : review again test coverage at next round and then it's done.

lightning/src/routing/network_graph.rs

ariard · 2020-10-29T15:54:07Z

lightning/src/routing/network_graph.rs

+/// added to it while we are waiting for a reply_short_channel_ids_end.
+/// The pending short_channel_ids will be chunked and sent to the remote
+/// peer.
+pub struct GossipQueryShortChannelIdsTask {


Can you add comment why we need to maintain a state for queried remote peer and can't send simply query in a stateless flow ? AFAIU, it's only used to sanitize ReplyShortChannelIdsEnd. Maybe in the future it should be extended with a timestamp to know when to expire queries and retry with another peer.

Comment added, hopefully that clarifies the state.

lightning/src/routing/network_graph.rs

ariard · 2020-10-29T16:26:38Z

lightning/src/routing/network_graph.rs

+		if task_received_first_block.is_none() {
+			// The replies must include our range but can be a superset of the
+			// queried block range. We start by validating the reply's
+			// first_blocknum is less than or equal to the start of our query.


You mean "superior or equal" ? At least that was the check msg.first_blocknum > task_first_blocknum is enforcing.

Clarified the comment here

ariard · 2020-10-29T16:29:03Z

lightning/src/routing/network_graph.rs

+
+/// Maximum number of reply_channel_range messages we will allow in
+/// in reply to a query_channel_range.
+const MAX_REPLY_CHANNEL_RANGE_PER_QUERY: usize = 250;


Are those values from BOLT 7 ? Can't find them there

They are not defined in BOLT 7. I updated the comments with more information. 8000 is used in other implementations as the raw batch size to fit inside the 65535-byte message limit. I can fill you in on max replies over another channel if you want some additional insight.

Good, the 65535 as an upper bound makes sense to me.

lightning/src/routing/network_graph.rs

julianknutsen · 2020-11-01T04:53:31Z

@ariard What was the final conclusion on the dependency between this and #691? These patches look to be based on master and didn't see a resolution in any of the comments about the ordering.

bmancini55 · 2020-11-01T17:34:35Z

@julianknutsen @ariard I'm happy to rebase this PR if desired. The changes in peer_manager are fairly minimal for this PR.

@ariard, I'm looking over your comments and will make those changes shortly. Thanks!

ariard · 2020-11-02T03:39:04Z

@bmancini55 It can go first, it's more near-merge mature than #691 :)

@julianknutsen What's the real issue with the ordering ? AFAICT, there is only one potential conflicting file lightning-net-tokio/src/lib.rsand in anycase it would be a minor one ? I might be wrong, in that case may you point the conflict and how much complexity it would be to fix it if #691 go second ?

julianknutsen · 2020-11-02T17:21:35Z

The merge conflicts should be easy to resolve as they all follow existing patterns. A more interesting question to answer is how much test coverage is needed in the peer_manager code and should there be an effort to increase code coverage over time when new features are added.

The tests in #692 introduce easy-to-use testing frameworks and exhaustive unit tests of the callback code that help ensure all the code is tested at some level. Those will likely need to be extended to include the new branches this PR introduces.

If #692 were to go first, this patch would result in a code coverage drop in the report which would be an automated flag to help catch the testing issues.

Not hard, just something to add to the TODO list and it is easy for these things to fall through the cracks with multiple changes to the same code.

I'd recommend this goes first to limit the patch maintenance as 692 is unlikely to be merged in the current form, but I wanted to share the pros/cons to help make the best decision.

TheBlueMatt

Sorry about the delay, was off for the last week. I think we can cut back a lot of code here by being a bit more willing to respond to bogus-but-not-harmful data from peers.

TheBlueMatt · 2020-11-03T02:35:51Z

lightning/src/ln/msgs.rs

@@ -824,6 +824,27 @@ pub trait RoutingMessageHandler : Send + Sync {
 	fn get_next_node_announcements(&self, starting_point: Option<&PublicKey>, batch_amount: u8) -> Vec<NodeAnnouncement>;
 	/// Returns whether a full sync should be requested from a peer.
 	fn should_request_full_sync(&self, node_id: &PublicKey) -> bool;
+	/// Queries a peer for a list of channels with a funding UTXO in the requested
+	/// chain and range of blocks.
+	fn query_channel_range(&self, their_node_id: &PublicKey, chain_hash: BlockHash, first_blocknum: u32, number_of_blocks: u32) -> Result<(), LightningError>;


I think these fn interfaces can be cut back a bit - at least chain_hash can be dropped because NetworkGraph objects only actually support one chain (though maybe we should track it in the NetworkGraph!) and I'm not sure exactly how this would be used so not sure why we need number_of_blocks - can we just always set it to MAX_INT? It may help to document exactly how these intend to be used, though I've left some further motivating comments below.

There are three use cases we may be interested in:

A full sync

A partial sync

Perform an SCID query if we receive an unknown channel_update

Focusing on the full sync, we can drop the first_blocknum and number_of_blocks arguments and simplify the interface to a single method perform_routing_sync(&self, their_node_id: &PublicKey) or similarly named. This could initiate the message flow that covers both channel discovery and request of info. It will depend on how we want to handle state (more on that below).

I think you may want to keep number_of_blocks, you might have clients with limited storage and thus building their network map iteratively (request channel announcement/updates for a range, discard not-interesting channels, move forward). I agree on the NetworkGraph+chain_hash, though it can be done as a follow-up.

Right, I guess my thinking here is in part because we haven't implemented the peer_handler part here, and I don't know how interesting it would be to have a generic API which allows all kinds of requests if we only ever make one or two request types in our own code - my gut says probably not so interesting.

TheBlueMatt · 2020-11-03T02:41:01Z

lightning/src/routing/network_graph.rs

+
+		// First we obtain a lock on the task hashmap. In order to avoid borrowing issues
+		// we will access the task as needed.
+		let mut query_range_tasks = self.chan_range_query_tasks.lock().unwrap();


I'm not convinced we need the state tracking/validation at all - we query a peer for all the channel IDs in a given range, and it feeds us back channel IDs, we then do a ton of validation to make sure they line up with exactly the query we sent before asking for the channel announcements. In the case the peer is malicious and trying to make us waste upload bandwidth by sending more queries, they could tell us about fake channels in the right range (or just send us pings), and in the case they're buggy we should find that out when they send us the channel announcements in question and they fail to validate.

Yeah I did have the same wonder about the state tracking purpose : #736 (comment), my conclusion was that it's required by the way the spec is structured. Now does the overall gossip spec is DoS safe ? I don't think so if you assume disk-IO one on channel_announcement un-bounded reception.

I don't think there is a necessity that out-of-range channel announcements fail to validate if we have the blockstate yet available ?

TheBlueMatt · 2020-11-03T02:41:39Z

lightning/src/routing/network_graph.rs

+				let task = query_range_tasks.get_mut(their_node_id).unwrap();
+				let mut short_channel_ids = Vec::new();
+				std::mem::swap(&mut short_channel_ids, &mut task.short_channel_ids);
+				self.query_short_channel_ids(their_node_id, task.chain_hash, short_channel_ids)?;


Why not send the queries as we go? With that I think we can drop all of the state tracking here, which would clean up the code a ton and make reasoning about state and DoS issues much simpler.

I don't think we can make the SCID queries entirely stateless. The spec only allows a single query of each type to be in-flight at a time. A large channel range query will stream multiple batches of SCIDs. Because we can only have a single SCID query in-flight, we need to store the subsequent batches and wait until we get an _end message before we can initiate another SCID query. For example:

|------ query_channel_range ----->| |<----- reply_channel_range (batch 1) ------| |<----- reply_channel_range (batch 2) ------| |<----- reply_channel_range (batch n) ------| | | |------ query_short_channel_ids (batch 1) ----->| |<----- many chan_ann, etc... ------| |<----- reply_short_channel_ids_end ------| | | |------ query_short_channel_ids (batch 2) ----->| |<----- many chan_ann, etc... ------| |<----- reply_short_channel_ids_end ------| | | |------ query_short_channel_ids (batch n) ----->| |<----- many chan_ann, etc... ------| |<----- reply_short_channel_ids_end ------|

So at a bare minimum we'll need to serialize and defer SCID query events some how, even if we initiate the first SCID query as soon as receive the first batch.

The path I went down allows us to handle the three use cases described above, but is overkill if we only plan to support full syncs. I'm happy to come up with a few options to simplify things and report back!

Ugh. I wonder if we should just get the spec to relax the query limit for short_channel_ids data so that we can make all our responses completely stateless (modulo network buffers). The alternative would be to tightly limit the response set size and maybe even do requests one block at a time to make sure we can set a tight limit and not have to worry about it. I do think we can still cut back a bunch on the amount of state we track to enforce checks that aren't really DoS risks.

ariard · 2020-11-04T01:32:47Z

Thanks @julianknutsen for the information, added a direct reference to your comment in #707 backlog.

ariard · 2020-11-04T02:45:42Z

Thanks for the great new comments at 6a59483.

@TheBlueMatt We might bound the size of the scid query after reply_channel_range batch completion to avoid upload bandwidth/disk-IO DoS ? Assuming we pick up a per-peer bound, we may use the historical average number of channel opening per-block on the last year as an heuristic, plus some safety margin. The hard limit is max_number_of_p2wsh_per_block.

Also you assume a DoS adversary is able to gain our gossips-sync peer selection. We may recommend to pick them randomly to mitigate as a first step.

That said, I think we should defer this DoS discussion to future PRs, cutting in the implementation as required by the spec isn't going to solve the issue, as ultimately it's a question of receiving non-sanitized data from untrusted peers. IMO, we should first a) know the default bandwidth/disk resources we assume on a RL client (but maybe configurable?) b) define how do we select gossip-sync peers c) allocate a gossip budget to each peer (expressed in byte_per_block_scanned?)

TheBlueMatt · 2020-11-05T00:58:18Z

We might bound the size of the scid query after reply_channel_range batch completion to avoid upload bandwidth/disk-IO DoS ?

I can't say I'm super concerned about upload bandwidth - assuming we use a similar approach to the existing map sync stuff where we only send such messages with low priority and let channel-related messages go first, its not any worse than what a peer can do with ping/pongs anyway. As for disk, for now our map is all in memory, though it would be something worth considering if we ever move to a disk-backed version.

Also you assume a DoS adversary is able to gain our gossips-sync peer selection. We may recommend to pick them randomly to mitigate as a first step.

Maybe? I assume we'll want to sync with multiple peers at least, eventually doing some reconciliation. I think we can pretty tightly limit things and then we don't have to worry about it, though.

That said, I think we should defer this DoS discussion to future PRs

I don't see why? We have to deal with it sooner or later, might as well get it right in the first go? I think it should be pretty doable to make things pretty cheap here.

bmancini55 · 2020-11-06T04:46:07Z

I wonder if we should just get the spec to relax the query limit for short_channel_ids data

It would simplify things on the query end. IIRC, it's a simple DoS prevention mechanism so the query recipient can throttle by taking time responding to a query.

The alternative would be to tightly limit the response set size and maybe even do requests one block at a time to make sure we can set a tight limit and not have to worry about it.

One-by-one is a possibility, though our peers might not appreciate that haha. Alternatively, we could process just the first reply in a sequence. When the scid query completes for the first reply, we issue a new range query starting next in the sequence. The only state we would need in both of these options is the block height of where we are in the sync process.

To simplify as-is, we could prune the interface to only a sync method and use a single SyncTask struct to track completion of the range query and hold ScidQueryEvents that get flushed in sequence. It's not substantially different than the current code, but simplifies some of the moving pieces.

Generally I wasn't overly concerned with state because we're only going to initiate a sync with a small number of peers and we have an upper-bound on allocation of 16MB (250 messages limit * ~8000 scid * 8-bytes per scid) before we abort and clean up. But I certainly understand a desire to go stateless from an allocation and complexity standpoint!

Regardless of state, I also agree that we could simplify reply_channel_range validation logic without loss of functionality. If a node is going to send bogus data, the sequence validation isn't going to help us. We'll still need to validate whether the responder failed the query and if the query is complete. If we would rather leave the validation in, I'm ok with that too haha

TheBlueMatt · 2020-11-06T19:37:40Z

IIRC, it's a simple DoS prevention mechanism so the query recipient can throttle by taking time responding to a query.

Right, I see the motivation there, but it doesn't really work that way - in practice you're never going to implement it by not processing a new inbound message until you've finished uploading the responses from the previous message (and checking with the kernel that its already been sent?), so a peer can almost always send several in a row.

I think maybe the thinking might have been more along the lines of making sure other messages can be sent in between by encoding into the spec the idea of not having too much in-flight at once, though that also doesn't solve the problem (there's plenty of other issues around fresh announcements flooding message queues) and you really need a solution like ours where you have (effectively, in our case) a separate outbound message queue for "background" messages and one for channel messages.

I don't think its unreasonably to change the spec here, though other implementations may disagree. I'd be curious to know if any implementations actually have the ability to enforce the "one-in-flight" rule, anyway.

bmancini55 · 2020-12-01T21:53:13Z

Sorry for the delay, took some time off with the holiday and just got a chance to circle back to this. Latest update pushed as described above.

A few notes about adding the BlockHash to NetworkGraph

This required added a BlockHash argument to the new method of NetGraphMsgHandler to support construction of the NetworkGraph (fe74ea8#diff-e68ca216ace695426d8cc8a641448f25fb88c18f8d3149a879465ef8498b077cR91)
For NetworkGraph serialization, I added BlockHash as the first piece of data. This seemed most logical to me since the other data is variable in length. (fe74ea8#diff-e68ca216ace695426d8cc8a641448f25fb88c18f8d3149a879465ef8498b077cR921)

Happy to modify this if needed.

Support for the gossip_queries feature flag (bits 6/7) is added to the Features struct. This feature is available in the Init and Node contexts. The gossip_queries feature is not fully implemented so this feature is disabled when sent to peers in the Init message.

To enable gossip_queries message decoding, this commit implements the wire module's Encoding trait for each message type. It also adds these messages to the wire module's Message enum and the read function to enable decoding of a buffer.

TheBlueMatt

This is looking good. Mostly trivial nitpicky stuff here. I didn't test it against any live nodes, have you gotten a chance to?

TheBlueMatt · 2020-12-02T00:17:14Z

lightning/src/ln/msgs.rs

@@ -827,6 +827,25 @@ pub trait RoutingMessageHandler : Send + Sync {
 	fn get_next_node_announcements(&self, starting_point: Option<&PublicKey>, batch_amount: u8) -> Vec<NodeAnnouncement>;
 	/// Returns whether a full sync should be requested from a peer.
 	fn should_request_full_sync(&self, node_id: &PublicKey) -> bool;
+	/// Initiates routing gossip sync by querying a peer to discover channels


While you're at this can you update the docs on the above method to be more specific that this is just about the initial sync bit in Init messages, and that many lightning implementations ignore this completely - maybe even rename it should_set_initial_sync_bit or so.

lightning/src/routing/network_graph.rs

TheBlueMatt · 2020-12-02T00:53:09Z

lightning/src/routing/network_graph.rs

+	/// Handling requests with first_blocknum very far away may trigger repeated
+	/// disk I/O if the NetworkGraph is not fully in-memory.
+	fn handle_query_channel_range(&self, _their_node_id: &PublicKey, _msg: &QueryChannelRange) -> Result<(), LightningError> {
+		// TODO


Is it safe to presume you'll be working on this sooner or later? :)

Indeed, want to outbound queries buttoned up before diving into responding to them.

TheBlueMatt · 2020-12-02T01:06:20Z

lightning/src/ln/msgs.rs

+	/// Initiates routing gossip sync by querying a peer to discover channels
+	/// and their associated routing gossip messages. This method will use a
+	/// sync strategy defined by the implementor.
+	fn sync_routing_table(&self, their_node_id: &PublicKey);


It may be useful to pass in the Init message here so that the NetGraphMessageHandler can analyze the InitFeatures. Same goes for should_request_full_sync - that way we can only return true in the second when the peer doesn't support gossip queries and send a query_channel_range when they do.

TheBlueMatt · 2020-12-02T01:07:32Z

lightning/src/ln/peer_handler.rs

 					if msg.features.supports_data_loss_protect() { "supported" } else { "not supported"},
 					if msg.features.initial_routing_sync() { "requested" } else { "not requested" },
 					if msg.features.supports_upfront_shutdown_script() { "supported" } else { "not supported"},
+					if msg.features.supports_gossip_queries() { "supported" } else { "not supported" },


I presume we should call sync_routing_table at the bottom of this block where we call peer_connected?

bmancini55 · 2020-12-02T02:42:20Z

This is looking good. Mostly trivial nitpicky stuff here. I didn't test it against any live nodes, have you gotten a chance to?

Yep tested against LND with a minor modifications to PeerManager to call sync_routing_table when the peer connects, as you noted above.

On that, I was trying to keep the touches on PeerManager to a minimum and was planning to submit another PR to conditionally send the query or set the sync feature bit. Should I just make those changes now so this PR can button up initiating queries?

bmancini55 · 2020-12-03T19:32:00Z

New updates:

changed handler methods to pass ownership of gossip messages and cleaned up the comments as suggested
pass Init message to sync_routing_table
call sync_routing_table after receipt of Init from peer

In terms of adding Init to should_request_full_sync, we use this method in two places:

when outbound, we send Init immediately after finalizing noise. In this case we do not yet have knowledge of the peer's features
after we receive an Init message and the peer is an inbound peer

What I'm doing is using should_request_full_sync to set the initialize_routing_sync feature bit and flag the peer as requesting a sync. Then once we receive the peer's Init message, we use that flag to control if sync_routing_table should be called.

This bit of orchestration could be removed, along with removing should_request_full_sync from the RoutingMessageHandler trait if we follow the path of other implementations and never set the initial_routing_sync feature bit.

ariard

Sorry for delay, the stateless approach is so much simpler !

What I'm doing is using should_request_full_sync to set the initialize_routing_sync feature bit and flag the peer as requesting a sync. Then once we receive the peer's Init message, we use that flag to control if sync_routing_table should be called.

IIUC, setting initialize_routing_sync on our Init message should trigger the receiving peer to send us a complete routing table dump, at least that the behavior recommended by the spec ? But how many implementations are still supporting this behavior, it sounds a huge upstream bandwidth leak ? I'm of the opinion to remove it completely and only rely on gossip queries.

Also have a look on lightning-dev meeting of yesterday : http://gnusha.org/lightning-dev/2020-12-07.log, we discussed both #804/#811

it has been proposed to repurpose full_information of channel range query as a clear EOF, thus standardizing behavior among implementation
we discussed introducing a new feature no-zlib-support thus easing discovery of raw encoding supporting peers for gossip queries

And we still have to make the stateless approach lawful by allowing concurrent gossip queries.

What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?

lightning/src/util/events.rs

ariard · 2020-12-08T15:41:09Z

lightning/src/ln/msgs.rs

-	/// Returns whether a full sync should be requested from a peer.
+	/// Returns whether a full sync should be requested from a peer. The result is
+	/// used to set the initial_routing_sync feature bit when sending an Init message.
+	/// This feature bit is not used by some implementations in favor of using gossip_queries.


You can add implementations and their versions which doesn't support initial_routing_sync in part of comment, helpful for debugging on our side :)

lightning/src/routing/network_graph.rs

ariard · 2020-12-08T16:16:34Z

lightning/src/ln/msgs.rs

+	/// Handles when a peer asks us to send routing gossip messages for a
+	/// list of short_channel_ids. There are potential DoS vectors when handling
+	/// inbound queries. Handling requests with first_blocknum very far away may
+	/// trigger repeated disk I/O if the NetworkGraph is not fully in-memory.


Alternatively, you can gather potential DoS vectors in "Implementor DoS Warnings" section around RoutingMessageHandler trait doc, with trade-offs if graph isn't in memory. I'm fine being in trait function docs, your call if you want to remove reference to NetworkGraph.

bmancini55 · 2020-12-09T02:46:24Z

IIUC, setting initialize_routing_sync on our Init message should trigger the receiving peer to send us a complete routing table dump, at least that the behavior recommended by the spec ? But how many implementations are still supporting this behavior, it sounds a huge upstream bandwidth leak ? I'm of the opinion to remove it completely and only rely on gossip queries.

When we enable gossip_queries feature in our init message the remote peer will ignore the initial_routing_sync flag. I know LND dropped support for initial_routing_sync entirely regardless of whether you signal gossip_queries to it. I haven't recently tested Eclair or CL to see if they still support initial_routing_sync when gossip_queries isn't negotiated.

For outbound sync, I'm in agreement that we should not set inititial_routing_sync and only rely on gossip_queries. This has the benefit of removing that conditional logic from PeerManager and removing should_request_full_sync from RoutingMessageHandler. Sync can then be fully encapsulated inside the implementation of RoutingMessageHandler.

Also have a look on lightning-dev meeting of yesterday : http://gnusha.org/lightning-dev/2020-12-07.log, we discussed both #804/#811

Thanks, unfortunately I had a conflict and was only able to log on for the second half! Both of those proposals sound good though.

What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?

Sounds good, I'll make these changes tomorrow!

This change enables initiating gossip queries with a peer using the SendMessageEvent enum. Specifically we add an event for sending query_channel_range to discover the existance of channels and an event for sending query_short_channel_ids to request routing gossip messages for a set of channels. These events are handled inside the process_events method of PeerManager which sends the serialized message to the peer.

Defines message handlers for gossip_queries messages in the RoutingMessageHandler trait. The MessageSendEventsProvider supertrait is added to RoutingMessageHandler so that the implementor can use SendMessageEvents to send messages to a peer at the appropriate time. The trait methods are stubbed in NetGraphMsgHandler which implements RoutingMessageHandler and return a "not implemented" error.

To perform a sync of routing gossip messages with a peer requires a two step process where we first initiate a channel range query to discover channels in a block range. Next we request the routing gossip messages for discovered channels. This code implements logic in NetGraphMsgHandler for performing these two tasks while taking into account the specification and variance in implementation.

This changes adds the genesis block hash as a BlockHash to the NetworkGraph struct. Making the NetworkGraph aware allows the message handler to validate the chain_hash for received messages. This change also adds the hash value to the Writeable and Readable methods.

This commit simplifies the sync process for routing gossip messages. When a sync is initiated, the process is handled statelessly by immediately issuing SCID queries as channel range replies are received. This greatly simplifies the state machine at the cost of fully validating and conforming to the current spec.

This change modifies gossip_queries methods in RoutingMessageHandler to move the message instead of passing a reference. This allows the message handler to be more efficient by not requiring a full copy of SCIDs passed in messages.

This commit modifies sync_routing_table in RoutingMessageHandler to accept a reference to the Init message received by the peer. This allows the method to use the Peer's features to drive the operations of the gossip_queries routing table sync.

TheBlueMatt · 2020-12-09T20:51:48Z

What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?

This sounds good to me. Especially given it works in the real world, I see little reason to sit on it forever.

TheBlueMatt

LGTM. One note about docs but I'm happy with it.

TheBlueMatt · 2020-12-10T20:56:40Z

lightning/src/ln/msgs.rs

+	/// Initiates routing gossip sync by querying a peer to discover channels
+	/// and their associated routing gossip messages. This method will use a
+	/// sync strategy defined by the implementor.
+	fn sync_routing_table(&self, their_node_id: &PublicKey, init: &Init);


Since its ultimately up to the implementer what they want to do here, it may make sense for the docs to be a little less prescriptive (ie, after the first few nodes you may not want to always sync the routing table) and just describe that this is called on connection and implementers should decide if they want to sync the table.

ariard

Thanks for the changes, just a small test coverage issue and SGTM :)

ariard · 2020-12-14T17:15:34Z

lightning/src/routing/network_graph.rs

+		{
+			let (secp_ctx, net_graph_msg_handler) = create_net_graph_msg_handler();
+			let init_msg = Init { features: InitFeatures::known() };
+			for n in 1..6 {


You need to iterate the for-loop until 7, otherwise it doesn't test syncing limitation to FULL_SYNCS_TO_REQUEST ?

Diff:

diff --git a/lightning/src/routing/network_graph.rs b/lightning/src/routing/network_graph.rs index d6c72d08..4b0c2087 100644 --- a/lightning/src/routing/network_graph.rs +++ b/lightning/src/routing/network_graph.rs @@ -235,10 +235,10 @@ impl<C: Deref + Sync + Send, L: Deref + Sync + Send> RoutingMessageHandler for N return (); } - // Check if we need to perform a full synchronization with this peer - if !self.should_request_full_sync(their_node_id) { - return (); - } + //// Check if we need to perform a full synchronization with this peer + //if !self.should_request_full_sync(their_node_id) { + // return (); + //} let first_blocknum = 0; let number_of_blocks = 0xffffffff;

Dag-nabbit. Fixed! Thanks for catching that. A minor refactor of the test bit me.

This commit changes outbound routing table sync to use gossip_queries instead of the effectively deprecated initial_routing_sync feature. This change removes setting of initial_routing_sync in our outbound Init message. Instead we now call sync_routing_table after receiving an Init message from a peer. If the peer supports gossip_queries and should_request_full_sync returns true, we initiate a full gossip_queries sync.

This method was used to set the initial_routing_sync flag when sending an outbound Init message to a peer. Since we are now relying on gossip_queries instead of initial_routing_sync, synchronization can be fully encapsulate into RoutingMessageHandler via sync_routing_table. This commit removes should_request_full_sync from the trait RoutingMessageHandler. The implementation is still used in NetGraphMsgHandler and has been converted into a private method instead of a trait function.

ariard · 2020-12-15T11:10:32Z

Code Review ACK e0bb63b

Corrects the comment for sync_routing_table in RoutingMessageHandler to be less prescriptive about the implementor's actions.

TheBlueMatt reviewed Oct 4, 2020

View reviewed changes

ariard reviewed Oct 5, 2020

View reviewed changes

ariard reviewed Oct 16, 2020

View reviewed changes

bmancini55 force-pushed the gossip_queries branch 3 times, most recently from 24ebffc to a2325a5 Compare October 23, 2020 19:54

bmancini55 force-pushed the gossip_queries branch from a2325a5 to fabd072 Compare October 23, 2020 20:30

bmancini55 marked this pull request as ready for review October 23, 2020 20:44

julianknutsen mentioned this pull request Oct 24, 2020

(1/4) Enum Dispatch NOISE State Machine & Unit Tests #691

Closed

bmancini55 changed the title ~~Initiate gossip_queries~~ Initiate gossip_queries part 1 Oct 26, 2020

ariard reviewed Oct 29, 2020

View reviewed changes

bmancini55 force-pushed the gossip_queries branch from fabd072 to 6a59483 Compare November 2, 2020 21:41

TheBlueMatt reviewed Nov 3, 2020

View reviewed changes

bmancini55 force-pushed the gossip_queries branch from 50c6b53 to 9b289c0 Compare December 1, 2020 21:38

bmancini55 added 2 commits December 1, 2020 17:18

Add gossip_queries messages to wire decoding

3220f3b

To enable gossip_queries message decoding, this commit implements the wire module's Encoding trait for each message type. It also adds these messages to the wire module's Message enum and the read function to enable decoding of a buffer.

bmancini55 force-pushed the gossip_queries branch 2 times, most recently from 66a8917 to d3f7a9d Compare December 1, 2020 22:54

TheBlueMatt reviewed Dec 2, 2020

View reviewed changes

bmancini55 changed the title ~~Initiate gossip_queries part 1~~ Initiate gossip_queries Dec 3, 2020

ariard reviewed Dec 8, 2020

View reviewed changes

bmancini55 added 7 commits December 9, 2020 15:02

Pass gossip_queries messages to handler via ownership

7e1e0ac

This change modifies gossip_queries methods in RoutingMessageHandler to move the message instead of passing a reference. This allows the message handler to be more efficient by not requiring a full copy of SCIDs passed in messages.

bmancini55 force-pushed the gossip_queries branch from 105df12 to 3df9fae Compare December 9, 2020 20:14

TheBlueMatt approved these changes Dec 10, 2020

View reviewed changes

ariard reviewed Dec 14, 2020

View reviewed changes

bmancini55 added 2 commits December 14, 2020 12:52

bmancini55 force-pushed the gossip_queries branch from 3df9fae to e0bb63b Compare December 14, 2020 17:55

Fix comment for sync_routing_table

c026764

Corrects the comment for sync_routing_table in RoutingMessageHandler to be less prescriptive about the implementor's actions.

TheBlueMatt approved these changes Dec 15, 2020

View reviewed changes

TheBlueMatt merged commit 74cd96f into lightningdevkit:main Dec 15, 2020

Initiate gossip_queries #736

Initiate gossip_queries #736

Uh oh!

Conversation

bmancini55 commented Oct 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

julianknutsen commented Oct 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheBlueMatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmancini55 commented Oct 5, 2020

Uh oh!

ariard left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bmancini55 commented Oct 14, 2020

Uh oh!

ariard commented Oct 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bmancini55 commented Oct 15, 2020

Uh oh!

ariard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmancini55 commented Oct 16, 2020

Uh oh!

ariard commented Oct 16, 2020

Uh oh!

codecov bot commented Oct 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ariard left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

bmancini55 commented Oct 3, 2020 •

edited

Loading

julianknutsen commented Oct 4, 2020 •

edited

Loading

ariard commented Oct 15, 2020 •

edited

Loading

codecov bot commented Oct 23, 2020 •

edited

Loading

ariard left a comment •

edited

Loading

bmancini55 Nov 2, 2020 •

edited

Loading