-
Notifications
You must be signed in to change notification settings - Fork 407
Initiate gossip_queries #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Structure looks good mod one note, would be nice to get some tests at least of the implementation of the message handler for NetGraphMsgHandler.
lightning/src/ln/msgs.rs
Outdated
@@ -826,6 +826,33 @@ pub trait RoutingMessageHandler : Send + Sync { | |||
fn should_request_full_sync(&self, node_id: &PublicKey) -> bool; | |||
} | |||
|
|||
|
|||
/// A trait that describes an object that can receive gossip queries messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just integrate this into RoutingMessageHandler. Indeed, it would make RoutingMessageHandler much more complex, but that's the way the protocol went, and the existing functions are not very useful without this, too. That would also create significantly less conflicts with other work, as Julian notes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I've been thinking the same thing. Thanks!
@julianknutsen, thanks I'll check it out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good ! Mainly see my point on DisconnectPeer, I would be more liberal with gossips queries errors from remote peers, as severing connections have consequence for channel processing. Also, on those cases, BOLT 7 recommendation is a MAY, so we really have room here. I wouldn't be surprised this protocol to be a bit buggy in other implementations and we should avoid error contamination.
lightning/src/ln/peer_handler.rs
Outdated
MessageSendEvent::SendChannelsQuery { ref node_id, ref msg } => { | ||
log_trace!(self.logger, "Handling SendChannelsQuery event in peer_handler"); | ||
let (mut descriptor, peer) = get_peer_for_forwarding!(node_id, { | ||
//TODO: Do whatever we're gonna do for handling dropped messages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just do nothing here. I assume that gossip queries will be scheduled by NetGraphMsgHandler until its is "satisfy" with its view of the network and should be ready to deal with peers with poor view of the network, which is the same of not finding the peer.
If this is too slow and peer are too much unreliable, we may add an Event but honestly let's not overload the number of events to manage.
/// The node_id of the node which should receive this message | ||
node_id: PublicKey, | ||
/// The message which should be sent. | ||
msg: msgs::QueryShortChannelIds, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a worthy follow-up would be to coordinate better between query-gossips-at-connection and query-gossips-when-network-view-is-bad, AFAICT this patchset aims to query for each connection, which might be too much bandwidth hungry. I was thinking you could have a timer_refresh_network_view_with_queries
in NetGraphMsgHandler on the model of PeerHandler::timer_tick_occured()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, sorry about that. I was doing some dev testing and left code that performs a sync on each connection. I still need to clean up the PeerHandler code to respect should_request_full_sync
.
Thanks, I'll check out PeerHandler::timer_tick_occured()
. For periodic anti-entropy, it might be nice to connect to a random node from the network view and attempt a sync.
Updates:
More to follow! |
@bmancini55 Is this ready for another round of review :) ? EDIT: I'll have a look on the new tests anyway |
@ariard I'll be adding more commits this week, so not quite there yet for a complete look. The tests should be good to review though, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far the code is pretty well-documented and coverage sounds to be exhaustive (missing handle_query_channel_range
/handle_query_short_channels_ids
IIUC)
Still have a look on the mutation I provided and the point on the task chain hash, I'm not sure if it's correct.
If you have a follow-up in mind, you're welcome to track them in an appropriate issue, let avoids this PR to become obese.
let first_blocknum = 0; | ||
let number_of_blocks = 0xffff_ffff; | ||
|
||
// When no active query exists for the node, it should send a query message and generate a task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the active query be time-out at some point in case of irresponsive peer ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree on this point. Also need to clean up tasks on peer disconnection.
let mut query_range_tasks_lock = self.query_channel_range_tasks.lock().unwrap(); | ||
let query_range_tasks = &mut *query_range_tasks_lock; | ||
if query_range_tasks.contains_key(their_node_id) { | ||
return Err(LightningError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a caller requirement to track channel_range query I think this might be a hint of a higher logic bug. Though I don't have a clear model of this method usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking is that this will be called from peer_handler after init negotiation and should_request_full_sync
returns true. This guard prevents multiple in-flight query_channel_range
messages which is spec non-compliant. This could be changed to queue channel range queries but I thought outright rejection was a simpler for now.
Relatedly, my plan is to make full_syncs_requested
decrement on any failure in handle_reply_channel_range
so that a full sync can be requested with the next peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we can switch it later to an assert when higher sync logic is implemented.
@ariard 🤦 major apologies for that sloppiness!! I have fixed the issues with
If you're ok with it, this PR can be limited to:
Subsequent PRs can:
If that sounds good, I will remove temporary code from peer_handler, squash commits appropriately, and flip this to ready for review. |
SGTM with proposed scope! |
24ebffc
to
a2325a5
Compare
Codecov Report
@@ Coverage Diff @@
## main #736 +/- ##
==========================================
- Coverage 91.33% 91.16% -0.18%
==========================================
Files 37 37
Lines 22423 22623 +200
==========================================
+ Hits 20481 20624 +143
- Misses 1942 1999 +57
Continue to review full report at Codecov.
|
a2325a5
to
fabd072
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds almost good to me, only call for better naming/comments. Thanks for the great code comments, it make the understand of this fuzzy query protocol easier.
Note to myself : review again test coverage at next round and then it's done.
/// added to it while we are waiting for a reply_short_channel_ids_end. | ||
/// The pending short_channel_ids will be chunked and sent to the remote | ||
/// peer. | ||
pub struct GossipQueryShortChannelIdsTask { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comment why we need to maintain a state for queried remote peer and can't send simply query in a stateless flow ? AFAIU, it's only used to sanitize ReplyShortChannelIdsEnd
. Maybe in the future it should be extended with a timestamp to know when to expire queries and retry with another peer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added, hopefully that clarifies the state.
if task_received_first_block.is_none() { | ||
// The replies must include our range but can be a superset of the | ||
// queried block range. We start by validating the reply's | ||
// first_blocknum is less than or equal to the start of our query. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "superior or equal" ? At least that was the check msg.first_blocknum > task_first_blocknum
is enforcing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clarified the comment here
|
||
/// Maximum number of reply_channel_range messages we will allow in | ||
/// in reply to a query_channel_range. | ||
const MAX_REPLY_CHANNEL_RANGE_PER_QUERY: usize = 250; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those values from BOLT 7 ? Can't find them there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not defined in BOLT 7. I updated the comments with more information. 8000 is used in other implementations as the raw batch size to fit inside the 65535-byte message limit. I can fill you in on max replies over another channel if you want some additional insight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, the 65535 as an upper bound makes sense to me.
@julianknutsen @ariard I'm happy to rebase this PR if desired. The changes in peer_manager are fairly minimal for this PR. @ariard, I'm looking over your comments and will make those changes shortly. Thanks! |
@bmancini55 It can go first, it's more near-merge mature than #691 :) @julianknutsen What's the real issue with the ordering ? AFAICT, there is only one potential conflicting file |
The merge conflicts should be easy to resolve as they all follow existing patterns. A more interesting question to answer is how much test coverage is needed in the peer_manager code and should there be an effort to increase code coverage over time when new features are added. The tests in #692 introduce easy-to-use testing frameworks and exhaustive unit tests of the callback code that help ensure all the code is tested at some level. Those will likely need to be extended to include the new branches this PR introduces. If #692 were to go first, this patch would result in a code coverage drop in the report which would be an automated flag to help catch the testing issues. Not hard, just something to add to the TODO list and it is easy for these things to fall through the cracks with multiple changes to the same code. I'd recommend this goes first to limit the patch maintenance as 692 is unlikely to be merged in the current form, but I wanted to share the pros/cons to help make the best decision. |
fabd072
to
6a59483
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about the delay, was off for the last week. I think we can cut back a lot of code here by being a bit more willing to respond to bogus-but-not-harmful data from peers.
lightning/src/ln/msgs.rs
Outdated
@@ -824,6 +824,27 @@ pub trait RoutingMessageHandler : Send + Sync { | |||
fn get_next_node_announcements(&self, starting_point: Option<&PublicKey>, batch_amount: u8) -> Vec<NodeAnnouncement>; | |||
/// Returns whether a full sync should be requested from a peer. | |||
fn should_request_full_sync(&self, node_id: &PublicKey) -> bool; | |||
/// Queries a peer for a list of channels with a funding UTXO in the requested | |||
/// chain and range of blocks. | |||
fn query_channel_range(&self, their_node_id: &PublicKey, chain_hash: BlockHash, first_blocknum: u32, number_of_blocks: u32) -> Result<(), LightningError>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these fn interfaces can be cut back a bit - at least chain_hash can be dropped because NetworkGraph
objects only actually support one chain (though maybe we should track it in the NetworkGraph
!) and I'm not sure exactly how this would be used so not sure why we need number_of_blocks
- can we just always set it to MAX_INT? It may help to document exactly how these intend to be used, though I've left some further motivating comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are three use cases we may be interested in:
- A full sync
- A partial sync
- Perform an SCID query if we receive an unknown channel_update
Focusing on the full sync, we can drop the first_blocknum
and number_of_blocks
arguments and simplify the interface to a single method perform_routing_sync(&self, their_node_id: &PublicKey)
or similarly named. This could initiate the message flow that covers both channel discovery and request of info. It will depend on how we want to handle state (more on that below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want to keep number_of_blocks
, you might have clients with limited storage and thus building their network map iteratively (request channel announcement/updates for a range, discard not-interesting channels, move forward). I agree on the NetworkGraph+chain_hash, though it can be done as a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I guess my thinking here is in part because we haven't implemented the peer_handler part here, and I don't know how interesting it would be to have a generic API which allows all kinds of requests if we only ever make one or two request types in our own code - my gut says probably not so interesting.
|
||
// First we obtain a lock on the task hashmap. In order to avoid borrowing issues | ||
// we will access the task as needed. | ||
let mut query_range_tasks = self.chan_range_query_tasks.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced we need the state tracking/validation at all - we query a peer for all the channel IDs in a given range, and it feeds us back channel IDs, we then do a ton of validation to make sure they line up with exactly the query we sent before asking for the channel announcements. In the case the peer is malicious and trying to make us waste upload bandwidth by sending more queries, they could tell us about fake channels in the right range (or just send us pings), and in the case they're buggy we should find that out when they send us the channel announcements in question and they fail to validate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I did have the same wonder about the state tracking purpose : #736 (comment), my conclusion was that it's required by the way the spec is structured. Now does the overall gossip spec is DoS safe ? I don't think so if you assume disk-IO one on channel_announcement
un-bounded reception.
I don't think there is a necessity that out-of-range channel announcements fail to validate if we have the blockstate yet available ?
let task = query_range_tasks.get_mut(their_node_id).unwrap(); | ||
let mut short_channel_ids = Vec::new(); | ||
std::mem::swap(&mut short_channel_ids, &mut task.short_channel_ids); | ||
self.query_short_channel_ids(their_node_id, task.chain_hash, short_channel_ids)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not send the queries as we go? With that I think we can drop all of the state tracking here, which would clean up the code a ton and make reasoning about state and DoS issues much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can make the SCID queries entirely stateless. The spec only allows a single query of each type to be in-flight at a time. A large channel range query will stream multiple batches of SCIDs. Because we can only have a single SCID query in-flight, we need to store the subsequent batches and wait until we get an _end
message before we can initiate another SCID query. For example:
|------ query_channel_range ----->|
|<----- reply_channel_range (batch 1) ------|
|<----- reply_channel_range (batch 2) ------|
|<----- reply_channel_range (batch n) ------|
| |
|------ query_short_channel_ids (batch 1) ----->|
|<----- many chan_ann, etc... ------|
|<----- reply_short_channel_ids_end ------|
| |
|------ query_short_channel_ids (batch 2) ----->|
|<----- many chan_ann, etc... ------|
|<----- reply_short_channel_ids_end ------|
| |
|------ query_short_channel_ids (batch n) ----->|
|<----- many chan_ann, etc... ------|
|<----- reply_short_channel_ids_end ------|
So at a bare minimum we'll need to serialize and defer SCID query events some how, even if we initiate the first SCID query as soon as receive the first batch.
The path I went down allows us to handle the three use cases described above, but is overkill if we only plan to support full syncs. I'm happy to come up with a few options to simplify things and report back!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh. I wonder if we should just get the spec to relax the query limit for short_channel_ids data so that we can make all our responses completely stateless (modulo network buffers). The alternative would be to tightly limit the response set size and maybe even do requests one block at a time to make sure we can set a tight limit and not have to worry about it. I do think we can still cut back a bunch on the amount of state we track to enforce checks that aren't really DoS risks.
Thanks @julianknutsen for the information, added a direct reference to your comment in #707 backlog. |
Thanks for the great new comments at 6a59483. @TheBlueMatt We might bound the size of the Also you assume a DoS adversary is able to gain our gossips-sync peer selection. We may recommend to pick them randomly to mitigate as a first step. That said, I think we should defer this DoS discussion to future PRs, cutting in the implementation as required by the spec isn't going to solve the issue, as ultimately it's a question of receiving non-sanitized data from untrusted peers. IMO, we should first a) know the default bandwidth/disk resources we assume on a RL client (but maybe configurable?) b) define how do we select gossip-sync peers c) allocate a gossip budget to each peer (expressed in byte_per_block_scanned?) |
I can't say I'm super concerned about upload bandwidth - assuming we use a similar approach to the existing map sync stuff where we only send such messages with low priority and let channel-related messages go first, its not any worse than what a peer can do with ping/pongs anyway. As for disk, for now our map is all in memory, though it would be something worth considering if we ever move to a disk-backed version.
Maybe? I assume we'll want to sync with multiple peers at least, eventually doing some reconciliation. I think we can pretty tightly limit things and then we don't have to worry about it, though.
I don't see why? We have to deal with it sooner or later, might as well get it right in the first go? I think it should be pretty doable to make things pretty cheap here. |
It would simplify things on the query end. IIRC, it's a simple DoS prevention mechanism so the query recipient can throttle by taking time responding to a query.
One-by-one is a possibility, though our peers might not appreciate that haha. Alternatively, we could process just the first reply in a sequence. When the To simplify as-is, we could prune the interface to only a sync method and use a single Generally I wasn't overly concerned with state because we're only going to initiate a sync with a small number of peers and we have an upper-bound on allocation of 16MB (250 messages limit * ~8000 Regardless of state, I also agree that we could simplify |
Right, I see the motivation there, but it doesn't really work that way - in practice you're never going to implement it by not processing a new inbound message until you've finished uploading the responses from the previous message (and checking with the kernel that its already been sent?), so a peer can almost always send several in a row. I think maybe the thinking might have been more along the lines of making sure other messages can be sent in between by encoding into the spec the idea of not having too much in-flight at once, though that also doesn't solve the problem (there's plenty of other issues around fresh announcements flooding message queues) and you really need a solution like ours where you have (effectively, in our case) a separate outbound message queue for "background" messages and one for channel messages. I don't think its unreasonably to change the spec here, though other implementations may disagree. I'd be curious to know if any implementations actually have the ability to enforce the "one-in-flight" rule, anyway. |
50c6b53
to
9b289c0
Compare
Sorry for the delay, took some time off with the holiday and just got a chance to circle back to this. Latest update pushed as described above. A few notes about adding the
Happy to modify this if needed. |
Support for the gossip_queries feature flag (bits 6/7) is added to the Features struct. This feature is available in the Init and Node contexts. The gossip_queries feature is not fully implemented so this feature is disabled when sent to peers in the Init message.
To enable gossip_queries message decoding, this commit implements the wire module's Encoding trait for each message type. It also adds these messages to the wire module's Message enum and the read function to enable decoding of a buffer.
66a8917
to
d3f7a9d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good. Mostly trivial nitpicky stuff here. I didn't test it against any live nodes, have you gotten a chance to?
lightning/src/ln/msgs.rs
Outdated
@@ -827,6 +827,25 @@ pub trait RoutingMessageHandler : Send + Sync { | |||
fn get_next_node_announcements(&self, starting_point: Option<&PublicKey>, batch_amount: u8) -> Vec<NodeAnnouncement>; | |||
/// Returns whether a full sync should be requested from a peer. | |||
fn should_request_full_sync(&self, node_id: &PublicKey) -> bool; | |||
/// Initiates routing gossip sync by querying a peer to discover channels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're at this can you update the docs on the above method to be more specific that this is just about the initial sync bit in Init messages, and that many lightning implementations ignore this completely - maybe even rename it should_set_initial_sync_bit
or so.
/// Handling requests with first_blocknum very far away may trigger repeated | ||
/// disk I/O if the NetworkGraph is not fully in-memory. | ||
fn handle_query_channel_range(&self, _their_node_id: &PublicKey, _msg: &QueryChannelRange) -> Result<(), LightningError> { | ||
// TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it safe to presume you'll be working on this sooner or later? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, want to outbound queries buttoned up before diving into responding to them.
lightning/src/ln/msgs.rs
Outdated
/// Initiates routing gossip sync by querying a peer to discover channels | ||
/// and their associated routing gossip messages. This method will use a | ||
/// sync strategy defined by the implementor. | ||
fn sync_routing_table(&self, their_node_id: &PublicKey); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to pass in the Init
message here so that the NetGraphMessageHandler can analyze the InitFeatures. Same goes for should_request_full_sync
- that way we can only return true in the second when the peer doesn't support gossip queries and send a query_channel_range when they do.
if msg.features.supports_data_loss_protect() { "supported" } else { "not supported"}, | ||
if msg.features.initial_routing_sync() { "requested" } else { "not requested" }, | ||
if msg.features.supports_upfront_shutdown_script() { "supported" } else { "not supported"}, | ||
if msg.features.supports_gossip_queries() { "supported" } else { "not supported" }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume we should call sync_routing_table
at the bottom of this block where we call peer_connected
?
Yep tested against LND with a minor modifications to On that, I was trying to keep the touches on |
New updates:
In terms of adding
What I'm doing is using This bit of orchestration could be removed, along with removing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for delay, the stateless approach is so much simpler !
What I'm doing is using should_request_full_sync to set the initialize_routing_sync feature bit and flag the peer as requesting a sync. Then once we receive the peer's Init message, we use that flag to control if sync_routing_table should be called.
IIUC, setting initialize_routing_sync
on our Init
message should trigger the receiving peer to send us a complete routing table dump, at least that the behavior recommended by the spec ? But how many implementations are still supporting this behavior, it sounds a huge upstream bandwidth leak ? I'm of the opinion to remove it completely and only rely on gossip queries.
Also have a look on lightning-dev meeting of yesterday : http://gnusha.org/lightning-dev/2020-12-07.log, we discussed both #804/#811
- it has been proposed to repurpose
full_information
of channel range query as a clear EOF, thus standardizing behavior among implementation - we discussed introducing a new feature
no-zlib-support
thus easing discovery of raw encoding supporting peers for gossip queries
And we still have to make the stateless approach lawful by allowing concurrent gossip queries.
What I'm thinking of is landing this PR once minor points are solved and then write down the spec updates to propose changes matching our requirement ?
lightning/src/ln/msgs.rs
Outdated
/// Returns whether a full sync should be requested from a peer. | ||
/// Returns whether a full sync should be requested from a peer. The result is | ||
/// used to set the initial_routing_sync feature bit when sending an Init message. | ||
/// This feature bit is not used by some implementations in favor of using gossip_queries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add implementations and their versions which doesn't support initial_routing_sync
in part of comment, helpful for debugging on our side :)
lightning/src/ln/msgs.rs
Outdated
/// Handles when a peer asks us to send routing gossip messages for a | ||
/// list of short_channel_ids. There are potential DoS vectors when handling | ||
/// inbound queries. Handling requests with first_blocknum very far away may | ||
/// trigger repeated disk I/O if the NetworkGraph is not fully in-memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you can gather potential DoS vectors in "Implementor DoS Warnings" section around RoutingMessageHandler
trait doc, with trade-offs if graph isn't in memory. I'm fine being in trait function docs, your call if you want to remove reference to NetworkGraph
.
When we enable For outbound sync, I'm in agreement that we should not set
Thanks, unfortunately I had a conflict and was only able to log on for the second half! Both of those proposals sound good though.
Sounds good, I'll make these changes tomorrow! |
This change enables initiating gossip queries with a peer using the SendMessageEvent enum. Specifically we add an event for sending query_channel_range to discover the existance of channels and an event for sending query_short_channel_ids to request routing gossip messages for a set of channels. These events are handled inside the process_events method of PeerManager which sends the serialized message to the peer.
Defines message handlers for gossip_queries messages in the RoutingMessageHandler trait. The MessageSendEventsProvider supertrait is added to RoutingMessageHandler so that the implementor can use SendMessageEvents to send messages to a peer at the appropriate time. The trait methods are stubbed in NetGraphMsgHandler which implements RoutingMessageHandler and return a "not implemented" error.
To perform a sync of routing gossip messages with a peer requires a two step process where we first initiate a channel range query to discover channels in a block range. Next we request the routing gossip messages for discovered channels. This code implements logic in NetGraphMsgHandler for performing these two tasks while taking into account the specification and variance in implementation.
This changes adds the genesis block hash as a BlockHash to the NetworkGraph struct. Making the NetworkGraph aware allows the message handler to validate the chain_hash for received messages. This change also adds the hash value to the Writeable and Readable methods.
This commit simplifies the sync process for routing gossip messages. When a sync is initiated, the process is handled statelessly by immediately issuing SCID queries as channel range replies are received. This greatly simplifies the state machine at the cost of fully validating and conforming to the current spec.
This change modifies gossip_queries methods in RoutingMessageHandler to move the message instead of passing a reference. This allows the message handler to be more efficient by not requiring a full copy of SCIDs passed in messages.
This commit modifies sync_routing_table in RoutingMessageHandler to accept a reference to the Init message received by the peer. This allows the method to use the Peer's features to drive the operations of the gossip_queries routing table sync.
105df12
to
3df9fae
Compare
This sounds good to me. Especially given it works in the real world, I see little reason to sit on it forever. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One note about docs but I'm happy with it.
/// Initiates routing gossip sync by querying a peer to discover channels | ||
/// and their associated routing gossip messages. This method will use a | ||
/// sync strategy defined by the implementor. | ||
fn sync_routing_table(&self, their_node_id: &PublicKey, init: &Init); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since its ultimately up to the implementer what they want to do here, it may make sense for the docs to be a little less prescriptive (ie, after the first few nodes you may not want to always sync the routing table) and just describe that this is called on connection and implementers should decide if they want to sync the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, just a small test coverage issue and SGTM :)
{ | ||
let (secp_ctx, net_graph_msg_handler) = create_net_graph_msg_handler(); | ||
let init_msg = Init { features: InitFeatures::known() }; | ||
for n in 1..6 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to iterate the for-loop until 7, otherwise it doesn't test syncing limitation to FULL_SYNCS_TO_REQUEST ?
Diff:
diff --git a/lightning/src/routing/network_graph.rs b/lightning/src/routing/network_graph.rs
index d6c72d08..4b0c2087 100644
--- a/lightning/src/routing/network_graph.rs
+++ b/lightning/src/routing/network_graph.rs
@@ -235,10 +235,10 @@ impl<C: Deref + Sync + Send, L: Deref + Sync + Send> RoutingMessageHandler for N
return ();
}
- // Check if we need to perform a full synchronization with this peer
- if !self.should_request_full_sync(their_node_id) {
- return ();
- }
+ //// Check if we need to perform a full synchronization with this peer
+ //if !self.should_request_full_sync(their_node_id) {
+ // return ();
+ //}
let first_blocknum = 0;
let number_of_blocks = 0xffffffff;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dag-nabbit. Fixed! Thanks for catching that. A minor refactor of the test bit me.
This commit changes outbound routing table sync to use gossip_queries instead of the effectively deprecated initial_routing_sync feature. This change removes setting of initial_routing_sync in our outbound Init message. Instead we now call sync_routing_table after receiving an Init message from a peer. If the peer supports gossip_queries and should_request_full_sync returns true, we initiate a full gossip_queries sync.
This method was used to set the initial_routing_sync flag when sending an outbound Init message to a peer. Since we are now relying on gossip_queries instead of initial_routing_sync, synchronization can be fully encapsulate into RoutingMessageHandler via sync_routing_table. This commit removes should_request_full_sync from the trait RoutingMessageHandler. The implementation is still used in NetGraphMsgHandler and has been converted into a private method instead of a trait function.
3df9fae
to
e0bb63b
Compare
Code Review ACK e0bb63b |
Corrects the comment for sync_routing_table in RoutingMessageHandler to be less prescriptive about the implementor's actions.
Part 1 for initiating gossip_queries for #615. This PR includes: