Skip to content

(1/4) Enum Dispatch NOISE State Machine & Unit Tests #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

julianknutsen
Copy link

@julianknutsen julianknutsen commented Sep 14, 2020

PRs:
(1/4) Enum Dispatch NOISE State Machine & Unit Tests
(2/4) PeerManager Unit Tests and Refactoring
(3/4) Conduit fixups from 494
(4/4) Update OutboundQueue flush behavior and fix existing bugs that it found

Motivation
Refactor & Unit Tests
Fuzz Testing
New Features

Read First

This stack has been living as a PR against Arik's branch over the last 3-4 weeks and builds on his work from February #494. The initial feedback is positive enough to move it against rust-bitcoin/rust-lightning. The initial goal of the stack was to address the outstanding design feedback from 494 to get it into a stable place for merge. It has expanded into a fully tested and modular PeerManager.

Due to the complexity of rebasing 494 against master to make this clean, it has been squashed to a single commit at the beginning of this PR. That allows it to technically build against Master and remove the merge-commit at the end of the stack.

Outstanding issues from the original 494 review and where they are addressed:

  • Heap fragmentation in handshake path
    • perf: Replace Vec w/ static arrays for act buffers
    • perf: Remove Vec from chacha::encrypt/decrypt
  • Bug in decrypt path causing a panic on decryption failure
    • fix: Conduit::Decryptor should not panic on decryption failure
  • Bug that didn't allow receiving of partial ACT message
    • fix: Allow partial act messages to be received
  • Conduit vs. Encryptor/Decryptor
  • Separate crypto library - Unaddressed
    • Seems like an easy separate PR to get all these in the right place

Next Action: Reviewers

  • Final sign-off

History

Update 8-23-2020: Removed new features from PR and created separate discussion issues #672 & #673
Update 8-24-2020 After discussion, #672 is a requirement to not cause a regression. PR now has this feature.
Update-8-26-2020 Updated to use a no allocation design that was unlocked after implementing 673. This makes the allocation policy similar to the existing behavior on master.
Update-9-2-2020 Initial feedback reviewed and review patches added to PR. All outstanding issues are still marked as unresolved to make sure nothing gets lost. Reviewers have next action
Update-9-24-2020 Moved to a PR against rust-bitcoin/rust-lightning
Update-9-25-2020 Squashed 494 at front of PR so this can technically merge into master
Update-10-2-2020 Rebased against master
Update-10-5-2020 Code handoff to @Arik and @ariard for next steps

Intro

Comments/questions/clarifications all welcome. I've done my best to adhere to the style guidelines, but if there are glaring issues I can iterate on them. I'm also available over Slack for any direct communication if needed.

Motivation

I looked around for a good first project to help me orient with the codebase and understand a bit more about the development cycle. I initially started going down #528 and that led me to #494 which looks like it was on the path to making the peer_handler code a bit more modular (and testable), but has been in-progress for almost six months. Since the code is in flux, it would be hard to add any additional testing or refactoring to enable better testing without introducing large merge conflicts as a separate PR.

So, I took that code as a base and wrote a set of tests against the PeerHandshake interface to get a better sense for how all of the pieces fit together and it led me to a thread I kept pulling that eventually ended up in a full set of tests and refactoring for the handshake portion of the code. It follows the spirit of the PR by abstracting the handshake code in a way that enables easier testing and delegation of responsibilities. It also gave me the opportunity to improve the documentation and control flow between states in a way that is easy to test and maintain.

I've also taken the opportunity to clean up many outstanding warnings that the code currently generated to get it into a more complete state.

My hope is that these changes can help motivate integrating #494 in order to unlock future testing and architecture improvements.

Testing and Refactoring

After completing the initial set of tests for the PeerHandshake module there were a few pieces of code that were hard to test or impossible to test based on the current design. The structure of helper functions that used match to determine invalid transitions left room for ambiguity and made the test cases harder to identify and create.

To enable the final bit of testing and create a more maintainable design I moved towards a state pattern utilizing pure functions and enum-dispatch. This enabled more well-defined transitions as well as very simple unit tests.

At a high level, there is a simple state machine interface with one method:

pub(super) trait IHandshakeState {
	fn next(self, input: &[u8]) -> Result<(Option<Vec<u8>>, HandshakeState), String>;
 }

All state implementations follow the same pattern that is similar to a pure function
that consumes all inputs to produce an output. No global state.

impl IHandshakeState for State {
	fn next(self, input: &[u8]) -> Result<(Option<Vec<u8>>, HandshakeState), String> {
		// move all values from previous state
		
		// calculate needed values for state transitions and response act
		
		// build new state
		
		// return (response act, new_state)
	}
}

The PeerHandshake is now a very simple wrapper class that just maintains the current
state and marshals data to the next() functions:

pub fn process_act(&mut self, input: &[u8]) -> Result<(Option<Vec<u8>>, Option<(Conduit, PublicKey)>), String> {
	let cur_state = self.state.take().unwrap();

	let (act_opt, mut next_state) = cur_state.next(input)?;

	let result = match next_state {
		HandshakeState::Complete(ref mut conduit_and_pubkey) => {
			Ok((act_opt, conduit_and_pubkey.take()))
		},
		_ => { Ok((act_opt, None)) }
	};

	self.state = Some(next_state);

	result
}

This design also allows for simple unit testing of the state machine transitions and act byte return values. Here is the test that validates proper Act2 generation and transition:

// Responder::AwaitingActOne -> AwaitingActThree
// RFC test vector: transport-responder successful handshake
#[test]
fn awaiting_act_one_to_awaiting_act_three() {
	let test_ctx = TestCtx::new();
	let (act2, awaiting_act_three_state) = test_ctx.responder.next(&test_ctx.valid_act1).unwrap();

	assert_eq!(act2.unwrap(), test_ctx.valid_act2);
	assert_matches!(awaiting_act_three_state, ResponderAwaitingActThree(_));
}

// Responder::AwaitingActOne -> Error (bad version byte)
// RFC test vector: transport-responder act1 bad version test
#[test]
fn awaiting_act_one_to_awaiting_act_three_input_bad_version() {
	let test_ctx = TestCtx::new();
	let act1 = hex::decode("01036360e856310ce5d294e8be33fc807077dc56ac80d95d9cd4ddbd21325eff73f70df6086551151f58b8afe6c195782c6a").unwrap();

	assert_eq!(test_ctx.responder.next(&act1).err(), Some(String::from("unexpected version")));
}

Fuzz Testing

Write a new fuzz test for the PeerHandshake module that is able to generate failure paths that occur after a partial handshake sequence has been completed.

To enable this, a new testing object FuzzGen has been introduced that can deterministically generate bytes and bools based on the random fuzz input. These building blocks are enough for the test code to generates different execution paths and complete partial phases of the handshake protocol before generating an error.

When going through a test cycle where the handshake completes, it will also verify that the initiator and responder can communicate successfully through the returned conduits sending variable length data and validating the contents.

	// Possibly generate bad data for act1 and ensure that the responder does not panic
	let mut act1 = test_ctx.act1; // valid act1 for success case
	if generator.generate_bool()? { // invalid act1 for failure case
		act1 = (generator.generate_bytes(50)?).to_vec();
		used_generated_data = true;
	}

	let mut act2 = match test_ctx.responder_handshake.process_act(&act1) {
		Ok((Some(act2), None)) => {
			// extremely unlikely we randomly generate a correct act1, but if so.. reset this
			used_generated_data = false;
			act2
		}
		Err(_) => {
			assert!(used_generated_data);
			return Err("generated expected failure with bad data".to_string());
		}
		_ => panic!("responder required to output act bytes and no conduit/pubkey")
	}

@julianknutsen julianknutsen changed the title Enum Dispatch NOISE State Machine & Unit Tests (1/4) Enum Dispatch NOISE State Machine & Unit Tests Sep 14, 2020
@julianknutsen julianknutsen marked this pull request as ready for review September 15, 2020 00:29
// Alias type to help differentiate between temporary key and chaining key when passing bytes around
type ChainingKey = [u8; 32];

// Generate a SHA-256 hash from one or more elements concatenated together
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial @arik-so review: I still think this macro might be more legibly put in a function, or a custom hash object/class. But, as I said before, ultimately this is up to Matt and Antoine.

Overall, I think it makes the code less readable and introduces a wrapper type that has to be unwrapped which I think detracts from the audit-ability. I'm also not sure about the aversion to variadic macros. I needed a way to do the hash for 1 or 2 objects and used a macro that made the code (in my point of view) more legible.

With that said, I did implement it to get your take and see which way the code should go. Just let me know and I can fixup this PR to have the final result.

julianknutsen/rust-lightning@0b4c819

@julianknutsen julianknutsen changed the title (1/4) Enum Dispatch NOISE State Machine & Unit Tests (1/3) Enum Dispatch NOISE State Machine & Unit Tests Sep 15, 2020
@codecov
Copy link

codecov bot commented Sep 15, 2020

Codecov Report

Merging #691 into main will increase coverage by 0.48%.
The diff coverage is 90.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #691      +/-   ##
==========================================
+ Coverage   92.01%   92.49%   +0.48%     
==========================================
  Files          37       42       +5     
  Lines       20106    21431    +1325     
==========================================
+ Hits        18500    19823    +1323     
- Misses       1606     1608       +2     
Impacted Files Coverage Δ
lightning/src/ln/peers/handler.rs 55.50% <47.58%> (ø)
lightning/src/ln/peers/handshake/acts.rs 94.64% <94.64%> (ø)
lightning/src/ln/peers/chacha.rs 95.65% <95.65%> (ø)
lightning/src/ln/peers/handshake/mod.rs 95.74% <95.74%> (ø)
lightning/src/ln/peers/conduit.rs 96.42% <96.42%> (ø)
lightning/src/ln/peers/handshake/states.rs 97.58% <97.58%> (ø)
lightning/src/ln/peers/hkdf5869rfc.rs 100.00% <100.00%> (ø)
lightning/src/chain/chainmonitor.rs 100.00% <0.00%> (ø)
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48f8e3...30286c0. Read the comment docs.

@julianknutsen julianknutsen force-pushed the new-state-machine branch 2 times, most recently from d2fd4ee to 72f0221 Compare September 15, 2020 23:20
@arik-so
Copy link
Contributor

arik-so commented Sep 16, 2020

I don't see any feedback I've left in the upstream iteration that hasn't been addressed at this point, so I'm with this PR to proceed in its current state.

@julianknutsen julianknutsen changed the title (1/3) Enum Dispatch NOISE State Machine & Unit Tests (1/4) Enum Dispatch NOISE State Machine & Unit Tests Sep 18, 2020
@julianknutsen
Copy link
Author

Rebased against upstream and added licenses to new files.

@@ -21,16 +21,11 @@
pub mod channelmanager;
pub mod channelmonitor;
pub mod msgs;
pub mod peer_handler;
pub mod peers;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move peers up to its own top-level module.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just straight for #494, but it seems fine to make it a top-level. Added #718

//! Conduit enables message encryption and decryption, and automatically handles key rotation.

mod chacha;
pub mod handler;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should more than this mod be pub? Or is the intent to merge first and then make things public later?

Copy link
Author

@julianknutsen julianknutsen Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it was just for fuzz tests. Fixed in: review: Fix peers::handler visibility

I think the right thing is merge first and then fix visibility issues later. This is the only needed by the tokio sample right now.

struct Peer {
channel_encryptor: PeerChannelEncryptor,
encryptor: PeerState,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we ever use their_node_id on outbound peers before we finish connecting? Can we move it into PeerState::Connected and get rid of the Option<>?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, and the asymmetry between when this was set and when it wasn't was confusing. This has been changed in #692 so that it is an output of the NOISE handshake completion.

julianknutsen and others added 11 commits October 2, 2020 12:58
This will make future changes that stabilize this work easier to review
and merge.

Co-authored-by: Arik Sosman <[email protected]>
Co-authored-by: Julian Knutsen <[email protected]>
Lots of build warnings still, but it builds and tests pass.
Add unit tests against the PeerHandshake public interface prior to
changing any of the implementation details.

This is a solid stand-alone patch, even if the subsequent refactor and
design changes are not merged. A future patch uses the test vectors from
the RFC to minimize the code under test.
The implementation logic is duplicated from handshake/mod.rs so that it
can be tested and refactored independent of the current implementation.

This uses enum dispatch coupled with separately 'moved' state objects
that allow for simple testing, easy state inspection, and understandable
transition logic between states. This pattern also removes the need for
 match statements inside implementation logic in favor of the instance
representing the current state and available data.

Every state transition is implemented in a next() function and is a
straight forward translation from an object representing the existing
known data and an input act message (bytes).

Example usage:

let (act_data, next_state) = cur_state.next(input_data);

This state machine will eventually be moved into place inside the
PeerHandshake which will act as a simple wrapper that just maintains the
state machine and marshals the data between the peer_handler and states.
PeerHandshake is now a simple object around the new state machine that
simply marshals data between the callers and state machine states. For
now, keep the APIs identical to make the patches easier to read.

The only interesting piece to note here is the handling of
get_remote_pubkey() to keep parity with the existing implementation.
Future patches will remove this separate function in favor of callers
taking ownership of it once the handshake is complete.
Introduce the sha256!() and concat!() macros that are used to
improve readability of the noise exchanges.
Small inconvenience with the HandshakeHash return value that will be
taken care of by the end of this patch stack.
Fix inconsistent use of concat vs. Sha256::engine() input. These can
be deduplicated at the end.
Identify potential issue regarding the invalid case where
the read_buffer has data after completion.
Use the slice refs from the act array instead.
Write a new fuzz test for the PeerHandshake module that is able
to generate failure paths that occur after a partial handshake sequence
has been completed.

To enable this, a new testing object FuzzGen has been introduced that
can deterministically generate bytes and bools based on the random fuzz
input. These building blocks are enough for the test code to generates
different execution paths and complete partial phases of the handshake
protocol before generating an error.

When going through a test cycle where the handshake completes, it will
also verify that the initiator and responder can communicate
successfully through the returned conduits sending variable length data
and validating the contents.
Functionality moved to:
- peers/handshake/mod.rs
- peers/handshake/states.rs
- peers/conduit.rs

Fuzz testing rewritten here:
- fuzz/src/peer_crypt.rs
Fix regression introduced in eb6a371 that would error and subsequently
cause a disconnect if a partial act was sent to the state machine, even
if future calls could be added to the read buffer to create a
valid act.

Add appropriate unit tests and update fuzz testing to send partial acts
some of the time.
introduced in eb6a371

On decryption failure, the Decryptor should return an error to the
caller so it can disconnect instead of panicking.

Implement this behavior by plumbing Result through the iterator code
path and adding the appropriate tests and comments.
Enhanced the fuzz testing to simulate more test cases:
* Valid or random act data
* Optionally sending garbage data along with act
* Optionally sending act data in multiple segments

Found regression introduced in eb6a371 relating to the Conduit
panicking during an hmac failure.
To avoid tuple returns from a single initialization function, create a
setup_outbound() function that is only callable by the outbound
PeerHandshake, but doesn't leak implementation details.
Proper handshake peers won't send more bytes than required for
act1 and act2 and it is impossible to generate future acts
without first receiving an interleaved response. In those cases, the
peer is misbehaving and we can fail early.

This unlocks a Vec-less design in the handshake code now that
there is an upper limit on read buffer sizes.
After reviewing lightningdevkit#494 there was design feedback on the use of Vec vs.
arrays in this layer of the code for fragmentation reasons. To get ahead
of the same issues in this new state machine and to limit the behavior
changes from master, remove Vec from the state machine in favor of thin
wrappers around fixed-size arrays.

This patch reintroduces the Act object (a thin wrapper around fixed size
arrays) with some convenience features to make them a bit easier to
pass around and build.

The Handshake code is still note Vec-clean as the encryption code uses
Vecs during encryption, but it is one step closer to passing back slices
to the peer_handler.
In order to limit perf/allocation changes from master, remove the last
use of Vec from the handshake code.

The decrypt/encrypt code now takes an out parameter to place the
decrypted bytes. All handshake users know the fixed-size of the output
decryption so they can use arrays. The Conduit code will still allocate
a Vec based on the input message and pass it through.

After this patch, the handshake code is now Vec-less and copies are
minimized.
Clean up small items based on review comments that are expected to be
non-controversial so they can all go in one commit.
Follow existing pattern used in util/ for consistency.
Don't rely on the initialized buffer filled with zeros.
Instead of returning a subslice of input for the caller to process, just
return the number of bytes consume and let them do the accounting.
@julianknutsen
Copy link
Author

Rebased against upstream and resolved merges.

use ln::peers::{chacha, hkdf5869rfc};
use util::byte_utils;

pub(super) type SymmetricKey = [u8; 32];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: better to newtype this (ie SymmetricKey([u8; 32])) so that the typesystem enforces explicit conversion to/from [u8; 32].


}

pub(super) struct Encryptor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe these can be non-pub (I know they are used in later PRs, but I don't think they should be, see comment there).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dereferenced in handler.rs.

I'm not a fan of the pattern, but that was how 494 existed and I kept it that way until the Transport encapsulation was done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I think we agree, re: #693 (comment)

}
}

impl IHandshakeState for InitiatorStartingState {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, the trait is never really used, even in tests - here and for all the individual messages, we could just have a "dumb" impl block and things would work as well, and then the trait is only ever used by the HandshakeState enum, which means it could just be dropped.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a defensive programming technique to ensure all of the individual state enums all implement the same common interface. That way, the wrapper enum can be simple delegation.

It was the idiomatic rust enum dispatch method I found. Other projects leverage macros and crates, but I chose to keep it more verbose to not introduce additional dependencies.

Here is the trait definition that I hope explains it. If you want this removed just let me know, I don't need to bikeshed about it.

// Trait for all individual states to implement that ensure HandshakeState::next() can
// delegate to a common function signature. May transition to the same state in the event there are
// not yet enough bytes to move forward with the handshake.
pub(super) trait IHandshakeState {
	fn next(self, input: &[u8]) -> Result<(Option<Act>, HandshakeState), String>;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess I don't have a super strong feeling on using a trait to show identical operation, though it wasn't immediately obvious to me that that was its purpose given its public across the module (but only used in the states.rs file) and implemented by both the top-level enum and structs inside. Maybe best to move it in states.rs and add a comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I see the value of IHandshakeState. Nothing enforces that the enum states must implement this trait.

}

// Handshake state of the Initiator prior to generating Act 1
pub(super) struct InitiatorStartingState {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the structs here need to be pub(super)? It looks like they're only ever accessed in two tests to check state, which could just as easily be a test-only method returning a state name or something.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are contained by the HandshakeState enums which can't leak private types in the mod.rs usages.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, seems like we could make HandshakeState a struct and move the enum inside it to reduce the exposed surface of states.rs significantly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of reducing the exposed surface? This is just an enum used inside the handshake module. Would you rather just inline it all in one file? That is sort of what it sounds like.

This is how the enum dispatch pattern works in Rust and it seems like we are just talking past each other. If you are uncomfortable with this pattern in your codebase feel free to have @arik-so or @ariard change it or discard the PRs.

Copy link
Contributor

@jkczyz jkczyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I wasn't able to review this earlier. Leaving some comments here while I go through each commit in my spare cycles. I really like the direction especially with the additional unit tests! I'd like to see more of this in the project going forward.

While I know your timeline and our priorities don't match right now, I'm happy to continue reviewing this and help get it merged. Please don't feel obligated to respond to the comments. Just want you to know this is on my radar.

Comment on lines 453 to 469
// Default Outbound::Uninitiated
#[test]
fn peer_handshake_new_outbound() {
let test_ctx = TestCtx::new();

assert_matches!(test_ctx.outbound_handshake.state, Some(HandshakeState::Uninitiated));
assert_eq!(test_ctx.outbound_handshake.get_remote_pubkey(), Some(test_ctx.inbound_public_key));
}

// Default Inbound::AwaitingActOne
#[test]
fn peer_handshake_new_inbound() {
let test_ctx = TestCtx::new();

assert_matches!(test_ctx.inbound_handshake.state, Some(HandshakeState::AwaitingActOne(_)));
assert!(test_ctx.inbound_handshake.get_remote_pubkey().is_none());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding test naming, we haven't been too consistent. I'm hoping we can come up with a convention describing what is being tested without being overly verbose. To that end, I'd like to propose removing the "peer_handshake_" prefix since it is already part of the fully-qualified function names, which appear in the test output (e.g., ln::peers::handshake::test::peer_handshake_new_outbound).

Then these two may be named new_outbound_uninitiated and new_inbound_awaiting_act_one.

Generally, I'd prefer naming like <action>_<details>[_<outcome>] in a manner that reads well. Consider some of the wire tests like :

https://github.com/rust-bitcoin/rust-lightning/blob/2dd8b3e896374006c898243537b78bb1ecc4fb6d/lightning/src/ln/wire.rs#L316

That said, there is wiggle room especially when it comes to initialization tests and ones below where there are state transitions and different initial states (e.g. inbound vs outbound). Submodules could potentially be used to clarify the latter, but here I think the flow of the tests as they are is likely a better approach.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reasoning seems fine. To help the next contributor, I would suggest creating a test style guide and then modifying/enforcing that style for existing code or new contributions. As a new contributor, I looked for uniformity in the existing codebase. Since it didn't exist I found something that made sense to me. I would expect others to do the same without more guidelines.

You will notice in later patches that the Transport unit tests utilize comments for the given/when/then pattern to help readers understand the tests. Would likely want to be cleaned up to match your expectations.

I agree that test naming consistency should exist for this project. I just wonder whether or not one-off changes like you suggest here will raise the quality bar long term over a more specific style guide that you can point to for new developers.

Either way, easy to sed all of this to match what you want.


// Outbound::Uninitiated -> AwaitingActTwo
#[test]
fn peer_handshake_outbound_uninitiated_to_awaiting_act_two() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and other state transition tests below, I'd be fine with simply removing the "peer_handshake_" prefix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, great place for a test style guide. I think the path should include enough relevant information to remove the prefix.

($e:expr, $state_match:pat) => {
match $e {
$state_match => (),
_ => panic!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to make this output something indicating the mismatch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect you can just put a format string in the panic with the two items, but I'm not 100% sure how that works with patterns and if it gives you what you want.


// Inbound::AwaitingActOne -> AwaitingActThree
#[test]
fn peer_handshake_new_inbound_awaiting_act_one_to_awaiting_act_three() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the other state transition tests, move the successful test case before the failure cases. Or vice versa, if your prefer.

// Outbound::AwaitingActTwo -> Complete (with extra data)
// Ensures that any remaining data in the read buffer is transferred to the conduit once
// the handshake is complete
// TODO: Is this valid? Don't we expect peers to need ActThree before sending additional data?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Slack. It is probably best to view this first PR in the "files changed" view instead of commit-by-commit. Otherwise, you will see all of the bugs from 494 before they were fixed in subsequent commits.

}

// Inbound::AwaitingActThree -> None
// TODO: should this transition to Complete instead of None?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely it should. @arik-so?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in Slack. It is probably best to view this first PR in the "files changed" view instead of commit-by-commit. Otherwise, you will see all of the bugs from 494 before they were fixed in subsequent commits.

Comment on lines +203 to +208
pub fn read_buffer_length(&self) -> usize {
match &self.read_buffer {
&Some(ref vec) => { vec.len() }
&None => 0
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider having an assert against the content rather than simply exposing the length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was intentional. My test philosophy is to not add test-assert methods as extensions of existing classes. Those are used primarily in test-only Mock objects. Exposing internal lengths isn't great, but it is better than not having test coverage at all. Another great place for this project have a common pattern or doc that you can point new developers towards. Otherwise, the reviews seem pretty disjoint depending on which reviewer looks at the code and their own personal style.

I agree that this pattern definitely points to a poor design choice in 494 to have the Conduit(Encryptor/Decryptor) created and instantiated as part of the handshake code. This came up in one of Matt's comments later on as well.

An alternative is to have the handshake code return the number of bytes consumed and the higher layers can instantiate the Conduit after the handshake is complete. This would allow the tests to verify the number of bytes read matches expectations and removes this test-only method.

This wasn't done in the original patches due to the existing layering pre-Transport. The PeerManager code handling the number of bytes read as well as the completed handshake state made the code hard to reason about and maintain. This may be much simpler now that more distinct layers exist.

In many of the changes I made here, I chose to keep the existing design choices from 494 to limit the exposed surface for feedback and iteration. I was under the assumption that 494 was reviewed from a design perspective and just needed to be cleaned up a bit. Obviously, that wasn't that case but should explain some of the confusing choices that were made in these commits.

@julianknutsen
Copy link
Author

@jkczyz Glad it is starting to find a champion on the core team. That will hopefully give it some of the push it needs to get across the finish line.

I've added my comments to help the next person to fix this up so they don't get too far off-track.

One of the challenges this set of patches is going to face moving forward is letting perfect be the enemy of good. There has been a lot of duplicate feedback and discussions over the last 6 weeks about specific style and design choices. It is definitely frustrating as a contributor to go through multiple rounds of revisions with one set of reviewers and then have another set of reviewers come in at the 11th hour and bring up old discussion items or ask for core items to be changed.

I would recommend working on the messaging for new developers and calling out which core devs have the final say on patches so that the contributors can target them specifically early on. That will hopefully help others avoid the same fate.

There is a lot of really good testing and refactoring here that drastically improves the maintainability and testing of this section of LDK. Considering the state the code was in prior to the patches, the existing bugs, and the general lack of testing, it may make sense to take an iterative approach to this code instead of trying to get everything perfect in one go.

It would be a shame for it to spend another few months going stale and conflicting code being merged just for it to end up like #494.

@julianknutsen
Copy link
Author

@ariard @arik-so @jkczyz

I thought I would check back after a few weeks and answer any other questions that came up. It looks like this stalled out a bit and there aren't any comments or actionable steps to push it forward.

Would it make sense to just close this stack out? The core team looks to have other priorities and it isn't clear how this interacts with #736 that looks to be moving closer to a merge.

Let me know and I am happy to close this and the corresponding backlog issues so there isn't confusion as to what can/needs to get done in this layer.

@jkczyz
Copy link
Contributor

jkczyz commented Oct 27, 2020

@ariard @arik-so @jkczyz

I thought I would check back after a few weeks and answer any other questions that came up. It looks like this stalled out a bit and there aren't any comments or actionable steps to push it forward.

Would it make sense to just close this stack out? The core team looks to have other priorities and it isn't clear how this interacts with #736 that looks to be moving closer to a merge.

Let me know and I am happy to close this and the corresponding backlog issues so there isn't confusion as to what can/needs to get done in this layer.

Sorry for dropping the ball on this. I had just left for vacation when you responded to my comments, and I haven't had a chance to circle back now that I've returned. Been a little busy catching up on other work but will touch base with the others on how to proceed.

@julianknutsen
Copy link
Author

@ariard @arik-so @jkczyz
I thought I would check back after a few weeks and answer any other questions that came up. It looks like this stalled out a bit and there aren't any comments or actionable steps to push it forward.
Would it make sense to just close this stack out? The core team looks to have other priorities and it isn't clear how this interacts with #736 that looks to be moving closer to a merge.
Let me know and I am happy to close this and the corresponding backlog issues so there isn't confusion as to what can/needs to get done in this layer.

Sorry for dropping the ball on this. I had just left for vacation when you responded to my comments, and I haven't had a chance to circle back now that I've returned. Been a little busy catching up on other work but will touch base with the others on how to proceed.

No worries. I know there is a lot going on and just want to help out in some of the free time I have. I'll check back in a bit to clear anything that comes up.

@ariard
Copy link

ariard commented Nov 2, 2020

@julianknutsen

One of the challenges this set of patches is going to face moving forward is letting perfect be the enemy of good. There has been a lot of duplicate feedback and discussions over the last 6 weeks about specific style and design choices

Pushing for the good instead of the perfect doesn't solve the problem in itself. Please consider, that bitcoin open-source projects don't have core team members to just "buy-in" and actually the matter rests in building this notion of "good" among the set of regular contributors. It doesn't mean a contest of subjective opinions but a fact-based discussion, anchored in a previous context. If it does give you the feeling there is some authority to persuade, bear in mind we're striving as much for trust-minimized software than we aim to "authority"-minimize the corresponding dev process yielding the code, even if it's always a moving target and in the meanwhile you have to rely somehow on contributors weight of experience.

In that light, specific style and design choices does matter a lot as different styles come with different cognitive costs in regards of the actual style of the codebase. It shouldn't not only be understood by few, but also bearable by a reasonable majority of the contributors.

It is definitely frustrating as a contributor to go through multiple rounds of revisions with one set of reviewers and then have another set of reviewers come in at the 11th hour and bring up old discussion items or ask for core items to be changed.

I concede that a more structured review, calling out first contributors agreement on core items and conceptual soundness, then aiming to evaluate code style and ensuring good test coverage would save us all time and that's something we should point out more in our dev process. But still, that's only something which start to be loosely enforced on more advanced projects like Core, given earliness of LDK, it's not that much wrong at this state of the project, IMHO.

It would be a shame for it to spend another few months going stale and conflicting code being merged just for it to end up like #494.

And your timeline might not be the one of the wider contributor community, even if I know how much frustrating it can be to stale in and not giving you the sense of moving forward as you expected. Consider that a contributor community on any project is likely to have already a almost full-bandwidth when you're arriving with a set of contributions and how do you smooth your contribution integration cost is as much your responsibility that its their responsibility to give you clear guidelines on how to onboard on the project.

Finally, and whatever will be the final state of this code, I would like to thank you for your participation on LDK, you pointing crudely the frailty of our dev process is far more valuable than your code in my sense. We're all learning and I'll ensure we update the CONTRIBUTING.md and touch on the subject with other contributors publicly.

@julianknutsen
Copy link
Author

Although we disagree on some of the main issues that got us here, I appreciate your candidness and willingness to help raise the bar for future contributors. Aside from anything else, I hope that my experience will help improve transparency and communication for future developers even if the news is hard to hear. We are all volunteering our time and want the project to succeed and it is hard to read each other's minds from across the globe.

On a more personal note, this experience has introduced a lot of internal hostility and for my own mental health, I'm going to unsubscribe from future notifications so that I don't reintroduce any conflict or tension and all future conversations can just be about the code.

I'll leave these PRs open for now but will likely clean up my remote branches around the new year as I get settled in other projects.

Thanks again, @ariard. I appreciate the leadership you've given this project over my time here. The project is lucky to have you.

@TheBlueMatt
Copy link
Collaborator

The author decided to move on some time ago, and redoing the great work here would likely require a substantial effort to rebase and address the regressions that were being worked on in the later PRs in this set when the author moved on. Going to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants