(1/4) Enum Dispatch NOISE State Machine & Unit Tests #691

julianknutsen · 2020-09-14T23:23:05Z

PRs:
(1/4) Enum Dispatch NOISE State Machine & Unit Tests
(2/4) PeerManager Unit Tests and Refactoring
(3/4) Conduit fixups from 494
(4/4) Update OutboundQueue flush behavior and fix existing bugs that it found

Motivation
Refactor & Unit Tests
Fuzz Testing
New Features

Read First

This stack has been living as a PR against Arik's branch over the last 3-4 weeks and builds on his work from February #494. The initial feedback is positive enough to move it against rust-bitcoin/rust-lightning. The initial goal of the stack was to address the outstanding design feedback from 494 to get it into a stable place for merge. It has expanded into a fully tested and modular PeerManager.

Due to the complexity of rebasing 494 against master to make this clean, it has been squashed to a single commit at the beginning of this PR. That allows it to technically build against Master and remove the merge-commit at the end of the stack.

Outstanding issues from the original 494 review and where they are addressed:

Heap fragmentation in handshake path
- perf: Replace Vec w/ static arrays for act buffers
- perf: Remove Vec from chacha::encrypt/decrypt
Bug in decrypt path causing a panic on decryption failure
- fix: Conduit::Decryptor should not panic on decryption failure
Bug that didn't allow receiving of partial ACT message
- fix: Allow partial act messages to be received
Conduit vs. Encryptor/Decryptor
- Addressed in (3/3) Conduit fixups from 494
Separate crypto library - Unaddressed
- Seems like an easy separate PR to get all these in the right place

Next Action: Reviewers

Final sign-off

History

Update 8-23-2020: Removed new features from PR and created separate discussion issues #672 & #673
Update 8-24-2020 After discussion, #672 is a requirement to not cause a regression. PR now has this feature.
Update-8-26-2020 Updated to use a no allocation design that was unlocked after implementing 673. This makes the allocation policy similar to the existing behavior on master.
Update-9-2-2020 Initial feedback reviewed and review patches added to PR. All outstanding issues are still marked as unresolved to make sure nothing gets lost. Reviewers have next action
Update-9-24-2020 Moved to a PR against rust-bitcoin/rust-lightning
Update-9-25-2020 Squashed 494 at front of PR so this can technically merge into master
Update-10-2-2020 Rebased against master
Update-10-5-2020 Code handoff to @Arik and @ariard for next steps

Intro

Comments/questions/clarifications all welcome. I've done my best to adhere to the style guidelines, but if there are glaring issues I can iterate on them. I'm also available over Slack for any direct communication if needed.

Motivation

I looked around for a good first project to help me orient with the codebase and understand a bit more about the development cycle. I initially started going down #528 and that led me to #494 which looks like it was on the path to making the peer_handler code a bit more modular (and testable), but has been in-progress for almost six months. Since the code is in flux, it would be hard to add any additional testing or refactoring to enable better testing without introducing large merge conflicts as a separate PR.

So, I took that code as a base and wrote a set of tests against the PeerHandshake interface to get a better sense for how all of the pieces fit together and it led me to a thread I kept pulling that eventually ended up in a full set of tests and refactoring for the handshake portion of the code. It follows the spirit of the PR by abstracting the handshake code in a way that enables easier testing and delegation of responsibilities. It also gave me the opportunity to improve the documentation and control flow between states in a way that is easy to test and maintain.

I've also taken the opportunity to clean up many outstanding warnings that the code currently generated to get it into a more complete state.

My hope is that these changes can help motivate integrating #494 in order to unlock future testing and architecture improvements.

Testing and Refactoring

After completing the initial set of tests for the PeerHandshake module there were a few pieces of code that were hard to test or impossible to test based on the current design. The structure of helper functions that used match to determine invalid transitions left room for ambiguity and made the test cases harder to identify and create.

To enable the final bit of testing and create a more maintainable design I moved towards a state pattern utilizing pure functions and enum-dispatch. This enabled more well-defined transitions as well as very simple unit tests.

At a high level, there is a simple state machine interface with one method:

pub(super) trait IHandshakeState {
	fn next(self, input: &[u8]) -> Result<(Option<Vec<u8>>, HandshakeState), String>;
 }

All state implementations follow the same pattern that is similar to a pure function
that consumes all inputs to produce an output. No global state.

impl IHandshakeState for State {
	fn next(self, input: &[u8]) -> Result<(Option<Vec<u8>>, HandshakeState), String> {
		// move all values from previous state
		
		// calculate needed values for state transitions and response act
		
		// build new state
		
		// return (response act, new_state)
	}
}

The PeerHandshake is now a very simple wrapper class that just maintains the current
state and marshals data to the next() functions:

pub fn process_act(&mut self, input: &[u8]) -> Result<(Option<Vec<u8>>, Option<(Conduit, PublicKey)>), String> {
	let cur_state = self.state.take().unwrap();

	let (act_opt, mut next_state) = cur_state.next(input)?;

	let result = match next_state {
		HandshakeState::Complete(ref mut conduit_and_pubkey) => {
			Ok((act_opt, conduit_and_pubkey.take()))
		},
		_ => { Ok((act_opt, None)) }
	};

	self.state = Some(next_state);

	result
}

This design also allows for simple unit testing of the state machine transitions and act byte return values. Here is the test that validates proper Act2 generation and transition:

// Responder::AwaitingActOne -> AwaitingActThree
// RFC test vector: transport-responder successful handshake
#[test]
fn awaiting_act_one_to_awaiting_act_three() {
	let test_ctx = TestCtx::new();
	let (act2, awaiting_act_three_state) = test_ctx.responder.next(&test_ctx.valid_act1).unwrap();

	assert_eq!(act2.unwrap(), test_ctx.valid_act2);
	assert_matches!(awaiting_act_three_state, ResponderAwaitingActThree(_));
}

// Responder::AwaitingActOne -> Error (bad version byte)
// RFC test vector: transport-responder act1 bad version test
#[test]
fn awaiting_act_one_to_awaiting_act_three_input_bad_version() {
	let test_ctx = TestCtx::new();
	let act1 = hex::decode("01036360e856310ce5d294e8be33fc807077dc56ac80d95d9cd4ddbd21325eff73f70df6086551151f58b8afe6c195782c6a").unwrap();

	assert_eq!(test_ctx.responder.next(&act1).err(), Some(String::from("unexpected version")));
}

Fuzz Testing

Write a new fuzz test for the PeerHandshake module that is able to generate failure paths that occur after a partial handshake sequence has been completed.

To enable this, a new testing object FuzzGen has been introduced that can deterministically generate bytes and bools based on the random fuzz input. These building blocks are enough for the test code to generates different execution paths and complete partial phases of the handshake protocol before generating an error.

When going through a test cycle where the handshake completes, it will also verify that the initiator and responder can communicate successfully through the returned conduits sending variable length data and validating the contents.

	// Possibly generate bad data for act1 and ensure that the responder does not panic
	let mut act1 = test_ctx.act1; // valid act1 for success case
	if generator.generate_bool()? { // invalid act1 for failure case
		act1 = (generator.generate_bytes(50)?).to_vec();
		used_generated_data = true;
	}

	let mut act2 = match test_ctx.responder_handshake.process_act(&act1) {
		Ok((Some(act2), None)) => {
			// extremely unlikely we randomly generate a correct act1, but if so.. reset this
			used_generated_data = false;
			act2
		}
		Err(_) => {
			assert!(used_generated_data);
			return Err("generated expected failure with bad data".to_string());
		}
		_ => panic!("responder required to output act bytes and no conduit/pubkey")
	}

julianknutsen · 2020-09-15T00:45:00Z

lightning/src/ln/peers/handshake/states.rs

+// Alias type to help differentiate between temporary key and chaining key when passing bytes around
+type ChainingKey = [u8; 32];
+
+// Generate a SHA-256 hash from one or more elements concatenated together


Initial @arik-so review: I still think this macro might be more legibly put in a function, or a custom hash object/class. But, as I said before, ultimately this is up to Matt and Antoine.

Overall, I think it makes the code less readable and introduces a wrapper type that has to be unwrapped which I think detracts from the audit-ability. I'm also not sure about the aversion to variadic macros. I needed a way to do the hash for 1 or 2 objects and used a macro that made the code (in my point of view) more legible.

With that said, I did implement it to get your take and see which way the code should go. Just let me know and I can fixup this PR to have the final result.

julianknutsen/rust-lightning@0b4c819

codecov · 2020-09-15T18:45:34Z

Codecov Report

Merging #691 into main will increase coverage by 0.48%.
The diff coverage is 90.07%.

@@            Coverage Diff             @@
##             main     #691      +/-   ##
==========================================
+ Coverage   92.01%   92.49%   +0.48%     
==========================================
  Files          37       42       +5     
  Lines       20106    21431    +1325     
==========================================
+ Hits        18500    19823    +1323     
- Misses       1606     1608       +2

Impacted Files	Coverage Δ
lightning/src/ln/peers/handler.rs	`55.50% <47.58%> (ø)`
lightning/src/ln/peers/handshake/acts.rs	`94.64% <94.64%> (ø)`
lightning/src/ln/peers/chacha.rs	`95.65% <95.65%> (ø)`
lightning/src/ln/peers/handshake/mod.rs	`95.74% <95.74%> (ø)`
lightning/src/ln/peers/conduit.rs	`96.42% <96.42%> (ø)`
lightning/src/ln/peers/handshake/states.rs	`97.58% <97.58%> (ø)`
lightning/src/ln/peers/hkdf5869rfc.rs	`100.00% <100.00%> (ø)`
lightning/src/chain/chainmonitor.rs	`100.00% <0.00%> (ø)`
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e48f8e3...30286c0. Read the comment docs.

arik-so · 2020-09-16T08:13:35Z

I don't see any feedback I've left in the upstream iteration that hasn't been addressed at this point, so I'm with this PR to proceed in its current state.

julianknutsen · 2020-09-18T17:55:47Z

Rebased against upstream and added licenses to new files.

TheBlueMatt · 2020-09-21T16:11:18Z

lightning/src/ln/mod.rs

@@ -21,16 +21,11 @@
 pub mod channelmanager;
 pub mod channelmonitor;
 pub mod msgs;
-pub mod peer_handler;
+pub mod peers;


Lets move peers up to its own top-level module.

This was just straight for #494, but it seems fine to make it a top-level. Added #718

TheBlueMatt · 2020-09-22T17:11:37Z

lightning/src/ln/peers/mod.rs

+//! Conduit enables message encryption and decryption, and automatically handles key rotation.
+
+mod chacha;
+pub mod handler;


Should more than this mod be pub? Or is the intent to merge first and then make things public later?

~~I guess it was just for fuzz tests. Fixed in: review: Fix peers::handler visibility~~

I think the right thing is merge first and then fix visibility issues later. This is the only needed by the tokio sample right now.

TheBlueMatt · 2020-09-22T18:41:43Z

lightning/src/ln/peers/handler.rs

 struct Peer {
-	channel_encryptor: PeerChannelEncryptor,
+	encryptor: PeerState,


Do we ever use their_node_id on outbound peers before we finish connecting? Can we move it into PeerState::Connected and get rid of the Option<>?

No, and the asymmetry between when this was set and when it wasn't was confusing. This has been changed in #692 so that it is an output of the NOISE handshake completion.

This will make future changes that stabilize this work easier to review and merge. Co-authored-by: Arik Sosman <[email protected]> Co-authored-by: Julian Knutsen <[email protected]>

Lots of build warnings still, but it builds and tests pass.

Add unit tests against the PeerHandshake public interface prior to changing any of the implementation details. This is a solid stand-alone patch, even if the subsequent refactor and design changes are not merged. A future patch uses the test vectors from the RFC to minimize the code under test.

The implementation logic is duplicated from handshake/mod.rs so that it can be tested and refactored independent of the current implementation. This uses enum dispatch coupled with separately 'moved' state objects that allow for simple testing, easy state inspection, and understandable transition logic between states. This pattern also removes the need for match statements inside implementation logic in favor of the instance representing the current state and available data. Every state transition is implemented in a next() function and is a straight forward translation from an object representing the existing known data and an input act message (bytes). Example usage: let (act_data, next_state) = cur_state.next(input_data); This state machine will eventually be moved into place inside the PeerHandshake which will act as a simple wrapper that just maintains the state machine and marshals the data between the peer_handler and states.

PeerHandshake is now a simple object around the new state machine that simply marshals data between the callers and state machine states. For now, keep the APIs identical to make the patches easier to read. The only interesting piece to note here is the handling of get_remote_pubkey() to keep parity with the existing implementation. Future patches will remove this separate function in favor of callers taking ownership of it once the handshake is complete.

Introduce the sha256!() and concat!() macros that are used to improve readability of the noise exchanges.

Small inconvenience with the HandshakeHash return value that will be taken care of by the end of this patch stack.

Fix inconsistent use of concat vs. Sha256::engine() input. These can be deduplicated at the end.

Identify potential issue regarding the invalid case where the read_buffer has data after completion.

Use the slice refs from the act array instead.

Write a new fuzz test for the PeerHandshake module that is able to generate failure paths that occur after a partial handshake sequence has been completed. To enable this, a new testing object FuzzGen has been introduced that can deterministically generate bytes and bools based on the random fuzz input. These building blocks are enough for the test code to generates different execution paths and complete partial phases of the handshake protocol before generating an error. When going through a test cycle where the handshake completes, it will also verify that the initiator and responder can communicate successfully through the returned conduits sending variable length data and validating the contents.

Functionality moved to: - peers/handshake/mod.rs - peers/handshake/states.rs - peers/conduit.rs Fuzz testing rewritten here: - fuzz/src/peer_crypt.rs

Fix regression introduced in eb6a371 that would error and subsequently cause a disconnect if a partial act was sent to the state machine, even if future calls could be added to the read buffer to create a valid act. Add appropriate unit tests and update fuzz testing to send partial acts some of the time.

introduced in eb6a371 On decryption failure, the Decryptor should return an error to the caller so it can disconnect instead of panicking. Implement this behavior by plumbing Result through the iterator code path and adding the appropriate tests and comments.

Enhanced the fuzz testing to simulate more test cases: * Valid or random act data * Optionally sending garbage data along with act * Optionally sending act data in multiple segments Found regression introduced in eb6a371 relating to the Conduit panicking during an hmac failure.

To avoid tuple returns from a single initialization function, create a setup_outbound() function that is only callable by the outbound PeerHandshake, but doesn't leak implementation details.

Proper handshake peers won't send more bytes than required for act1 and act2 and it is impossible to generate future acts without first receiving an interleaved response. In those cases, the peer is misbehaving and we can fail early. This unlocks a Vec-less design in the handshake code now that there is an upper limit on read buffer sizes.

After reviewing lightningdevkit#494 there was design feedback on the use of Vec vs. arrays in this layer of the code for fragmentation reasons. To get ahead of the same issues in this new state machine and to limit the behavior changes from master, remove Vec from the state machine in favor of thin wrappers around fixed-size arrays. This patch reintroduces the Act object (a thin wrapper around fixed size arrays) with some convenience features to make them a bit easier to pass around and build. The Handshake code is still note Vec-clean as the encryption code uses Vecs during encryption, but it is one step closer to passing back slices to the peer_handler.

In order to limit perf/allocation changes from master, remove the last use of Vec from the handshake code. The decrypt/encrypt code now takes an out parameter to place the decrypted bytes. All handshake users know the fixed-size of the output decryption so they can use arrays. The Conduit code will still allocate a Vec based on the input message and pass it through. After this patch, the handshake code is now Vec-less and copies are minimized.

Clean up small items based on review comments that are expected to be non-controversial so they can all go in one commit.

Follow existing pattern used in util/ for consistency.

Don't rely on the initialized buffer filled with zeros.

Instead of returning a subslice of input for the caller to process, just return the number of bytes consume and let them do the accounting.

julianknutsen · 2020-10-02T20:22:20Z

Rebased against upstream and resolved merges.

TheBlueMatt · 2020-10-04T21:20:57Z

lightning/src/ln/peers/conduit.rs

+use ln::peers::{chacha, hkdf5869rfc};
+use util::byte_utils;
+
+pub(super) type SymmetricKey = [u8; 32];


Nit: better to newtype this (ie SymmetricKey([u8; 32])) so that the typesystem enforces explicit conversion to/from [u8; 32].

TheBlueMatt · 2020-10-04T22:30:07Z

lightning/src/ln/peers/conduit.rs

+
+}
+
+pub(super) struct Encryptor {


I believe these can be non-pub (I know they are used in later PRs, but I don't think they should be, see comment there).

Dereferenced in handler.rs.

I'm not a fan of the pattern, but that was how 494 existed and I kept it that way until the Transport encapsulation was done.

Right, I think we agree, re: #693 (comment)

TheBlueMatt · 2020-10-05T00:43:23Z

lightning/src/ln/peers/handshake/states.rs

+	}
+}
+
+impl IHandshakeState for InitiatorStartingState {


AFAICT, the trait is never really used, even in tests - here and for all the individual messages, we could just have a "dumb" impl block and things would work as well, and then the trait is only ever used by the HandshakeState enum, which means it could just be dropped.

It is a defensive programming technique to ensure all of the individual state enums all implement the same common interface. That way, the wrapper enum can be simple delegation.

It was the idiomatic rust enum dispatch method I found. Other projects leverage macros and crates, but I chose to keep it more verbose to not introduce additional dependencies.

Here is the trait definition that I hope explains it. If you want this removed just let me know, I don't need to bikeshed about it.

// Trait for all individual states to implement that ensure HandshakeState::next() can // delegate to a common function signature. May transition to the same state in the event there are // not yet enough bytes to move forward with the handshake. pub(super) trait IHandshakeState { fn next(self, input: &[u8]) -> Result<(Option<Act>, HandshakeState), String>; }

Hmm, I guess I don't have a super strong feeling on using a trait to show identical operation, though it wasn't immediately obvious to me that that was its purpose given its public across the module (but only used in the states.rs file) and implemented by both the top-level enum and structs inside. Maybe best to move it in states.rs and add a comment?

I'm not sure I see the value of IHandshakeState. Nothing enforces that the enum states must implement this trait.

TheBlueMatt · 2020-10-05T00:44:09Z

lightning/src/ln/peers/handshake/states.rs

+}
+
+// Handshake state of the Initiator prior to generating Act 1
+pub(super) struct InitiatorStartingState {


Do the structs here need to be pub(super)? It looks like they're only ever accessed in two tests to check state, which could just as easily be a test-only method returning a state name or something.

Yes, they are contained by the HandshakeState enums which can't leak private types in the mod.rs usages.

Right, seems like we could make HandshakeState a struct and move the enum inside it to reduce the exposed surface of states.rs significantly?

What is the point of reducing the exposed surface? This is just an enum used inside the handshake module. Would you rather just inline it all in one file? That is sort of what it sounds like.

This is how the enum dispatch pattern works in Rust and it seems like we are just talking past each other. If you are uncomfortable with this pattern in your codebase feel free to have @arik-so or @ariard change it or discard the PRs.

jkczyz

Sorry I wasn't able to review this earlier. Leaving some comments here while I go through each commit in my spare cycles. I really like the direction especially with the additional unit tests! I'd like to see more of this in the project going forward.

While I know your timeline and our priorities don't match right now, I'm happy to continue reviewing this and help get it merged. Please don't feel obligated to respond to the comments. Just want you to know this is on my radar.

jkczyz · 2020-10-06T17:36:31Z