Skip to content

Fix nagle, other minor optimizations #8222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 8, 2025

Conversation

rustyrussell
Copy link
Contributor

@rustyrussell rustyrussell commented Apr 9, 2025

I noticed we were not going as fast as we should, in a 100-payments test. Turns out our Nagle-suppressing code was wrong!

Then I continued with minor optimization until our profiling basically only contained sqlite operations, resulting in an 8% speedup in bouncing payments. (when eatmydata was used).

(Dropped patch which speeds CI using eatmydata, as it broke too much!)

@rustyrussell rustyrussell added this to the v25.05 milestone Apr 9, 2025
@rustyrussell rustyrussell changed the title Fix nagle, other minor optimizations Fix nagle, other minor optimizations, speed CI Apr 29, 2025
@rustyrussell rustyrussell changed the title Fix nagle, other minor optimizations, speed CI Fix nagle, other minor optimizations May 1, 2025
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch 15 times, most recently from 3eeff79 to 821e111 Compare May 8, 2025 00:26
Once we speed things up, we can *receive* commitment_signed before we disconnect
due to sending ours: in this case, we *will* have a scratch tx.

Signed-off-by: Rusty Russell <[email protected]>
Once we speed things up, this can receive COMMITMENT_SIGNED, and hence succeed.  Let's
drop on receive, not send!

```
2025-05-06T06:07:51.5997333Z         # Peers try RBF, break on initial COMMITMENT_SIGNED
2025-05-06T06:07:51.5997855Z         bump = l1.rpc.openchannel_bump(chan_id, chan_amount, initpsbt['psbt'])
2025-05-06T06:07:51.5998214Z >       with pytest.raises(RpcError):
2025-05-06T06:07:51.5998542Z E       Failed: DID NOT RAISE <class 'pyln.client.lightning.RpcError'>
```

Signed-off-by: Rusty Russell <[email protected]>
Presumably we want to wait for htlcs to be all resolved here.  With the speedup,
this failed under Postgres:

```
>       wait_for(lambda: only_one(l2.rpc.listpeerchannels(l3.info['id'])['channels'])['htlcs'] != [])

tests/test_closing.py:4191:
```

Signed-off-by: Rusty Russell <[email protected]>
…e a short signature.

```
>               assert txinfo['weight'] == prep['estimated_final_weight']
E               assert 755 == 756
```

Signed-off-by: Rusty Russell <[email protected]>
This happens one time in 256:

```
2025-05-06T11:00:43.2201832Z >           assert feerate - 1 < total_feerate_perkw < feerate + 1
2025-05-06T11:00:43.2202133Z E           assert 12005.233318796338 < (12000 + 1)
```

Signed-off-by: Rusty Russell <[email protected]>
We have to make sure the new tx has been received by bitcoind, otherwise we
could be examining the previous:

```
>           check_feerate(l2, total_feerate_perkw, feerate)

tests/test_closing.py:4053: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

node = <fixtures.LightningNode object at 0x7f645789ee90>
actual_feerate = 14000.0, expected_feerate = 15000

    def check_feerate(node, actual_feerate, expected_feerate):
        # Feerate can't be lower.
>       assert actual_feerate > expected_feerate - 2
```

Signed-off-by: Rusty Russell <[email protected]>
We can fail to connect because l3 connects to l2 at the same time l2 connects to l3:

```
        # The plugin in l2 fixes up the connection, so this works!
>       l1.rpc.fetchinvoice(offer['bolt12'])

tests/test_connection.py:4792:
...
        elif "error" in resp:
>           raise RpcError(method, payload, resp['error'])
E           pyln.client.lightning.RpcError: RPC call failed: method: fetchinvoice, payload: {'offer': 'lno1qgsqvgnwgcg35z6ee2h3yczraddm72xrfua9uve2rlrm9deu7xyfzrcgqgqjczs7w3jhxazldahxjmmwd4jhxumpvaj47en0wfmkzunytanxz6tvzcssxhftzxfdlwsnfcgw2sy8t5mxa0ytcdfat2nkdwqvpy9nnsa9mzza'}, error: {'code': 1005, 'message': 'Timeout waiting for response'}
```

From logs:

```
lightningd-2 2025-05-07T21:20:54.367Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: peer_out WIRE_INIT
lightningd-2 2025-05-07T21:20:54.367Z DEBUG   035d2b1192dfba134e10e540875d366ebc8bc353d5aa766b80c090b39c3a5d885d-connectd: Connect IN
``

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch from 821e111 to fde04af Compare May 8, 2025 01:26
We need to make sure the lease starts a the block we expect:

```
        # Note that l3 has the whole lease delay (minus blocks already mined)
        _, _, l3blocks = l3.wait_for_onchaind_tx('OUR_DELAYED_RETURN_TO_WALLET',
                                                 'OUR_UNILATERAL/DELAYED_OUTPUT_TO_US')
>       assert l3blocks == 4032 - 6 - 2 - 1
E       assert 4022 == (((4032 - 6) - 2) - 1)

tests/test_closing.py:985: AssertionError
```

Signed-off-by: Rusty Russell <[email protected]>
Our CORK logic was wrong, and it's better to use Nagle anyway:
we disable Nagle just before sending timing-critical messages.

Time for 100 (failed) payments:

Before:
	148.8573575

After:
	10.7356977

Note this revealed a timing problem in test_reject_invalid_payload: we would
miss the cause of the sendonion failure, and waitsendpay would be called
*after* it had failed, so simply returns "Payment failure reason unknown".

Signed-off-by: Rusty Russell <[email protected]>
Changelog-Fixed: Protocol: Removed 200ms latency from sending commit/revoke messages.
To run, use:
	VALGRIND=0 TEST_BENCH=1 eatmydata pytest tests/test_connection.py::test_bench

Example of running on my laptop (without --enable-debugbuild, over of 10 runs):

	FAILED tests/test_connection.py::test_bench - assert 47.403406-49.191765(48.343+/-0.48) == 0

With eatmydata:

	FAILED tests/test_connection.py::test_bench - assert 25.867860-27.549878(26.5762+/-0.5) == 0

You can also run perf on l1 once it's running:

	perf record --call-graph dwarf -q -p $(cat /tmp/ltests-*/test_bench_1/lightning-1/lightningd-regtest.pid)

Then ^C after 10 seconds and run "perf report".

Things which stood out:

1. Tracing in db_exec_prepared_v2 (fixed in another PR)

   31.12%     0.04%  lightningd  lightningd            [.] db_exec_prepared_v2
   - 31.08% db_exec_prepared_v2
      + 22.96% db_sqlite3_exec
      + 4.46% trace_span_end
      + 1.77% trace_span_start
      + 1.11% trace_span_tag
      + 0.72% tal_free

2. Logging:

   - 16.03% logv
      - 8.15% maybe_print
         - log_to_files
            + 4.51% __GI__IO_fflush (inlined)
            + 1.97% tal_fmt_
            + 0.51% __GI___strftime_l (inlined)

3. Notification (when nothing is listening) in notify_log:

      - 6.84% maybe_notify_log
         - notify_log
            + 3.37% notify_send
            + 1.75% notify_start
            + 1.71% log_notification_serialize
        0.56% new_log_entry

Signed-off-by: Rusty Russell <[email protected]>
…hecking.

Before:
92.26076173782349
91.9576666355133
90.92732524871826

After:
90.5989830493927
88.10309219360352
90.07689118385315
If nobody is subscribed, have notify_start return NULL and the caller
can skip serialization.  This is particularly useful for the "log"
notification which can get called a lot.

Signed-off-by: Rusty Russell <[email protected]>
This means we don't have to iterate through all plugins, making
our "do we even have to construct this notification" optimization
much more efficient.

Signed-off-by: Rusty Russell <[email protected]>
This speeds logging slightly, but most of the time is actually
spent in fflush() (which we need).

Result (10 runs, eatmydata):
	FAILED tests/test_connection.py::test_bench - assert 24.086638-25.347475(24.6901+/-0.4) == 0

This is an 8% performance improvement (over the entire series), which is not bad.

Signed-off-by: Rusty Russell <[email protected]>
If there are pending channel announcements, they'll look like a leak unless we scan into
the maps.

```
lightningd-2 2025-05-01T07:27:03.922Z **BROKEN** gossipd: MEMLEAK: 0x60d000000478
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:   label=gossipd/gossmap_manage.c:595:struct pending_cannounce
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:   alloc:
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/tal/tal.c:488 (tal_alloc_)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossmap_manage.c:595 (gossmap_manage_channel_announcement)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossipd.c:205 (handle_recv_gossip)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossipd.c:300 (connectd_req)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/common/daemon_conn.c:35 (handle_read)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:60 (next_plan)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:422 (do_plan)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/io.c:439 (io_ready)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/ccan/ccan/io/poll.c:455 (io_loop)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     /home/runner/work/lightning/lightning/gossipd/gossipd.c:660 (main)
lightningd-2 2025-05-01T07:27:03.923Z **BROKEN** gossipd:     ../sysdeps/nptl/libc_start_call_main.h:58 (__libc_start_call_main)
lightningd-2 2025-05-01T07:27:03.924Z **BROKEN** gossipd:     ../csu/libc-start.c:392 (__libc_start_main_impl)
lightningd-2 2025-05-01T07:27:03.924Z **BROKEN** gossipd:   parents:
lightningd-2 2025-05-01T07:27:03.924Z **BROKEN** gossipd:     gossipd/gossmap_manage.c:475:struct gossmap_manage
```

Signed-off-by: Rusty Russell <[email protected]>
This significantly slows down the cycle when we have flakes (it's good if
the tree is completely broken though!).

Signed-off-by: Rusty Russell <[email protected]>
It's disabled.

Signed-off-by: Rusty Russell <[email protected]>
We don't care about data persistence, and this suppresses all sync calls (speeding
sqlite3 specifically).

Signed-off-by: Rusty Russell <[email protected]>
@rustyrussell rustyrussell force-pushed the guilt/fix-tcp-cork branch from fde04af to ad11fe6 Compare May 8, 2025 02:50
@rustyrussell rustyrussell enabled auto-merge (rebase) May 8, 2025 02:51
@rustyrussell rustyrussell merged commit 1521b92 into ElementsProject:master May 8, 2025
36 of 40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant