Hold `PeerState` in an `RwLock` rather than a `Mutex` #2968

tnull · 2024-03-26T08:32:35Z

Previously, we would hold the PeerState in a Mutex, which disallows concurrent read-only operations. Here we switch to an RwLock making this possible.

To this end, we first switch all instances of Mutex::lock to RwLock::write, and then selectively adapt some usages permitting for it to RwLock::read.

TheBlueMatt

I'd like to better understand the reported use for this - list_channels, specifically, should be plenty fast enough that the time under a single lock is ~instant and this doesn't change much of anything. If there's some other places where we can be read-only for longer (eg I/O) then I'd say lets do it.

codecov-commenter · 2024-03-27T10:20:58Z

Codecov Report

Attention: Patch coverage is 95.34884% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 89.37%. Comparing base (1d9e541) to head (cf2d418).

Files	Patch %	Lines
lightning/src/ln/channelmanager.rs	95.13%	1 Missing and 6 partials ⚠️
lightning/src/ln/functional_test_utils.rs	83.33%	1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2968      +/-   ##
==========================================
+ Coverage   89.31%   89.37%   +0.05%     
==========================================
  Files         117      117              
  Lines       95618    95618              
  Branches    95618    95618              
==========================================
+ Hits        85404    85456      +52     
+ Misses       7974     7926      -48     
+ Partials     2240     2236       -4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tnull · 2024-03-27T10:28:18Z

I'd like to better understand the reported use for this - list_channels, specifically, should be plenty fast enough that the time under a single lock is ~instant and this doesn't change much of anything. If there's some other places where we can be read-only for longer (eg I/O) then I'd say lets do it.

Right, the users who reported this should def. investigate why it was blocking for so long. I now went through all usages and found several places where we could also just us a RwLockReadGuard (forward_intercepted_htlcs, compute_inflight_htlcs, get_relevant_txids), a particularly promising case being taking a RwLockReadGuard in ChannelManager::write, during channel persistence.

IMO, the other cases make this a worthwhile improvement, independently of the concrete reported issue.

TheBlueMatt · 2024-03-27T18:59:19Z

a particularly promising case being taking a RwLockReadGuard in ChannelManager::write, during channel persistence.

Writing the manager basically halts the world no matter what due to the consistency write lock.

I'm not entirely convinced by the remaining changes that they're worth it.

tnull · 2024-03-27T19:04:43Z

Writing the manager basically halts the world no matter what due to the consistency write lock.

Yes, for many operations, but we don't acquire total_consistency_lock in a lot of places, e.g. list_channels and get_relevant_txids. So they could still be executed concurrently during a ChannelManager::write operation (at least during persisting channels, at some point we then acquire the write lock on per_peer_state).

What am I missing? Is not acquiring it a lock order violation we need to fix? The docs seem to indicate so, but I'm not sure taking total_consistency_lock for each list_channels makes a lot of sense?

TheBlueMatt · 2024-03-27T21:03:16Z

Yes, for many operations, but we don't acquire total_consistency_lock in a lot of places, e.g. list_channels and get_relevant_txids. So they could still be executed concurrently during a ChannelManager::write operation (at least during persisting channels, at some point we then acquire the write lock on per_peer_state).

Ah, fair point, duh. Note that write currently takes the top-level per_peer_state write lock, but that's an easy swap in this PR.

What am I missing? Is not acquiring it a lock order violation we need to fix? The docs seem to indicate so, but I'm not sure taking total_consistency_lock for each list_channels makes a lot of sense?

No, there's no lockorder violation if we skip a lock, only if they're in the wrong order.

tnull · 2024-03-28T09:16:09Z

Ah, fair point, duh. Note that write currently takes the top-level per_peer_state write lock, but that's an easy swap in this PR.

I don't think so? During channel data serialization we take the read lock here, which would allow concurrent access with the changes in this PR.

Then later we take the write lock here, but this comment has me believe we can't change that to assert we'd never violate lock order?

So, IIUC, we can have during access during parts of the serialization, but not for the full write call?

No, there's no lockorder violation if we skip a lock, only if they're in the wrong order.

Ah, okay, the doc's explanation around 'different branches' had me confused whether we're allowed to skip the root or not. Logically it makes sense, the docs aren't fully clear on that though, IMO.

TheBlueMatt · 2024-03-28T15:16:24Z

I don't think so? During channel data serialization we take the read lock here, which would allow concurrent access with the changes in this PR.

Yea, I guess I was figuring that loop would be pretty quick, but I guess it does write all the channels so its not free, even if pretty quick.

Then later we take the write lock here, but this comment has me believe we can't change that to assert we'd never violate lock order?

Yea, its...delicate. IIRC this is the only place we actually take per-peer locks recursively, so the total_consistency_guard saves us, but of course it is a little less fragile to keep relying on the per_peer write lock for that.

Previously, we would hold the `PeerState` in a `Mutex`, which disallows concurrent read-only operations. Here we switch to an `RwLock` making this possible.

tnull · 2024-04-08T09:48:16Z

Rebased on main to resolve minor conflicts.

TheBlueMatt

Still not entirely convinced this improves parallelism in any practical cases, honestly.

TheBlueMatt · 2024-04-09T00:34:34Z

lightning/src/ln/channelmanager.rs

-				let mut peer_state_lock = peer_state_rwlock.write().unwrap();
-				let peer_state = &mut *peer_state_lock;
+				let peer_state_lock = peer_state_rwlock.read().unwrap();
+				let peer_state = &*peer_state_lock;


Is there a race condition here where we decide whether or not to include the channel at this step but then the channel gets updated and we write the wrong number of channels? Doesn't seem like it was introduced in this PR but still worth fixing.

Yeah, I noticed that as well and created a TODO for it but so far intended to fix it elsewhere. Looking at it again it seems that this could be fixed by a) only iterating the peer states once and persisting to a temporary buffer while keeping track of serializable_peer_count and number_of_funded_channels or b) taking the per_peer_state.write() lock earlier and holding it for the remainder of the method.

Seems a) is rather inefficient and b) would mean dropping any chance of concurrent access during at least parts of write. I think switching to an RwLock might still be nice in general and going forward, but if you don't agree, possibly we should just do b) in a separate PR and close this one?

TheBlueMatt · 2024-04-15T18:05:41Z

IIRC we discussed offline and concluded to abandon this in favor of fixing the issue at #2968 (comment) by holding the top-level write lock for longer. Leaving this open until we move that to an issue or open the followup pr.

tnull · 2024-04-17T10:04:19Z

IIRC we discussed offline and concluded to abandon this in favor of fixing the issue at #2968 (comment) by holding the top-level write lock for longer. Leaving this open until we move that to an issue or open the followup pr.

Now opened to this effect: #2998

Closing this as the benefits of switching to RwLock are not immediately clear, and we can always do so if we find a compelling reason in the future.

tnull marked this pull request as draft March 26, 2024 08:32

tnull changed the title ~~Hold PeerState in an RwLock rather than a Mutex~~ Hold PeerState in an RwLock rather than a Mutex Mar 26, 2024

tnull marked this pull request as ready for review March 26, 2024 14:47

tnull force-pushed the 2024-03-rwlock-peer-state branch from 5c6cf90 to 4f4fa58 Compare March 26, 2024 14:55

TheBlueMatt reviewed Mar 26, 2024

View reviewed changes

tnull force-pushed the 2024-03-rwlock-peer-state branch from 4f4fa58 to 3658581 Compare March 27, 2024 10:20

tnull added 6 commits April 8, 2024 11:46

Hold PeerState in an RwLock rather than a Mutex

f1ae10b

Previously, we would hold the `PeerState` in a `Mutex`, which disallows concurrent read-only operations. Here we switch to an `RwLock` making this possible.

Have list_channels* methods only acquire an RwLockReadGuard

6a9a01a

Have forward_intercepted_htlc only acquire an RwLockReadGuard

abdd6fb

Have compute_inflight_htlcs only acquire an RwLockReadGuard

4209c9f

Have get_relevant_txids only acquire an RwLockReadGuard

aa4abb2

Have ChannelManager::write method only acquire an RwLockReadGuard

cf2d418

tnull force-pushed the 2024-03-rwlock-peer-state branch from 3658581 to cf2d418 Compare April 8, 2024 09:46

TheBlueMatt reviewed Apr 9, 2024

View reviewed changes

tnull closed this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hold `PeerState` in an `RwLock` rather than a `Mutex` #2968

Hold `PeerState` in an `RwLock` rather than a `Mutex` #2968

tnull commented Mar 26, 2024

TheBlueMatt left a comment

codecov-commenter commented Mar 27, 2024 •

edited

Loading

tnull commented Mar 27, 2024

TheBlueMatt commented Mar 27, 2024

tnull commented Mar 27, 2024 •

edited

Loading

TheBlueMatt commented Mar 27, 2024

tnull commented Mar 28, 2024 •

edited

Loading

TheBlueMatt commented Mar 28, 2024

tnull commented Apr 8, 2024

TheBlueMatt left a comment

TheBlueMatt Apr 9, 2024

tnull Apr 9, 2024

TheBlueMatt commented Apr 15, 2024

tnull commented Apr 17, 2024

Hold PeerState in an RwLock rather than a Mutex #2968

Hold PeerState in an RwLock rather than a Mutex #2968

Conversation

tnull commented Mar 26, 2024

TheBlueMatt left a comment

Choose a reason for hiding this comment

codecov-commenter commented Mar 27, 2024 • edited Loading

Codecov Report

tnull commented Mar 27, 2024

TheBlueMatt commented Mar 27, 2024

tnull commented Mar 27, 2024 • edited Loading

TheBlueMatt commented Mar 27, 2024

tnull commented Mar 28, 2024 • edited Loading

TheBlueMatt commented Mar 28, 2024

tnull commented Apr 8, 2024

TheBlueMatt left a comment

Choose a reason for hiding this comment

TheBlueMatt Apr 9, 2024

Choose a reason for hiding this comment

tnull Apr 9, 2024

Choose a reason for hiding this comment

TheBlueMatt commented Apr 15, 2024

tnull commented Apr 17, 2024

Hold `PeerState` in an `RwLock` rather than a `Mutex` #2968

Hold `PeerState` in an `RwLock` rather than a `Mutex` #2968

codecov-commenter commented Mar 27, 2024 •

edited

Loading

tnull commented Mar 27, 2024 •

edited

Loading

tnull commented Mar 28, 2024 •

edited

Loading