Skip to content

Rewrite std::sync::TaskPool to be load balancing and panic-resistant #18941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 16, 2014

Conversation

reem
Copy link
Contributor

@reem reem commented Nov 14, 2014

The previous implementation was very likely to cause panics during
unwinding through this process:

  • child panics, drops its receiver
  • taskpool comes back around and sends another job over to that child
  • the child receiver has hung up, so the taskpool panics on send
  • during unwinding, the taskpool attempts to send a quit message to
    the child, causing a panic during unwinding
  • panic during unwinding causes a process abort

This meant that TaskPool upgraded any child panic to a full process
abort. This came up in Iron when it caused crashes in long-running
servers.

This implementation uses a single channel to communicate between
spawned tasks and the TaskPool, which significantly reduces the complexity
of the implementation and cuts down on allocation. The TaskPool uses
the channel as a single-producer-multiple-consumer queue.

Additionally, through the use of send_opt and recv_opt instead of
send and recv, this TaskPool is robust on the face of child panics,
both before, during, and after the TaskPool itself is dropped.

This TaskPool uses an additional "monitor" task start new child
tasks to replace those that panic.

Due to the TaskPool no longer using an init_fn_factory, this is a

[breaking-change]

otherwise, the API has not changed.

If you used init_fn_factory in your code, and this change breaks for
you, you can instead use an AtomicUint counter and a channel to
move information into child tasks.

@reem
Copy link
Contributor Author

reem commented Nov 14, 2014

This already exists as a library here: https://github.com/reem/rust-resistant-taskpool

r? @huonw

channels: Vec<Sender<Msg<T>>>,
next_index: uint,
///
/// Spawns n + 1 tasks and respawns tasks on subtask panics.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe

Spawns `n` worker tasks and one additional supervisor to replenish the pool after any panics.

@huonw
Copy link
Member

huonw commented Nov 14, 2014

I like it, although I'd like a second opinion, e.g. @alexcrichton.

@alexcrichton
Copy link
Member

This looks great, thanks @reem! Would it be possible to not have a monitoring task at all? Could each Sentinel detect failure via task::failing() and if so it spawns a new thread with another handle to the mutex?

@reem
Copy link
Contributor Author

reem commented Nov 14, 2014

Great idea! I was trying to factor the monitor out, but didn't consider having the Sentinel be the one to actually spawn a new task.

@reem
Copy link
Contributor Author

reem commented Nov 14, 2014

I refactored all of task-re-spawning logic into Sentinel, and it really simplifies the code along with eliminating the monitor task.


// Cancel and destroy this sentinel.
fn cancel(mut self) {
self.active = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice idea, I like this better than task::failing()

@alexcrichton
Copy link
Member

r=me with @huonw's comment

The previous implementation was very likely to cause panics during
unwinding through this process:

- child panics, drops its receiver
- taskpool comes back around and sends another job over to that child
- the child receiver has hung up, so the taskpool panics on send
- during unwinding, the taskpool attempts to send a quit message to
  the child, causing a panic during unwinding
- panic during unwinding causes a process abort

This meant that TaskPool upgraded any child panic to a full process
abort. This came up in Iron when it caused crashes in long-running
servers.

This implementation uses a single channel to communicate between
spawned tasks and the TaskPool, which significantly reduces the complexity
of the implementation and cuts down on allocation. The TaskPool uses
the channel as a single-producer-multiple-consumer queue.

Additionally, through the use of send_opt and recv_opt instead of
send and recv, this TaskPool is robust on the face of child panics,
both before, during, and after the TaskPool itself is dropped.

Due to the TaskPool no longer using an `init_fn_factory`, this is a

[breaking-change]

otherwise, the API has not changed.

If you used `init_fn_factory` in your code, and this change breaks for
you, you can instead use an `AtomicUint` counter and a channel to
move information into child tasks.
@barosl
Copy link
Contributor

barosl commented Nov 14, 2014

Cool thing! This fixes #18836. I'm closing it now.

@bors bors merged commit 93c4942 into rust-lang:master Nov 16, 2014
@reem reem deleted the better-task-pool branch November 16, 2014 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants