Skip to content

feat: list all registered schedulers (#1009) #1050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

clumsy
Copy link
Contributor

@clumsy clumsy commented Apr 23, 2025

A simple merge for the list of all registered schedulers.

Test plan:
[x] all existing tests should pass

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 23, 2025
@kiukchung
Copy link
Contributor

could you provide more context to what you want to achieve with this?

@clumsy
Copy link
Contributor Author

clumsy commented Apr 23, 2025

All the details are in the linked #1009, @kiukchung. Please let me know if more details are needed there.

@kiukchung
Copy link
Contributor

Hi @clumsy thanks for the pointer. torchx.schedulers.get_scheduler_factories() is a public API and this change is not backwards-compatible for the case get_scheduler_factories(skip_defaults=False) where there exists registered entrypoint schedulers. Now users will get their configured + default schedulers instead of just their configured ones.

Could you describe your use-case in wanting the list of supported schedulers offered to your users to be dynamic? Usually torchx users want to control the schedulers they configure for their users.

@clumsy
Copy link
Contributor Author

clumsy commented May 1, 2025

Sure, @kiukchung

Take NeMo for example, NVidia ships it with all dependencies, including nemo-run (https://github.com/NVIDIA/NeMo/blob/94589bde88fab1997c842be4e000faf69180cffb/nemo/collections/common/parts/nemo_run_utils.py#L18)

Unfortunately nemo-run unconditionally registers custom schedulers: https://github.com/NVIDIA/NeMo-Run/blob/main/pyproject.toml#L43-L48

Thus we cannot use local_cwd from within the container for example, or if we have nemo-run installed.

It makes sense to have a feature to restrict the available schedulers, but does it have to be the default one?

@kiukchung
Copy link
Contributor

kiukchung commented May 1, 2025

@clumsy ah that's an interesting edge-case. What you basically want is for torchx.schedulers to be additive. We don't treat it as such today. Since Python entrypoint groups don't compound, we have to come up with a convention for the group names.

One thing to note about DEFAULT_SCHEDULER_MODULES is that TorchX treats them as the default if you haven't registered your own (akin to map.get("key", default="DEFAULT_VAL")) rather than "generally useful ones" that get added regardless of whether you have your own registrations. This is generally the case for all the TorchX configurations exposed as entrypoints (see: https://pytorch.org/torchx/latest/advanced.html)

We could do something like: {org_name}.torchx.schedulers and at load time select *.torchx.schedulers entrypoint groups. For BC we'd also have to keep reading torchx.schedulers.

If you're open to it, you can add support for prefixes in torch.util.entrypoints.load() (

group: str, default: Optional[Dict[str, Any]] = None, skip_defaults: bool = False
)

and make a change in torchx.schedulers.get_schedulers() to call the load() fn appropriately.

There's some interesting cases regarding name conflicts and ordering (e.g. if nemo registers a scheduler with the same name as the one somewhere else what do you do?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants