Skip to content

Feature Request: Tensor paralellism (--split-mode row) over rpc #13083

Open
@tobi97h

Description

@tobi97h

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Implement tensor parelellism over rpc. At the moment setting --split-mode row has no effect if used for rpc the rpc server.

Could you provide me with a rough outline on how I would best go about it?

What steps would I have to take to extend the functionality of the rpc server?

Motivation

I love your project, its everything i looked for. You guys are true heros, the antidote to nvidias corporate greed.

I am running at home two tesla p100 on old gaming mainboards, connected via an infiniband nic in eth mode. The nic is dirt cheap aswell as the tesla p100, if we can get this to work, you can easily run 8B models with 60+ tps with just two cards.

This will unlock the full potential of homelabs/smaller enterprise.

Love you guys

Possible Implementation

I just started looking into it and found your implementation for row splitting on a single host.

    if (split_mode == LLAMA_SPLIT_MODE_ROW) {
        ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
        auto ggml_backend_split_buffer_type_fn = (ggml_backend_split_buffer_type_t)
            ggml_backend_reg_get_proc_address(reg, "ggml_backend_split_buffer_type");
        if (ggml_backend_split_buffer_type_fn) {
            size_t dev_index = [&]() {
                auto * reg = ggml_backend_dev_backend_reg(dev);
                for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); ++i) {
                    if (ggml_backend_reg_dev_get(reg, i) == dev) {
                        return i;
                    }
                }
                throw std::runtime_error(format("device %s not found in its backend reg", ggml_backend_dev_name(dev)));
            }();
            auto * buft = ggml_backend_split_buffer_type_fn(dev_index, tensor_split);
            if (buft != nullptr) {
                buft_list.emplace_back(dev, buft);
            }
        }
    }

Distributing the splits via rpc to different hosts for computation. What files/folders would I need to have a look at, I am asking for some general guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions