Feature Request: Tensor paralellism (--split-mode row) over rpc

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggml-org/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggml-org/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

Implement tensor parelellism over rpc. At the moment setting --split-mode row has no effect if used for rpc the rpc server. 

Could you provide me with a rough outline on how I would best go about it? 

What steps would I have to take to extend the functionality of the rpc server?


### Motivation

I love your project, its everything i looked for. You guys are true heros, the antidote to nvidias corporate greed.

I am running at home two tesla p100 on old gaming mainboards, connected via an infiniband nic in eth mode. The nic is dirt cheap aswell as the tesla p100, if we can get this to work, you can easily run 8B models with 60+ tps with just two cards. 

This will unlock the full potential of homelabs/smaller enterprise.

Love you guys

### Possible Implementation

I just started looking into it and found your implementation for row splitting on a single host.

```cpp
    if (split_mode == LLAMA_SPLIT_MODE_ROW) {
        ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
        auto ggml_backend_split_buffer_type_fn = (ggml_backend_split_buffer_type_t)
            ggml_backend_reg_get_proc_address(reg, "ggml_backend_split_buffer_type");
        if (ggml_backend_split_buffer_type_fn) {
            size_t dev_index = [&]() {
                auto * reg = ggml_backend_dev_backend_reg(dev);
                for (size_t i = 0; i < ggml_backend_reg_dev_count(reg); ++i) {
                    if (ggml_backend_reg_dev_get(reg, i) == dev) {
                        return i;
                    }
                }
                throw std::runtime_error(format("device %s not found in its backend reg", ggml_backend_dev_name(dev)));
            }();
            auto * buft = ggml_backend_split_buffer_type_fn(dev_index, tensor_split);
            if (buft != nullptr) {
                buft_list.emplace_back(dev, buft);
            }
        }
    }
```

Distributing the splits via rpc to different hosts for computation. What files/folders would I need to have a look at, I am asking for some general guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Tensor paralellism (--split-mode row) over rpc #13083

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Tensor paralellism (--split-mode row) over rpc #13083

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions