ggml: improve ggml_backend_cuda_cpy_tensor_async #13818

koush · 2025-05-27T03:52:43Z

I'm working on a general tensor parallel backend. It leverages asynchronous tensor copies. I found that the pipeline was stalling here when the async call was actually sync.

koush · 2025-05-27T03:56:45Z

The existing row split implementation does the mul mat and then immediately gathers the tensors and is per backend implementation. My current approach leaves the tensors on the GPU for further unary and binary ops, and eventually needs to be gathered for the ROPE and RMS ops. It's around 15% faster than single GPU (and much faster than row splitting, which is slower than single GPU on CUDA), but graph execution is currently disabled and once I get that sorted it should be significantly improved.

JohannesGaessler · 2025-05-27T07:02:17Z

I also started working on tensor parallelism, see #13776 . I would be happy to leave the implementation to you if you're interested in working on it.

slaren · 2025-05-27T11:25:19Z

ggml/src/ggml-backend.cpp

+                    if (input_backend->iface.synchronize) {
+                        // async copy succeeded, need to synchronize the input backend to ensure the copy is done before the split backend uses it
+                        input_backend->iface.synchronize(input_backend);
+                    }


A synchronization after an async copy is not necessary. The way async copy is intended to work is roughly explained here:

llama.cpp/ggml/include/ggml-backend.h

Lines 108 to 112 in 7fe03e7

// asynchronous copy

// the copy is performed after all the currently queued operations in backend_src

// backend_dst will wait for the copy to complete before performing other operations

// automatic fallback to sync copy if async is not supported

GGML_API void ggml_backend_tensor_copy_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, struct ggml_tensor * src, struct ggml_tensor * dst);

That makes sense, however i was receiving garbage output without this sync when using layer parallel. I suspect it’s because the stream being used for the async copy is the source and not the dest, as specified in the code comments.

how should I proceed here? I was wary of changing the existing stream behavior.

I don't see a problem with the current implementation. The copy is performed on the source stream so that it happens at the end of all queued operations in the source backend. Then the destination stream waits on an event until the copy is complete, which ensures that any operations added later to the destination backend are executed after the copy has completed.

The issue is that I need to manage the dest stream syncing manually, as I am queueing multiple asynchronous memcpy and then performing one synchronize after the gathers, allowing for concurrent transfers. Otherwise copying to the dest context is serialized. Without this change, I am unable to get tensor parallel working faster than 50% utilization per GPU (with 2 GPU) since each GPU ends up waiting for the other. I had a workaround for this by using different thread to avoid the blocking, but that seems like a lot of unnecessary overhead.

I should note that I am also not using ggml_backend_tensor_copy_async and calling the device memcpy and sync directly since similar sync behavior exists there. Maybe the ggml-backend.cpp should call that instead?

llama.cpp/ggml/src/ggml-backend.cpp

Line 408 in a8ea03d

ggml_backend_synchronize(backend_src);

If this is intended behavior, this may just be a gap in the API where there's no way to start multiple asynchronous memcpy without blocking on the destination per copy.

I suppose you could remove the event wait on the dst stream at the end of the async copy, and transfer the responsibility of synchronizing the dst backend to the application.

I updated the change to make the async memcpy happen on the dst stream, as that's where further ops will presumably occur with that data. It's the responsibility of the caller to ensure the src is safe to use for memcpy until it is complete. Ie, calling synchronize on the src backend if necessary. In some cases no synchronize is needed at all. I saw CANN is the only other backend that implements asynchronous memcpy, so I updated that as well.

Actually I took a further look at the pipeline parallel batching implementation and I think my last change would have negatively affected performance. I've updated the change to leave the src stream copy intact, and then issue a final sync once all inputs are sent. This way the input sync doesn't serialize the memcpy.

Make device to device actually async; right now it syncs on dst. Implement host to/from device async.

koush · 2025-05-29T22:47:48Z

I also started working on tensor parallelism, see #13776 . I would be happy to leave the implementation to you if you're interested in working on it.

I'm definitely interested, but don't let me stop you.

koush · 2025-06-01T15:17:37Z

@slaren @JohannesGaessler
I've got a decent starting implementation for tensor parallelism now. It is dependent on this change and some other minor ones in the common backend code.

Using this model to test because it's dense enough to saturate GPU cores, the initial results on bartowski/nvidia_Llama-3_1-Nemotron-Ultra-253B-v1-GGUF:IQ4_NL on 2x RTX Pro 6000:

Before on layer split: 9 tokens/sec, GPU usage at 50/50 in nvidia-smi
Existing tensor split: 13 tokens/sec, GPU usage at 65/65 in nvidia-smi
New tensor split backend: 16 tokens/sec, both GPU usage at 90/90, 25% improvement

The new implementation also performs better than the existing implementation on smaller models as well: it has no GPU communication other than after the RMS Norm calls that require tensor gather (existing implementation gathers after every mul mat). That frequent gpu-gpu overhead kills performance.

On a smaller Qwen 3 32B dense model:
Before on layer split: 26 tokens/sec
Existing tensor split: 19 tokens/sec
New tensor split backend: 36 tokens/sec

Since the new GPU implementation is a wrapping backend, it should also work with heterogenous devices (if their respective backends implement the new backend requirements).

The very much work in progress is here: https://github.com/koush/llama.cpp/tree/parallel

Should I use this pull request to consolidate all my backend api changes before opening a pull request for tensor parallelism backend?

slaren · 2025-06-01T19:06:08Z

Should I use this pull request to consolidate all my backend api changes before opening a pull request for tensor parallelism backend?

I would prefer if everything is in the same PR so that it is clear what is the motivation for the changes to the backend interface.

koush force-pushed the cuda-memcpy-async branch from f76085e to 5225eaa Compare May 27, 2025 03:53

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels May 27, 2025

koush changed the title ~~ggml: fix ggml_backend_cuda_cpy_tensor_async device to device to actually be async~~ ggml: improve ggml_backend_cuda_cpy_tensor_async May 27, 2025

koush force-pushed the cuda-memcpy-async branch from 5225eaa to 1984136 Compare May 27, 2025 04:04

slaren reviewed May 27, 2025

View reviewed changes

koush force-pushed the cuda-memcpy-async branch from daa3f79 to f23582b Compare May 27, 2025 22:19

ggml: improve ggml_backend_cuda_cpy_tensor_async

7966d05

Make device to device actually async; right now it syncs on dst. Implement host to/from device async.

koush force-pushed the cuda-memcpy-async branch from f23582b to 7966d05 Compare May 28, 2025 01:28

koush closed this Jun 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml: improve ggml_backend_cuda_cpy_tensor_async #13818

ggml: improve ggml_backend_cuda_cpy_tensor_async #13818

Uh oh!

koush commented May 27, 2025

Uh oh!

koush commented May 27, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented May 27, 2025

Uh oh!

slaren May 27, 2025

Uh oh!

koush May 27, 2025 •

edited

Loading

Uh oh!

slaren May 27, 2025 •

edited

Loading

Uh oh!

koush May 27, 2025 •

edited

Loading

Uh oh!

koush May 27, 2025 •

edited

Loading

Uh oh!

slaren May 27, 2025

Uh oh!

koush May 27, 2025

Uh oh!

koush May 28, 2025 •

edited

Loading

Uh oh!

koush commented May 29, 2025

Uh oh!

koush commented Jun 1, 2025 •

edited

Loading

Uh oh!

slaren commented Jun 1, 2025

Uh oh!

Uh oh!

	// asynchronous copy
	// the copy is performed after all the currently queued operations in backend_src
	// backend_dst will wait for the copy to complete before performing other operations
	// automatic fallback to sync copy if async is not supported
	GGML_API void ggml_backend_tensor_copy_async(ggml_backend_t backend_src, ggml_backend_t backend_dst, struct ggml_tensor * src, struct ggml_tensor * dst);

ggml: improve ggml_backend_cuda_cpy_tensor_async #13818

ggml: improve ggml_backend_cuda_cpy_tensor_async #13818

Uh oh!

Conversation

koush commented May 27, 2025

Uh oh!

koush commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented May 27, 2025

Uh oh!

slaren May 27, 2025

Choose a reason for hiding this comment

Uh oh!

koush May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koush May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koush May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren May 27, 2025

Choose a reason for hiding this comment

Uh oh!

koush May 27, 2025

Choose a reason for hiding this comment

Uh oh!

koush May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

koush commented May 29, 2025

Uh oh!

koush commented Jun 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Jun 1, 2025

Uh oh!

Uh oh!

koush commented May 27, 2025 •

edited

Loading

koush May 27, 2025 •

edited

Loading

slaren May 27, 2025 •

edited

Loading

koush May 27, 2025 •

edited

Loading

koush May 27, 2025 •

edited

Loading

koush May 28, 2025 •

edited

Loading

koush commented Jun 1, 2025 •

edited

Loading