Skip to content

Add Moore Threads GPU support and update GitHub workflow for MUSA build #3069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 28, 2025

Conversation

yeahdongcn
Copy link
Contributor

@yeahdongcn yeahdongcn commented Apr 24, 2025

This PR includes the following updates:

  • Adds Moore Threads GPU support to README.md, and replaces usage of ./main with whisper-cli.
  • Enables building of ghcr.io/ggml-org/whisper.cpp:main-musa via GitHub workflow for MUSA support.
  • Updates the PATH environment variable in both main and main-cuda Dockerfiles for improved CLI usability.
  • Enhances the Makefile to forward GGML_CUDA and GGML_MUSA environment variables to cmake.

    (Previously, setting export GGML_CUDA=1 on the host had no effect during the build. This change ensures these flags are now respected.)

Testing Done

# host
## cmake
cmake -B build -DGGML_MUSA=1 -DMUSA_ARCHITECTURES="21"
cmake --build build -j --config Release

## make
make base.en CMAKE_ARGS="-DGGML_MUSA=1"

# container
docker build -t whisper.cpp:main-musa -f ./.devops/main-musa.Dockerfile .
docker run -it --rm -v $PWD/samples:/audios -v $PWD/models/:/models whisper.cpp:main-musa "whisper-cli -m /models/ggml-base.en.bin -f ./samples/jfk.wav"
whisper_init_from_file_with_params_no_state: loading model from '/models/ggml-base.en.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 0
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: yes
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51864
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1607 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:        MUSA0 total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_backend_init_gpu: using MUSA0 backend
whisper_init_state: kv self size  =    6.29 MB
whisper_init_state: kv cross size =   18.87 MB
whisper_init_state: kv pad  size  =    3.15 MB
whisper_init_state: compute buffer (conv)   =   17.22 MB
whisper_init_state: compute buffer (encode) =   85.86 MB
whisper_init_state: compute buffer (cross)  =    4.65 MB
whisper_init_state: compute buffer (decode) =   97.27 MB

system_info: n_threads = 4 / 12 | WHISPER : COREML = 0 | OPENVINO = 0 | MUSA : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...


[00:00:00.000 --> 00:00:11.000]   And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.


whisper_print_timings:     load time =   100.09 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     7.57 ms
whisper_print_timings:   sample time =    47.42 ms /   133 runs (     0.36 ms per run)
whisper_print_timings:   encode time =   121.28 ms /     1 runs (   121.28 ms per run)
whisper_print_timings:   decode time =    68.62 ms /     3 runs (    22.87 ms per run)
whisper_print_timings:   batchd time =  1870.63 ms /   126 runs (    14.85 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (     0.00 ms per run)
whisper_print_timings:    total time =  2222.29 ms

Signed-off-by: Xiaodong Ye <[email protected]>
@yeahdongcn yeahdongcn requested a review from ggerganov April 24, 2025 10:32
@yeahdongcn
Copy link
Contributor Author

@ggerganov @danbev Do you have any other concerns? Can we merge this?

@danbev
Copy link
Collaborator

danbev commented Apr 28, 2025

@ggerganov @danbev Do you have any other concerns? Can we merge this?

All good from my side 👍

@ggerganov ggerganov merged commit 50218b9 into ggml-org:master Apr 28, 2025
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants