sync : ggml #2844

ggerganov · 2025-02-26T20:39:42Z

No description provided.

Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes + Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.

Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm. Signed-off-by: fxzjshm <[email protected]>

* CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <[email protected]>

cont #11659 ggml-ci

… (llama/11690) Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9

…a/11551)

* vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes

SYCL does not support non contiguous tensors for norm operations

* ggml : optimize convert f32<->f16 for loongarch_asx * ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16 * ggml : Fix warnings when run cpu CI locally on LoongArch

After the barrier in last iteration is executed, still the loop termination condition will be executed. However main thread can destroy the cgraph object and its nodes already, then another thread will access it, but the thing is already gone. Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the prior situation is possible. Last syncronization should be done after the loop to ensure the cgraph/cplan won't be accessed after the main thread exits from the function.

…lama/11502)

…VRAM allocation (llama/11592)

Co-authored-by: Jeff Bolz <[email protected]>

* Update ggml.c * Update arg.cpp * Update speculative.h

* CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <[email protected]>

… (llama/11803) * Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx * Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string

Signed-off-by: Weizhao Ouyang <[email protected]>

* Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32

* ggml : x2 speed for WASM by optimizing SIMD * fix bad merging * rm trailing spaces * rm redundant clamp * better quantize_row_q8_K Co-authored-by: camel-cdr <[email protected]> * remove memset that causes buffer overflow Co-authored-by: camel-cdr <[email protected]> --------- Co-authored-by: camel-cdr <[email protected]>

* ggml-cpu : add chunking support to mul_mat_id * allocate chunk counter in wdata parallelize src1 quantization by column to allows parallelization even when there is only one row * disable for arm * cleanup * better way to disable for arm * fix uninitialized counter when using 1 thread only * revert test-backend-ops changes

* musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <[email protected]> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

…11780)

* mm subgroup size * upload vulkan x86 builds

* Optimize ggml_vec_dot_q3_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q4_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q6_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q5_K_q8_K for LoongArch ASX * Optimize ggml_vec_dot_q2_K_q8_K for LoongArch ASX * Optimize mul_sum_i8_pairs_float for LoongArch ASX * Optimize ggml_vec_dot_iq4_xs_q8_K for LoongArch ASX

* opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`

* vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem

* repo : update links to new url ggml-ci * cont : more urls ggml-ci

…Intel Macs. (llama/11904)

* vulkan: support memset_tensor * vulkan: support GGML_OP_SUM * vulkan: implement GGML_OP_ARGMAX * vulkan: implement GGML_OP_SUB * vulkan: implement GGML_OP_COUNT_EQUAL * vulkan: implement GGML_OP_OPT_STEP_ADAMW * vulkan: fix check_results RWKV_WKV6 crash and memory leaks * vulkan: implement GGML_OP_REPEAT_BACK * tests: remove invalid test-backend-ops REPEAT_BACK tests * vulkan: fix COUNT_EQUAL memset using a fillBuffer command

* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <[email protected]>

…11917) * Added SVE Implementation for Q3_K Kernel in ggml-cpu-quants.c file * Improved Formating of code in ggml-cpu-quants.c file * style : minor fixes * style : less whitespaces * style : ptr spaceing --------- Co-authored-by: vithulep <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>

* ggml-cpu: Add CPU backend support for KleidiAI library * Add environmental variable GGML_KLEIDIAI_SME * Add support for multithread LHS conversion * Switch kernel selection order to dotprod and i8mm * updates for review comments * More updates for review comments * Reorganize and rename KleidiAI files * Move ggml-cpu-traits.h to source file * Update cmake for SME build and add alignment for SME * Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list

* MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <[email protected]>

* CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <[email protected]>

…/12000)

@ericcurtin

* ggml: add s390x ARCH_FLAGS for compilation Signed-off-by: Aaron Teo <[email protected]> * ggml: add SIMD for s390x using vector intrinsics SIMD is activated for: * ggml_vec_dot_f32 * ggml_vec_dot_f16 * ggml_vec_mad_f32 * ggml_vec_mad_f16 * ggml_vec_mad_f32_unroll * ggml_vec_scale_f32 * ggml_vec_scale_f16 SIMD is NOT activated for: * ggml_vec_dot_f16_unroll (pending bugfix) Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing escape character in GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: add temporary patch for GGML_F32_ARR and GGML_F16_ARR Signed-off-by: Aaron Teo <[email protected]> * ggml: fix s390x GGML_F32x4_REDUCE Signed-off-by: Aaron Teo <[email protected]> * ggml: full SIMD activation for F32,F16 s390x Signed-off-by: Aaron Teo <[email protected]> * ggml: add option to disable s390x VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: change vecintrin.h include to ggml-cpu-impl * add __VXE__ and __VXE2__ macros Signed-off-by: Aaron Teo <[email protected]> * cmake: add s390x target detection for VX/VXE/VXE2 Signed-off-by: Aaron Teo <[email protected]> * ggml: move s390x vector intrinsics to ggml-cpu-impl.h Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x Q8_0 SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: correct documentation for Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x reduce code complexity Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x bugfix typo Q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activated for Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x inline vec_reve Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: add VXE backend feature Signed-off-by: Aaron Teo <[email protected]> * ggml: remove test.py Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_0 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for quantize_row_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: bugfix iq4_xs Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: add float, double, and long vector data type Signed-off-by: Aaron Teo <[email protected]> * ggml: clean up iq4_xs SIMD Signed-off-by: Aaron Teo <[email protected]> * ggml: fix improper use of restrict keyword Signed-off-by: Aaron Teo <[email protected]> * ggml: update warning message for ggml_vec_tbl Signed-off-by: Aaron Teo <[email protected]> * ggml: untested implementation of ggml_vec_dot_iq2_xxs_q8_K Signed-off-by: Aaron Teo <[email protected]> * ggml: update ggml_vec_dot_q4_1_q8_1 to use typedefs Signed-off-by: Aaron Teo <[email protected]> * ggml: switch to restrict for iq4_nl Signed-off-by: Aaron Teo <[email protected]> * ggml: slight dot product speed improvement for q4_1_q8_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for q6_K Signed-off-by: Aaron Teo <[email protected]> * ggml: add missing `_t` to ggml_int8x16x4_t Signed-off-by: Aaron Teo <[email protected]> * ggml: fix missing `_t` for ggml_vec_xl_s8x4 Signed-off-by: Aaron Teo <[email protected]> * ggml: fix more missing `_t` Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q8_0 increase of 3.86% for prompt processing and 32.22% for token generation Signed-off-by: Aaron Teo <[email protected]> * ggml: patch Q8_0 to use proper vector sizes Signed-off-by: Aaron Teo <[email protected]> * ggml: optimise Q8_0 dot prod compute kernel further Signed-off-by: Aaron Teo <[email protected]> * ggml: add unroll and prefetch to Q4_1 Signed-off-by: Aaron Teo <[email protected]> * ggml: refactor Q6_K variable naming for readability Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q6_K typos Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q5_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix wrong char*x16_t naming Signed-off-by: Aaron Teo <[email protected]> * ggml: Q5_K y0 wrong signness Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q5_K invalid uchar type Signed-off-by: Aaron Teo <[email protected]> * ggml: s390x SIMD activation for Q4_K Signed-off-by: Aaron Teo <[email protected]> * ggml: fix Q4_K invalid vector intrinsics Signed-off-by: Aaron Teo <[email protected]> * ggml: simplify ggml_padd_s16 compute kernel Signed-off-by: Aaron Teo <[email protected]> * ggml: correct ggml-cpu vxe wording Signed-off-by: Aaron Teo <[email protected]> * ggml: change ggml_aligned_malloc alignment to 256 256 is the cache line size for s390x platforms Signed-off-by: Aaron Teo <[email protected]> * ggml: resolve pr merge via cherry-pick 225bbbf Signed-off-by: Aaron Teo <[email protected]> * ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701) * ggml: resolve pr merge via cherry-pick 4571953 Signed-off-by: Aaron Teo <[email protected]> * ggml: cmake remove fork when determining s390x machine type thank you @ericcurtin Signed-off-by: Aaron Teo <[email protected]> --------- Signed-off-by: Aaron Teo <[email protected]> Co-authored-by: Jinyang He <[email protected]> Co-authored-by: junchao-zhao <[email protected]>

* opt performance by reorder for Intel GPU * detect hw type and save opt feature, and print opt feature * correct name * support optimize graph once when compute graph, record the opt status in tensor->extra, make CI passed * add env variable GGML_SYCL_DISABLE_OPT for debug * use syclex::architecture replace the custom hw define, update the guide for GGML_SYCL_DISABLE_OPT * add performance data * mv getrows functions to separeted files * fix global variables --------- Co-authored-by: arthw <[email protected]>

* opencl: fix small shape gemv, remove unused extensions * opencl: fix `transpose_16`, `dump_tensor`, enforce subgroup size * opencl: fix for token length < 4 * opencl: use wave size of 64 for all Adreno GPUs --------- Co-authored-by: Shawn Gu <[email protected]> Co-authored-by: Skyler Szot <[email protected]>

metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <[email protected]>

… backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div

retr0reg and others added 30 commits February 26, 2025 22:39

rpc: fix known RCE in rpc-server (ggml/1103)

e407bcc

Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes + Check if `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.

metal : use residency set for other platforms (llama/11648)

982535f

HIP: force max threads per block to be 1024 (llama/11621)

8440a75

Some old/vendor forked version of llvm still use 256. Explicitly set it to 1024 to align with upstream llvm. Signed-off-by: fxzjshm <[email protected]>

CUDA: non-contiguous (RMS) norm support (llama/11659)

53ad347

* CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <[email protected]>

CUDA: support for mat. mul. with ne03 != ne13 (llama/11656)

88ce132

metal : adjust support conditions for norm operators (llama/11671)

26ab6ec

cont #11659 ggml-ci

metal : avoid breaking build when metal API predates TARGET_OS_VISION…

c2b8bf0

… (llama/11690) Avoids breakage in nix flake build introduced by b0569130c5e9c671152c913d82803b7c2f014ff9

vulkan: use smaller combined allocations to avoid fragmentation (llam…

34a9e8a

…a/11551)

vulkan: initial support for IQ4_XS quantization (llama/11501)

96bd41f

vulkan: optimize coopmat2 iq2/iq3 callbacks (llama/11521)

6724a2a

* vulkan: optimize coopmat2 iq2/iq3 callbacks * build: trigger CI on GLSL compute shader changes

ggml : fix LoongArch compile error with 128-bit SIMD (llama/11701)

f62a152

SYCL: Adjust support condition for norm operators (llama/11674)

1f1ddf8

SYCL does not support non contiguous tensors for norm operations

ggml : optimize and build warning fix for LoongArch (llama/11709)

e4c89e5

* ggml : optimize convert f32<->f16 for loongarch_asx * ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16 * ggml : Fix warnings when run cpu CI locally on LoongArch

SYCL: remove XMX info from print devices (llama/11712)

52d3ac8

vulkan: print shared memory size (llama/11719)

2ea46bc

CUDA: fix min. version for movmatrix (llama/11751)

03aba11

vulkan: account for lookup tables when checking shared memory size (l…

d196c35

…lama/11502)

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid …

7da8fa6

…VRAM allocation (llama/11592)

vulkan: Make Vulkan optional at runtime (ggml/11493). (llama/11494)

94eaf97

Co-authored-by: Jeff Bolz <[email protected]>

fix: typos in documentation files (llama/11791)

06db520

* Update ggml.c * Update arg.cpp * Update speculative.h

CUDA: use arch list for compatibility check (llama/11775)

8903621

* CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <[email protected]>

Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx…

7ebc835

… (llama/11803) * Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx * Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string

CUDA: fix CUDART_VERSION checks (llama/11821)

523f00b

ggml-cpu: Fix duplicate MATMUL_INT8 (llama/11817)

2686930

Signed-off-by: Weizhao Ouyang <[email protected]>

ggml : fix multi-threaded clamp_f32 (llama/11824)

04063cb

* Bug fix for clamp_f32 When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0. * Bug fix for clamp_f32 * Bug fix for clamp_f32

cleanup: fix compile warnings associated with gnu_printf (llama/11811)

e288e30

HIP: Switch to std::vector in rocblas version check (llama/11820)

5590b2b

HIP: Remove GCN from list of devices that avoid MMQ (llama/11831)

5b62183

slaren and others added 28 commits February 26, 2025 22:39

musa: bump MUSA SDK version to rc3.1.1 (llama/11822)

07bbd8e

* musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <[email protected]> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

llamafile: use member variable instead of constant for iq4nlt (llama/…

f7244e0

…11780)

vulkan: linux builds + small subgroup size fixes (llama/11767)

0fc51b7

* mm subgroup size * upload vulkan x86 builds

cuda : add ampere to the list of default architectures (llama/11870)

14b9afe

opencl: Fix rope and softmax (llama/11833)

3151f2c

* opencl: fix `ROPE` * opencl: fix `SOFT_MAX` * Add fp16 variant * opencl: enforce subgroup size for `soft_max`

vulkan: initial support for IQ1_S and IQ1_M quantizations (llama/11528)

bcc9bef

* vulkan: initial support for IQ1_S and IQ1_M quantizations * vulkan: define MMV kernels for IQ1 quantizations * devops: increase timeout of Vulkan tests again * vulkan: simplify ifdef for init_iq_shmem

repo : update links to new url (llama/11886)

c65a301

* repo : update links to new url ggml-ci * cont : more urls ggml-ci

metal : optimize dequant q6_K kernel (llama/11892)

c9dfdfa

metal : fix the crash caused by the lack of residency set support on …

3dba9f7

…Intel Macs. (llama/11904)

vulkan: support multi/vision rope, and noncontiguous rope (llama/11902)

1a97554

CUDA: use async data loading for FlashAttention (llama/11894)

06fed44

* CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <[email protected]>

MUSA: support ARM64 and enable dp4a .etc (llama/11843)

1d80839

* MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <[email protected]>

CUDA: correct the lowest Maxwell supported by CUDA 12 (llama/11984)

3852692

* CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <[email protected]>

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (llama…

4fdcb18

…/12000)

CUDA: optimize FA for GQA + large batches (llama/12014)

e4f3c48

CUDA: app option to compile without FlashAttention (llama/12025)

51bb2f9

SYCL: Fix GGML_SYCL_DEBUG macro (llama/11995)

56d1ec4

metal : copy kernels for quant to F32/F16 conversions (llama/12017)

351801a

metal: use dequantize_q templates --------- Co-authored-by: Georgi Gerganov <[email protected]>

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU)…

eaa7e83

… backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div

sync : ggml

ee6ab0b

ggerganov merged commit 17addf7 into master Feb 27, 2025
40 of 45 checks passed

ggerganov deleted the sync-ggml-25-02-26 branch February 27, 2025 06:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #2844

sync : ggml #2844

Uh oh!

ggerganov commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

sync : ggml #2844

sync : ggml #2844

Uh oh!

Conversation

ggerganov commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!