[libc++] Vectorize mismatch #73255

philnik777 · 2023-11-23T17:14:33Z

---------------------------------------------------
Benchmark                           old         new
---------------------------------------------------
bm_mismatch<char>/1           0.835 ns      2.37 ns
bm_mismatch<char>/2            1.44 ns      2.60 ns
bm_mismatch<char>/3            2.06 ns      2.83 ns
bm_mismatch<char>/4            2.60 ns      3.29 ns
bm_mismatch<char>/5            3.15 ns      3.77 ns
bm_mismatch<char>/6            3.82 ns      4.17 ns
bm_mismatch<char>/7            4.29 ns      4.52 ns
bm_mismatch<char>/8            4.78 ns      4.86 ns
bm_mismatch<char>/16           9.06 ns      7.54 ns
bm_mismatch<char>/64           31.7 ns      19.1 ns
bm_mismatch<char>/512           249 ns      8.16 ns
bm_mismatch<char>/4096         1956 ns      44.2 ns
bm_mismatch<char>/32768       15498 ns       501 ns
bm_mismatch<char>/262144     123965 ns      4479 ns
bm_mismatch<char>/1048576    495668 ns     21306 ns
bm_mismatch<short>/1          0.710 ns      2.12 ns
bm_mismatch<short>/2           1.03 ns      2.66 ns
bm_mismatch<short>/3           1.29 ns      3.56 ns
bm_mismatch<short>/4           1.68 ns      4.29 ns
bm_mismatch<short>/5           1.96 ns      5.18 ns
bm_mismatch<short>/6           2.59 ns      5.91 ns
bm_mismatch<short>/7           2.86 ns      6.63 ns
bm_mismatch<short>/8           3.19 ns      7.33 ns
bm_mismatch<short>/16          5.48 ns      13.0 ns
bm_mismatch<short>/64          16.6 ns      4.06 ns
bm_mismatch<short>/512          130 ns      13.8 ns
bm_mismatch<short>/4096         985 ns      93.8 ns
bm_mismatch<short>/32768       7846 ns      1002 ns
bm_mismatch<short>/262144     63217 ns     10637 ns
bm_mismatch<short>/1048576   251782 ns     42471 ns
bm_mismatch<int>/1            0.716 ns      1.91 ns
bm_mismatch<int>/2             1.21 ns      2.49 ns
bm_mismatch<int>/3             1.38 ns      3.46 ns
bm_mismatch<int>/4             1.71 ns      4.04 ns
bm_mismatch<int>/5             2.00 ns      4.98 ns
bm_mismatch<int>/6             2.43 ns      5.67 ns
bm_mismatch<int>/7             3.05 ns      6.38 ns
bm_mismatch<int>/8             3.22 ns      7.09 ns
bm_mismatch<int>/16            5.18 ns      12.8 ns
bm_mismatch<int>/64            16.6 ns      5.28 ns
bm_mismatch<int>/512            129 ns      25.2 ns
bm_mismatch<int>/4096          1009 ns       201 ns
bm_mismatch<int>/32768         7776 ns      2144 ns
bm_mismatch<int>/262144       62371 ns     20551 ns
bm_mismatch<int>/1048576     254750 ns     90097 ns

github-actions · 2023-11-23T17:17:04Z

✅ With the latest revision this PR passed the C/C++ code formatter.

libcxx/include/__algorithm/vectorization.h

philnik777 · 2024-02-24T14:18:14Z

This uses extensions, but I don't expect this to land that soon anyways. Worst case I'll have to wait a few weeks until I can land it.

CC @DenisYaroshevskiy in case you are interested in helping review this.

llvmbot · 2024-02-24T14:18:46Z

@llvm/pr-subscribers-libcxx

Author: Nikolas Klauser (philnik777)

Changes

--------------------------------------------------
Benchmark                          old         new
--------------------------------------------------
bm_mismatch&lt;char&gt;/1            10.6 ns     10.5 ns
bm_mismatch&lt;char&gt;/2            18.6 ns     18.6 ns
bm_mismatch&lt;char&gt;/3            21.4 ns     21.7 ns
bm_mismatch&lt;char&gt;/4            22.7 ns     23.0 ns
bm_mismatch&lt;char&gt;/5            23.8 ns     23.9 ns
bm_mismatch&lt;char&gt;/6            24.2 ns     24.5 ns
bm_mismatch&lt;char&gt;/7            24.4 ns     24.9 ns
bm_mismatch&lt;char&gt;/8            24.8 ns     25.1 ns
bm_mismatch&lt;char&gt;/16           26.1 ns     26.6 ns
bm_mismatch&lt;char&gt;/64           31.1 ns     31.3 ns
bm_mismatch&lt;char&gt;/512          82.1 ns     37.6 ns
bm_mismatch&lt;char&gt;/4096          503 ns     70.1 ns
bm_mismatch&lt;char&gt;/32768        3920 ns      386 ns
bm_mismatch&lt;char&gt;/262144      31386 ns     2988 ns
bm_mismatch&lt;char&gt;/1048576    123315 ns    12640 ns
bm_mismatch&lt;short&gt;/1           10.6 ns     10.6 ns
bm_mismatch&lt;short&gt;/2           19.0 ns     18.7 ns
bm_mismatch&lt;short&gt;/3           22.1 ns     21.7 ns
bm_mismatch&lt;short&gt;/4           23.5 ns     23.0 ns
bm_mismatch&lt;short&gt;/5           24.4 ns     23.8 ns
bm_mismatch&lt;short&gt;/6           25.2 ns     24.5 ns
bm_mismatch&lt;short&gt;/7           25.6 ns     25.0 ns
bm_mismatch&lt;short&gt;/8           25.8 ns     25.0 ns
bm_mismatch&lt;short&gt;/16          26.9 ns     26.2 ns
bm_mismatch&lt;short&gt;/64          32.3 ns     34.4 ns
bm_mismatch&lt;short&gt;/512         83.1 ns     43.1 ns
bm_mismatch&lt;short&gt;/4096         511 ns      138 ns
bm_mismatch&lt;short&gt;/32768       3966 ns      940 ns
bm_mismatch&lt;short&gt;/262144     31230 ns     7724 ns
bm_mismatch&lt;short&gt;/1048576   124324 ns    30343 ns
bm_mismatch&lt;int&gt;/1             10.6 ns     10.6 ns
bm_mismatch&lt;int&gt;/2             19.3 ns     19.0 ns
bm_mismatch&lt;int&gt;/3             21.9 ns     21.6 ns
bm_mismatch&lt;int&gt;/4             23.2 ns     22.9 ns
bm_mismatch&lt;int&gt;/5             23.9 ns     24.0 ns
bm_mismatch&lt;int&gt;/6             24.2 ns     24.6 ns
bm_mismatch&lt;int&gt;/7             24.6 ns     25.0 ns
bm_mismatch&lt;int&gt;/8             24.9 ns     25.3 ns
bm_mismatch&lt;int&gt;/16            26.1 ns     31.9 ns
bm_mismatch&lt;int&gt;/64            33.7 ns     36.6 ns
bm_mismatch&lt;int&gt;/512            135 ns     53.0 ns
bm_mismatch&lt;int&gt;/4096           978 ns      199 ns
bm_mismatch&lt;int&gt;/32768         7768 ns     1471 ns
bm_mismatch&lt;int&gt;/262144       62202 ns    12351 ns
bm_mismatch&lt;int&gt;/1048576     244952 ns    50080 ns

Patch is 26.41 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73255.diff

11 Files Affected:

(modified) libcxx/benchmarks/CMakeLists.txt (+1)
(added) libcxx/benchmarks/algorithms/mismatch.bench.cpp (+31)
(modified) libcxx/include/CMakeLists.txt (+1)
(modified) libcxx/include/__algorithm/mismatch.h (+73-5)
(added) libcxx/include/__algorithm/simd_utils.h (+114)
(modified) libcxx/include/__bit/bit_cast.h (+9)
(modified) libcxx/include/__bit/countr.h (+9-4)
(modified) libcxx/include/libcxx.imp (+1)
(modified) libcxx/include/module.modulemap.in (+5-1)
(modified) libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp (+100-61)
(removed) libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch_pred.pass.cpp (-119)

diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index 2434d82c6fd6ba..0337d4fdb73aa2 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -182,6 +182,7 @@ set(BENCHMARK_TESTS
     algorithms/make_heap_then_sort_heap.bench.cpp
     algorithms/min.bench.cpp
     algorithms/min_max_element.bench.cpp
+    algorithms/mismatch.bench.cpp
     algorithms/pop_heap.bench.cpp
     algorithms/pstl.stable_sort.bench.cpp
     algorithms/push_heap.bench.cpp
diff --git a/libcxx/benchmarks/algorithms/mismatch.bench.cpp b/libcxx/benchmarks/algorithms/mismatch.bench.cpp
new file mode 100644
index 00000000000000..98725c8690e640
--- /dev/null
+++ b/libcxx/benchmarks/algorithms/mismatch.bench.cpp
@@ -0,0 +1,31 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include <algorithm>
+#include <benchmark/benchmark.h>
+#include <random>
+
+template <class T>
+static void bm_mismatch(benchmark::State& state) {
+  std::vector<T> vec1(state.range(), '1');
+  std::vector<T> vec2(state.range(), '1');
+  std::mt19937_64 rng(std::random_device{}());
+
+  for (auto _ : state) {
+    auto idx  = rng() % vec1.size();
+    vec1[idx] = '2';
+    benchmark::DoNotOptimize(vec1);
+    benchmark::DoNotOptimize(std::mismatch(vec1.begin(), vec1.end(), vec2.begin()));
+    vec1[idx] = '1';
+  }
+}
+BENCHMARK(bm_mismatch<char>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<short>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<int>)->DenseRange(1, 8)->Range(16, 1 << 20);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/CMakeLists.txt b/libcxx/include/CMakeLists.txt
index fa3b464e56c4d0..e04e6c899bd161 100644
--- a/libcxx/include/CMakeLists.txt
+++ b/libcxx/include/CMakeLists.txt
@@ -217,6 +217,7 @@ set(files
   __algorithm/shift_right.h
   __algorithm/shuffle.h
   __algorithm/sift_down.h
+  __algorithm/simd_utils.h
   __algorithm/sort.h
   __algorithm/sort_heap.h
   __algorithm/stable_partition.h
diff --git a/libcxx/include/__algorithm/mismatch.h b/libcxx/include/__algorithm/mismatch.h
index d345b6048a7e9b..9cb7c6f9ff55c7 100644
--- a/libcxx/include/__algorithm/mismatch.h
+++ b/libcxx/include/__algorithm/mismatch.h
@@ -11,23 +11,89 @@
 #define _LIBCPP___ALGORITHM_MISMATCH_H
 
 #include <__algorithm/comp.h>
+#include <__algorithm/simd_utils.h>
+#include <__algorithm/unwrap_iter.h>
 #include <__config>
-#include <__iterator/iterator_traits.h>
+#include <__functional/identity.h>
+#include <__type_traits/invoke.h>
+#include <__type_traits/is_constant_evaluated.h>
+#include <__type_traits/is_equality_comparable.h>
+#include <__type_traits/operation_traits.h>
+#include <__utility/move.h>
 #include <__utility/pair.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
 #  pragma GCC system_header
 #endif
 
+_LIBCPP_PUSH_MACROS
+#include <__undef_macros>
+
 _LIBCPP_BEGIN_NAMESPACE_STD
 
+template <class _Iter1, class _Sent1, class _Iter2, class _Pred, class _Proj1, class _Proj2>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Iter1, _Iter2>
+__mismatch_loop(_Iter1 __first1, _Sent1 __last1, _Iter2 __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+  while (__first1 != __last1) {
+    if (!std::__invoke(__pred, std::__invoke(__proj1, *__first1), std::__invoke(__proj2, *__first2)))
+      break;
+    ++__first1;
+    ++__first2;
+  }
+  return std::make_pair(std::move(__first1), std::move(__first2));
+}
+
+template <class _Iter1, class _Sent1, class _Iter2, class _Pred, class _Proj1, class _Proj2>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Iter1, _Iter2>
+__mismatch(_Iter1 __first1, _Sent1 __last1, _Iter2 __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+  return std::__mismatch_loop(__first1, __last1, __first2, __pred, __proj1, __proj2);
+}
+
+#if _LIBCPP_VECTORIZE_ALGORIHTMS
+
+template <class _Tp,
+          class _Pred,
+          class _Proj1,
+          class _Proj2,
+          __enable_if_t<is_integral<_Tp>::value && __desugars_to<__equal_tag, _Pred, _Tp, _Tp>::value &&
+                            __is_identity<_Proj1>::value && __is_identity<_Proj2>::value,
+                        int> = 0>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Tp*, _Tp*>
+__mismatch(_Tp* __first1, _Tp* __last1, _Tp* __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+  constexpr size_t __unroll_count = 4;
+  constexpr size_t __vec_size     = __native_vector_size<_Tp>;
+  using __vec                     = __simd_vector<_Tp, __vec_size>;
+  while (!__libcpp_is_constant_evaluated() && static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) {
+    __vec __lhs[__unroll_count];
+    __vec __rhs[__unroll_count];
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      __lhs[__i] = std::__load_vector<__vec>(__first1 + __i * __vec_size);
+      __rhs[__i] = std::__load_vector<__vec>(__first2 + __i * __vec_size);
+    }
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {
+        auto __offset = __i * __unroll_count + std::__find_first_not_set(__cmp_res);
+        return {__first1 + __offset, __first2 + __offset};
+      }
+    }
+
+    __first1 += __unroll_count * __vec_size;
+    __first2 += __unroll_count * __vec_size;
+  }
+  return std::__mismatch_loop(__first1, __last1, __first2, __pred, __proj1, __proj2);
+}
+
+#endif // _LIBCPP_VECTORIZE_ALGORIHTMS
+
 template <class _InputIterator1, class _InputIterator2, class _BinaryPredicate>
 _LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_InputIterator1, _InputIterator2>
 mismatch(_InputIterator1 __first1, _InputIterator1 __last1, _InputIterator2 __first2, _BinaryPredicate __pred) {
-  for (; __first1 != __last1; ++__first1, (void)++__first2)
-    if (!__pred(*__first1, *__first2))
-      break;
-  return pair<_InputIterator1, _InputIterator2>(__first1, __first2);
+  __identity __proj;
+  auto __res = std::__mismatch(
+      std::__unwrap_iter(__first1), std::__unwrap_iter(__last1), std::__unwrap_iter(__first2), __pred, __proj, __proj);
+  return std::make_pair(std::__rewrap_iter(__first1, __res.first), std::__rewrap_iter(__first2, __res.second));
 }
 
 template <class _InputIterator1, class _InputIterator2>
@@ -59,4 +125,6 @@ mismatch(_InputIterator1 __first1, _InputIterator1 __last1, _InputIterator2 __fi
 
 _LIBCPP_END_NAMESPACE_STD
 
+_LIBCPP_POP_MACROS
+
 #endif // _LIBCPP___ALGORITHM_MISMATCH_H
diff --git a/libcxx/include/__algorithm/simd_utils.h b/libcxx/include/__algorithm/simd_utils.h
new file mode 100644
index 00000000000000..72f1a775dfe9b9
--- /dev/null
+++ b/libcxx/include/__algorithm/simd_utils.h
@@ -0,0 +1,114 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef _LIBCPP___ALGORITHM_SIMD_UTILS_H
+#define _LIBCPP___ALGORITHM_SIMD_UTILS_H
+
+#include <__bit/bit_cast.h>
+#include <__bit/countr.h>
+#include <__config>
+#include <__type_traits/is_arithmetic.h>
+#include <__type_traits/is_same.h>
+#include <__utility/integer_sequence.h>
+#include <cstddef>
+#include <cstdint>
+
+#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
+#  pragma GCC system_header
+#endif
+
+#if _LIBCPP_STD_VER >= 14 && __has_attribute(__ext_vector_type__) && __has_builtin(__builtin_reduce_and) &&            \
+    __has_builtin(__builtin_convertvector)
+#  define _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS 1
+#else
+#  define _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS 0
+#endif
+
+#if _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS && !defined(__OPTIMIZE_SIZE__)
+#  define _LIBCPP_VECTORIZE_ALGORIHTMS 1
+#else
+#  define _LIBCPP_VECTORIZE_ALGORIHTMS 0
+#endif
+
+#if _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS
+
+_LIBCPP_BEGIN_NAMESPACE_STD
+
+#  if defined(__AVX512F__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 64 / sizeof(_Tp);
+#  elif defined(__AVX__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 32 / sizeof(_Tp);
+#  elif defined(__SSE__) || defined(__ARM_NEON__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 16 / sizeof(_Tp);
+#  elif defined(__MMX__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 8 / sizeof(_Tp);
+#  else
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 1;
+#  endif
+
+template <class _Tp, size_t _Np>
+using __simd_vector __attribute__((__ext_vector_type__(_Np))) = _Tp;
+
+template <class _VecT>
+inline constexpr size_t __simd_vector_size_v = []() -> size_t { static_assert(false, "Not a vector!"); }();
+
+template <class _Tp, size_t _Np>
+inline constexpr size_t __simd_vector_size_v<__simd_vector<_Tp, _Np>> = _Np;
+
+template <class _VecT>
+using __simd_vector_underlying_type_t = decltype([]<class _Tp, size_t _Np>() { return _Tp{}; });
+
+template <class _VecT, class _UnderlyingType = __simd_vector_underlying_type_t<_VecT>>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _VecT __load_vector(const _UnderlyingType* __ptr) noexcept {
+  return []<size_t... _Indices>(const _UnderlyingType* __lptr, index_sequence<_Indices...>) static noexcept {
+    return _VecT{__lptr[_Indices]...};
+  }(__ptr, make_index_sequence<__simd_vector_size_v<_VecT>>{});
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI bool __all_of(__simd_vector<_Tp, _Np> __vec) noexcept {
+  return __builtin_reduce_and(__builtin_convertvector(__vec, __simd_vector<bool, _Np>));
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI size_t __find_first_set(__simd_vector<_Tp, _Np> __vec) noexcept {
+  using __mask_vec = __simd_vector<bool, _Np>;
+
+  auto __impl = [&]<class _MaskT>(_MaskT) noexcept {
+    return std::__countr_zero(std::__bit_cast<_MaskT>(__builtin_convertvector(__vec, __mask_vec)));
+  };
+
+  if constexpr (sizeof(__mask_vec) == sizeof(uint8_t)) {
+    return __impl(uint8_t{});
+  } else if constexpr (sizeof(__mask_vec) == sizeof(uint16_t)) {
+    return __impl(uint16_t{});
+  } else if constexpr (sizeof(__mask_vec) == sizeof(uint32_t)) {
+    return __impl(uint32_t{});
+  } else if constexpr (sizeof(__mask_vec) == sizeof(uint64_t)) {
+    return __impl(uint64_t{});
+  } else {
+    static_assert(sizeof(__mask_vec) == 0, "unexpected required size for mask integer type");
+  }
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI size_t __find_first_not_set(__simd_vector<_Tp, _Np> __vec) noexcept {
+  return std::__find_first_set(~__vec);
+}
+
+_LIBCPP_END_NAMESPACE_STD
+
+#endif // _LIBCPP_STD_VER >= 14 && __has_attribute(__ext_vector_type__) && __has_builtin(__builtin_reduce_and) &&
+       // __has_builtin(__builtin_convertvector)
+
+#endif // _LIBCPP___ALGORITHM_SIMD_UTILS_H
diff --git a/libcxx/include/__bit/bit_cast.h b/libcxx/include/__bit/bit_cast.h
index f20b39ae748b10..6298810f373303 100644
--- a/libcxx/include/__bit/bit_cast.h
+++ b/libcxx/include/__bit/bit_cast.h
@@ -19,6 +19,15 @@
 
 _LIBCPP_BEGIN_NAMESPACE_STD
 
+#ifndef _LIBCPP_CXX03_LANG
+
+template <class _ToType, class _FromType>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI constexpr _ToType __bit_cast(const _FromType& __from) noexcept {
+  return __builtin_bit_cast(_ToType, __from);
+}
+
+#endif // _LIBCPP_CXX03_LANG
+
 #if _LIBCPP_STD_VER >= 20
 
 template <class _ToType, class _FromType>
diff --git a/libcxx/include/__bit/countr.h b/libcxx/include/__bit/countr.h
index 0cc679f87a99d9..b6b3ac52ca4e47 100644
--- a/libcxx/include/__bit/countr.h
+++ b/libcxx/include/__bit/countr.h
@@ -35,10 +35,8 @@ _LIBCPP_NODISCARD inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR int __libcpp_ct
   return __builtin_ctzll(__x);
 }
 
-#if _LIBCPP_STD_VER >= 20
-
-template <__libcpp_unsigned_integer _Tp>
-_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+template <class _Tp>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 int __countr_zero(_Tp __t) _NOEXCEPT {
   if (__t == 0)
     return numeric_limits<_Tp>::digits;
 
@@ -59,6 +57,13 @@ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) n
   }
 }
 
+#if _LIBCPP_STD_VER >= 20
+
+template <__libcpp_unsigned_integer _Tp>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+  return std::__countr_zero(__t);
+}
+
 template <__libcpp_unsigned_integer _Tp>
 _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_one(_Tp __t) noexcept {
   return __t != numeric_limits<_Tp>::max() ? std::countr_zero(static_cast<_Tp>(~__t)) : numeric_limits<_Tp>::digits;
diff --git a/libcxx/include/libcxx.imp b/libcxx/include/libcxx.imp
index fbe09fa5e54ab1..8b2f78b713976a 100644
--- a/libcxx/include/libcxx.imp
+++ b/libcxx/include/libcxx.imp
@@ -217,6 +217,7 @@
   { include: [ "<__algorithm/shift_right.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/shuffle.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/sift_down.h>", "private", "<algorithm>", "public" ] },
+  { include: [ "<__algorithm/simd_utils.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/sort.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/sort_heap.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/stable_partition.h>", "private", "<algorithm>", "public" ] },
diff --git a/libcxx/include/module.modulemap.in b/libcxx/include/module.modulemap.in
index e72136a58c0b1b..79245c87b15847 100644
--- a/libcxx/include/module.modulemap.in
+++ b/libcxx/include/module.modulemap.in
@@ -697,7 +697,10 @@ module std_private_algorithm_minmax                                      [system
   export *
 }
 module std_private_algorithm_minmax_element                              [system] { header "__algorithm/minmax_element.h" }
-module std_private_algorithm_mismatch                                    [system] { header "__algorithm/mismatch.h" }
+module std_private_algorithm_mismatch                                    [system] {
+  header "__algorithm/mismatch.h"
+  export std_private_algorithm_simd_utils
+}
 module std_private_algorithm_move                                        [system] { header "__algorithm/move.h" }
 module std_private_algorithm_move_backward                               [system] { header "__algorithm/move_backward.h" }
 module std_private_algorithm_next_permutation                            [system] { header "__algorithm/next_permutation.h" }
@@ -1048,6 +1051,7 @@ module std_private_algorithm_sort                                        [system
   header "__algorithm/sort.h"
   export std_private_debug_utils_strict_weak_ordering_check
 }
+module std_private_algorithm_simd_utils                                  [system] { header "__algorithm/simd_utils.h" }
 module std_private_algorithm_sort_heap                                   [system] { header "__algorithm/sort_heap.h" }
 module std_private_algorithm_stable_partition                            [system] { header "__algorithm/stable_partition.h" }
 module std_private_algorithm_stable_sort                                 [system] { header "__algorithm/stable_sort.h" }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
index cc588c095ccfb2..faaef5377863b8 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
@@ -16,78 +16,117 @@
 // template<InputIterator Iter1, InputIterator Iter2Pred>
 //   constexpr pair<Iter1, Iter2>   // constexpr after c++17
 //   mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2); // C++14
+//
+// template<InputIterator Iter1, InputIterator Iter2,
+//          Predicate<auto, Iter1::value_type, Iter2::value_type> Pred>
+//   requires CopyConstructible<Pred>
+//   constexpr pair<Iter1, Iter2>   // constexpr after c++17
+//   mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Pred pred);
+//
+// template<InputIterator Iter1, InputIterator Iter2, Predicate Pred>
+//   constexpr pair<Iter1, Iter2>   // constexpr after c++17
+//   mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2, Pred pred); // C++14
+
+// ADDITIONAL_COMPILE_FLAGS(has-fconstexpr-steps): -fconstexpr-steps=20000000
+// ADDITIONAL_COMPILE_FLAGS(has-fconstexpr-ops-limit): -fconstexpr-ops-limit=100000000
 
 #include <algorithm>
+#include <array>
 #include <cassert>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
-
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int ib[] = {1, 3};
-    int ic[] = {1, 3, 5, 7};
-    typedef cpp17_input_iterator<int*>         II;
-    typedef bidirectional_iterator<int*> BI;
-
-    auto p1 = std::mismatch(std::begin(ia), std::end(ia), std::begin(ic));
-    if (p1.first != ia+2 || p1.second != ic+2)
-        return false;
-
-    auto p2 = std::mismatch(std::begin(ia), std::end(ia), std::begin(ic), std::end(ic));
-    if (p2.first != ia+2 || p2.second != ic+2)
-        return false;
-
-    auto p3 = std::mismatch(std::begin(ib), std::end(ib), std::begin(ic));
-    if (p3.first != ib+2 || p3.second != ic+2)
-        return false;
-
-    auto p4 = std::mismatch(std::begin(ib), std::end(ib), std::begin(ic), std::end(ic));
-    if (p4.first != ib+2 || p4.second != ic+2)
-        return false;
-
-    auto p5 = std::mismatch(II(std::begin(ib)), II(std::end(ib)), II(std::begin(ic)));
-    if (p5.first != II(ib+2) || p5.second != II(ic+2))
-        return false;
-    auto p6 = std::mismatch(BI(std::begin(ib)), BI(std::end(ib)), BI(std::begin(ic)), BI(std::end(ic)));
-    if (p6.first != BI(ib+2) || p6.second != BI(ic+2))
-        return false;
-
-    return true;
-    }
+#include "type_algorithms.h"
+
+template <class Iter, class Container1, class Container2>
+TEST_CONSTEXPR_CXX20 void check(Container1 lhs, Container2 rhs, size_t offset) {
+  if (lhs.size() == rhs.size()) {
+    assert(std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data())) ==
+           std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+
+    assert(std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data()), std::equal_to<int>()) ==
+           std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+  }
+
+#if TEST_STD_VER >= 14
+  assert(
+      std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data()), Iter(rhs.data() + rhs.size())) ==
+      std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+
+  assert(std::mismatch(Iter(lhs.data()),
+                       Iter(lhs.data() + lhs.size()),
+                       Iter(rhs.data()),
+                       Iter(rhs.data() + rhs.size()),
+                       std::equal_to<int>()) == std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
 #endif
+}
 
-int main(int, char**)
-{
-    int ia[] = {0, 1, 2, 2, 0, 1, 2, 3};
-    const unsigned sa = sizeof(ia)/sizeof(ia[0]);
-    int ib[] = {0, 1, 2, 3, 0, 1, 2, 3};
-    const unsigned sb = sizeof(ib)/sizeof(ib[0]); ((void)sb); // unused in C++11
-
-    typedef cpp17_input_iterator<const int*> II;
-    typedef random_access_iterator<const int*>  RAI;
-
-    assert(std::mismatch(II(ia), II(ia + sa), II(ib))
-            == (std::pair<II, II>(II(ia+3), II(ib+3))));
-
-    assert(std::mismatch(RAI(ia), RAI(ia + sa), RAI(ib))
-            == (std::pair<RAI, RAI>(RAI(ia+3), RAI(ib+3))));
-
-#if TEST_STD_VER > 11 // We have the four iteration version
-    assert(std::mismatch(II(ia), II(ia + sa), II(ib), II(ib+sb))
-            == (std::pair<II, II>(II(ia+3), II(ib+3))));
-
-    assert(std::mismatch(RAI(ia), RAI(ia + s...
[truncated]

DenisYaroshevskiy

Thanks for tagging me, commandable effort!

DenisYaroshevskiy · 2024-02-24T16:04:22Z

libcxx/benchmarks/algorithms/mismatch.bench.cpp

+#include <random>
+
+template <class T>
+static void bm_mismatch(benchmark::State& state) {


I don't know how much you care for it but you can try to mess with data alignment.

ve1.data() and vec2.data() will be aligned to 16 bytes. Which can lead to loads being aligned.

the difference can be quite huge.

There are some things you can do about that, I don't know if they are worth it for 2 range algorithms.

At the very least maybe aligned your .data() pointers to 64 bytes:

allocate both vectors with +64 bytes.

align the pointers forward - get a span.

I've added a TODO for now. Though even if it makes a difference I'm not sure we can do much about it, can we?

You can align onr of the arrays. Also this makes for a better benchmark

libcxx/benchmarks/algorithms/mismatch.bench.cpp

DenisYaroshevskiy · 2024-02-24T16:15:16Z

libcxx/benchmarks/algorithms/mismatch.bench.cpp

+BENCHMARK(bm_mismatch<short>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<int>)->DenseRange(1, 8)->Range(16, 1 << 20);
+
+BENCHMARK_MAIN();


You want to compare against memcmp. For libstdc++ that is the best one. So that you'll see how far away are you.

DenisYaroshevskiy · 2024-02-24T16:21:46Z

libcxx/include/__algorithm/mismatch.h

+    }
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {


I really don't like the dependency here.

have a look if maybe comparing all into bool __res[__unroll_count]; would be helpful.

or all of them togehter and then do one __all_of

Try which is best.

Look at std::memcmp:
https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memcmpeq-avx2.S.html#220

I've tried a few versions, and I couldn't get anything significant. The benchmark is already very close to memcmp for large arrays, so I'm not sure it makes a ton of sense to try to squeeze out the last few bits right now. I'd much rather look into improving it for small arrays for now.

sure. You might want to keep memcmp in the benchmark and not delete it.

Also - take a note that you use powers of 2 everywhere. That will only measure the very unrolled thing. Tails can be significant.

I've made a patch to improve the tail: #83440. I'm not sure about keeping the memcmp benchmark, since our benchmarks are to show performance improvements, and not to compare to kind-of similar algorithms.

libcxx/include/__algorithm/mismatch.h

libcxx/include/__algorithm/simd_utils.h

DenisYaroshevskiy · 2024-02-24T17:12:21Z

At the moment it's not trivial to run benchmarks for this against what I have. Is there something standalone?

philnik777

Thanks for taking a look!

libcxx/include/__algorithm/simd_utils.h

libcxx/include/__algorithm/mismatch.h

philnik777 · 2024-02-25T17:32:52Z

libcxx/include/__algorithm/mismatch.h

+    }
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {


I've tried a few versions, and I couldn't get anything significant. The benchmark is already very close to memcmp for large arrays, so I'm not sure it makes a ton of sense to try to squeeze out the last few bits right now. I'd much rather look into improving it for small arrays for now.

libcxx/include/__algorithm/mismatch.h

libcxx/include/__algorithm/simd_utils.h

philnik777 · 2024-02-26T10:38:10Z

At the moment it's not trivial to run benchmarks for this against what I have. Is there something standalone?

Could you elaborate a bit? I'm not sure what you're trying to do.

DenisYaroshevskiy

Could you elaborate a bit? I'm not sure what you're trying to do.

I was thinking to compare this implementation on the benchmarks I have. But when it's a part of a libc++ fork it's a bit too much work.

libcxx/include/__algorithm/mismatch.h

DenisYaroshevskiy · 2024-02-26T11:43:59Z

libcxx/include/__algorithm/mismatch.h

+    }
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {


sure. You might want to keep memcmp in the benchmark and not delete it.

Also - take a note that you use powers of 2 everywhere. That will only measure the very unrolled thing. Tails can be significant.

libcxx/include/__algorithm/simd_utils.h

philnik777 · 2024-02-27T17:30:54Z

Could you elaborate a bit? I'm not sure what you're trying to do.

I was thinking to compare this implementation on the benchmarks I have. But when it's a part of a libc++ fork it's a bit too much work.

With a bit of playing preprocessor (and not disabling the old operation_traits.h) you should be able to get things working with a 17 release: https://godbolt.org/z/j16P5Kxeb. Just put it before any other includes and everything should work (I think).

ldionne

Not a full review, I had to leave halfway through. I haven't looked at the core of the implementation in detail yet.

libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp

libcxx/include/__algorithm/mismatch.h

DenisYaroshevskiy

As far as SIMD algorithm is concerned, this lgtm.

There are some potential improvements that we discussed and I think that the benchmark maybe has some issues but it's OK.

I didn't have energy to test it against my code at the moment unfrotunately.

P.S. Good work.

libcxx/include/__algorithm/simd_utils.h

ldionne

This LGTM, I do have a few comments but I think this is a really good patch. I would strongly encourage you to consider the interaction of these optimizations and the PSTL with unseq. I think we should use the same underlying machinery for both.

ldionne · 2024-03-14T15:39:58Z

libcxx/include/__algorithm/mismatch.h

+  constexpr size_t __vec_size     = __native_vector_size<_Tp>;
+  using __vec                     = __simd_vector<_Tp, __vec_size>;
+  if (!__libcpp_is_constant_evaluated()) {
+    while (static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) [[__unlikely__]] {


@fhahn @jroelofs According to @philnik777, if we manually unroll the loop with a constant number of iterations, Clang isn't "smart" enough to vectorize the code. So we end up having to use explicit constructs like __load_vector and friends to make it happen. Just a heads up in case you have thoughts since we've been talking about vectorization in the context of libc++.

Can you show the version w/o manual unrolling?

Have you tried #pragma clang loop vectorize(enable) and/or #pragma clang loop unroll_count(N) or #pragma clang loop unroll(full)?

ldionne · 2024-03-14T15:43:45Z

libcxx/include/__algorithm/mismatch.h

+                            __is_identity<_Proj1>::value && __is_identity<_Proj2>::value,
+                        int> = 0>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Tp*, _Tp*>
+__mismatch(_Tp* __first1, _Tp* __last1, _Tp* __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {


One thing we could do here is implement this function as std::mismatch(std::execution::unseq, ...), and then we simply have this mismatch call into the PSTL version with unseq if it knows that to be safe. This would provide a nice way to frame these optimizations, since that would also be applicable to other algorithms.

libcxx/include/__algorithm/simd_utils.h

DenisYaroshevskiy · 2024-03-14T17:31:23Z

I think we should use the same underlying machinery for both.

How'd you go about that? You need some compiler extensions to do predicates

ldionne · 2024-03-14T19:39:38Z

I think we should use the same underlying machinery for both.

How'd you go about that? You need some compiler extensions to do predicates

At the very least, std::mismatch(std::execution::unseq, ...) should be able to use the same optimizations as the serial version, so there's something to be shared. Not everything may be shareable, but some of it must be.

DenisYaroshevskiy · 2024-03-14T21:27:41Z

I think we should use the same underlying machinery for both.

How'd you go about that? You need some compiler extensions to do predicates

At the very least, std::mismatch(std::execution::unseq, ...) should be able to use the same optimizations as the serial version, so there's something to be shared. Not everything may be shareable, but some of it must be.

Is it legal for unseq to just call serial version? I don't know of anything unseq could have done better here

github-actions · 2024-03-23T12:51:55Z

✅ With the latest revision this PR passed the Python code formatter.

wsehjk · 2025-04-13T09:05:50Z

libcxx/include/__algorithm/mismatch.h

+  constexpr size_t __vec_size     = __native_vector_size<_Tp>;
+  using __vec                     = __simd_vector<_Tp, __vec_size>;
+  if (!__libcpp_is_constant_evaluated()) {
+    while (static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) [[__unlikely__]] {


Hi, Why this while condition is marked with __unlikely__. I think it should be __likely__ @philnik777

Why do you think the loop is likely?

ldionne reviewed Nov 23, 2023

View reviewed changes

libcxx/include/__algorithm/vectorization.h Outdated Show resolved Hide resolved

libcxx/include/__algorithm/vectorization.h Outdated Show resolved Hide resolved

philnik777 force-pushed the optimize_mismatch branch from f5a4e3b to 5251bb2 Compare December 23, 2023 13:05

philnik777 force-pushed the optimize_mismatch branch 5 times, most recently from 1c5041e to 87f2ba2 Compare February 22, 2024 15:03

philnik777 marked this pull request as ready for review February 24, 2024 14:18

philnik777 requested a review from a team as a code owner February 24, 2024 14:18

llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Feb 24, 2024

DenisYaroshevskiy reviewed Feb 24, 2024

View reviewed changes

philnik777 force-pushed the optimize_mismatch branch from 87f2ba2 to 46ab541 Compare February 26, 2024 10:36

philnik777 commented Feb 26, 2024

View reviewed changes

DenisYaroshevskiy reviewed Feb 26, 2024

View reviewed changes

philnik777 force-pushed the optimize_mismatch branch 2 times, most recently from 4bdaba3 to 6c6f71a Compare February 29, 2024 12:19

ldionne reviewed Feb 29, 2024

View reviewed changes

philnik777 mentioned this pull request Mar 3, 2024

[libc++] Optimize the std::mismatch tail #83440

Merged

philnik777 force-pushed the optimize_mismatch branch from 6c6f71a to c45f2b6 Compare March 3, 2024 16:53

DenisYaroshevskiy approved these changes Mar 3, 2024

View reviewed changes

libcxx/include/__algorithm/simd_utils.h Show resolved Hide resolved

philnik777 force-pushed the optimize_mismatch branch from c45f2b6 to ecbc13b Compare March 9, 2024 10:35

philnik777 mentioned this pull request Mar 10, 2024

[libc++] Vectorize all the algorithms #84663

Open

26 tasks

philnik777 force-pushed the optimize_mismatch branch 2 times, most recently from 935ecee to 295200e Compare March 11, 2024 12:47

philnik777 force-pushed the optimize_mismatch branch from 295200e to 3323800 Compare March 11, 2024 13:44

ldionne approved these changes Mar 14, 2024

View reviewed changes

philnik777 force-pushed the optimize_mismatch branch 6 times, most recently from 0f83e8f to 8d6a002 Compare March 17, 2024 19:51

philnik777 mentioned this pull request Mar 18, 2024

[libc++][ranges] Optimize the performance of ranges::starts_with #84570

Open

philnik777 force-pushed the optimize_mismatch branch 2 times, most recently from f5fe730 to 7b44e8d Compare March 23, 2024 12:49

philnik777 force-pushed the optimize_mismatch branch from 7b44e8d to 6509147 Compare March 23, 2024 14:27

[libc++] Vectorize mismatch

ad2ec50

philnik777 force-pushed the optimize_mismatch branch from 6509147 to ad2ec50 Compare March 23, 2024 14:27

philnik777 merged commit b68e2eb into llvm:main Mar 23, 2024

philnik777 deleted the optimize_mismatch branch March 23, 2024 14:28

wsehjk reviewed Apr 13, 2025

View reviewed changes

[libc++] Vectorize mismatch #73255

[libc++] Vectorize mismatch #73255

Uh oh!

Conversation

philnik777 commented Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

philnik777 commented Feb 24, 2024

Uh oh!

llvmbot commented Feb 24, 2024

Uh oh!

DenisYaroshevskiy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DenisYaroshevskiy Feb 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DenisYaroshevskiy commented Feb 24, 2024

Uh oh!

philnik777 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

philnik777 commented Feb 26, 2024

Uh oh!

DenisYaroshevskiy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

philnik777 commented Feb 27, 2024

Uh oh!

ldionne left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DenisYaroshevskiy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

philnik777 commented Nov 23, 2023 •

edited

Loading

github-actions bot commented Nov 23, 2023 •

edited

Loading

DenisYaroshevskiy Feb 27, 2024 •

edited

Loading

jroelofs Mar 14, 2024 •

edited

Loading

wsehjk Apr 13, 2025 •

edited

Loading