Skip to content

[libc++] Vectorize mismatch #73255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 23, 2024
Merged

Conversation

philnik777
Copy link
Contributor

@philnik777 philnik777 commented Nov 23, 2023

---------------------------------------------------
Benchmark                           old         new
---------------------------------------------------
bm_mismatch<char>/1           0.835 ns      2.37 ns
bm_mismatch<char>/2            1.44 ns      2.60 ns
bm_mismatch<char>/3            2.06 ns      2.83 ns
bm_mismatch<char>/4            2.60 ns      3.29 ns
bm_mismatch<char>/5            3.15 ns      3.77 ns
bm_mismatch<char>/6            3.82 ns      4.17 ns
bm_mismatch<char>/7            4.29 ns      4.52 ns
bm_mismatch<char>/8            4.78 ns      4.86 ns
bm_mismatch<char>/16           9.06 ns      7.54 ns
bm_mismatch<char>/64           31.7 ns      19.1 ns
bm_mismatch<char>/512           249 ns      8.16 ns
bm_mismatch<char>/4096         1956 ns      44.2 ns
bm_mismatch<char>/32768       15498 ns       501 ns
bm_mismatch<char>/262144     123965 ns      4479 ns
bm_mismatch<char>/1048576    495668 ns     21306 ns
bm_mismatch<short>/1          0.710 ns      2.12 ns
bm_mismatch<short>/2           1.03 ns      2.66 ns
bm_mismatch<short>/3           1.29 ns      3.56 ns
bm_mismatch<short>/4           1.68 ns      4.29 ns
bm_mismatch<short>/5           1.96 ns      5.18 ns
bm_mismatch<short>/6           2.59 ns      5.91 ns
bm_mismatch<short>/7           2.86 ns      6.63 ns
bm_mismatch<short>/8           3.19 ns      7.33 ns
bm_mismatch<short>/16          5.48 ns      13.0 ns
bm_mismatch<short>/64          16.6 ns      4.06 ns
bm_mismatch<short>/512          130 ns      13.8 ns
bm_mismatch<short>/4096         985 ns      93.8 ns
bm_mismatch<short>/32768       7846 ns      1002 ns
bm_mismatch<short>/262144     63217 ns     10637 ns
bm_mismatch<short>/1048576   251782 ns     42471 ns
bm_mismatch<int>/1            0.716 ns      1.91 ns
bm_mismatch<int>/2             1.21 ns      2.49 ns
bm_mismatch<int>/3             1.38 ns      3.46 ns
bm_mismatch<int>/4             1.71 ns      4.04 ns
bm_mismatch<int>/5             2.00 ns      4.98 ns
bm_mismatch<int>/6             2.43 ns      5.67 ns
bm_mismatch<int>/7             3.05 ns      6.38 ns
bm_mismatch<int>/8             3.22 ns      7.09 ns
bm_mismatch<int>/16            5.18 ns      12.8 ns
bm_mismatch<int>/64            16.6 ns      5.28 ns
bm_mismatch<int>/512            129 ns      25.2 ns
bm_mismatch<int>/4096          1009 ns       201 ns
bm_mismatch<int>/32768         7776 ns      2144 ns
bm_mismatch<int>/262144       62371 ns     20551 ns
bm_mismatch<int>/1048576     254750 ns     90097 ns

Copy link

github-actions bot commented Nov 23, 2023

✅ With the latest revision this PR passed the C/C++ code formatter.

@philnik777 philnik777 force-pushed the optimize_mismatch branch 5 times, most recently from 1c5041e to 87f2ba2 Compare February 22, 2024 15:03
@philnik777
Copy link
Contributor Author

This uses extensions, but I don't expect this to land that soon anyways. Worst case I'll have to wait a few weeks until I can land it.

CC @DenisYaroshevskiy in case you are interested in helping review this.

@philnik777 philnik777 marked this pull request as ready for review February 24, 2024 14:18
@philnik777 philnik777 requested a review from a team as a code owner February 24, 2024 14:18
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Feb 24, 2024
@llvmbot
Copy link
Member

llvmbot commented Feb 24, 2024

@llvm/pr-subscribers-libcxx

Author: Nikolas Klauser (philnik777)

Changes
--------------------------------------------------
Benchmark                          old         new
--------------------------------------------------
bm_mismatch&lt;char&gt;/1            10.6 ns     10.5 ns
bm_mismatch&lt;char&gt;/2            18.6 ns     18.6 ns
bm_mismatch&lt;char&gt;/3            21.4 ns     21.7 ns
bm_mismatch&lt;char&gt;/4            22.7 ns     23.0 ns
bm_mismatch&lt;char&gt;/5            23.8 ns     23.9 ns
bm_mismatch&lt;char&gt;/6            24.2 ns     24.5 ns
bm_mismatch&lt;char&gt;/7            24.4 ns     24.9 ns
bm_mismatch&lt;char&gt;/8            24.8 ns     25.1 ns
bm_mismatch&lt;char&gt;/16           26.1 ns     26.6 ns
bm_mismatch&lt;char&gt;/64           31.1 ns     31.3 ns
bm_mismatch&lt;char&gt;/512          82.1 ns     37.6 ns
bm_mismatch&lt;char&gt;/4096          503 ns     70.1 ns
bm_mismatch&lt;char&gt;/32768        3920 ns      386 ns
bm_mismatch&lt;char&gt;/262144      31386 ns     2988 ns
bm_mismatch&lt;char&gt;/1048576    123315 ns    12640 ns
bm_mismatch&lt;short&gt;/1           10.6 ns     10.6 ns
bm_mismatch&lt;short&gt;/2           19.0 ns     18.7 ns
bm_mismatch&lt;short&gt;/3           22.1 ns     21.7 ns
bm_mismatch&lt;short&gt;/4           23.5 ns     23.0 ns
bm_mismatch&lt;short&gt;/5           24.4 ns     23.8 ns
bm_mismatch&lt;short&gt;/6           25.2 ns     24.5 ns
bm_mismatch&lt;short&gt;/7           25.6 ns     25.0 ns
bm_mismatch&lt;short&gt;/8           25.8 ns     25.0 ns
bm_mismatch&lt;short&gt;/16          26.9 ns     26.2 ns
bm_mismatch&lt;short&gt;/64          32.3 ns     34.4 ns
bm_mismatch&lt;short&gt;/512         83.1 ns     43.1 ns
bm_mismatch&lt;short&gt;/4096         511 ns      138 ns
bm_mismatch&lt;short&gt;/32768       3966 ns      940 ns
bm_mismatch&lt;short&gt;/262144     31230 ns     7724 ns
bm_mismatch&lt;short&gt;/1048576   124324 ns    30343 ns
bm_mismatch&lt;int&gt;/1             10.6 ns     10.6 ns
bm_mismatch&lt;int&gt;/2             19.3 ns     19.0 ns
bm_mismatch&lt;int&gt;/3             21.9 ns     21.6 ns
bm_mismatch&lt;int&gt;/4             23.2 ns     22.9 ns
bm_mismatch&lt;int&gt;/5             23.9 ns     24.0 ns
bm_mismatch&lt;int&gt;/6             24.2 ns     24.6 ns
bm_mismatch&lt;int&gt;/7             24.6 ns     25.0 ns
bm_mismatch&lt;int&gt;/8             24.9 ns     25.3 ns
bm_mismatch&lt;int&gt;/16            26.1 ns     31.9 ns
bm_mismatch&lt;int&gt;/64            33.7 ns     36.6 ns
bm_mismatch&lt;int&gt;/512            135 ns     53.0 ns
bm_mismatch&lt;int&gt;/4096           978 ns      199 ns
bm_mismatch&lt;int&gt;/32768         7768 ns     1471 ns
bm_mismatch&lt;int&gt;/262144       62202 ns    12351 ns
bm_mismatch&lt;int&gt;/1048576     244952 ns    50080 ns

Patch is 26.41 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73255.diff

11 Files Affected:

  • (modified) libcxx/benchmarks/CMakeLists.txt (+1)
  • (added) libcxx/benchmarks/algorithms/mismatch.bench.cpp (+31)
  • (modified) libcxx/include/CMakeLists.txt (+1)
  • (modified) libcxx/include/__algorithm/mismatch.h (+73-5)
  • (added) libcxx/include/__algorithm/simd_utils.h (+114)
  • (modified) libcxx/include/__bit/bit_cast.h (+9)
  • (modified) libcxx/include/__bit/countr.h (+9-4)
  • (modified) libcxx/include/libcxx.imp (+1)
  • (modified) libcxx/include/module.modulemap.in (+5-1)
  • (modified) libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp (+100-61)
  • (removed) libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch_pred.pass.cpp (-119)
diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index 2434d82c6fd6ba..0337d4fdb73aa2 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -182,6 +182,7 @@ set(BENCHMARK_TESTS
     algorithms/make_heap_then_sort_heap.bench.cpp
     algorithms/min.bench.cpp
     algorithms/min_max_element.bench.cpp
+    algorithms/mismatch.bench.cpp
     algorithms/pop_heap.bench.cpp
     algorithms/pstl.stable_sort.bench.cpp
     algorithms/push_heap.bench.cpp
diff --git a/libcxx/benchmarks/algorithms/mismatch.bench.cpp b/libcxx/benchmarks/algorithms/mismatch.bench.cpp
new file mode 100644
index 00000000000000..98725c8690e640
--- /dev/null
+++ b/libcxx/benchmarks/algorithms/mismatch.bench.cpp
@@ -0,0 +1,31 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include <algorithm>
+#include <benchmark/benchmark.h>
+#include <random>
+
+template <class T>
+static void bm_mismatch(benchmark::State& state) {
+  std::vector<T> vec1(state.range(), '1');
+  std::vector<T> vec2(state.range(), '1');
+  std::mt19937_64 rng(std::random_device{}());
+
+  for (auto _ : state) {
+    auto idx  = rng() % vec1.size();
+    vec1[idx] = '2';
+    benchmark::DoNotOptimize(vec1);
+    benchmark::DoNotOptimize(std::mismatch(vec1.begin(), vec1.end(), vec2.begin()));
+    vec1[idx] = '1';
+  }
+}
+BENCHMARK(bm_mismatch<char>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<short>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<int>)->DenseRange(1, 8)->Range(16, 1 << 20);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/CMakeLists.txt b/libcxx/include/CMakeLists.txt
index fa3b464e56c4d0..e04e6c899bd161 100644
--- a/libcxx/include/CMakeLists.txt
+++ b/libcxx/include/CMakeLists.txt
@@ -217,6 +217,7 @@ set(files
   __algorithm/shift_right.h
   __algorithm/shuffle.h
   __algorithm/sift_down.h
+  __algorithm/simd_utils.h
   __algorithm/sort.h
   __algorithm/sort_heap.h
   __algorithm/stable_partition.h
diff --git a/libcxx/include/__algorithm/mismatch.h b/libcxx/include/__algorithm/mismatch.h
index d345b6048a7e9b..9cb7c6f9ff55c7 100644
--- a/libcxx/include/__algorithm/mismatch.h
+++ b/libcxx/include/__algorithm/mismatch.h
@@ -11,23 +11,89 @@
 #define _LIBCPP___ALGORITHM_MISMATCH_H
 
 #include <__algorithm/comp.h>
+#include <__algorithm/simd_utils.h>
+#include <__algorithm/unwrap_iter.h>
 #include <__config>
-#include <__iterator/iterator_traits.h>
+#include <__functional/identity.h>
+#include <__type_traits/invoke.h>
+#include <__type_traits/is_constant_evaluated.h>
+#include <__type_traits/is_equality_comparable.h>
+#include <__type_traits/operation_traits.h>
+#include <__utility/move.h>
 #include <__utility/pair.h>
 
 #if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
 #  pragma GCC system_header
 #endif
 
+_LIBCPP_PUSH_MACROS
+#include <__undef_macros>
+
 _LIBCPP_BEGIN_NAMESPACE_STD
 
+template <class _Iter1, class _Sent1, class _Iter2, class _Pred, class _Proj1, class _Proj2>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Iter1, _Iter2>
+__mismatch_loop(_Iter1 __first1, _Sent1 __last1, _Iter2 __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+  while (__first1 != __last1) {
+    if (!std::__invoke(__pred, std::__invoke(__proj1, *__first1), std::__invoke(__proj2, *__first2)))
+      break;
+    ++__first1;
+    ++__first2;
+  }
+  return std::make_pair(std::move(__first1), std::move(__first2));
+}
+
+template <class _Iter1, class _Sent1, class _Iter2, class _Pred, class _Proj1, class _Proj2>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Iter1, _Iter2>
+__mismatch(_Iter1 __first1, _Sent1 __last1, _Iter2 __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+  return std::__mismatch_loop(__first1, __last1, __first2, __pred, __proj1, __proj2);
+}
+
+#if _LIBCPP_VECTORIZE_ALGORIHTMS
+
+template <class _Tp,
+          class _Pred,
+          class _Proj1,
+          class _Proj2,
+          __enable_if_t<is_integral<_Tp>::value && __desugars_to<__equal_tag, _Pred, _Tp, _Tp>::value &&
+                            __is_identity<_Proj1>::value && __is_identity<_Proj2>::value,
+                        int> = 0>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Tp*, _Tp*>
+__mismatch(_Tp* __first1, _Tp* __last1, _Tp* __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+  constexpr size_t __unroll_count = 4;
+  constexpr size_t __vec_size     = __native_vector_size<_Tp>;
+  using __vec                     = __simd_vector<_Tp, __vec_size>;
+  while (!__libcpp_is_constant_evaluated() && static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) {
+    __vec __lhs[__unroll_count];
+    __vec __rhs[__unroll_count];
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      __lhs[__i] = std::__load_vector<__vec>(__first1 + __i * __vec_size);
+      __rhs[__i] = std::__load_vector<__vec>(__first2 + __i * __vec_size);
+    }
+
+    for (size_t __i = 0; __i != __unroll_count; ++__i) {
+      if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {
+        auto __offset = __i * __unroll_count + std::__find_first_not_set(__cmp_res);
+        return {__first1 + __offset, __first2 + __offset};
+      }
+    }
+
+    __first1 += __unroll_count * __vec_size;
+    __first2 += __unroll_count * __vec_size;
+  }
+  return std::__mismatch_loop(__first1, __last1, __first2, __pred, __proj1, __proj2);
+}
+
+#endif // _LIBCPP_VECTORIZE_ALGORIHTMS
+
 template <class _InputIterator1, class _InputIterator2, class _BinaryPredicate>
 _LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_InputIterator1, _InputIterator2>
 mismatch(_InputIterator1 __first1, _InputIterator1 __last1, _InputIterator2 __first2, _BinaryPredicate __pred) {
-  for (; __first1 != __last1; ++__first1, (void)++__first2)
-    if (!__pred(*__first1, *__first2))
-      break;
-  return pair<_InputIterator1, _InputIterator2>(__first1, __first2);
+  __identity __proj;
+  auto __res = std::__mismatch(
+      std::__unwrap_iter(__first1), std::__unwrap_iter(__last1), std::__unwrap_iter(__first2), __pred, __proj, __proj);
+  return std::make_pair(std::__rewrap_iter(__first1, __res.first), std::__rewrap_iter(__first2, __res.second));
 }
 
 template <class _InputIterator1, class _InputIterator2>
@@ -59,4 +125,6 @@ mismatch(_InputIterator1 __first1, _InputIterator1 __last1, _InputIterator2 __fi
 
 _LIBCPP_END_NAMESPACE_STD
 
+_LIBCPP_POP_MACROS
+
 #endif // _LIBCPP___ALGORITHM_MISMATCH_H
diff --git a/libcxx/include/__algorithm/simd_utils.h b/libcxx/include/__algorithm/simd_utils.h
new file mode 100644
index 00000000000000..72f1a775dfe9b9
--- /dev/null
+++ b/libcxx/include/__algorithm/simd_utils.h
@@ -0,0 +1,114 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef _LIBCPP___ALGORITHM_SIMD_UTILS_H
+#define _LIBCPP___ALGORITHM_SIMD_UTILS_H
+
+#include <__bit/bit_cast.h>
+#include <__bit/countr.h>
+#include <__config>
+#include <__type_traits/is_arithmetic.h>
+#include <__type_traits/is_same.h>
+#include <__utility/integer_sequence.h>
+#include <cstddef>
+#include <cstdint>
+
+#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
+#  pragma GCC system_header
+#endif
+
+#if _LIBCPP_STD_VER >= 14 && __has_attribute(__ext_vector_type__) && __has_builtin(__builtin_reduce_and) &&            \
+    __has_builtin(__builtin_convertvector)
+#  define _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS 1
+#else
+#  define _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS 0
+#endif
+
+#if _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS && !defined(__OPTIMIZE_SIZE__)
+#  define _LIBCPP_VECTORIZE_ALGORIHTMS 1
+#else
+#  define _LIBCPP_VECTORIZE_ALGORIHTMS 0
+#endif
+
+#if _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS
+
+_LIBCPP_BEGIN_NAMESPACE_STD
+
+#  if defined(__AVX512F__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 64 / sizeof(_Tp);
+#  elif defined(__AVX__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 32 / sizeof(_Tp);
+#  elif defined(__SSE__) || defined(__ARM_NEON__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 16 / sizeof(_Tp);
+#  elif defined(__MMX__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 8 / sizeof(_Tp);
+#  else
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 1;
+#  endif
+
+template <class _Tp, size_t _Np>
+using __simd_vector __attribute__((__ext_vector_type__(_Np))) = _Tp;
+
+template <class _VecT>
+inline constexpr size_t __simd_vector_size_v = []() -> size_t { static_assert(false, "Not a vector!"); }();
+
+template <class _Tp, size_t _Np>
+inline constexpr size_t __simd_vector_size_v<__simd_vector<_Tp, _Np>> = _Np;
+
+template <class _VecT>
+using __simd_vector_underlying_type_t = decltype([]<class _Tp, size_t _Np>() { return _Tp{}; });
+
+template <class _VecT, class _UnderlyingType = __simd_vector_underlying_type_t<_VecT>>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _VecT __load_vector(const _UnderlyingType* __ptr) noexcept {
+  return []<size_t... _Indices>(const _UnderlyingType* __lptr, index_sequence<_Indices...>) static noexcept {
+    return _VecT{__lptr[_Indices]...};
+  }(__ptr, make_index_sequence<__simd_vector_size_v<_VecT>>{});
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI bool __all_of(__simd_vector<_Tp, _Np> __vec) noexcept {
+  return __builtin_reduce_and(__builtin_convertvector(__vec, __simd_vector<bool, _Np>));
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI size_t __find_first_set(__simd_vector<_Tp, _Np> __vec) noexcept {
+  using __mask_vec = __simd_vector<bool, _Np>;
+
+  auto __impl = [&]<class _MaskT>(_MaskT) noexcept {
+    return std::__countr_zero(std::__bit_cast<_MaskT>(__builtin_convertvector(__vec, __mask_vec)));
+  };
+
+  if constexpr (sizeof(__mask_vec) == sizeof(uint8_t)) {
+    return __impl(uint8_t{});
+  } else if constexpr (sizeof(__mask_vec) == sizeof(uint16_t)) {
+    return __impl(uint16_t{});
+  } else if constexpr (sizeof(__mask_vec) == sizeof(uint32_t)) {
+    return __impl(uint32_t{});
+  } else if constexpr (sizeof(__mask_vec) == sizeof(uint64_t)) {
+    return __impl(uint64_t{});
+  } else {
+    static_assert(sizeof(__mask_vec) == 0, "unexpected required size for mask integer type");
+  }
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI size_t __find_first_not_set(__simd_vector<_Tp, _Np> __vec) noexcept {
+  return std::__find_first_set(~__vec);
+}
+
+_LIBCPP_END_NAMESPACE_STD
+
+#endif // _LIBCPP_STD_VER >= 14 && __has_attribute(__ext_vector_type__) && __has_builtin(__builtin_reduce_and) &&
+       // __has_builtin(__builtin_convertvector)
+
+#endif // _LIBCPP___ALGORITHM_SIMD_UTILS_H
diff --git a/libcxx/include/__bit/bit_cast.h b/libcxx/include/__bit/bit_cast.h
index f20b39ae748b10..6298810f373303 100644
--- a/libcxx/include/__bit/bit_cast.h
+++ b/libcxx/include/__bit/bit_cast.h
@@ -19,6 +19,15 @@
 
 _LIBCPP_BEGIN_NAMESPACE_STD
 
+#ifndef _LIBCPP_CXX03_LANG
+
+template <class _ToType, class _FromType>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI constexpr _ToType __bit_cast(const _FromType& __from) noexcept {
+  return __builtin_bit_cast(_ToType, __from);
+}
+
+#endif // _LIBCPP_CXX03_LANG
+
 #if _LIBCPP_STD_VER >= 20
 
 template <class _ToType, class _FromType>
diff --git a/libcxx/include/__bit/countr.h b/libcxx/include/__bit/countr.h
index 0cc679f87a99d9..b6b3ac52ca4e47 100644
--- a/libcxx/include/__bit/countr.h
+++ b/libcxx/include/__bit/countr.h
@@ -35,10 +35,8 @@ _LIBCPP_NODISCARD inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR int __libcpp_ct
   return __builtin_ctzll(__x);
 }
 
-#if _LIBCPP_STD_VER >= 20
-
-template <__libcpp_unsigned_integer _Tp>
-_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+template <class _Tp>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 int __countr_zero(_Tp __t) _NOEXCEPT {
   if (__t == 0)
     return numeric_limits<_Tp>::digits;
 
@@ -59,6 +57,13 @@ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) n
   }
 }
 
+#if _LIBCPP_STD_VER >= 20
+
+template <__libcpp_unsigned_integer _Tp>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+  return std::__countr_zero(__t);
+}
+
 template <__libcpp_unsigned_integer _Tp>
 _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_one(_Tp __t) noexcept {
   return __t != numeric_limits<_Tp>::max() ? std::countr_zero(static_cast<_Tp>(~__t)) : numeric_limits<_Tp>::digits;
diff --git a/libcxx/include/libcxx.imp b/libcxx/include/libcxx.imp
index fbe09fa5e54ab1..8b2f78b713976a 100644
--- a/libcxx/include/libcxx.imp
+++ b/libcxx/include/libcxx.imp
@@ -217,6 +217,7 @@
   { include: [ "<__algorithm/shift_right.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/shuffle.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/sift_down.h>", "private", "<algorithm>", "public" ] },
+  { include: [ "<__algorithm/simd_utils.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/sort.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/sort_heap.h>", "private", "<algorithm>", "public" ] },
   { include: [ "<__algorithm/stable_partition.h>", "private", "<algorithm>", "public" ] },
diff --git a/libcxx/include/module.modulemap.in b/libcxx/include/module.modulemap.in
index e72136a58c0b1b..79245c87b15847 100644
--- a/libcxx/include/module.modulemap.in
+++ b/libcxx/include/module.modulemap.in
@@ -697,7 +697,10 @@ module std_private_algorithm_minmax                                      [system
   export *
 }
 module std_private_algorithm_minmax_element                              [system] { header "__algorithm/minmax_element.h" }
-module std_private_algorithm_mismatch                                    [system] { header "__algorithm/mismatch.h" }
+module std_private_algorithm_mismatch                                    [system] {
+  header "__algorithm/mismatch.h"
+  export std_private_algorithm_simd_utils
+}
 module std_private_algorithm_move                                        [system] { header "__algorithm/move.h" }
 module std_private_algorithm_move_backward                               [system] { header "__algorithm/move_backward.h" }
 module std_private_algorithm_next_permutation                            [system] { header "__algorithm/next_permutation.h" }
@@ -1048,6 +1051,7 @@ module std_private_algorithm_sort                                        [system
   header "__algorithm/sort.h"
   export std_private_debug_utils_strict_weak_ordering_check
 }
+module std_private_algorithm_simd_utils                                  [system] { header "__algorithm/simd_utils.h" }
 module std_private_algorithm_sort_heap                                   [system] { header "__algorithm/sort_heap.h" }
 module std_private_algorithm_stable_partition                            [system] { header "__algorithm/stable_partition.h" }
 module std_private_algorithm_stable_sort                                 [system] { header "__algorithm/stable_sort.h" }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
index cc588c095ccfb2..faaef5377863b8 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
@@ -16,78 +16,117 @@
 // template<InputIterator Iter1, InputIterator Iter2Pred>
 //   constexpr pair<Iter1, Iter2>   // constexpr after c++17
 //   mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2); // C++14
+//
+// template<InputIterator Iter1, InputIterator Iter2,
+//          Predicate<auto, Iter1::value_type, Iter2::value_type> Pred>
+//   requires CopyConstructible<Pred>
+//   constexpr pair<Iter1, Iter2>   // constexpr after c++17
+//   mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Pred pred);
+//
+// template<InputIterator Iter1, InputIterator Iter2, Predicate Pred>
+//   constexpr pair<Iter1, Iter2>   // constexpr after c++17
+//   mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2, Pred pred); // C++14
+
+// ADDITIONAL_COMPILE_FLAGS(has-fconstexpr-steps): -fconstexpr-steps=20000000
+// ADDITIONAL_COMPILE_FLAGS(has-fconstexpr-ops-limit): -fconstexpr-ops-limit=100000000
 
 #include <algorithm>
+#include <array>
 #include <cassert>
+#include <vector>
 
 #include "test_macros.h"
 #include "test_iterators.h"
-
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
-    int ia[] = {1, 3, 6, 7};
-    int ib[] = {1, 3};
-    int ic[] = {1, 3, 5, 7};
-    typedef cpp17_input_iterator<int*>         II;
-    typedef bidirectional_iterator<int*> BI;
-
-    auto p1 = std::mismatch(std::begin(ia), std::end(ia), std::begin(ic));
-    if (p1.first != ia+2 || p1.second != ic+2)
-        return false;
-
-    auto p2 = std::mismatch(std::begin(ia), std::end(ia), std::begin(ic), std::end(ic));
-    if (p2.first != ia+2 || p2.second != ic+2)
-        return false;
-
-    auto p3 = std::mismatch(std::begin(ib), std::end(ib), std::begin(ic));
-    if (p3.first != ib+2 || p3.second != ic+2)
-        return false;
-
-    auto p4 = std::mismatch(std::begin(ib), std::end(ib), std::begin(ic), std::end(ic));
-    if (p4.first != ib+2 || p4.second != ic+2)
-        return false;
-
-    auto p5 = std::mismatch(II(std::begin(ib)), II(std::end(ib)), II(std::begin(ic)));
-    if (p5.first != II(ib+2) || p5.second != II(ic+2))
-        return false;
-    auto p6 = std::mismatch(BI(std::begin(ib)), BI(std::end(ib)), BI(std::begin(ic)), BI(std::end(ic)));
-    if (p6.first != BI(ib+2) || p6.second != BI(ic+2))
-        return false;
-
-    return true;
-    }
+#include "type_algorithms.h"
+
+template <class Iter, class Container1, class Container2>
+TEST_CONSTEXPR_CXX20 void check(Container1 lhs, Container2 rhs, size_t offset) {
+  if (lhs.size() == rhs.size()) {
+    assert(std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data())) ==
+           std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+
+    assert(std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data()), std::equal_to<int>()) ==
+           std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+  }
+
+#if TEST_STD_VER >= 14
+  assert(
+      std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data()), Iter(rhs.data() + rhs.size())) ==
+      std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+
+  assert(std::mismatch(Iter(lhs.data()),
+                       Iter(lhs.data() + lhs.size()),
+                       Iter(rhs.data()),
+                       Iter(rhs.data() + rhs.size()),
+                       std::equal_to<int>()) == std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
 #endif
+}
 
-int main(int, char**)
-{
-    int ia[] = {0, 1, 2, 2, 0, 1, 2, 3};
-    const unsigned sa = sizeof(ia)/sizeof(ia[0]);
-    int ib[] = {0, 1, 2, 3, 0, 1, 2, 3};
-    const unsigned sb = sizeof(ib)/sizeof(ib[0]); ((void)sb); // unused in C++11
-
-    typedef cpp17_input_iterator<const int*> II;
-    typedef random_access_iterator<const int*>  RAI;
-
-    assert(std::mismatch(II(ia), II(ia + sa), II(ib))
-            == (std::pair<II, II>(II(ia+3), II(ib+3))));
-
-    assert(std::mismatch(RAI(ia), RAI(ia + sa), RAI(ib))
-            == (std::pair<RAI, RAI>(RAI(ia+3), RAI(ib+3))));
-
-#if TEST_STD_VER > 11 // We have the four iteration version
-    assert(std::mismatch(II(ia), II(ia + sa), II(ib), II(ib+sb))
-            == (std::pair<II, II>(II(ia+3), II(ib+3))));
-
-    assert(std::mismatch(RAI(ia), RAI(ia + s...
[truncated]

Copy link

@DenisYaroshevskiy DenisYaroshevskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tagging me, commandable effort!

#include <random>

template <class T>
static void bm_mismatch(benchmark::State& state) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much you care for it but you can try to mess with data alignment.

ve1.data() and vec2.data() will be aligned to 16 bytes. Which can lead to loads being aligned.

the difference can be quite huge.

There are some things you can do about that, I don't know if they are worth it for 2 range algorithms.

At the very least maybe aligned your .data() pointers to 64 bytes:

  • allocate both vectors with +64 bytes.
  • align the pointers forward - get a span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a TODO for now. Though even if it makes a difference I'm not sure we can do much about it, can we?

Copy link

@DenisYaroshevskiy DenisYaroshevskiy Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can align onr of the arrays. Also this makes for a better benchmark

BENCHMARK(bm_mismatch<short>)->DenseRange(1, 8)->Range(16, 1 << 20);
BENCHMARK(bm_mismatch<int>)->DenseRange(1, 8)->Range(16, 1 << 20);

BENCHMARK_MAIN();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to compare against memcmp. For libstdc++ that is the best one. So that you'll see how far away are you.

}

for (size_t __i = 0; __i != __unroll_count; ++__i) {
if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't like the dependency here.

  1. have a look if maybe comparing all into bool __res[__unroll_count]; would be helpful.

  2. or all of them togehter and then do one __all_of

Try which is best.

Look at std::memcmp:
https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memcmpeq-avx2.S.html#220

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried a few versions, and I couldn't get anything significant. The benchmark is already very close to memcmp for large arrays, so I'm not sure it makes a ton of sense to try to squeeze out the last few bits right now. I'd much rather look into improving it for small arrays for now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. You might want to keep memcmp in the benchmark and not delete it.

Also - take a note that you use powers of 2 everywhere. That will only measure the very unrolled thing. Tails can be significant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a patch to improve the tail: #83440. I'm not sure about keeping the memcmp benchmark, since our benchmarks are to show performance improvements, and not to compare to kind-of similar algorithms.

@DenisYaroshevskiy
Copy link

At the moment it's not trivial to run benchmarks for this against what I have. Is there something standalone?

Copy link
Contributor Author

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

}

for (size_t __i = 0; __i != __unroll_count; ++__i) {
if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried a few versions, and I couldn't get anything significant. The benchmark is already very close to memcmp for large arrays, so I'm not sure it makes a ton of sense to try to squeeze out the last few bits right now. I'd much rather look into improving it for small arrays for now.

@philnik777
Copy link
Contributor Author

At the moment it's not trivial to run benchmarks for this against what I have. Is there something standalone?

Could you elaborate a bit? I'm not sure what you're trying to do.

Copy link

@DenisYaroshevskiy DenisYaroshevskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a bit? I'm not sure what you're trying to do.

I was thinking to compare this implementation on the benchmarks I have. But when it's a part of a libc++ fork it's a bit too much work.

}

for (size_t __i = 0; __i != __unroll_count; ++__i) {
if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. You might want to keep memcmp in the benchmark and not delete it.

Also - take a note that you use powers of 2 everywhere. That will only measure the very unrolled thing. Tails can be significant.

@philnik777
Copy link
Contributor Author

Could you elaborate a bit? I'm not sure what you're trying to do.

I was thinking to compare this implementation on the benchmarks I have. But when it's a part of a libc++ fork it's a bit too much work.

With a bit of playing preprocessor (and not disabling the old operation_traits.h) you should be able to get things working with a 17 release: https://godbolt.org/z/j16P5Kxeb. Just put it before any other includes and everything should work (I think).

@philnik777 philnik777 force-pushed the optimize_mismatch branch 2 times, most recently from 4bdaba3 to 6c6f71a Compare February 29, 2024 12:19
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review, I had to leave halfway through. I haven't looked at the core of the implementation in detail yet.

Copy link

@DenisYaroshevskiy DenisYaroshevskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as SIMD algorithm is concerned, this lgtm.

There are some potential improvements that we discussed and I think that the benchmark maybe has some issues but it's OK.

I didn't have energy to test it against my code at the moment unfrotunately.

P.S. Good work.

@philnik777 philnik777 force-pushed the optimize_mismatch branch 2 times, most recently from 935ecee to 295200e Compare March 11, 2024 12:47
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, I do have a few comments but I think this is a really good patch. I would strongly encourage you to consider the interaction of these optimizations and the PSTL with unseq. I think we should use the same underlying machinery for both.

constexpr size_t __vec_size = __native_vector_size<_Tp>;
using __vec = __simd_vector<_Tp, __vec_size>;
if (!__libcpp_is_constant_evaluated()) {
while (static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) [[__unlikely__]] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fhahn @jroelofs According to @philnik777, if we manually unroll the loop with a constant number of iterations, Clang isn't "smart" enough to vectorize the code. So we end up having to use explicit constructs like __load_vector and friends to make it happen. Just a heads up in case you have thoughts since we've been talking about vectorization in the context of libc++.

Copy link
Contributor

@jroelofs jroelofs Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show the version w/o manual unrolling?

Have you tried #pragma clang loop vectorize(enable) and/or #pragma clang loop unroll_count(N) or #pragma clang loop unroll(full)?

__is_identity<_Proj1>::value && __is_identity<_Proj2>::value,
int> = 0>
_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Tp*, _Tp*>
__mismatch(_Tp* __first1, _Tp* __last1, _Tp* __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we could do here is implement this function as std::mismatch(std::execution::unseq, ...), and then we simply have this mismatch call into the PSTL version with unseq if it knows that to be safe. This would provide a nice way to frame these optimizations, since that would also be applicable to other algorithms.

@DenisYaroshevskiy
Copy link

I think we should use the same underlying machinery for both.

How'd you go about that? You need some compiler extensions to do predicates

@ldionne
Copy link
Member

ldionne commented Mar 14, 2024

I think we should use the same underlying machinery for both.

How'd you go about that? You need some compiler extensions to do predicates

At the very least, std::mismatch(std::execution::unseq, ...) should be able to use the same optimizations as the serial version, so there's something to be shared. Not everything may be shareable, but some of it must be.

@DenisYaroshevskiy
Copy link

I think we should use the same underlying machinery for both.

How'd you go about that? You need some compiler extensions to do predicates

At the very least, std::mismatch(std::execution::unseq, ...) should be able to use the same optimizations as the serial version, so there's something to be shared. Not everything may be shareable, but some of it must be.

Is it legal for unseq to just call serial version? I don't know of anything unseq could have done better here

@philnik777 philnik777 force-pushed the optimize_mismatch branch 6 times, most recently from 0f83e8f to 8d6a002 Compare March 17, 2024 19:51
@philnik777 philnik777 force-pushed the optimize_mismatch branch 2 times, most recently from f5fe730 to 7b44e8d Compare March 23, 2024 12:49
Copy link

✅ With the latest revision this PR passed the Python code formatter.

@philnik777 philnik777 merged commit b68e2eb into llvm:main Mar 23, 2024
@philnik777 philnik777 deleted the optimize_mismatch branch March 23, 2024 14:28
constexpr size_t __vec_size = __native_vector_size<_Tp>;
using __vec = __simd_vector<_Tp, __vec_size>;
if (!__libcpp_is_constant_evaluated()) {
while (static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) [[__unlikely__]] {
Copy link
Contributor

@wsehjk wsehjk Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, Why this while condition is marked with __unlikely__. I think it should be __likely__ @philnik777

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think the loop is likely?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants