-
Notifications
You must be signed in to change notification settings - Fork 13.6k
[libc++] Vectorize mismatch #73255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[libc++] Vectorize mismatch #73255
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
f5a4e3b
to
5251bb2
Compare
1c5041e
to
87f2ba2
Compare
This uses extensions, but I don't expect this to land that soon anyways. Worst case I'll have to wait a few weeks until I can land it. CC @DenisYaroshevskiy in case you are interested in helping review this. |
@llvm/pr-subscribers-libcxx Author: Nikolas Klauser (philnik777) Changes
Patch is 26.41 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73255.diff 11 Files Affected:
diff --git a/libcxx/benchmarks/CMakeLists.txt b/libcxx/benchmarks/CMakeLists.txt
index 2434d82c6fd6ba..0337d4fdb73aa2 100644
--- a/libcxx/benchmarks/CMakeLists.txt
+++ b/libcxx/benchmarks/CMakeLists.txt
@@ -182,6 +182,7 @@ set(BENCHMARK_TESTS
algorithms/make_heap_then_sort_heap.bench.cpp
algorithms/min.bench.cpp
algorithms/min_max_element.bench.cpp
+ algorithms/mismatch.bench.cpp
algorithms/pop_heap.bench.cpp
algorithms/pstl.stable_sort.bench.cpp
algorithms/push_heap.bench.cpp
diff --git a/libcxx/benchmarks/algorithms/mismatch.bench.cpp b/libcxx/benchmarks/algorithms/mismatch.bench.cpp
new file mode 100644
index 00000000000000..98725c8690e640
--- /dev/null
+++ b/libcxx/benchmarks/algorithms/mismatch.bench.cpp
@@ -0,0 +1,31 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#include <algorithm>
+#include <benchmark/benchmark.h>
+#include <random>
+
+template <class T>
+static void bm_mismatch(benchmark::State& state) {
+ std::vector<T> vec1(state.range(), '1');
+ std::vector<T> vec2(state.range(), '1');
+ std::mt19937_64 rng(std::random_device{}());
+
+ for (auto _ : state) {
+ auto idx = rng() % vec1.size();
+ vec1[idx] = '2';
+ benchmark::DoNotOptimize(vec1);
+ benchmark::DoNotOptimize(std::mismatch(vec1.begin(), vec1.end(), vec2.begin()));
+ vec1[idx] = '1';
+ }
+}
+BENCHMARK(bm_mismatch<char>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<short>)->DenseRange(1, 8)->Range(16, 1 << 20);
+BENCHMARK(bm_mismatch<int>)->DenseRange(1, 8)->Range(16, 1 << 20);
+
+BENCHMARK_MAIN();
diff --git a/libcxx/include/CMakeLists.txt b/libcxx/include/CMakeLists.txt
index fa3b464e56c4d0..e04e6c899bd161 100644
--- a/libcxx/include/CMakeLists.txt
+++ b/libcxx/include/CMakeLists.txt
@@ -217,6 +217,7 @@ set(files
__algorithm/shift_right.h
__algorithm/shuffle.h
__algorithm/sift_down.h
+ __algorithm/simd_utils.h
__algorithm/sort.h
__algorithm/sort_heap.h
__algorithm/stable_partition.h
diff --git a/libcxx/include/__algorithm/mismatch.h b/libcxx/include/__algorithm/mismatch.h
index d345b6048a7e9b..9cb7c6f9ff55c7 100644
--- a/libcxx/include/__algorithm/mismatch.h
+++ b/libcxx/include/__algorithm/mismatch.h
@@ -11,23 +11,89 @@
#define _LIBCPP___ALGORITHM_MISMATCH_H
#include <__algorithm/comp.h>
+#include <__algorithm/simd_utils.h>
+#include <__algorithm/unwrap_iter.h>
#include <__config>
-#include <__iterator/iterator_traits.h>
+#include <__functional/identity.h>
+#include <__type_traits/invoke.h>
+#include <__type_traits/is_constant_evaluated.h>
+#include <__type_traits/is_equality_comparable.h>
+#include <__type_traits/operation_traits.h>
+#include <__utility/move.h>
#include <__utility/pair.h>
#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
# pragma GCC system_header
#endif
+_LIBCPP_PUSH_MACROS
+#include <__undef_macros>
+
_LIBCPP_BEGIN_NAMESPACE_STD
+template <class _Iter1, class _Sent1, class _Iter2, class _Pred, class _Proj1, class _Proj2>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Iter1, _Iter2>
+__mismatch_loop(_Iter1 __first1, _Sent1 __last1, _Iter2 __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+ while (__first1 != __last1) {
+ if (!std::__invoke(__pred, std::__invoke(__proj1, *__first1), std::__invoke(__proj2, *__first2)))
+ break;
+ ++__first1;
+ ++__first2;
+ }
+ return std::make_pair(std::move(__first1), std::move(__first2));
+}
+
+template <class _Iter1, class _Sent1, class _Iter2, class _Pred, class _Proj1, class _Proj2>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Iter1, _Iter2>
+__mismatch(_Iter1 __first1, _Sent1 __last1, _Iter2 __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+ return std::__mismatch_loop(__first1, __last1, __first2, __pred, __proj1, __proj2);
+}
+
+#if _LIBCPP_VECTORIZE_ALGORIHTMS
+
+template <class _Tp,
+ class _Pred,
+ class _Proj1,
+ class _Proj2,
+ __enable_if_t<is_integral<_Tp>::value && __desugars_to<__equal_tag, _Pred, _Tp, _Tp>::value &&
+ __is_identity<_Proj1>::value && __is_identity<_Proj2>::value,
+ int> = 0>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Tp*, _Tp*>
+__mismatch(_Tp* __first1, _Tp* __last1, _Tp* __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) {
+ constexpr size_t __unroll_count = 4;
+ constexpr size_t __vec_size = __native_vector_size<_Tp>;
+ using __vec = __simd_vector<_Tp, __vec_size>;
+ while (!__libcpp_is_constant_evaluated() && static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) {
+ __vec __lhs[__unroll_count];
+ __vec __rhs[__unroll_count];
+
+ for (size_t __i = 0; __i != __unroll_count; ++__i) {
+ __lhs[__i] = std::__load_vector<__vec>(__first1 + __i * __vec_size);
+ __rhs[__i] = std::__load_vector<__vec>(__first2 + __i * __vec_size);
+ }
+
+ for (size_t __i = 0; __i != __unroll_count; ++__i) {
+ if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) {
+ auto __offset = __i * __unroll_count + std::__find_first_not_set(__cmp_res);
+ return {__first1 + __offset, __first2 + __offset};
+ }
+ }
+
+ __first1 += __unroll_count * __vec_size;
+ __first2 += __unroll_count * __vec_size;
+ }
+ return std::__mismatch_loop(__first1, __last1, __first2, __pred, __proj1, __proj2);
+}
+
+#endif // _LIBCPP_VECTORIZE_ALGORIHTMS
+
template <class _InputIterator1, class _InputIterator2, class _BinaryPredicate>
_LIBCPP_NODISCARD_EXT inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_InputIterator1, _InputIterator2>
mismatch(_InputIterator1 __first1, _InputIterator1 __last1, _InputIterator2 __first2, _BinaryPredicate __pred) {
- for (; __first1 != __last1; ++__first1, (void)++__first2)
- if (!__pred(*__first1, *__first2))
- break;
- return pair<_InputIterator1, _InputIterator2>(__first1, __first2);
+ __identity __proj;
+ auto __res = std::__mismatch(
+ std::__unwrap_iter(__first1), std::__unwrap_iter(__last1), std::__unwrap_iter(__first2), __pred, __proj, __proj);
+ return std::make_pair(std::__rewrap_iter(__first1, __res.first), std::__rewrap_iter(__first2, __res.second));
}
template <class _InputIterator1, class _InputIterator2>
@@ -59,4 +125,6 @@ mismatch(_InputIterator1 __first1, _InputIterator1 __last1, _InputIterator2 __fi
_LIBCPP_END_NAMESPACE_STD
+_LIBCPP_POP_MACROS
+
#endif // _LIBCPP___ALGORITHM_MISMATCH_H
diff --git a/libcxx/include/__algorithm/simd_utils.h b/libcxx/include/__algorithm/simd_utils.h
new file mode 100644
index 00000000000000..72f1a775dfe9b9
--- /dev/null
+++ b/libcxx/include/__algorithm/simd_utils.h
@@ -0,0 +1,114 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef _LIBCPP___ALGORITHM_SIMD_UTILS_H
+#define _LIBCPP___ALGORITHM_SIMD_UTILS_H
+
+#include <__bit/bit_cast.h>
+#include <__bit/countr.h>
+#include <__config>
+#include <__type_traits/is_arithmetic.h>
+#include <__type_traits/is_same.h>
+#include <__utility/integer_sequence.h>
+#include <cstddef>
+#include <cstdint>
+
+#if !defined(_LIBCPP_HAS_NO_PRAGMA_SYSTEM_HEADER)
+# pragma GCC system_header
+#endif
+
+#if _LIBCPP_STD_VER >= 14 && __has_attribute(__ext_vector_type__) && __has_builtin(__builtin_reduce_and) && \
+ __has_builtin(__builtin_convertvector)
+# define _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS 1
+#else
+# define _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS 0
+#endif
+
+#if _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS && !defined(__OPTIMIZE_SIZE__)
+# define _LIBCPP_VECTORIZE_ALGORIHTMS 1
+#else
+# define _LIBCPP_VECTORIZE_ALGORIHTMS 0
+#endif
+
+#if _LIBCPP_HAS_ALGORITHM_VECTOR_UTILS
+
+_LIBCPP_BEGIN_NAMESPACE_STD
+
+# if defined(__AVX512F__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 64 / sizeof(_Tp);
+# elif defined(__AVX__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 32 / sizeof(_Tp);
+# elif defined(__SSE__) || defined(__ARM_NEON__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 16 / sizeof(_Tp);
+# elif defined(__MMX__)
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 8 / sizeof(_Tp);
+# else
+template <class _Tp>
+inline constexpr size_t __native_vector_size = 1;
+# endif
+
+template <class _Tp, size_t _Np>
+using __simd_vector __attribute__((__ext_vector_type__(_Np))) = _Tp;
+
+template <class _VecT>
+inline constexpr size_t __simd_vector_size_v = []() -> size_t { static_assert(false, "Not a vector!"); }();
+
+template <class _Tp, size_t _Np>
+inline constexpr size_t __simd_vector_size_v<__simd_vector<_Tp, _Np>> = _Np;
+
+template <class _VecT>
+using __simd_vector_underlying_type_t = decltype([]<class _Tp, size_t _Np>() { return _Tp{}; });
+
+template <class _VecT, class _UnderlyingType = __simd_vector_underlying_type_t<_VecT>>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _VecT __load_vector(const _UnderlyingType* __ptr) noexcept {
+ return []<size_t... _Indices>(const _UnderlyingType* __lptr, index_sequence<_Indices...>) static noexcept {
+ return _VecT{__lptr[_Indices]...};
+ }(__ptr, make_index_sequence<__simd_vector_size_v<_VecT>>{});
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI bool __all_of(__simd_vector<_Tp, _Np> __vec) noexcept {
+ return __builtin_reduce_and(__builtin_convertvector(__vec, __simd_vector<bool, _Np>));
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI size_t __find_first_set(__simd_vector<_Tp, _Np> __vec) noexcept {
+ using __mask_vec = __simd_vector<bool, _Np>;
+
+ auto __impl = [&]<class _MaskT>(_MaskT) noexcept {
+ return std::__countr_zero(std::__bit_cast<_MaskT>(__builtin_convertvector(__vec, __mask_vec)));
+ };
+
+ if constexpr (sizeof(__mask_vec) == sizeof(uint8_t)) {
+ return __impl(uint8_t{});
+ } else if constexpr (sizeof(__mask_vec) == sizeof(uint16_t)) {
+ return __impl(uint16_t{});
+ } else if constexpr (sizeof(__mask_vec) == sizeof(uint32_t)) {
+ return __impl(uint32_t{});
+ } else if constexpr (sizeof(__mask_vec) == sizeof(uint64_t)) {
+ return __impl(uint64_t{});
+ } else {
+ static_assert(sizeof(__mask_vec) == 0, "unexpected required size for mask integer type");
+ }
+}
+
+template <class _Tp, size_t _Np>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI size_t __find_first_not_set(__simd_vector<_Tp, _Np> __vec) noexcept {
+ return std::__find_first_set(~__vec);
+}
+
+_LIBCPP_END_NAMESPACE_STD
+
+#endif // _LIBCPP_STD_VER >= 14 && __has_attribute(__ext_vector_type__) && __has_builtin(__builtin_reduce_and) &&
+ // __has_builtin(__builtin_convertvector)
+
+#endif // _LIBCPP___ALGORITHM_SIMD_UTILS_H
diff --git a/libcxx/include/__bit/bit_cast.h b/libcxx/include/__bit/bit_cast.h
index f20b39ae748b10..6298810f373303 100644
--- a/libcxx/include/__bit/bit_cast.h
+++ b/libcxx/include/__bit/bit_cast.h
@@ -19,6 +19,15 @@
_LIBCPP_BEGIN_NAMESPACE_STD
+#ifndef _LIBCPP_CXX03_LANG
+
+template <class _ToType, class _FromType>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI constexpr _ToType __bit_cast(const _FromType& __from) noexcept {
+ return __builtin_bit_cast(_ToType, __from);
+}
+
+#endif // _LIBCPP_CXX03_LANG
+
#if _LIBCPP_STD_VER >= 20
template <class _ToType, class _FromType>
diff --git a/libcxx/include/__bit/countr.h b/libcxx/include/__bit/countr.h
index 0cc679f87a99d9..b6b3ac52ca4e47 100644
--- a/libcxx/include/__bit/countr.h
+++ b/libcxx/include/__bit/countr.h
@@ -35,10 +35,8 @@ _LIBCPP_NODISCARD inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR int __libcpp_ct
return __builtin_ctzll(__x);
}
-#if _LIBCPP_STD_VER >= 20
-
-template <__libcpp_unsigned_integer _Tp>
-_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+template <class _Tp>
+_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 int __countr_zero(_Tp __t) _NOEXCEPT {
if (__t == 0)
return numeric_limits<_Tp>::digits;
@@ -59,6 +57,13 @@ _LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) n
}
}
+#if _LIBCPP_STD_VER >= 20
+
+template <__libcpp_unsigned_integer _Tp>
+_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_zero(_Tp __t) noexcept {
+ return std::__countr_zero(__t);
+}
+
template <__libcpp_unsigned_integer _Tp>
_LIBCPP_NODISCARD_EXT _LIBCPP_HIDE_FROM_ABI constexpr int countr_one(_Tp __t) noexcept {
return __t != numeric_limits<_Tp>::max() ? std::countr_zero(static_cast<_Tp>(~__t)) : numeric_limits<_Tp>::digits;
diff --git a/libcxx/include/libcxx.imp b/libcxx/include/libcxx.imp
index fbe09fa5e54ab1..8b2f78b713976a 100644
--- a/libcxx/include/libcxx.imp
+++ b/libcxx/include/libcxx.imp
@@ -217,6 +217,7 @@
{ include: [ "<__algorithm/shift_right.h>", "private", "<algorithm>", "public" ] },
{ include: [ "<__algorithm/shuffle.h>", "private", "<algorithm>", "public" ] },
{ include: [ "<__algorithm/sift_down.h>", "private", "<algorithm>", "public" ] },
+ { include: [ "<__algorithm/simd_utils.h>", "private", "<algorithm>", "public" ] },
{ include: [ "<__algorithm/sort.h>", "private", "<algorithm>", "public" ] },
{ include: [ "<__algorithm/sort_heap.h>", "private", "<algorithm>", "public" ] },
{ include: [ "<__algorithm/stable_partition.h>", "private", "<algorithm>", "public" ] },
diff --git a/libcxx/include/module.modulemap.in b/libcxx/include/module.modulemap.in
index e72136a58c0b1b..79245c87b15847 100644
--- a/libcxx/include/module.modulemap.in
+++ b/libcxx/include/module.modulemap.in
@@ -697,7 +697,10 @@ module std_private_algorithm_minmax [system
export *
}
module std_private_algorithm_minmax_element [system] { header "__algorithm/minmax_element.h" }
-module std_private_algorithm_mismatch [system] { header "__algorithm/mismatch.h" }
+module std_private_algorithm_mismatch [system] {
+ header "__algorithm/mismatch.h"
+ export std_private_algorithm_simd_utils
+}
module std_private_algorithm_move [system] { header "__algorithm/move.h" }
module std_private_algorithm_move_backward [system] { header "__algorithm/move_backward.h" }
module std_private_algorithm_next_permutation [system] { header "__algorithm/next_permutation.h" }
@@ -1048,6 +1051,7 @@ module std_private_algorithm_sort [system
header "__algorithm/sort.h"
export std_private_debug_utils_strict_weak_ordering_check
}
+module std_private_algorithm_simd_utils [system] { header "__algorithm/simd_utils.h" }
module std_private_algorithm_sort_heap [system] { header "__algorithm/sort_heap.h" }
module std_private_algorithm_stable_partition [system] { header "__algorithm/stable_partition.h" }
module std_private_algorithm_stable_sort [system] { header "__algorithm/stable_sort.h" }
diff --git a/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp b/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
index cc588c095ccfb2..faaef5377863b8 100644
--- a/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
+++ b/libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
@@ -16,78 +16,117 @@
// template<InputIterator Iter1, InputIterator Iter2Pred>
// constexpr pair<Iter1, Iter2> // constexpr after c++17
// mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2); // C++14
+//
+// template<InputIterator Iter1, InputIterator Iter2,
+// Predicate<auto, Iter1::value_type, Iter2::value_type> Pred>
+// requires CopyConstructible<Pred>
+// constexpr pair<Iter1, Iter2> // constexpr after c++17
+// mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Pred pred);
+//
+// template<InputIterator Iter1, InputIterator Iter2, Predicate Pred>
+// constexpr pair<Iter1, Iter2> // constexpr after c++17
+// mismatch(Iter1 first1, Iter1 last1, Iter2 first2, Iter2 last2, Pred pred); // C++14
+
+// ADDITIONAL_COMPILE_FLAGS(has-fconstexpr-steps): -fconstexpr-steps=20000000
+// ADDITIONAL_COMPILE_FLAGS(has-fconstexpr-ops-limit): -fconstexpr-ops-limit=100000000
#include <algorithm>
+#include <array>
#include <cassert>
+#include <vector>
#include "test_macros.h"
#include "test_iterators.h"
-
-#if TEST_STD_VER > 17
-TEST_CONSTEXPR bool test_constexpr() {
- int ia[] = {1, 3, 6, 7};
- int ib[] = {1, 3};
- int ic[] = {1, 3, 5, 7};
- typedef cpp17_input_iterator<int*> II;
- typedef bidirectional_iterator<int*> BI;
-
- auto p1 = std::mismatch(std::begin(ia), std::end(ia), std::begin(ic));
- if (p1.first != ia+2 || p1.second != ic+2)
- return false;
-
- auto p2 = std::mismatch(std::begin(ia), std::end(ia), std::begin(ic), std::end(ic));
- if (p2.first != ia+2 || p2.second != ic+2)
- return false;
-
- auto p3 = std::mismatch(std::begin(ib), std::end(ib), std::begin(ic));
- if (p3.first != ib+2 || p3.second != ic+2)
- return false;
-
- auto p4 = std::mismatch(std::begin(ib), std::end(ib), std::begin(ic), std::end(ic));
- if (p4.first != ib+2 || p4.second != ic+2)
- return false;
-
- auto p5 = std::mismatch(II(std::begin(ib)), II(std::end(ib)), II(std::begin(ic)));
- if (p5.first != II(ib+2) || p5.second != II(ic+2))
- return false;
- auto p6 = std::mismatch(BI(std::begin(ib)), BI(std::end(ib)), BI(std::begin(ic)), BI(std::end(ic)));
- if (p6.first != BI(ib+2) || p6.second != BI(ic+2))
- return false;
-
- return true;
- }
+#include "type_algorithms.h"
+
+template <class Iter, class Container1, class Container2>
+TEST_CONSTEXPR_CXX20 void check(Container1 lhs, Container2 rhs, size_t offset) {
+ if (lhs.size() == rhs.size()) {
+ assert(std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data())) ==
+ std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+
+ assert(std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data()), std::equal_to<int>()) ==
+ std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+ }
+
+#if TEST_STD_VER >= 14
+ assert(
+ std::mismatch(Iter(lhs.data()), Iter(lhs.data() + lhs.size()), Iter(rhs.data()), Iter(rhs.data() + rhs.size())) ==
+ std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
+
+ assert(std::mismatch(Iter(lhs.data()),
+ Iter(lhs.data() + lhs.size()),
+ Iter(rhs.data()),
+ Iter(rhs.data() + rhs.size()),
+ std::equal_to<int>()) == std::make_pair(Iter(lhs.data() + offset), Iter(rhs.data() + offset)));
#endif
+}
-int main(int, char**)
-{
- int ia[] = {0, 1, 2, 2, 0, 1, 2, 3};
- const unsigned sa = sizeof(ia)/sizeof(ia[0]);
- int ib[] = {0, 1, 2, 3, 0, 1, 2, 3};
- const unsigned sb = sizeof(ib)/sizeof(ib[0]); ((void)sb); // unused in C++11
-
- typedef cpp17_input_iterator<const int*> II;
- typedef random_access_iterator<const int*> RAI;
-
- assert(std::mismatch(II(ia), II(ia + sa), II(ib))
- == (std::pair<II, II>(II(ia+3), II(ib+3))));
-
- assert(std::mismatch(RAI(ia), RAI(ia + sa), RAI(ib))
- == (std::pair<RAI, RAI>(RAI(ia+3), RAI(ib+3))));
-
-#if TEST_STD_VER > 11 // We have the four iteration version
- assert(std::mismatch(II(ia), II(ia + sa), II(ib), II(ib+sb))
- == (std::pair<II, II>(II(ia+3), II(ib+3))));
-
- assert(std::mismatch(RAI(ia), RAI(ia + s...
[truncated]
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tagging me, commandable effort!
#include <random> | ||
|
||
template <class T> | ||
static void bm_mismatch(benchmark::State& state) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much you care for it but you can try to mess with data alignment.
ve1.data() and vec2.data() will be aligned to 16 bytes. Which can lead to loads being aligned.
the difference can be quite huge.
There are some things you can do about that, I don't know if they are worth it for 2 range algorithms.
At the very least maybe aligned your .data() pointers to 64 bytes:
- allocate both vectors with +64 bytes.
- align the pointers forward - get a span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a TODO for now. Though even if it makes a difference I'm not sure we can do much about it, can we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can align onr of the arrays. Also this makes for a better benchmark
BENCHMARK(bm_mismatch<short>)->DenseRange(1, 8)->Range(16, 1 << 20); | ||
BENCHMARK(bm_mismatch<int>)->DenseRange(1, 8)->Range(16, 1 << 20); | ||
|
||
BENCHMARK_MAIN(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to compare against memcmp
. For libstdc++ that is the best one. So that you'll see how far away are you.
} | ||
|
||
for (size_t __i = 0; __i != __unroll_count; ++__i) { | ||
if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like the dependency here.
-
have a look if maybe comparing all into
bool __res[__unroll_count];
would be helpful. -
or all of them togehter and then do one
__all_of
Try which is best.
Look at std::memcmp:
https://codebrowser.dev/glibc/glibc/sysdeps/x86_64/multiarch/memcmpeq-avx2.S.html#220
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried a few versions, and I couldn't get anything significant. The benchmark is already very close to memcmp
for large arrays, so I'm not sure it makes a ton of sense to try to squeeze out the last few bits right now. I'd much rather look into improving it for small arrays for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. You might want to keep memcmp in the benchmark and not delete it.
Also - take a note that you use powers of 2 everywhere. That will only measure the very unrolled thing. Tails can be significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a patch to improve the tail: #83440. I'm not sure about keeping the memcmp
benchmark, since our benchmarks are to show performance improvements, and not to compare to kind-of similar algorithms.
At the moment it's not trivial to run benchmarks for this against what I have. Is there something standalone? |
87f2ba2
to
46ab541
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look!
} | ||
|
||
for (size_t __i = 0; __i != __unroll_count; ++__i) { | ||
if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried a few versions, and I couldn't get anything significant. The benchmark is already very close to memcmp
for large arrays, so I'm not sure it makes a ton of sense to try to squeeze out the last few bits right now. I'd much rather look into improving it for small arrays for now.
Could you elaborate a bit? I'm not sure what you're trying to do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a bit? I'm not sure what you're trying to do.
I was thinking to compare this implementation on the benchmarks I have. But when it's a part of a libc++ fork it's a bit too much work.
} | ||
|
||
for (size_t __i = 0; __i != __unroll_count; ++__i) { | ||
if (auto __cmp_res = __lhs[__i] == __rhs[__i]; !std::__all_of(__cmp_res)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. You might want to keep memcmp in the benchmark and not delete it.
Also - take a note that you use powers of 2 everywhere. That will only measure the very unrolled thing. Tails can be significant.
With a bit of playing preprocessor (and not disabling the old |
4bdaba3
to
6c6f71a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a full review, I had to leave halfway through. I haven't looked at the core of the implementation in detail yet.
libcxx/test/std/algorithms/alg.nonmodifying/mismatch/mismatch.pass.cpp
Outdated
Show resolved
Hide resolved
6c6f71a
to
c45f2b6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as SIMD algorithm is concerned, this lgtm.
There are some potential improvements that we discussed and I think that the benchmark maybe has some issues but it's OK.
I didn't have energy to test it against my code at the moment unfrotunately.
P.S. Good work.
c45f2b6
to
ecbc13b
Compare
935ecee
to
295200e
Compare
295200e
to
3323800
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, I do have a few comments but I think this is a really good patch. I would strongly encourage you to consider the interaction of these optimizations and the PSTL with unseq
. I think we should use the same underlying machinery for both.
constexpr size_t __vec_size = __native_vector_size<_Tp>; | ||
using __vec = __simd_vector<_Tp, __vec_size>; | ||
if (!__libcpp_is_constant_evaluated()) { | ||
while (static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) [[__unlikely__]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fhahn @jroelofs According to @philnik777, if we manually unroll the loop with a constant number of iterations, Clang isn't "smart" enough to vectorize the code. So we end up having to use explicit constructs like __load_vector
and friends to make it happen. Just a heads up in case you have thoughts since we've been talking about vectorization in the context of libc++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show the version w/o manual unrolling?
Have you tried #pragma clang loop vectorize(enable)
and/or #pragma clang loop unroll_count(N)
or #pragma clang loop unroll(full)
?
__is_identity<_Proj1>::value && __is_identity<_Proj2>::value, | ||
int> = 0> | ||
_LIBCPP_NODISCARD _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_Tp*, _Tp*> | ||
__mismatch(_Tp* __first1, _Tp* __last1, _Tp* __first2, _Pred& __pred, _Proj1& __proj1, _Proj2& __proj2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing we could do here is implement this function as std::mismatch(std::execution::unseq, ...)
, and then we simply have this mismatch
call into the PSTL version with unseq
if it knows that to be safe. This would provide a nice way to frame these optimizations, since that would also be applicable to other algorithms.
How'd you go about that? You need some compiler extensions to do predicates |
At the very least, |
Is it legal for unseq to just call serial version? I don't know of anything unseq could have done better here |
0f83e8f
to
8d6a002
Compare
f5fe730
to
7b44e8d
Compare
✅ With the latest revision this PR passed the Python code formatter. |
7b44e8d
to
6509147
Compare
6509147
to
ad2ec50
Compare
constexpr size_t __vec_size = __native_vector_size<_Tp>; | ||
using __vec = __simd_vector<_Tp, __vec_size>; | ||
if (!__libcpp_is_constant_evaluated()) { | ||
while (static_cast<size_t>(__last1 - __first1) >= __unroll_count * __vec_size) [[__unlikely__]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, Why this while
condition is marked with __unlikely__
. I think it should be __likely__
@philnik777
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you think the loop is likely?
Uh oh!
There was an error while loading. Please reload this page.