Description
The (ECMAScript) regular expression [\W\D]
describes a character class that matches the union of (a) all non-alphanumeric characters and (b) all non-digits. So effectively, the character class should be equivalent to [\D]
and thus match all non-digits. However, libc++'s regex implementation only matches non-alphanumeric characters.
Test case:
#include <iostream>
#include <regex>
using namespace std;
int main()
{
regex re(R"([\W\D])");
cout << "matches alphabetic: " << regex_match("a", re) << '\n'
<< "matches digit: " << regex_match("0", re) << '\n'
<< "matches non-alphanumeric: " << regex_match(".", re);
return 0;
}
https://godbolt.org/z/YdvY4Pb6a
This prints:
matches alphabetic: 0
matches digit: 0
matches non-alphanumeric: 1
But it should print (as MSVC STL and libstdc++ do here):
matches alphabetic: 1
matches digit: 0
matches non-alphanumeric: 1
The problem lies here:
llvm-project/libcxx/include/regex
Lines 2139 to 2141 in 215c0d2
The negated character classes are bitwise or'ed, but De Morgan's law says that (not w) or (not d) = not (w and d)
, so the bit masks should really be bitwise and'ed.
But bitwise and'ing is problematic as well, because the standard only provides a guarantee that bitwise or'ing works, but doesn't state that bitwise and'ing corresponds to the intersection of the character classes (see [re.grammar/9]). Maybe and'ing will still work for libc++'s std::regex_traits<char>
and std::regex_traits<wchar_t>
traits classes (although I haven't checked that), but it might not do the right thing for some user-provided traits classes.