-
Notifications
You must be signed in to change notification settings - Fork 7.9k
RFC: Locale-independent case conversion #7506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
123025d
to
d072b91
Compare
main/rfc1867.c
Outdated
@@ -714,7 +713,7 @@ SAPI_API SAPI_POST_HANDLER_FUNC(rfc1867_post_handler) /* {{{ */ | |||
int content_type_len = (int)strlen(content_type_dup); | |||
char *content_type_lcase = estrndup(content_type_dup, content_type_len); | |||
|
|||
php_strtolower(content_type_lcase, content_type_len); | |||
zend_str_tolower(content_type_lcase, content_type_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - I think this may have made sense regardless for http headers but outside of the scope of this RFC
e.g. this is done in https://golang.org/src/net/http/internal/ascii/print.go - the rfc from 1995 https://datatracker.ietf.org/doc/html/rfc1867 makes no mention of locale so I think it was assuming everything was using the english locale
More of an issue in ext/standard/http_fopen_wrapper.c for parsing request headers than here
EDIT: Though "boundary" would be the same no matter what locale for any locale I know of, so not a bug in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RFC 1867 says "Note that mime headers are generally required to consist only of 7-bit data in the US-ASCII character set."
Headers should not be processed in a locale-depdendent fashion. Switch from upper to lowercasing because that's the standard for PHP and we provide an ASCII implementation of this operation. This is adapted from GH-7506.
All of the changes to "miscellaneous" extensions you've made here look like plain bug fixes to me -- we should land these independently of this RFC, and I think those should also go into PHP-8.1. Would you mind submitting a PR with just those changes? I landed the mbstring change that was using toupper in 46315de, but the rest looks like it should be directly applicable without the new toupper functionality. This will help reduce the scope of the RFC itself. |
Sure, will do. |
The changes to ext/pdo_dblib and ext/xml require zend_str_toupper(). So I'll leave them out of the misc PR. Or I can put the introduction of zend_str_toupper() in the misc PR if you prefer. |
c00f865
to
fdc8ff3
Compare
The force push removed the misc bug fixes from the first commit so that it won't conflict with #7511. |
Zend/zend_operators.c
Outdated
const __m128i _a = _mm_set1_epi8('a' - 1); | ||
const __m128i z_ = _mm_set1_epi8('z' + 1); | ||
const __m128i delta = _mm_set1_epi8('a' - 'A'); | ||
do { | ||
__m128i op = _mm_loadu_si128((__m128i*)p); | ||
__m128i gt = _mm_cmpgt_epi8(op, _a); | ||
__m128i lt = _mm_cmplt_epi8(op, z_); | ||
__m128i mingle = _mm_and_si128(gt, lt); | ||
__m128i sub = _mm_and_si128(mingle, delta); | ||
__m128i upper = _mm_sub_epi8(op, sub); | ||
_mm_storeu_si128((__m128i *)q, upper); | ||
p += 16; | ||
q += 16; | ||
} while (p + 16 <= end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can improve slightly on the p + 16 <= end
by defining z
to be end
rounded down to a multiple of 16 and then simply compare p != z
. Like this, if I haven't made a mistake (I'm quite tired, so likely):
const __m128i _a = _mm_set1_epi8('a' - 1); | |
const __m128i z_ = _mm_set1_epi8('z' + 1); | |
const __m128i delta = _mm_set1_epi8('a' - 'A'); | |
do { | |
__m128i op = _mm_loadu_si128((__m128i*)p); | |
__m128i gt = _mm_cmpgt_epi8(op, _a); | |
__m128i lt = _mm_cmplt_epi8(op, z_); | |
__m128i mingle = _mm_and_si128(gt, lt); | |
__m128i sub = _mm_and_si128(mingle, delta); | |
__m128i upper = _mm_sub_epi8(op, sub); | |
_mm_storeu_si128((__m128i *)q, upper); | |
p += 16; | |
q += 16; | |
} while (p + 16 <= end); | |
// round end down to multiple of 16 | |
static const uintptr_t mask = 15u; | |
unsigned char *z = (unsigned char*)(((uintptr_t)end) & ~mask); | |
const __m128i _a = _mm_set1_epi8('a' - 1); | |
const __m128i z_ = _mm_set1_epi8('z' + 1); | |
const __m128i delta = _mm_set1_epi8('a' - 'A'); | |
do { | |
__m128i op = _mm_loadu_si128((__m128i*)p); | |
__m128i gt = _mm_cmpgt_epi8(op, _a); | |
__m128i lt = _mm_cmplt_epi8(op, z_); | |
__m128i mingle = _mm_and_si128(gt, lt); | |
__m128i sub = _mm_and_si128(mingle, delta); | |
__m128i upper = _mm_sub_epi8(op, sub); | |
_mm_storeu_si128((__m128i *)q, upper); | |
p += 16; | |
q += 16; | |
} while (p != z); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assembly of the inner loop at -O2 is just
.L535:
movdqu -16(%rbx), %xmm1
addq $16, %rbx
addq $16, %rax
movdqa %xmm1, %xmm2
movdqa %xmm1, %xmm0
pcmpgtb %xmm4, %xmm2
pcmpgtb %xmm5, %xmm0
pandn %xmm6, %xmm2
pand %xmm2, %xmm0
pand %xmm3, %xmm0
paddb %xmm1, %xmm0
movups %xmm0, -16(%rax)
cmpq %rbx, %rbp
jnb .L535
%rbx holds p+16 which it uses for the loop comparison, and then it loads from %rbx-16. I don't think there's any overhead associated with the -16 offset -- my understanding is that the addressing modes all happen in the same clock cycle.
It's not correct to mask out the lower bits of the end pointer as you suggest, because the input is not aligned, hence the use of the loadu builtin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not correct to mask out the lower bits of the end pointer as you suggest, because the input is not aligned, hence the use of the loadu builtin.
Right, sorry, was tired. The mask can be done on length
instead, I think? Anyway, the ASM is not as clean as you make it out to be on any compiler I've chosen on godbolt; not sure what's different. Here's an A/B comparison using the latest GCC: https://godbolt.org/z/PM5zT9aTE. Whether I did the optimization correctly or not, you can still see that the ASM isn't as clean as what you posted above and can tinker with flags/compilers to figure out the difference.
That's what I was going to suggest next :) We can land the new toupper implementations separately, both for internal use and as an optimization in |
This is #7521. |
This should become a lot simpler after a rebase :) |
b1883a1
to
585d34c
Compare
585d34c
to
641927c
Compare
static bool can_ct_eval_func_call(zend_string *name, uint32_t num_args, zval **args) {
/* Functions in this list must always produce the same result for the same arguments,
* and have no dependence on global state (such as locales). It is okay if they throw
* or warn on invalid arguments, as we detect this and will discard the evaluation result. */
if (false
|| zend_string_equals_literal(name, "array_diff")
|| zend_string_equals_literal(name, "array_diff_assoc")
|| zend_string_equals_literal(name, "array_diff_key") The earlier build failure was from running out of CI minutes, which was fixed. This can be rebased and tested again |
641927c
to
6fdf447
Compare
Done. |
Is this ready to be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problems
https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer depend on the current locale in php 8.2. Before that, this was unsafe to evaluate at compile time. Followup to phpGH-7506
…uation. https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer depend on the current locale in php 8.2. Before that, this was unsafe to evaluate at compile time. Followup to phpGH-7506 Add strcmp/strcasecmp/strtolower/strtoupper functions Add bin2hex/hex2bin and related functions Add array_key_first/array_key_last Update test of garbage collection using strtolower - Make sure that this tests the right thing when opcache is enabled. Update this as an example for future tests.
…uation. https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer depend on the current locale in php 8.2. Before that, this was unsafe to evaluate at compile time. Followup to phpGH-7506 Add strcmp/strcasecmp/strtolower/strtoupper functions Add bin2hex/hex2bin and related functions Add array_key_first/array_key_last Update test of garbage collection using strtolower - Make sure that this tests the right thing when opcache is enabled. Update this as an example for future tests.
…uation. https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer depend on the current locale in php 8.2. Before that, this was unsafe to evaluate at compile time. Followup to phpGH-7506 Add strcmp/strcasecmp/strtolower/strtoupper functions Add bin2hex/hex2bin and related functions Add array_key_first/array_key_last Update test of garbage collection using strtolower - Make sure that this tests the right thing when opcache is enabled. Update this as an example for future tests.
…uation, add functions. (#7780) https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer depend on the current locale in php 8.2. Before that, this was unsafe to evaluate at compile time. Followup to GH-7506 Add strcmp/strcasecmp/strtolower/strtoupper functions Add bin2hex/hex2bin and related functions Update test of garbage collection using strtolower to use something else to create a refcounted string
Implementation of RFC https://wiki.php.net/rfc/strtolower-ascii
I wanted to reuse the existing zend_string_tolower(), but there was no zend_string_toupper to go with it. So I added a family of toupper functions to zend_operators.c to go with the tolower functions.
Instead of splitting out a non-inline zend_tolower_ascii(), I realise now that I could have added extern declarations for tolower_map and toupper_map. I think I will try that tomorrow.
I replaced php_strtolower() with zend_str_tolower() in most places for performance reasons. They don't quite do the same thing -- php_strtolower() returns its argument. So it's not just a matter of
#define php_strtolower zend_str_tolower
.zend_str_toupper_impl() is not called by anything and so is untested. But the similar SSE code in zend_string_toupper_ex is tested.