Skip to content

RFC: Locale-independent case conversion #7506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 15, 2021

Conversation

tstarling
Copy link
Contributor

Implementation of RFC https://wiki.php.net/rfc/strtolower-ascii

  • Commit 1: Use ASCII case conversion consistently. See the RFC for consequences.
  • Commit 2: Add some tests since if I changed the magic numbers in the SSE implementation of strtoupper, nothing failed
  • Commit 3: Add ctype_tolower() and ctype_toupper() as proposed in the RFC.

I wanted to reuse the existing zend_string_tolower(), but there was no zend_string_toupper to go with it. So I added a family of toupper functions to zend_operators.c to go with the tolower functions.

Instead of splitting out a non-inline zend_tolower_ascii(), I realise now that I could have added extern declarations for tolower_map and toupper_map. I think I will try that tomorrow.

I replaced php_strtolower() with zend_str_tolower() in most places for performance reasons. They don't quite do the same thing -- php_strtolower() returns its argument. So it's not just a matter of #define php_strtolower zend_str_tolower.

zend_str_toupper_impl() is not called by anything and so is untested. But the similar SSE code in zend_string_toupper_ex is tested.

main/rfc1867.c Outdated
@@ -714,7 +713,7 @@ SAPI_API SAPI_POST_HANDLER_FUNC(rfc1867_post_handler) /* {{{ */
int content_type_len = (int)strlen(content_type_dup);
char *content_type_lcase = estrndup(content_type_dup, content_type_len);

php_strtolower(content_type_lcase, content_type_len);
zend_str_tolower(content_type_lcase, content_type_len);
Copy link
Contributor

@TysonAndre TysonAndre Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - I think this may have made sense regardless for http headers but outside of the scope of this RFC

e.g. this is done in https://golang.org/src/net/http/internal/ascii/print.go - the rfc from 1995 https://datatracker.ietf.org/doc/html/rfc1867 makes no mention of locale so I think it was assuming everything was using the english locale

More of an issue in ext/standard/http_fopen_wrapper.c for parsing request headers than here

EDIT: Though "boundary" would be the same no matter what locale for any locale I know of, so not a bug in this file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RFC 1867 says "Note that mime headers are generally required to consist only of 7-bit data in the US-ASCII character set."

nikic added a commit that referenced this pull request Sep 23, 2021
Headers should not be processed in a locale-depdendent fashion.
Switch from upper to lowercasing because that's the standard for
PHP and we provide an ASCII implementation of this operation.

This is adapted from GH-7506.
@nikic
Copy link
Member

nikic commented Sep 23, 2021

All of the changes to "miscellaneous" extensions you've made here look like plain bug fixes to me -- we should land these independently of this RFC, and I think those should also go into PHP-8.1. Would you mind submitting a PR with just those changes?

I landed the mbstring change that was using toupper in 46315de, but the rest looks like it should be directly applicable without the new toupper functionality.

This will help reduce the scope of the RFC itself.

@tstarling
Copy link
Contributor Author

All of the changes to "miscellaneous" extensions you've made here look like plain bug fixes to me -- we should land these independently of this RFC, and I think those should also go into PHP-8.1. Would you mind submitting a PR with just those changes?

Sure, will do.

@tstarling
Copy link
Contributor Author

The changes to ext/pdo_dblib and ext/xml require zend_str_toupper(). So I'll leave them out of the misc PR. Or I can put the introduction of zend_str_toupper() in the misc PR if you prefer.

@tstarling
Copy link
Contributor Author

The force push removed the misc bug fixes from the first commit so that it won't conflict with #7511.

Comment on lines 2718 to 2731
const __m128i _a = _mm_set1_epi8('a' - 1);
const __m128i z_ = _mm_set1_epi8('z' + 1);
const __m128i delta = _mm_set1_epi8('a' - 'A');
do {
__m128i op = _mm_loadu_si128((__m128i*)p);
__m128i gt = _mm_cmpgt_epi8(op, _a);
__m128i lt = _mm_cmplt_epi8(op, z_);
__m128i mingle = _mm_and_si128(gt, lt);
__m128i sub = _mm_and_si128(mingle, delta);
__m128i upper = _mm_sub_epi8(op, sub);
_mm_storeu_si128((__m128i *)q, upper);
p += 16;
q += 16;
} while (p + 16 <= end);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can improve slightly on the p + 16 <= end by defining z to be end rounded down to a multiple of 16 and then simply compare p != z. Like this, if I haven't made a mistake (I'm quite tired, so likely):

Suggested change
const __m128i _a = _mm_set1_epi8('a' - 1);
const __m128i z_ = _mm_set1_epi8('z' + 1);
const __m128i delta = _mm_set1_epi8('a' - 'A');
do {
__m128i op = _mm_loadu_si128((__m128i*)p);
__m128i gt = _mm_cmpgt_epi8(op, _a);
__m128i lt = _mm_cmplt_epi8(op, z_);
__m128i mingle = _mm_and_si128(gt, lt);
__m128i sub = _mm_and_si128(mingle, delta);
__m128i upper = _mm_sub_epi8(op, sub);
_mm_storeu_si128((__m128i *)q, upper);
p += 16;
q += 16;
} while (p + 16 <= end);
// round end down to multiple of 16
static const uintptr_t mask = 15u;
unsigned char *z = (unsigned char*)(((uintptr_t)end) & ~mask);
const __m128i _a = _mm_set1_epi8('a' - 1);
const __m128i z_ = _mm_set1_epi8('z' + 1);
const __m128i delta = _mm_set1_epi8('a' - 'A');
do {
__m128i op = _mm_loadu_si128((__m128i*)p);
__m128i gt = _mm_cmpgt_epi8(op, _a);
__m128i lt = _mm_cmplt_epi8(op, z_);
__m128i mingle = _mm_and_si128(gt, lt);
__m128i sub = _mm_and_si128(mingle, delta);
__m128i upper = _mm_sub_epi8(op, sub);
_mm_storeu_si128((__m128i *)q, upper);
p += 16;
q += 16;
} while (p != z);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assembly of the inner loop at -O2 is just

.L535:
	movdqu	-16(%rbx), %xmm1
	addq	$16, %rbx
	addq	$16, %rax
	movdqa	%xmm1, %xmm2
	movdqa	%xmm1, %xmm0
	pcmpgtb	%xmm4, %xmm2
	pcmpgtb	%xmm5, %xmm0
	pandn	%xmm6, %xmm2
	pand	%xmm2, %xmm0
	pand	%xmm3, %xmm0
	paddb	%xmm1, %xmm0
	movups	%xmm0, -16(%rax)
	cmpq	%rbx, %rbp
	jnb	.L535

%rbx holds p+16 which it uses for the loop comparison, and then it loads from %rbx-16. I don't think there's any overhead associated with the -16 offset -- my understanding is that the addressing modes all happen in the same clock cycle.

It's not correct to mask out the lower bits of the end pointer as you suggest, because the input is not aligned, hence the use of the loadu builtin.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not correct to mask out the lower bits of the end pointer as you suggest, because the input is not aligned, hence the use of the loadu builtin.

Right, sorry, was tired. The mask can be done on length instead, I think? Anyway, the ASM is not as clean as you make it out to be on any compiler I've chosen on godbolt; not sure what's different. Here's an A/B comparison using the latest GCC: https://godbolt.org/z/PM5zT9aTE. Whether I did the optimization correctly or not, you can still see that the ASM isn't as clean as what you posted above and can tinker with flags/compilers to figure out the difference.

@nikic
Copy link
Member

nikic commented Sep 24, 2021

The changes to ext/pdo_dblib and ext/xml require zend_str_toupper(). So I'll leave them out of the misc PR. Or I can put the introduction of zend_str_toupper() in the misc PR if you prefer.

That's what I was going to suggest next :) We can land the new toupper implementations separately, both for internal use and as an optimization in strtoupper(), same as we currently do in strtolower().

@tstarling
Copy link
Contributor Author

That's what I was going to suggest next :) We can land the new toupper implementations separately, both for internal use and as an optimization in strtoupper(), same as we currently do in strtolower().

This is #7521.

@nikic
Copy link
Member

nikic commented Sep 29, 2021

This should become a lot simpler after a rebase :)

@TysonAndre
Copy link
Contributor

. strtolower() and strtoupper() are no longer locale-sensitive. They now
perform ASCII case conversion. Use mb_strtolower() if you want localized
case conversion. Similarly, stristr, stripos, strripos, lcfirst, ucfirst,
ucwords, str_ireplace, array_change_key_case and sorting with SORT_FLAG_CASE
use ASCII case conversion.

  1. It may be clearer to qualify that with ASCII case conversion as if the locale were "C". Or ASCII case conversion without localization.
  2. As a followup that can be done outside this PR, Zend/Optimizer/sccp.c can also be changed for the functions that now use ASCII case conversion, for can_ct_eval_func_call
static bool can_ct_eval_func_call(zend_string *name, uint32_t num_args, zval **args) {
	/* Functions in this list must always produce the same result for the same arguments,
	 * and have no dependence on global state (such as locales). It is okay if they throw
	 * or warn on invalid arguments, as we detect this and will discard the evaluation result. */
	if (false
		|| zend_string_equals_literal(name, "array_diff")
		|| zend_string_equals_literal(name, "array_diff_assoc")
		|| zend_string_equals_literal(name, "array_diff_key")

The earlier build failure was from running out of CI minutes, which was fixed. This can be rebased and tested again

@tstarling
Copy link
Contributor Author

tstarling commented Dec 14, 2021

  1. It may be clearer to qualify that with ASCII case conversion as if the locale were "C". Or ASCII case conversion without localization.

Done.

@TysonAndre
Copy link
Contributor

Is this ready to be merged?

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems

@TysonAndre TysonAndre merged commit 8eee0d6 into php:master Dec 15, 2021
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Dec 15, 2021
https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer
depend on the current locale in php 8.2. Before that, this was unsafe to
evaluate at compile time.

Followup to phpGH-7506
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Dec 17, 2021
…uation.

https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer
depend on the current locale in php 8.2. Before that, this was unsafe to
evaluate at compile time.

Followup to phpGH-7506

Add strcmp/strcasecmp/strtolower/strtoupper functions

Add bin2hex/hex2bin and related functions

Add array_key_first/array_key_last

Update test of garbage collection using strtolower

- Make sure that this tests the right thing when opcache is enabled.
  Update this as an example for future tests.
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Dec 17, 2021
…uation.

https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer
depend on the current locale in php 8.2. Before that, this was unsafe to
evaluate at compile time.

Followup to phpGH-7506

Add strcmp/strcasecmp/strtolower/strtoupper functions

Add bin2hex/hex2bin and related functions

Add array_key_first/array_key_last

Update test of garbage collection using strtolower

- Make sure that this tests the right thing when opcache is enabled.
  Update this as an example for future tests.
TysonAndre added a commit to TysonAndre/php-src that referenced this pull request Dec 19, 2021
…uation.

https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer
depend on the current locale in php 8.2. Before that, this was unsafe to
evaluate at compile time.

Followup to phpGH-7506

Add strcmp/strcasecmp/strtolower/strtoupper functions

Add bin2hex/hex2bin and related functions

Add array_key_first/array_key_last

Update test of garbage collection using strtolower

- Make sure that this tests the right thing when opcache is enabled.
  Update this as an example for future tests.
TysonAndre added a commit that referenced this pull request Dec 20, 2021
…uation, add functions. (#7780)

https://wiki.php.net/rfc/strtolower-ascii means that these functions no longer
depend on the current locale in php 8.2. Before that, this was unsafe to
evaluate at compile time.

Followup to GH-7506

Add strcmp/strcasecmp/strtolower/strtoupper functions

Add bin2hex/hex2bin and related functions

Update test of garbage collection using strtolower to use something else to create a refcounted string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants