Skip to content

Use locale-independent alternatives to isalpha/isalnum/isctrl #7802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

TysonAndre
Copy link
Contributor

@TysonAndre TysonAndre commented Dec 21, 2021

  • Avoid registering/detecting stream wrappers in locale-independent ways. (low importance, hopefully, since non-ascii wrappers would still have to be registered by an application to be used. This would just result in notices in some locales and typically failing attempts to read files in others)

  • Avoid this in https://www.php.net/manual/en/function.finfo-file.php for libmagic for detecting magic file headers.

    I don't believe these should be locale dependent - note that ext/fileinfo code is only used in https://www.php.net/manual/en/function.finfo-file.php and other modules are unaffected

  • Avoid locale dependence for http/ftp/network protocols.

  • Avoid locale dependence for Windows drive letter names in zend_virtual_cwd

  • Make parse_url stop depending on locale
    Related to https://bugs.php.net/bug.php?id=52923
    iscntrl is locale-dependent which seems to corrupt certain bytes of utf-8 codepoints in the original bug report when replacing control characters with _ (not completely sure)

  • Make ini file line parsing locale-independent with respect to these functions

  • mb_send_mail is acting on individual bytes, so this seems like it would corrupt the To header

  • other changes

  • filter_var FILTER_VALIDATE_DOMAIN in ext/filter/logical_filters.c

  • TODO: isprint is also different in de_DE - but use in ext/fileinfo is limited to creating human-readable strings. Maybe this should always octally escape things that are non-printable in the C locale, if this assumes multi-byte codepoint representations such as utf-8 is used to display in many locales

Somewhat related to https://wiki.php.net/rfc/strtolower-ascii
but I don't think most of these should have been locale-dependent in the first
place - the code may not have considered locales

Some of these seem like bugs that should be fixed in earlier php versions as well

Note that this affects the non-utf-8 (unicode codepoints represented as single-byte) versions of some locales, e.g. 'de_DE' but not 'de_DE.UTF-8'

E.g. on Linux, setlocale(LC_ALL, 'de_DE');
(if the locale is installed and it succeeds)
will have some values for alpha/cntrl in the range 128-256 where the C locale
has no values.

To avoid this locale-dependence in older php versions,
applications can set setlocale(LC_CTYPE, 'C').

@TysonAndre TysonAndre requested review from nikic and cmb69 December 21, 2021 03:13
- Avoid registering/detecting stream wrappers in locale-independent ways.
- Avoid this in libmagic for detecting magic file headers.

  I don't believe these should be locale dependent.
- Avoid locale dependence for http/ftp/network protocols.
- Avoid locale dependence for Windows drive letter names in zend_virtual_cwd
- Make parse_url stop depending on locale
  Related to https://bugs.php.net/bug.php?id=52923
  iscntrl is locale-dependent which seems to corrupt certain bytes.

Somewhat related to https://wiki.php.net/rfc/strtolower-ascii
but I don't think most of these should have been locale-dependent in the first
place - the code may not have considered locales

E.g. on Linux, `setlocale(LC_ALL, 'de_DE');`
(if the locale is installed and it succeeds)
will have some values for alpha/cntrl in the range 128-256 where the C locale
has no values.

To avoid this locale-dependence in older php versions,
applications can set `setlocale(LC_CTYPE, 'C')`.
@TysonAndre TysonAndre force-pushed the fix-isalpha-locale-dependence branch from 9eed62a to d967be3 Compare December 21, 2021 15:18
Copy link
Member

@cmb69 cmb69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

On a relatively quick glance, I think this is mostly correct, but I suggest to split this into multiple PRs which can be evaluated on their own, to check whether they actually are more like bug fixes (and as such might be applied to stable branches), or whether they would be too much of a BC break even for PHP 8.2 (in which case a way forward may be considered).

For instance, the changes to gdImageXbmCtx() should better not be done in the bundled libgd for parity with system libgd. Instead that might be provided upstream (of course, they couldn't use zend_isalnum_ascii()). And while I assume that this change wouldn't affect many use cases, it might affect a few (assuming that XBM is still a thing, what may be an invalid assumption, but in that case support for XBM could be dropped altogether); the respective old comment doesn't look valid for stand-alone XBMs.

On the other hand, the changes to ext/filter look more like a bug fix.

And regarding libmagic: I'm somewhat unsure whether it is a good idea to further patch this library, or whether it wouldn't make more sense to try to push appropriate fixes and improvements (most notably, support for custom memory allocators) upstream.

@ramsey
Copy link
Member

ramsey commented May 27, 2022

Note to @php/release-managers : zend_isalpha_map and zend_isalpha_ascii were split out into #7820, which targets PHP-8.0.

What's the current status of this PR?

@ramsey ramsey added this to the PHP 8.2 milestone May 27, 2022
@github-actions
Copy link

There has not been any recent activity in this PR. It will automatically be closed in 7 days if no further action is taken.

@github-actions github-actions bot added the Stale label Jul 27, 2022
@github-actions github-actions bot closed this Aug 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants