Open
Description
Description
mb_strtoupper converts non-alphabetic characters in Shift_JIS (CP932)
var_dump(
"\x81\xE0" === mb_strtoupper("\x87\x90", 'cp932')
);
\x81\xE0
and \x87\x90
mean ≒
(U+2252: Approximately Equal to or the Image Of).
var_dump(
"\u{2252}" === mb_convert_encoding("\x81\xE0", 'utf-8', 'cp932'),
"\u{2252}" === mb_convert_encoding("\x87\x90", 'utf-8', 'cp932')
);
\x87\x90
is the one of NEC special characters and was registered in duplicate for historical reasons.
The unintended conversion is caused by a round-trip conversion between Unicode and Shift_JIS.
var_dump(
"\x81\xE0" === roundtrip("\x87\x90", 'cp932'),
"\x81\xE0" === roundtrip("\x81\xE0", 'cp932')
);
function roundtrip($char, $enc) {
return mb_convert_encoding(mb_convert_encoding($char, 'utf-8', $enc), $enc, 'utf-8');
}
As far as I know, 398 characters in Shift_JIS are affected by round-trip conversion. The test code is here.
The same problem applies to mb_scrub, mb_strtolower, mb_convert_case.
var_dump(
"\x81\xE0" === mb_scrub("\x87\x90", 'cp932'),
"\x81\xE0" === mb_strtolower("\x87\x90", 'cp932'),
"\x81\xE0" === mb_strtoupper("\x87\x90", 'cp932'),
"\x81\xE0" === mb_convert_case("\x87\x90", MB_CASE_TITLE, 'cp932')
);
PHP Version
PHP 8.2.10
Operating System
Debian 11.7 (Google ChromeOS 117.0)