Skip to content

mb_strtoupper converts non-alphabetic characters in Shift_JIS (CP932) #12412

Open
@masakielastic

Description

@masakielastic

Description

mb_strtoupper converts non-alphabetic characters in Shift_JIS (CP932)

var_dump(
    "\x81\xE0" === mb_strtoupper("\x87\x90", 'cp932')
);

\x81\xE0 and \x87\x90 mean (U+2252: Approximately Equal to or the Image Of).

var_dump(
    "\u{2252}" === mb_convert_encoding("\x81\xE0", 'utf-8', 'cp932'),
    "\u{2252}" === mb_convert_encoding("\x87\x90", 'utf-8', 'cp932')
);

\x87\x90 is the one of NEC special characters and was registered in duplicate for historical reasons.

The unintended conversion is caused by a round-trip conversion between Unicode and Shift_JIS.

var_dump(
    "\x81\xE0" === roundtrip("\x87\x90", 'cp932'),
    "\x81\xE0" === roundtrip("\x81\xE0", 'cp932')
);

function roundtrip($char, $enc) {
    return mb_convert_encoding(mb_convert_encoding($char, 'utf-8', $enc), $enc, 'utf-8');
}

As far as I know, 398 characters in Shift_JIS are affected by round-trip conversion. The test code is here.

The same problem applies to mb_scrub, mb_strtolower, mb_convert_case.

var_dump(
    "\x81\xE0" === mb_scrub("\x87\x90", 'cp932'),
    "\x81\xE0" === mb_strtolower("\x87\x90", 'cp932'),
    "\x81\xE0" === mb_strtoupper("\x87\x90", 'cp932'),
    "\x81\xE0" === mb_convert_case("\x87\x90", MB_CASE_TITLE, 'cp932')
);

PHP Version

PHP 8.2.10

Operating System

Debian 11.7 (Google ChromeOS 117.0)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions