Description
Description
Problem
The ICU library has its own locale format. When passing a locale ID to ICU, it is expected to be canonicalised using one of two canonicalisation operations. Level 1 is intended to perform minor, isolated changes on locales that are already in ICU format, such as standardising capitalisation. Level 2 may make major changes to the locale string and is designed to translate POSIX/XPG formats as well as nonstandard ICU locale IDs. The ICU userguide says this:
The recommended procedure for client code using locale IDs from outside sources (e.g., POSIX, user input, etc.) is to pass such “foreign IDs” through level 2 canonicalization before use.
PHP's older locale functions (eg setlocale
) accept only POSIX/XPG format locales, and PHP's modern Locale
class functions "are tolerant of" both POSIX/XPG and BCP 47/RFC 4646 formats. However, none of the intl extension Formatter classes perform level 2 canonicalisation by default, resulting in broken behaviour but no error when passing POSIX/XPG locales. In behaviour that can appear contrary, calling the getLocale
method of the formatter created with a BCP 47/RFC 4646 format locale will return a value with an _
separator (per ICU but also POSIX/XPG format) rather than a -
separator (per BCP 47/RFC 4646).
PHP does expose a level 2 canonicalisation function as Locale::canonicalize
but it's undocumented and crucially not referenced from any of the pages that accept locale IDs as a parameter. The current discoverability is low so unless you're intimately familiar with ICU as a library, it's as good as non-existent to PHP developers.
Here are two examples:
IntlDateFormatter
IntlDateFormatter::__construct
seemingly performs explicit level 1 canonicalisation here:- The documentation doesn't specify which locale formats are accepted, but gives examples using a mix of
-
and_
as the separator.
Test code:
<?php
var_dump((new IntlDateFormatter('pt', timezone: 'Europe/Amsterdam'))->getLocale());
var_dump((new IntlDateFormatter('pt', timezone: 'Europe/Amsterdam'))->format(1691585260));
var_dump((new IntlDateFormatter('pt-PT', timezone: 'Europe/Amsterdam'))->getLocale()); // BCP 47/RFC 4646
var_dump((new IntlDateFormatter('pt-PT', timezone: 'Europe/Amsterdam'))->format(1691585260));
var_dump((new IntlDateFormatter('pt_PT.utf8', timezone: 'Europe/Amsterdam'))->getLocale()); // POSIX/XPG
var_dump((new IntlDateFormatter('pt_PT.utf8', timezone: 'Europe/Amsterdam'))->format(1691585260));
Actual:
string(2) "pt"
string(79) "quarta-feira, 9 de agosto de 2023 14:47:40 Horário de Verão da Europa Central"
string(5) "pt_PT"
string(79) "quarta-feira, 9 de agosto de 2023 às 14:47:40 Hora de verão da Europa Central"
string(2) "pt"
string(79) "quarta-feira, 9 de agosto de 2023 14:47:40 Horário de Verão da Europa Central"
Expected:
string(2) "pt"
string(79) "quarta-feira, 9 de agosto de 2023 14:47:40 Horário de Verão da Europa Central"
string(5) "pt_PT"
string(79) "quarta-feira, 9 de agosto de 2023 às 14:47:40 Hora de verão da Europa Central"
string(2) "pt_PT"
string(79) "quarta-feira, 9 de agosto de 2023 às 14:47:40 Hora de verão da Europa Central"
NumberFormatter
NumberFormatter::__construct
seemingly relies on automatic level 1 canonicalisation, or at least I as a novice couldn't find any relevant calls to either Locale::createFromName or uloc_canonicalize (I may have missed them).- The documentation doesn't specify which locale formats are accepted but gives the POSIX/XPG style
en_CA
as an example.
Test code:
<?php
var_dump((new NumberFormatter('pt', NumberFormatter::CURRENCY))->getLocale());
var_dump((new NumberFormatter('pt', NumberFormatter::CURRENCY))->format(10000));
var_dump((new NumberFormatter('pt-PT', NumberFormatter::CURRENCY))->getLocale()); // BCP 47/RFC 4646
var_dump((new NumberFormatter('pt-PT', NumberFormatter::CURRENCY))->format(10000));
var_dump((new NumberFormatter('pt_PT.utf8', NumberFormatter::CURRENCY))->getLocale()); // POSIX/XPG
var_dump((new NumberFormatter('pt_PT.utf8', NumberFormatter::CURRENCY))->format(10000));
Actual:
string(2) "pt"
string(11) "¤10.000,00"
string(5) "pt_PT"
string(15) "10 000,00 €"
string(2) "pt"
string(12) "€10.000,00"
Expected:
string(2) "pt"
string(11) "¤10.000,00"
string(5) "pt_PT"
string(15) "10 000,00 €"
string(2) "pt_PT"
string(12) "10 000,00 €"
Suggested Solutions
- Always call
uloc_canonicalize
on non-empty PHP developer input when creating Formatters. - Always call
uloc_canonicalize
on PHP developer input when creating Formatters and throw an error if it differs from the output ofLocale::createFromName
. - Move this to be a documentation issue, document
Locale::canonicalize
, specify which locale formats are accepted out of the box by the Formatter constructors and add a note pointing toLocale::canonicalize
for other formats.
PHP Version
PHP 8.2.9
Operating System
Linux