Skip to content

Constructors of intl extension formatter classes don't canonicalise locale strings #11942

Open
@lpd-au

Description

@lpd-au

Description

Problem

The ICU library has its own locale format. When passing a locale ID to ICU, it is expected to be canonicalised using one of two canonicalisation operations. Level 1 is intended to perform minor, isolated changes on locales that are already in ICU format, such as standardising capitalisation. Level 2 may make major changes to the locale string and is designed to translate POSIX/XPG formats as well as nonstandard ICU locale IDs. The ICU userguide says this:

The recommended procedure for client code using locale IDs from outside sources (e.g., POSIX, user input, etc.) is to pass such “foreign IDs” through level 2 canonicalization before use.

PHP's older locale functions (eg setlocale) accept only POSIX/XPG format locales, and PHP's modern Locale class functions "are tolerant of" both POSIX/XPG and BCP 47/RFC 4646 formats. However, none of the intl extension Formatter classes perform level 2 canonicalisation by default, resulting in broken behaviour but no error when passing POSIX/XPG locales. In behaviour that can appear contrary, calling the getLocale method of the formatter created with a BCP 47/RFC 4646 format locale will return a value with an _ separator (per ICU but also POSIX/XPG format) rather than a - separator (per BCP 47/RFC 4646).

PHP does expose a level 2 canonicalisation function as Locale::canonicalize but it's undocumented and crucially not referenced from any of the pages that accept locale IDs as a parameter. The current discoverability is low so unless you're intimately familiar with ICU as a library, it's as good as non-existent to PHP developers.

Here are two examples:

IntlDateFormatter

Test code:

<?php
var_dump((new IntlDateFormatter('pt', timezone: 'Europe/Amsterdam'))->getLocale());
var_dump((new IntlDateFormatter('pt', timezone: 'Europe/Amsterdam'))->format(1691585260));
var_dump((new IntlDateFormatter('pt-PT', timezone: 'Europe/Amsterdam'))->getLocale());         // BCP 47/RFC 4646
var_dump((new IntlDateFormatter('pt-PT', timezone: 'Europe/Amsterdam'))->format(1691585260));
var_dump((new IntlDateFormatter('pt_PT.utf8', timezone: 'Europe/Amsterdam'))->getLocale());    // POSIX/XPG
var_dump((new IntlDateFormatter('pt_PT.utf8', timezone: 'Europe/Amsterdam'))->format(1691585260));

Actual:

string(2) "pt"
string(79) "quarta-feira, 9 de agosto de 2023 14:47:40 Horário de Verão da Europa Central"
string(5) "pt_PT"
string(79) "quarta-feira, 9 de agosto de 2023 às 14:47:40 Hora de verão da Europa Central"
string(2) "pt"
string(79) "quarta-feira, 9 de agosto de 2023 14:47:40 Horário de Verão da Europa Central"

Expected:

string(2) "pt"
string(79) "quarta-feira, 9 de agosto de 2023 14:47:40 Horário de Verão da Europa Central"
string(5) "pt_PT"
string(79) "quarta-feira, 9 de agosto de 2023 às 14:47:40 Hora de verão da Europa Central"
string(2) "pt_PT"
string(79) "quarta-feira, 9 de agosto de 2023 às 14:47:40 Hora de verão da Europa Central"

NumberFormatter

  • NumberFormatter::__construct seemingly relies on automatic level 1 canonicalisation, or at least I as a novice couldn't find any relevant calls to either Locale::createFromName or uloc_canonicalize (I may have missed them).
  • The documentation doesn't specify which locale formats are accepted but gives the POSIX/XPG style en_CA as an example.

Test code:

<?php
var_dump((new NumberFormatter('pt', NumberFormatter::CURRENCY))->getLocale());
var_dump((new NumberFormatter('pt', NumberFormatter::CURRENCY))->format(10000));
var_dump((new NumberFormatter('pt-PT', NumberFormatter::CURRENCY))->getLocale());              // BCP 47/RFC 4646
var_dump((new NumberFormatter('pt-PT', NumberFormatter::CURRENCY))->format(10000));
var_dump((new NumberFormatter('pt_PT.utf8', NumberFormatter::CURRENCY))->getLocale());         // POSIX/XPG
var_dump((new NumberFormatter('pt_PT.utf8', NumberFormatter::CURRENCY))->format(10000));

Actual:

string(2) "pt"
string(11) "¤10.000,00"
string(5) "pt_PT"
string(15) "10 000,00 €"
string(2) "pt"
string(12) "€10.000,00"

Expected:

string(2) "pt"
string(11) "¤10.000,00"
string(5) "pt_PT"
string(15) "10 000,00 €"
string(2) "pt_PT"
string(12) "10 000,00 €"

Suggested Solutions

  1. Always call uloc_canonicalize on non-empty PHP developer input when creating Formatters.
  2. Always call uloc_canonicalize on PHP developer input when creating Formatters and throw an error if it differs from the output of Locale::createFromName.
  3. Move this to be a documentation issue, document Locale::canonicalize, specify which locale formats are accepted out of the box by the Formatter constructors and add a note pointing to Locale::canonicalize for other formats.

PHP Version

PHP 8.2.9

Operating System

Linux

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions