Description
The purpose of this issue is to track tasks related to the effort to refactor string packages handling grapheme clusters to use "base" packages which handle more specialized use cases.
Overview
String packages, such as @stdlib/string/first
, have several possible "modes" of operation. When getting the first character, a straightforward approach would use indexing. E.g.,
var str = 'Hello, World!';
var ch = str[ 0 ];
// returns 'H'
This works according to user expectation so long as a character is a relatively common character which can be stored in a single UTF-16 code unit. However, this inevitably does not live up to user intuition when the first visual character is comprised of multiple code units.
As such, one has three options for resolving the first character:
- code units
- code points (one or more code units)
- grapheme clusters (one or more code points)
The most robust approach for matching user intuition is to resolve grapheme clusters (i.e., user-perceived visual characters), especially for text which may include emojis with skin tones and modified characteristics. However, resolving grapheme clusters is comparatively slow and may lead to unacceptable performance issues, especially when working with simple text.
Solution
Rather than provide a single API which only processes text as a sequence of grapheme clusters, the proposed solution is to refactor top-level @stdlib/string/*
packages which handle grapheme clusters to support different "modes" of operation, whereby a user can choose which type of processing is most appropriate for given input strings.
Internally, packages supporting different modes should rely on separate, specialized "base" packages (@stdlib/string/base/*
) which implement appropriate algorithms for resolving code units, code points, and grapheme clusters, respectively.
Prior Art
For examples of refactorings, see
@stdlib/string/first
@stdlib/string/base/first
@stdlib/string/base/first-code-point
@stdlib/string/base/first-grapheme-cluster
@stdlib/string/for-each
@stdlib/string/base/for-each
@stdlib/string/base/for-each-code-point
@stdlib/string/base/for-each-grapheme-cluster
Tasks
The following packages should be refactored to use the proposed solution:
-
@stdlib/string/first
-
@stdlib/string/for-each
-
@stdlib/string/left-trim-n
-
@stdlib/string/remove-first
- feat: refactor string packageremove-first
#1073 -
@stdlib/string/remove-last
- feat: refactor string package remove-last #1079 -
@stdlib/string/reverse
- feat: refactor string package reverse #1082 -
@stdlib/string/right-trim-n
-
@stdlib/string/truncate
- feat: refactor truncate string package #1097 -
@stdlib/string/truncate-middle
-
@stdlib/string/base/distances/levenshtein
The following package implementation needs to be rewritten:
-
@stdlib/string/base/prev-grapheme-cluster
Notes
In general, refactoring should happen in the following order:
- Create the base package processing grapheme clusters (package name should have a
-grapheme-cluster
or-grapheme-clusters
suffix). This is often similar to the top-level package, but stripped of input argument validation and optional arguments. - Create the base package for processing Unicode code units (package name should have a
-code-point
or-code-points
suffix). - Create the base package for processing UTF-16 code units (if necessary, package name should have a
-code-unit
or-code-units
suffix). - Refactor the top-level package to depend on the base packages and add support for specifying a
mode
option.