Skip to content

Commit 4f0430a

Browse files
Add Intl.Segmenter support (#539)
* Add Intl.Segmenter support and some initial tests. (Missing docs, coverage, release notes.) * Get to 100% coverage * Document intlSegmenter * Improve docs * Add release notes
1 parent 244df82 commit 4f0430a

File tree

4 files changed

+65
-2
lines changed

4 files changed

+65
-2
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,11 @@ Broadly, jsdiff's diff functions all take an old text and a new text and perform
3939

4040
Options
4141
* `ignoreCase`: Same as in `diffChars`. Defaults to false.
42+
* `intlSegmenter`: An optional [`Intl.Segmenter`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter) object (which must have a `granularity` of `'word'`) for `diffWords` to use to split the text into words.
43+
44+
By default, `diffWords` does not use an `Intl.Segmenter`, just some regexes for splitting text into words. This will tend to give worse results than `Intl.Segmenter` would, but ensures the results are consistent across environments; `Intl.Segmenter` behaviour is only loosely specced and the implementations in browsers could in principle change dramatically in future. If you want to use `diffWords` with an `Intl.Segmenter` but ensure it behaves the same whatever environment you run it in, use an `Intl.Segmenter` polyfill instead of the JavaScript engine's native `Intl.Segmenter` implementation.
45+
46+
Using an `Intl.Segmenter` should allow better word-level diffing of non-English text than the default behaviour. For instance, `Intl.Segmenter`s can generally identify via built-in dictionaries which sequences of adjacent Chinese characters form words, allowing word-level diffing of Chinese. By specifying a language when instantiating the segmenter (e.g. `new Intl.Segmenter('sv', {granularity: 'word'})`) you can also support language-specific rules, like treating Swedish's colon separated contractions (like *k:a* for *kyrka*) as single words; by default this would be seen as two words separated by a colon.
4247

4348
* `Diff.diffWordsWithSpace(oldStr, newStr[, options])` - diffs two blocks of text, treating each word, punctuation mark, newline, or run of (non-newline) whitespace as a token.
4449

release-notes.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,7 @@
3434
* The context line immediately before and immediately after an insertion must match exactly between the hunk and the file for a hunk to apply. (Previously this was not required.)
3535
- [#535](https://github.com/kpdecker/jsdiff/pull/535) **A bug in patch generation functions is now fixed** that would sometimes previously cause `\ No newline at end of file` to appear in the wrong place in the generated patch, resulting in the patch being invalid.
3636
- [#535](https://github.com/kpdecker/jsdiff/pull/535) **Passing `newlineIsToken: true` to *patch*-generation functions is no longer allowed.** (Passing it to `diffLines` is still supported - it's only functions like `createPatch` where passing `newlineIsToken` is now an error.) Allowing it to be passed never really made sense, since in cases where the option had any effect on the output at all, the effect tended to be causing a garbled patch to be created that couldn't actually be applied to the source file.
37+
- [#539](https://github.com/kpdecker/jsdiff/pull/539) **`diffWords` now takes an optional `intlSegmenter` option** which should be an `Intl.Segmenter` with word-level granularity. This provides better tokenization of text into words than the default behaviour, even for English but especially for some other languages for which the default behaviour is poor.
3738

3839
## v5.2.0
3940

src/diff/word.js

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -58,8 +58,16 @@ wordDiff.equals = function(left, right, options) {
5858
return left.trim() === right.trim();
5959
};
6060

61-
wordDiff.tokenize = function(value) {
62-
let parts = value.match(tokenizeIncludingWhitespace) || [];
61+
wordDiff.tokenize = function(value, options = {}) {
62+
let parts;
63+
if (options.intlSegmenter) {
64+
if (options.intlSegmenter.resolvedOptions().granularity != 'word') {
65+
throw new Error('The segmenter passed must have a granularity of "word"');
66+
}
67+
parts = Array.from(options.intlSegmenter.segment(value), segment => segment.segment);
68+
} else {
69+
parts = value.match(tokenizeIncludingWhitespace) || [];
70+
}
6371
const tokens = [];
6472
let prevPart = null;
6573
parts.forEach(part => {

test/diff/word.js

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -209,6 +209,55 @@ describe('WordDiff', function() {
209209
);
210210
expect(convertChangesToXML(diffResult)).to.equal('foo<del> </del><ins>\t</ins>bar');
211211
});
212+
213+
it('supports tokenizing with an Intl.Segmenter', () => {
214+
// Example 1: Diffing Chinese text with no spaces.
215+
// I am not a Chinese speaker but I believe these sentences to mean:
216+
// 1. "I have (我有) many (很多) tables (桌子)"
217+
// 2. "Mei (梅) has (有) many (很多) sons (儿子)"
218+
// We want to see that diffWords will get the word counts right and won't try to treat the
219+
// trailing 子 as common to both texts (since it's part of a different word each time).
220+
// TODO: Check with a Chinese speaker that this example is correct Chinese.
221+
const chineseSegmenter = new Intl.Segmenter('zh', {granularity: 'word'});
222+
const diffResult = diffWords('我有很多桌子。', '梅有很多儿子。', {intlSegmenter: chineseSegmenter});
223+
expect(diffResult).to.deep.equal([
224+
{ count: 1, added: false, removed: true, value: '我有' },
225+
{ count: 2, added: true, removed: false, value: '梅有' },
226+
{ count: 1, added: false, removed: false, value: '很多' },
227+
{ count: 1, added: false, removed: true, value: '桌子' },
228+
{ count: 1, added: true, removed: false, value: '儿子' },
229+
{ count: 1, added: false, removed: false, value: '。' }
230+
]);
231+
232+
// Example 2: Should understand that a colon in the middle of a word is not a word break in
233+
// Finnish (see https://stackoverflow.com/a/76402021/1709587)
234+
const finnishSegmenter = new Intl.Segmenter('fi', {granularity: 'word'});
235+
expect(convertChangesToXML(diffWords(
236+
'USA:n nykyinen presidentti',
237+
'USA ja sen presidentti',
238+
{intlSegmenter: finnishSegmenter}
239+
))).to.equal('<del>USA:n nykyinen</del><ins>USA ja sen</ins> presidentti');
240+
241+
// Example 3: Some English text, including contractions, long runs of arbitrary space,
242+
// and punctuation, and using case insensitive mode, just to show all normal behaviour of
243+
// diffWords still works with a segmenter
244+
const englishSegmenter = new Intl.Segmenter('en', {granularity: 'word'});
245+
expect(convertChangesToXML(diffWords(
246+
"There wasn't time \n \t for all that. He thought...",
247+
"There isn't time \n \t left for all that, he thinks.",
248+
{intlSegmenter: englishSegmenter, ignoreCase: true}
249+
))).to.equal(
250+
"There <del>wasn't</del><ins>isn't</ins> time \n \t <ins>left </ins>"
251+
+ 'for all that<del>.</del><ins>,</ins> he <del>thought</del><ins>thinks</ins>.<del>..</del>'
252+
);
253+
});
254+
255+
it('rejects attempts to use a non-word Intl.Segmenter', () => {
256+
const segmenter = new Intl.Segmenter('en', {granularity: 'grapheme'});
257+
expect(() => {
258+
diffWords('foo', 'bar', {intlSegmenter: segmenter});
259+
}).to['throw']('The segmenter passed must have a granularity of "word"');
260+
});
212261
});
213262

214263
describe('#diffWordsWithSpace', function() {

0 commit comments

Comments
 (0)