Skip to content

Commit 7c84960

Browse files
committed
Add invisible separator / comma to the list of things treated as spaces. One half of #1281 - although this doesn't address the crash, unfortunately
1 parent 40fee82 commit 7c84960

File tree

3 files changed

+69089
-68997
lines changed

3 files changed

+69089
-68997
lines changed

src/edu/stanford/nlp/process/LexCommon.tokens

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
/* \u3000 is ideographic space; \u205F is medium math space */
2-
SPACE = [ \t\u00A0\u2000-\u200A\u202F\u20F5\u3000]
2+
/* \u2063 is an invisible separator */
3+
SPACE = [ \t\u00A0\u2000-\u200A\u202F\u2063\u20F5\u3000]
34
SPACES = {SPACE}+
45
NEWLINE = \r|\r?\n|\u2028|\u2029|\u000B|\u000C|\u0085
56
SPACENL = ({SPACE}|{NEWLINE})

src/edu/stanford/nlp/process/PTBLexer.flex

+1
Original file line numberDiff line numberDiff line change
@@ -589,6 +589,7 @@ SPLET = &[aeiouAEIOU](acute|grave|uml);
589589

590590
%include LexCommon.tokens
591591

592+
/* SPACE, SPACENL, etc are in LexCommon.tokens */
592593
SPACENLS = {SPACENL}+
593594
/* These next ones are useful to get a fixed length trailing context. */
594595
SPACENL_ONE_CHAR = [ \t\u00A0\u2000-\u200A\u202F\u3000\r\n\u2028\u2029\u000B\u000C\u0085]

0 commit comments

Comments
 (0)