Title: Text Boundary Analysis
1Text Boundary Analysis
- Eric Mader
- Advisory Software Engineer
- IBM
2Where do I break lines?
- The rain in Spain stays mainly on the plain.
3Where do I break lines?
- The rain in Spain stays mainly on the plain.
???????????
4Where do I break lines?
- The rain in Spain stays mainly on the plain.
???????????
???????????????????????????????
5Even in English, this can be hard
You owe me 1,234.56... I think.
6Even in English, this can be hard
You owe me 1,234.56... I think.
7Word wrapping vs word selection
Word wrapping
Some characters behavior is context-dependent.
8Word wrapping vs word selection
Word wrapping
Some characters behavior is context-dependent.
Searching by words
Some characters behavior is context-dependent.
9Analysis by pairs
second
ltr
dgt
sp
pun
ltr
dgt
first
sp
X
X
X
pun
10Analysis by pairs
second
ltr
dgt
sp
pun
ltr
dgt
first
sp
X
X
X
pun
11Analysis by pairs
second
ltr
dgt
sp
pun
-
ltr
dgt
first
sp
X
X
X
X
pun
-
X
X
12Analysis by pairs
second
ltr
dgt
sp
pun
-
ltr
dgt
first
sp
X
X
X
X
pun
-
X
X
13Analysis by pairs
second
ltr
dgt
sp
pun
-
nbs
ltr
dgt
first
sp
X
X
X
X
pun
-
X
X
nbs
14Analysis by pairs
second
ltr
dgt
sp
pun
-
nbs
ltr
dgt
first
sp
X
X
X
X
pun
-
X
X
nbs
15Analysis by pairs
second
ltr
dgt
sp
pun
-
nbs
kji
ltr
X
dgt
X
first
sp
X
X
X
X
X
pun
X
-
X
X
X
nbs
X
X
X
kji
X
X
16Where pairs break down
A break position can depend on more than two
characters
You owe me 1,234.56... I think.
17Where pairs break down
A break position can depend on more than two
characters
You owe me 1,234.56... I think.
4.5
18Where pairs break down
A break position can depend on more than two
characters
You owe me 1,234.56... I think.
6..
19Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
20Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
21Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
22Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
23Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
24An example
- If not otherwise mentioned, each character is a
word unto itself. - A run of letters constitutes a word and is kept
together. Certain punctuation marks may appear
inside a word, but only if they have a letter on
each side. - A run of digits constitutes a number and is
kept together. Certain punctuation marks may
appear inside a number, but only if they have a
digit on each side. In addition, a number may
have certain optional prefix and suffix
characters. - If a word and a number appear in succession
with nothing between them, theyre kept together.
25The state-machine approach
start
A
0
.
26The state-machine approach
start
A
0
.
27The state-machine approach
start
A
0
.
28The state-machine approach
start
A
0
.
29The state-machine approach
start
A
0
.
30The state-machine approach
start
A
0
.
31The state-machine approach
start
A
0
.
32The state-machine approach
start
A
0
.
33The state-machine approach
start
A
0
.
34The state-machine approach
1,234.56...
start
A
0
.
35The state-machine approach
1,234.56...
start
A
0
.
36The state-machine approach
1,234.56...
start
A
0
.
37The state-machine approach
1,234.56...
start
A
0
.
38The state-machine approach
1,234.56...
start
A
0
.
39The state-machine approach
1,234.56...
start
A
0
.
40The state-machine approach
1,234.56...
start
A
0
.
41The state-machine approach
1,234.56...
start
A
0
.
42The state-machine approach
1,234.56...
start
A
0
.
43The state-machine approach
1,234.56...
start
A
0
.
44The state-machine approach
1,234.56...
start
A
0
.
45The state-machine approach
1,234.56...
start
A
0
.
46The state-machine approach
1,234.56...
start
A
0
.
47Limitations
19921996
48Limitations
19921996
49Limitations
1996
50Limitations
19921996
51Limitations
19921996
52Limitations
19921996
53Limitations
19921996
54Automatic table building
- If not otherwise mentioned, each character is a
word unto itself. - A run of letters constitutes a word and is kept
together. Certain punctuation marks may appear
inside a word, but only if they have a letter on
each side. - A run of digits constitutes a number and is
kept together. Certain punctuation marks may
appear inside a number, but only if they have a
digit on each side. In addition, a number may
have certain optional prefix and suffix
characters. - If a word and a number appear in succession
with nothing between them, theyre kept together.
55Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
56Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
57Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
58Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
59Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
60Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
61Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
62Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
63Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
64Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
65Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
66Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
67Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
68Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
69Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
70Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
71Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
72Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
73Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
74Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
75Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
76Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
77Automatic table building
- All regular-expression rules have equal
precedence - The winning rule is decided using a
longest-possible-match algorithm (except in
certain well-defined cases) - Our build algorithm parses the regular
expressions, builds the state table, and makes
sure its deterministic in a single pass
78Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
79Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
80Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
81Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
82Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
83Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
84Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
85Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
86Ignore characters
ignoreMnMeCf
87Surrogate support
kanji\u4e00-\u9fff\udb80-\udb83 ignoreMn
MeCf\udc00-\udcff
88Surrogate support
kanji\u4e00-\u9fff\udb80-\udb83 ignoreMn
MeCf\udc00-\udcff
89Surrogate support
kanji\u4e00-\u9fff\udb80-\udb83 ignoreMn
MeCf\udc00-\udcff
90Random-access iteration
You owe me 1,234.56... I think.
91Random-access iteration
You owe me 1,234.56... I think.
92Random-access iteration
You owe me 1,234.56... I think.
93Random-access iteration
You owe me 1,234.56... I think.
94Random-access iteration
You owe me 1,234.56... I think.
95Random-access iteration
You owe me 1,234.56... I think.
96Random-access iteration
!sent-startstartspaceendperiod !se
nt-startlcdigitstartspaceendterm
97Dictionary-based iteration
We hold these truths to be self-evident that all
men are created equal, that they are endowed by
their Creator with certain unalienable rights,
that among these are Life, Liberty, and the
Pursuit of Happiness.
98Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
99Dictionary-based iteration
dictionaryA-Za-z\-\
100Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
101Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
102Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
103Dictionary-based iteration
themendinetonight
104Dictionary-based iteration
themendinetonight
105Dictionary-based iteration
themendinetonight
106Dictionary-based iteration
themendinetonight
107Dictionary-based iteration
themendinetonight
108Dictionary-based iteration
themendinetonight
109Dictionary-based iteration
themendinetonight
110Dictionary-based iteration
themendinetonight
111Dictionary-based iteration
themendinetonight
112Dictionary-based iteration
themendinetonight
113Dictionary-based iteration
themendinetonight
114Dictionary-based iteration
themendinetonight
115Dictionary-based iteration
themendinetonight
116Dictionary-based iteration
themendinetonight
117Dictionary-based iteration
themendinetonight
118Dictionary-based iteration
themendinetonight
119Dictionary-based iteration
themendinetonight
120Dictionary-based iteration
themendinetonight
121Dictionary-based iteration
themendinetonight
122Dictionary-based iteration
themendinetonight
123Dictionary-based iteration
themendinetonight
124Dictionary-based iteration
themendinetonight
125Dictionary-based iteration
themendinetonight
126Dictionary-based iteration
themendinetonight
127Dictionary-based iteration
themendinetonight
128Dictionary-based iteration
themendinetonight
129Dictionary-based iteration
themendinetonight
130Dictionary-based iteration
themendinetonight
131Dictionary-based iteration
themendinetonight
132Dictionary-based iteration
themendinetonight
133Dictionary-based iteration
themendinetonight
134Dictionary-based iteration
themendinetonight
135Dictionary-based iteration
themendinetonight
136Dictionary-based iteration
themendinetonight
137Dictionary-based iteration
themendinetonight
138Dictionary-based iteration
themendinetonight
139Dictionary-based iteration
themendinetonight
140Dictionary-based iteration
themendinetonight
141Dictionary-based iteration
themendinetonight
142Dictionary-based iteration
themendinetonight
143Dictionary-based iteration
themendinetonight
144Dictionary-based iteration
themendinetonight
145Dictionary-based iteration
themendinetonight
146Dictionary-based iteration
themendinetonight
147Dictionary-based iteration
themendinetonight
148Dictionary-based iteration
themendinetonight
149Dictionary-based iteration
themendinetonight
150Dictionary-based iteration
themendinetonight
151Dictionary-based iteration
themendinetonight
152Dictionary-based iteration
themendinetonight
153Dictionary-based iteration
themendinetonight
154Text Boundary Analysis
- Eric Mader
- Advisory Software Engineer
- IBM