Text Boundary Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Text Boundary Analysis

Description:

Title: No Slide Title Author: Eric Mader Last modified by: Eric Mader Created Date: 7/8/1998 5:26:46 PM Document presentation format: Custom Company – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 155
Provided by: EricM167
Learn more at: https://icu-project.org
Category:

less

Transcript and Presenter's Notes

Title: Text Boundary Analysis


1
Text Boundary Analysis
  • Eric Mader
  • Advisory Software Engineer
  • IBM

2
Where do I break lines?
  • The rain in Spain stays mainly on the plain.

3
Where do I break lines?
  • The rain in Spain stays mainly on the plain.

???????????
4
Where do I break lines?
  • The rain in Spain stays mainly on the plain.

???????????
???????????????????????????????
5
Even in English, this can be hard
You owe me 1,234.56... I think.
6
Even in English, this can be hard
You owe me 1,234.56... I think.
7
Word wrapping vs word selection
Word wrapping
Some characters behavior is context-dependent.
8
Word wrapping vs word selection
Word wrapping
Some characters behavior is context-dependent.
Searching by words
Some characters behavior is context-dependent.
9
Analysis by pairs
second
ltr
dgt
sp
pun

ltr




dgt




first
sp
X
X

X
pun




10
Analysis by pairs
second
ltr
dgt
sp
pun

ltr




dgt




first
sp
X
X

X
pun




11
Analysis by pairs
second
ltr
dgt
sp
pun

-
ltr





dgt





first
sp
X
X

X
X
pun





-
X


X

12
Analysis by pairs
second
ltr
dgt
sp
pun

-
ltr





dgt





first
sp
X
X

X
X
pun





-
X


X

13
Analysis by pairs
second
ltr
dgt
sp
pun

-
nbs
ltr






dgt






first
sp
X
X

X
X

pun






-
X


X


nbs






14
Analysis by pairs
second
ltr
dgt
sp
pun

-
nbs
ltr






dgt






first
sp
X
X

X
X

pun






-
X


X


nbs






15
Analysis by pairs
second
ltr
dgt
sp
pun

-
nbs
kji
ltr






X
dgt






X
first
sp
X
X

X
X

X
pun






X
-
X


X


X
nbs







X

X

X
kji
X
X
16
Where pairs break down
A break position can depend on more than two
characters
You owe me 1,234.56... I think.
17
Where pairs break down
A break position can depend on more than two
characters
You owe me 1,234.56... I think.
4.5
18
Where pairs break down
A break position can depend on more than two
characters
You owe me 1,234.56... I think.
6..
19
Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
20
Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
21
Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
22
Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
23
Where pairs break down
Sentence boundaries require even more lookahead
He asked, How tall are you? Im about 6 ft.
tall. Wow!
24
An example
  • If not otherwise mentioned, each character is a
    word unto itself.
  • A run of letters constitutes a word and is kept
    together. Certain punctuation marks may appear
    inside a word, but only if they have a letter on
    each side.
  • A run of digits constitutes a number and is
    kept together. Certain punctuation marks may
    appear inside a number, but only if they have a
    digit on each side. In addition, a number may
    have certain optional prefix and suffix
    characters.
  • If a word and a number appear in succession
    with nothing between them, theyre kept together.

25
The state-machine approach

start
A
0

.

26
The state-machine approach

start
A
0

.

27
The state-machine approach

start
A
0

.

28
The state-machine approach

start
A
0

.

29
The state-machine approach

start
A
0

.

30
The state-machine approach

start
A
0

.

31
The state-machine approach

start
A
0

.

32
The state-machine approach

start
A
0

.

33
The state-machine approach

start
A
0

.

34
The state-machine approach
1,234.56...

start
A
0

.

35
The state-machine approach
1,234.56...

start
A
0

.

36
The state-machine approach
1,234.56...

start
A
0

.

37
The state-machine approach
1,234.56...

start
A
0

.

38
The state-machine approach
1,234.56...

start
A
0

.

39
The state-machine approach
1,234.56...

start
A
0

.

40
The state-machine approach
1,234.56...

start
A
0

.

41
The state-machine approach
1,234.56...

start
A
0

.

42
The state-machine approach
1,234.56...

start
A
0

.

43
The state-machine approach
1,234.56...

start
A
0

.

44
The state-machine approach
1,234.56...

start
A
0

.

45
The state-machine approach
1,234.56...

start
A
0

.

46
The state-machine approach
1,234.56...

start
A
0

.

47
Limitations
19921996
48
Limitations
19921996
49
Limitations
1996
50
Limitations
19921996
51
Limitations
19921996
52
Limitations
19921996
53
Limitations
19921996
54
Automatic table building
  • If not otherwise mentioned, each character is a
    word unto itself.
  • A run of letters constitutes a word and is kept
    together. Certain punctuation marks may appear
    inside a word, but only if they have a letter on
    each side.
  • A run of digits constitutes a number and is
    kept together. Certain punctuation marks may
    appear inside a number, but only if they have a
    digit on each side. In addition, a number may
    have certain optional prefix and suffix
    characters.
  • If a word and a number appear in succession
    with nothing between them, theyre kept together.

55
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
56
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
57
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
58
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
59
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
60
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
61
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
62
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
63
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
64
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
65
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
66
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
67
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
68
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
69
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
70
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
71
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
72
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
73
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
74
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
75
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
76
Automatic table building
letL dgtN mid-wordPd\\\. mid
-num\\\.\, pre-numSc\\.- post-
num\\ word(let(mid-wordlet)) num
ber(dgt(mid-numdgt)) word?(numberw
ord)(numberpost-num?)? pre-num(numberw
ord)(numberpost-num?)?
77
Automatic table building
  • All regular-expression rules have equal
    precedence
  • The winning rule is decided using a
    longest-possible-match algorithm (except in
    certain well-defined cases)
  • Our build algorithm parses the regular
    expressions, builds the state table, and makes
    sure its deterministic in a single pass

78
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
79
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
80
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
81
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
82
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
83
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
84
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
85
Sentence-break rules
.?termtermperiodendspace .?perio
dperiodendspace/startsent-start
86
Ignore characters
ignoreMnMeCf
87
Surrogate support
kanji\u4e00-\u9fff\udb80-\udb83 ignoreMn
MeCf\udc00-\udcff
88
Surrogate support
kanji\u4e00-\u9fff\udb80-\udb83 ignoreMn
MeCf\udc00-\udcff
89
Surrogate support
kanji\u4e00-\u9fff\udb80-\udb83 ignoreMn
MeCf\udc00-\udcff
90
Random-access iteration
You owe me 1,234.56... I think.
91
Random-access iteration
You owe me 1,234.56... I think.
92
Random-access iteration
You owe me 1,234.56... I think.
93
Random-access iteration
You owe me 1,234.56... I think.
94
Random-access iteration
You owe me 1,234.56... I think.
95
Random-access iteration
You owe me 1,234.56... I think.
96
Random-access iteration
!sent-startstartspaceendperiod !se
nt-startlcdigitstartspaceendterm
97
Dictionary-based iteration
We hold these truths to be self-evident that all
men are created equal, that they are endowed by
their Creator with certain unalienable rights,
that among these are Life, Liberty, and the
Pursuit of Happiness.
98
Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
99
Dictionary-based iteration
dictionaryA-Za-z\-\
100
Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
101
Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
102
Dictionary-based iteration
Weholdthesetruthstobeself-evidentthatallmenare
createdequal,thattheyareendowedbytheirCreatorwith
certainunalienablerights,thatamongtheseareLife,
Liberty,andthePursuitofHappiness.
103
Dictionary-based iteration
themendinetonight
104
Dictionary-based iteration
themendinetonight
105
Dictionary-based iteration
themendinetonight
106
Dictionary-based iteration
themendinetonight
107
Dictionary-based iteration
themendinetonight
108
Dictionary-based iteration
themendinetonight
109
Dictionary-based iteration
themendinetonight
110
Dictionary-based iteration
themendinetonight
111
Dictionary-based iteration
themendinetonight
112
Dictionary-based iteration
themendinetonight
113
Dictionary-based iteration
themendinetonight
114
Dictionary-based iteration
themendinetonight
115
Dictionary-based iteration
themendinetonight
116
Dictionary-based iteration
themendinetonight
117
Dictionary-based iteration
themendinetonight
118
Dictionary-based iteration
themendinetonight
119
Dictionary-based iteration
themendinetonight
120
Dictionary-based iteration
themendinetonight
121
Dictionary-based iteration
themendinetonight
122
Dictionary-based iteration
themendinetonight
123
Dictionary-based iteration
themendinetonight
124
Dictionary-based iteration
themendinetonight
125
Dictionary-based iteration
themendinetonight
126
Dictionary-based iteration
themendinetonight
127
Dictionary-based iteration
themendinetonight
128
Dictionary-based iteration
themendinetonight
129
Dictionary-based iteration
themendinetonight
130
Dictionary-based iteration
themendinetonight
131
Dictionary-based iteration
themendinetonight
132
Dictionary-based iteration
themendinetonight
133
Dictionary-based iteration
themendinetonight
134
Dictionary-based iteration
themendinetonight
135
Dictionary-based iteration
themendinetonight
136
Dictionary-based iteration
themendinetonight
137
Dictionary-based iteration
themendinetonight
138
Dictionary-based iteration
themendinetonight
139
Dictionary-based iteration
themendinetonight
140
Dictionary-based iteration
themendinetonight
141
Dictionary-based iteration
themendinetonight
142
Dictionary-based iteration
themendinetonight
143
Dictionary-based iteration
themendinetonight
144
Dictionary-based iteration
themendinetonight
145
Dictionary-based iteration
themendinetonight
146
Dictionary-based iteration
themendinetonight
147
Dictionary-based iteration
themendinetonight
148
Dictionary-based iteration
themendinetonight
149
Dictionary-based iteration
themendinetonight
150
Dictionary-based iteration
themendinetonight
151
Dictionary-based iteration
themendinetonight
152
Dictionary-based iteration
themendinetonight
153
Dictionary-based iteration
themendinetonight
154
Text Boundary Analysis
  • Eric Mader
  • Advisory Software Engineer
  • IBM
Write a Comment
User Comments (0)
About PowerShow.com