Title: Tools and Analyses for Ambiguous Input Streams
1Tools and Analyses for Ambiguous Input Streams
- Andrew Begel and Susan L. Graham
- University of California, Berkeley
- LDTA Workshop - April 3, 2004
2HarmoniaLanguage-aware Editing
- Programming by Voice
- Code dictation
- Voice-based editing commands
- Program Transformations
- Transformation actions
- Pattern-matching constructs
3HarmoniaLanguage-aware Editing
- Programming by Voice
- Code dictation
- Voice-based editing commands
- Program Transformations
- Transformation actions
- Pattern-matching constructs
Human Speech
4HarmoniaLanguage-aware Editing
- Programming by Voice
- Code dictation
- Voice-based editing commands
- Program Transformations
- Transformation actions
- Pattern-matching constructs
Human Speech
EmbeddedLanguages
5HarmoniaLanguage-aware Editing
- Programming by Voice
- Code dictation
- Voice-based editing commands
- Program Transformations
- Transformation actions
- Pattern-matching constructs
Human Speech
EmbeddedLanguages
Each kind of input stream ambiguity requires new
language analyses
6Speech Example
- for (int i 0 i lt 10 i )
- ?
7Ambiguities
4 int eye equals 0 aye less then 10 i plus plus
- for (int i 0 i lt 10 i )
- ?
8Ambiguities
ID Spelling?
KW or ID?
KW or ?
4 int eye equals 0 aye less then 10 i plus plus
- for (int i 0 i lt 10 i )
- ?
9Another Utterance
10Many Valid Parses!
for (times ate 0 to 1) ?
fore.times(8).equalsZero(2, plus 1) ?
11Embedded Language Example
- C and Regexps embedded in Flex
- Flex Rule for Identifiers
- _a-zA-Z(_a-zA-Z0-9) i
RETURN_TOKEN(ID)
12Embedded Language Example
- C and Regexps embedded in Flex
- Flex Rule for Identifiers
- _a-zA-Z(_a-zA-Z0-9) i
RETURN_TOKEN(ID) - Why not this interpretation?
- _a-zA-Z(_a-zA-Z0-9) i
RETURN_TOKEN(ID) -
13Legacy Language Example
14Legacy Language Example
- Fortran
- Do Loop
- DO 57 I 3,10
15Legacy Language Example
- Fortran
- Do Loop
- DO 57 I 3,10
- DO 57 I 3
16Legacy Language Example
- Fortran
- Do Loop
- DO 57 I 3,10
- Assignment
- DO 57 I 3
17Legacy Language Example
- Fortran
- Do Loop
- DO 57 I 3,10
- Assignment
- DO57I 3
18Legacy Language Example
- PL/I
- Non-reserved Keywords
- IF IF THEN
- THEN THEN ELSE
- ELSE ELSE END END
19Legacy Language Example
- PL/I
- Non-reserved Keywords
- IF IF THEN
- THEN THEN ELSE
- ELSE ELSE END END
ID
ID
KW
ID
20Input Stream Classification
21Input Stream Classification
Embedded Languages Fall in all Four Categories!
22GLR Analysis Architecture
Lexer
GLR Parser
Semantics
FOR I
FOR
I
(
23GLR Analysis Architecture
Handles syntactic ambiguities
Lexer
GLR Parser
Semantics
FOR I
FOR
I
(
24Our ContributionXGLR Analysis Architecture
for i equals zero ...
Lexer
XGLR Parser
Semantics
FOR I
FOR
I
25Our ContributionXGLR Analysis Architecture
for i equals zero ...
Handles input stream ambiguities
Lexer
XGLR Parser
Semantics
FOR I
FOR
I
4
EYE
26LR Parsing
Input Stream
Parse Stack
1
Parse Table
27LR Parsing
Input Stream
Parse Stack
1
Parse Table
28LR Parsing
Input Stream
Parse Stack
1
3
Parse Table
29GLR Parsing
Input Stream
Parse Stack
Parse Table
1
30GLR Parsing
Input Stream
Parse Stack
Parse Table
1
31GLR Parsing
Input Stream
Parse Stack
2
5
Parse Table
1
32GLR Parsing
Input Stream
Parse Stack
2
4
5
Parse Table
1
3
33XGLR in Action
34Parsing Homophones
23
FOR
BAR
35XGLR Extension Multiple Spellings,
Single and Multiple Lexical Categories
FOUR
FORE
ID
23
FOR
BAR
KW
4
NUM
36XGLR Extension Parsers fork due to input
ambiguity
FOUR
FORE
23
ID
23
FOR
BAR
KW
4
23
NUM
37Each parser shifts its now unambiguous input
FOUR
26
FORE
23
ID
23
FOR
29
BAR
KW
4
35
23
NUM
38The next input is lexed unambiguously
FOUR
26
FORE
23
ID
23
FOR
29
BAR
ID
KW
4
35
23
NUM
39ID is only a valid lookahead for two parsers
FOUR
26
49
FORE
23
ID
23
FOR
29
BAR
42
ID
KW
4
35
23
NUM
40Parsing Embedded Languages
- Example BNF Grammar
- Contains Languages L and W
- bL ? loopL dW ENDL
- loopL ? LOOPL ?
- dW ? WHILEW NUMW doW
- doW ? DOW ?
-
L
W
41Parsing Embedded Languages
- Example BNF Grammar
- Contains Languages L and W
- bL ? loopL dW ENDL
- loopL ? LOOPL ?
- dW ? WHILEW NUMW doW
- doW ? DOW ?
- LOOP WHILE 34 END WHILE 56 DO END
L
W
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46Parsing Embedded Languages
S
0
LOOP
WHILE
34
47S
0
LOOP
WHILE
34
Current parse state has ambiguous lexical language
48L
0
S
LOOP
WHILE
34
W
0
XGLR Extension Fork parsers, assign one to each
lexical language
49L
L
0
LOOP
KW
S
WHILE
34
W
W
0
LOOP
ID
XGLR Extension Single spelling, Multiple lexical
categories Lex lookahead both in language L and W
50L
L
L
4
0
LOOP
KW
S
WHILE
34
W
W
0
LOOP
ID
Only LOOPL is valid lookahead, and is shifted
51W
L
L
4
0
LOOP
KW
S
WHILE
34
W
W
0
LOOP
ID
XGLR Extension State 4 has lexer lookaheads
only in language W
52W
L
L
W
4
0
LOOP
WHILE
KW
KW
S
34
W
W
0
LOOP
ID
Lex lookahead in language W
53W
1
L
loop
W
L
L
W
4
0
LOOP
WHILE
KW
KW
S
34
W
W
0
LOOP
ID
REDUCE by rule 2 and GOTO state 1
54W
W
WHILE
1
KW
L
loop
W
L
L
4
0
LOOP
KW
S
34
W
W
0
LOOP
ID
55W
W
W
WHILE
1
2
KW
L
loop
W
L
L
4
0
LOOP
KW
S
34
W
W
0
LOOP
ID
Shift into state 2
56W
W
W
WHILE
1
2
KW
L
loop
W
L
L
4
0
LOOP
KW
W
S
34
W
W
NUM
0
LOOP
ID
XGLR Extension Lex lookahead in language W
57W
W
W
W
WHILE
34
1
2
KW
NUM
L
loop
W
L
L
4
0
LOOP
KW
S
W
W
0
LOOP
ID
58W
W
W
W
W
WHILE
34
1
2
3
KW
NUM
L
loop
W
L
L
4
0
LOOP
KW
S
W
W
0
LOOP
ID
Shift into state 3
59W
W
W
W
W
WHILE
34
1
2
3
KW
NUM
L
loop
W
L
L
4
0
LOOP
KW
S
W
W
0
LOOP
ID
Shift into state 3, which has ambiguous lexical
language
60W
W
W
W
W
WHILE
34
1
2
3
KW
NUM
L
loop
L
3
W
L
L
4
0
LOOP
KW
S
W
W
0
LOOP
ID
XGLR Extension Single spelling, Multiple lexical
categories Fork parsers, assign one to each
lexical language
61GLR Ambiguity Support
- Fork parser on shift-reduce conflict
- Fork parser on reduce-reduce conflict
62XGLR Ambiguity Support
- Fork parser on shift-reduce conflict
- Fork parser on reduce-reduce conflict
63XGLR Ambiguity Support
- Fork parser on shift-reduce conflict
- Fork parser on reduce-reduce conflict
- Fork parsers on ambiguous lexical language
- Single spelling, Multiple lexical categories
- Fork parsers on ambiguous lexical lookahead
- Single/Multiple Spellings, Multiple lexical
categories - Shift-shift conflict resolution
64XGLR Ambiguities
- Many GLR programming language specs have finite,
few ambiguities - XGLR language specs also have finite, but
slightly more, ambiguities - Lexical ambiguity due to ambiguous input does
result in more ambiguous parse forests
65XGLR Ambiguities
- Many GLR programming language specs have finite,
few ambiguities - XGLR language specs also have finite, but
slightly more, ambiguities - Lexical ambiguity due to ambiguous input does
result in more ambiguous parse forests - Ambiguity causes parsers to fork
- GLR maintains efficiency by merging parsers when
ambiguity is over
66Parser Merging
- GLR Parsers merge when in same parse state
8
5
5
3
1
67Parser Merging
- GLR Parsers merge when in same parse state
8
5
4
5
3
1
68Parser Merging
- XGLR Parsers merge when in same parse state and
same lexical state
A
A
W
8
5
A
5
A
A
A
3
1
69Parser Merging
- XGLR Parsers merge when in same parse state and
same lexical state
A
A
W
W
8
5
A
5
A
A
A
A
3
1
70Parser Merging
- XGLR Parsers merge when in same parse state and
same lexical state
A
A
W
W
W
8
5
4
A
5
A
A
A
A
3
1
71Parser Merging
- XGLR Parsers merge when in same parse state and
same lexical state
A
A
W
W
A
8
5
4
A
5
A
A
A
A
3
1
72Parser Merging
- XGLR Parsers merge when in same parse state and
same lexical state
A
A
W
W
A
8
5
4
A
5
A
A
A
A
3
1
73Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
8
A
5
DO57I3
A
1
74Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
8
DO57I
3
ID
A
5
A
A
57I3
1
75Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
W
8
DO57I
3
5
ID
A
5
A
A
A
57I3
1
3
76Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
W
W
8
DO57I
3
5
ID
A
5
A
A
A
A
I3
1
3
77Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
W
W
W
8
DO57I
3
5
6
ID
A
5
A
A
A
A
A
I3
1
3
4
78Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
W
W
W
W
8
DO57I
3
5
6
ID
A
5
A
A
A
A
A
A
3
1
3
4
I
ID
79Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
W
W
W
W
W
8
DO57I
3
5
6
9
ID
A
5
A
A
A
A
A
A
3
1
3
4
I
ID
80Out of Sync Parsers
- XGLR Parsers merge when in same parse state and
same lexical state and same input position
W
W
W
W
W
W
W
8
DO57I
3
5
6
9
ID
A
5
A
A
A
A
A
A
3
1
3
4
I
ID
81Implementation
- Keep map lookahead ? parser to use when looking
for parsers to merge with - Sort parsers by position of lookahead in the
input - Enables pruning of map as parsers move past a
particular input location - Extra memory required is bounded by dynamic
separation between first and last parsers
82Related Work
- GLR Parsing Algorithm
- Tomita 1985
- Farshi 1991
- Rekers 1992
- Johnstone et. al. 2002
- Incremental GLR
- Wagner 1997
- GLR Implementations(that I heard of before
today) - ASFSDF 1993
- Elkhound 2004
- Bison 2003
- DParser 2002
- Aycock and Horspool 1999
- Scannerless Parsing(or Context-Free Scanning)
- Salomon and Cormack 1989
- Visser 1997van den Brand 2002
- Ambiguous Input Streams
- Aycock and Horspool 2001
- Embedded Languages
- ASFSDF 1997
- Van de Vanter and Boshernitsan(CodeProcessor)
2000
83Future Work
- Semantic Analysis of Embedded Languages
- Automated Semantic Disambiguation
84Contributions
- Generalized GLR to handle input stream
ambiguities - Classified input stream ambiguities into four
categories - Implemented XGLR algorithm in Harmonia framework
- Constructed combined lexer and parser generator
to support embedded languages and lexical
ambiguities at each stage of analysis - Enabled analysis of embedded languages,
programming by voice, and legacy languages
85(No Transcript)