Title: ANTLR v3
1ANTLR v3 LL() Parsing
- Terence Parr
- University of San Francisco
- _at_
- Coverity
- July 2006
2Topics
- Research goals and motivation
- LL(k) parsing background
- LL() solution
- How it works
- Auto-backtracking extension
- Some initial results
3Research goals
- Make top-down LL-based parsers as powerful as
possible - allows more natural grammars
- makes language tools more accessible
- My research constrained by what most programmers
can/will use - recursive-descent parsers must be the base
- k1 fixed lookahead
- semantic predicates
- syntactic predicates controlled backtracking and
means of specifying ambiguity resolution - And for my next trick LL()
4Recent advances
- GLR Tomita handles CFGs like Earley but much
more efficient uses LR(1) forking of new
parsers at nondeterministic states - Elkhound McPeak reduces forking further (even
in nondeterministic situations) - PEG (parser expression grammar) Ford formalizes
ordered productions and syntactic predicates
from PCCTS/ANTLR backtracks through
alternatives taking first match no strict
ordering with GLR - Packrat parsing Ford (see Rats! Grimm)
memoizes partial parsing results to guarantee
linear time but with biggish heap - Dramatic foreshadowing LL() is to packrat as
Elkhound GLR is to Earleys algorithm
5LL() Motivation
- Natural grammars sometimes not LL(k) e.g.
abstract vs concrete methods - From the left edge, lookahead unbounded to see
the vs . We need arbitrary lookahead
because of the arg - If you have actions after ID, cant easily
refactor - Lookahead will be 5k10 usually for this decision
method type ID ( arg ) type
ID ( arg ) body
6Another non-LL(k) grammar
- Cant see past modifier here
- Could left-factor, but not always possible and
its unnatural!
def modifier classDef modifier
interfaceDef
def modifiers (classDefinterfaceDef)
7Background LL parsers
- Building a parser generator is easy except for
the lookahead analysis - rule ref ? rule()
- token ref ? match(token)
- rule def ? void rule() if (
lookahead-expr-alt 1 ) match alt 1 else if
( lookahead-expr-alt 2 ) match alt 2 else
error - The nature of the lookahead expressions dictates
the strength of your parser generator
8LL(2) parser example
void stat() if ( LA(1)IDLA(2)EQUALS )
match(ID) match(EQUALS) expr()
else if ( LA(1)IDLA(2)COLON )
match(ID) match(COLON) stat()
else error
stat ID expr ID stat
Lookahead is set of2-sequences that indicate
which alternative willultimately succeed
9Lookahead as DFA
void a() int alt0 if ( LA(1)ID )
if ( LA(2)EQUALS ) alt1 if ( LA(2)COLON
) alt2 switch (alt) case 1
match(ID) match(EQUALS) expr()
case 2 match(ID) match(COLON)
stat() default error
10Solution overview
- Natural extension to LL(k) lookahead DFA Allow
cyclic DFA that can skip ahead past the modifiers
to class or interface def - Dont approximate entire CFGwith a regex i.e.,
dont includeclass or interface def rules - Just predict and proceed normallywith LL parse
- DFA yields predicted alt number
- Grammar actions are not suckedinto DFAs and
arent executed duringprediction - No need to specify k a priori
11LL() code
- Arbitrary cyclic graphs cant be encoded w/o
gotos in Java, but here a simple while is ok
void a() int alt0 while (LA(1) in
modifier) consume() if ( LA(1)CLASS )
alt1 if ( LA(1)INTERFACE ) alt2 switch
(alt) case 1 case 2 default
error
12Isnt that just backtracking?
- No. For example, if I can guarantee you will
never lookahead more than 10 symbols, it's just
LL(10), right? - Not backtracking with the parser. DFA is smaller
and faster e.g., DFA predicting expr does not
follow deep call chain parser does - Dont have to avoid or unroll arbitrary user
actions in grammar! - The DFAs are efficiently coded and automatically
throttle down when less lookahead is needed
13LL() DFAConstruction Algorithm
14Algorithm discussion
- need suitable grammar representation
- sample LL(2) lookahead set computation
- LL() algorithm outline
- sample LL() DFA construction
15Lookahead NFA Construction
a
b
a b X b Y b B
16Sample LL(2) Lookahead Set
- Now, consider a simple fixed k lookahead
computation algorithm - Depth-first walk of NFA at left edge of each alt,
popping state off stack upon end-of-rule.
Terminate paths after traversing k2nd
non-epsilon edge - Lookahead for rule a alt 1 state sequence4,
2, 21,16 B 17, 20, 3, pop, 5 X 722, 18, 19, 20,
3, pop, 5 X 7, 13, 1, - Yields BX,X
17LL() Algorithm Outline
- idea perform a breadth-first search of the NFA,
carrying the stack context along with it so it
knows where to return upon end-of-rule NFA state - modify classical NFA-to-DFA conversion (subset
construction algorithm) - DFA state encodes configurations NFA could be in
after having seen input sequence including call
invocation stack - NFA configuration (saltcontext) tracks state,
predicted alt, and rule invocation stack to get
to that state - terminate algorithm when state uniquely predicts
an alternative or nondeterminism found (sictx)
and (sjctx) for same state s but different alts
i,j and same/similar context - verify DFA is reduced and all alternatives have
predict state
18LL() DFA Conversion
Classic DFA
a b X b Y b B
LL() DFA
1 alt
same NFA state, diff context
19Successful termination
a A X R A Y S
a (AA) B
DFA
DFA
LL()
LL()
Stops as ambiguity or unique prediction
20Cant see past recursion
- LL() DFA construction takes LL stack into
consideration, but resulting DFA will not have
stack uses sequence of states instead - Example weakness (same language, diff grammar)
// works a b X b Y b A
// doesnt work a b X b Y b A A b
// tail recursion
t.g25 Alternative 1 after matching input such
as A A A A decision cannot predict what comes
next due to recursion overflow to b from
b t.g25 Alternative 2
21LL() analysis fails sometimes
- LL() algorithm is exponential like subset
constr. algorithm worst case - keeps looking for more lookahead to distinguish
alternatives a problem in big grammars - doesnt like common recursive prefixes
- w/o failsafe would not terminate in our lifetime
- Workarounds
- manually set fixed k lookahead if possible
- syntactic predicates
- auto-backtracking mode
- refactor grammar if ambiguous or to reduce
lookahead requirements
22Auto-Backtracking
- Idea when LL() analysis fails, simply backtrack
at runtime to figure it out - newbie or rapid prototyping mode
- people dump the craziest stuff into ANTLR
- impl add syntactic predicate to each alt left
edge - LL() alg. uses preds only in nondeterministic
states(NFA config. extended to include semantic
context) - Use fixed k lookaheadbacktracking to get grammar
working then optimize with LL() - ANTLR v3 can memoize parsing results to guarantee
linear parsing time - Demo java parsing with, w/o memoization
23LL()Auto-Backtracking
grammar r options backtracktrue s e ''
e '' e '(' e ')' INT
24Java 1.4 Grammar Results
- Tweaked version of Rats!s Java grammar
- 99 Rules, 86 decisions
- LL(1) decisions 68 (excluding 2 that backtrack)
- LL(2) decisions 12
- LL() decisions 4
- Backtracking decisions 2
- No heap wasted on memoization (memo. off)!
- If limited to k1, 10 decisions backtrack
- Prelim. parsing profile on java/awt/Container.java
LL() lookahead range 1..8 tokens average
2Backtracking range 1..8 average 3.5
25Can we classify LL() strength?
- No strict ordering with CFG (ala GLR) Grammar
context-sensitive AnBnCn?
s (a) A b EOF A b EOF a A a
B b B b C
- production forces decision
- else predicate (a) not used
- language unaffected
matches AnBn
matches BnCn
Adapted from Fords PEG paper
26LL() vs LR(k)
- LR(k) even with k1 is generally more powerful
than LL() or at least more efficient for same
grammar, but there is no strict ordering add
epsilon rule refs to left edge of our grammar and
its not LR(k) for fixed k derived from adding
actions
a b A X R c A Y S b c
LL() but not LR(k) due to reduce-reduceconflict
27Summary and Conclusions
- Brazen assertion LL() syntactic predicates is
the most powerful parsing strategy that is
accessible/attractive to average programmer - LL() has benefits, flexibility, simplicity of LL
but is much stronger supports natural grammars - Doesn't alter recursive descent parser itself at
all just enhances the predictive capabilities. - Unifies lexing, parsing, tree parsing
- Basic algorithm is not that complicated, but
making it real and useful is interesting - Beta-release http//www.antlr.org/v3
- BSD license