ANTLR v3 - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

ANTLR v3

Description:

DFA yields predicted alt number. Grammar actions are not sucked ... Depth-first walk of NFA at left edge of each alt, popping state off stack upon end-of-rule. ... – PowerPoint PPT presentation

Number of Views:335

Avg rating:3.0/5.0

Slides: 28

Provided by: terenc2

Category:

Tags: antlr | alt

more less

Transcript and Presenter's Notes

Title: ANTLR v3

1
ANTLR v3 LL() Parsing

Terence Parr
University of San Francisco
_at_
Coverity
July 2006

2
Topics

Research goals and motivation
LL(k) parsing background
LL() solution
How it works
Auto-backtracking extension
Some initial results

3
Research goals

Make top-down LL-based parsers as powerful as
possible
allows more natural grammars
makes language tools more accessible
My research constrained by what most programmers
can/will use
recursive-descent parsers must be the base
k1 fixed lookahead
semantic predicates
syntactic predicates controlled backtracking and
means of specifying ambiguity resolution
And for my next trick LL()

4
Recent advances

GLR Tomita handles CFGs like Earley but much
more efficient uses LR(1) forking of new
parsers at nondeterministic states
Elkhound McPeak reduces forking further (even
in nondeterministic situations)
PEG (parser expression grammar) Ford formalizes
ordered productions and syntactic predicates
from PCCTS/ANTLR backtracks through
alternatives taking first match no strict
ordering with GLR
Packrat parsing Ford (see Rats! Grimm)
memoizes partial parsing results to guarantee
linear time but with biggish heap
Dramatic foreshadowing LL() is to packrat as
Elkhound GLR is to Earleys algorithm

5
LL() Motivation

Natural grammars sometimes not LL(k) e.g.
abstract vs concrete methods
From the left edge, lookahead unbounded to see
the vs . We need arbitrary lookahead
because of the arg
If you have actions after ID, cant easily
refactor
Lookahead will be 5k10 usually for this decision

method type ID ( arg ) type
ID ( arg ) body
6
Another non-LL(k) grammar

Cant see past modifier here
Could left-factor, but not always possible and
its unnatural!

def modifier classDef modifier
interfaceDef
def modifiers (classDefinterfaceDef)
7
Background LL parsers

Building a parser generator is easy except for
the lookahead analysis
rule ref ? rule()
token ref ? match(token)
rule def ? void rule() if (
lookahead-expr-alt 1 ) match alt 1 else if
( lookahead-expr-alt 2 ) match alt 2 else
error
The nature of the lookahead expressions dictates
the strength of your parser generator

8
LL(2) parser example
void stat() if ( LA(1)IDLA(2)EQUALS )
match(ID) match(EQUALS) expr()
else if ( LA(1)IDLA(2)COLON )
match(ID) match(COLON) stat()
else error
stat ID expr ID stat
Lookahead is set of2-sequences that indicate
which alternative willultimately succeed
9
Lookahead as DFA
void a() int alt0 if ( LA(1)ID )
if ( LA(2)EQUALS ) alt1 if ( LA(2)COLON
) alt2 switch (alt) case 1
match(ID) match(EQUALS) expr()
case 2 match(ID) match(COLON)
stat() default error
10
Solution overview

Natural extension to LL(k) lookahead DFA Allow
cyclic DFA that can skip ahead past the modifiers
to class or interface def
Dont approximate entire CFGwith a regex i.e.,
dont includeclass or interface def rules
Just predict and proceed normallywith LL parse
DFA yields predicted alt number
Grammar actions are not suckedinto DFAs and
arent executed duringprediction
No need to specify k a priori

11
LL() code

Arbitrary cyclic graphs cant be encoded w/o
gotos in Java, but here a simple while is ok

void a() int alt0 while (LA(1) in
modifier) consume() if ( LA(1)CLASS )
alt1 if ( LA(1)INTERFACE ) alt2 switch
(alt) case 1 case 2 default
error
12
Isnt that just backtracking?

No. For example, if I can guarantee you will
never lookahead more than 10 symbols, it's just
LL(10), right?
Not backtracking with the parser. DFA is smaller
and faster e.g., DFA predicting expr does not
follow deep call chain parser does
Dont have to avoid or unroll arbitrary user
actions in grammar!
The DFAs are efficiently coded and automatically
throttle down when less lookahead is needed

13
LL() DFAConstruction Algorithm
14
Algorithm discussion

need suitable grammar representation
sample LL(2) lookahead set computation
LL() algorithm outline
sample LL() DFA construction

15
Lookahead NFA Construction
a
b
a b X b Y b B
16
Sample LL(2) Lookahead Set

Now, consider a simple fixed k lookahead
computation algorithm
Depth-first walk of NFA at left edge of each alt,
popping state off stack upon end-of-rule.
Terminate paths after traversing k2nd
non-epsilon edge
Lookahead for rule a alt 1 state sequence4,
2, 21,16 B 17, 20, 3, pop, 5 X 722, 18, 19, 20,
3, pop, 5 X 7, 13, 1,
Yields BX,X

17
LL() Algorithm Outline

idea perform a breadth-first search of the NFA,
carrying the stack context along with it so it
knows where to return upon end-of-rule NFA state
modify classical NFA-to-DFA conversion (subset
construction algorithm)
DFA state encodes configurations NFA could be in
after having seen input sequence including call
invocation stack
NFA configuration (saltcontext) tracks state,
predicted alt, and rule invocation stack to get
to that state
terminate algorithm when state uniquely predicts
an alternative or nondeterminism found (sictx)
and (sjctx) for same state s but different alts
i,j and same/similar context
verify DFA is reduced and all alternatives have
predict state

18
LL() DFA Conversion
Classic DFA
a b X b Y b B
LL() DFA
1 alt
same NFA state, diff context
19
Successful termination
a A X R A Y S
a (AA) B
DFA
DFA
LL()
LL()
Stops as ambiguity or unique prediction
20
Cant see past recursion

LL() DFA construction takes LL stack into
consideration, but resulting DFA will not have
stack uses sequence of states instead
Example weakness (same language, diff grammar)

// works a b X b Y b A
// doesnt work a b X b Y b A A b
// tail recursion
t.g25 Alternative 1 after matching input such
as A A A A decision cannot predict what comes
next due to recursion overflow to b from
b t.g25 Alternative 2
21
LL() analysis fails sometimes

LL() algorithm is exponential like subset
constr. algorithm worst case
keeps looking for more lookahead to distinguish
alternatives a problem in big grammars
doesnt like common recursive prefixes
w/o failsafe would not terminate in our lifetime
Workarounds
manually set fixed k lookahead if possible
syntactic predicates
auto-backtracking mode
refactor grammar if ambiguous or to reduce
lookahead requirements

22
Auto-Backtracking

Idea when LL() analysis fails, simply backtrack
at runtime to figure it out
newbie or rapid prototyping mode
people dump the craziest stuff into ANTLR
impl add syntactic predicate to each alt left
edge
LL() alg. uses preds only in nondeterministic
states(NFA config. extended to include semantic
context)
Use fixed k lookaheadbacktracking to get grammar
working then optimize with LL()
ANTLR v3 can memoize parsing results to guarantee
linear parsing time
Demo java parsing with, w/o memoization

23
LL()Auto-Backtracking
grammar r options backtracktrue s e ''
e '' e '(' e ')' INT
24
Java 1.4 Grammar Results

Tweaked version of Rats!s Java grammar
99 Rules, 86 decisions
LL(1) decisions 68 (excluding 2 that backtrack)
LL(2) decisions 12
LL() decisions 4
Backtracking decisions 2
No heap wasted on memoization (memo. off)!
If limited to k1, 10 decisions backtrack
Prelim. parsing profile on java/awt/Container.java
LL() lookahead range 1..8 tokens average
2Backtracking range 1..8 average 3.5

25
Can we classify LL() strength?

No strict ordering with CFG (ala GLR) Grammar
context-sensitive AnBnCn?

s (a) A b EOF A b EOF a A a
B b B b C

production forces decision
else predicate (a) not used
language unaffected

matches AnBn
matches BnCn
Adapted from Fords PEG paper
26
LL() vs LR(k)

LR(k) even with k1 is generally more powerful
than LL() or at least more efficient for same
grammar, but there is no strict ordering add
epsilon rule refs to left edge of our grammar and
its not LR(k) for fixed k derived from adding
actions

a b A X R c A Y S b c
LL() but not LR(k) due to reduce-reduceconflict
27
Summary and Conclusions

Brazen assertion LL() syntactic predicates is
the most powerful parsing strategy that is
accessible/attractive to average programmer
LL() has benefits, flexibility, simplicity of LL
but is much stronger supports natural grammars
Doesn't alter recursive descent parser itself at
all just enhances the predictive capabilities.
Unifies lexing, parsing, tree parsing
Basic algorithm is not that complicated, but
making it real and useful is interesting
Beta-release http//www.antlr.org/v3
BSD license