Title: Syntax Analysis
1Syntax Analysis
- From Chapter 4, The Dragon Book, 2nd ed.
2Content
- 4.1 Introduction
- 4.2 Context-Free Grammar
- 4.3 Writing a Grammar
- 4.4 Top-Down Parsing
- 4.5 Bottom-Up Parsing
- 4.6 Introduction to LR Parsing Simple LR
- 4.7 More Powerful LR Parsers
- 4.8 Using Ambiguous Grammars
- 4.9 Parser Generators
34.1 Introduction
- Examine the way the parser fits into a typical
compiler.
44.1.1 The Role of the Parser
54.1.1 The Role of the Parser
- Three general types of parser for grammars
- universal,
- E.g., Cocke-Younger-Kasami (CYK) algorithm,
Earleys algorithm - Too inefficient to use in production compilers
- top-down,
- bottom-up
- The most efficient top-down and bottom-up methods
work only for sub-classes of grammars, but
several of these classes, particularly, LL and LR
grammars, are expressive enough to describe most
of the syntactic construct in modern programming
languages. - Parsers implemented by hand often use LL
grammars for example, the predictive-parsing
approach of Sec. 2.4.2 works for LL grammars. - Parsers for the larger class of LR grammars are
usually constructed using automated tools.
64.1.2 Representative Grammar
- Associativity and precedence are captured in
grammar 4.1. - LR grammar, suitable for bottom-up parsing
- Non-left-recursive variant of grammar (4.1)
- Used for top-down parsing
- Useful for illustrating techniques for handling
ambiguities during parsing
E ? E E E E ( E ) id (4.3)
74.1.3 Syntax Error Handling
- Common programming errors can occur as many
different levels. - Lexical errors
- misspelling of identifiers, keywords, or
operators, and missing quotes around text
intended as string - Syntactic errors
- misplaced semicolons or extra or missing braces
- In C or Java, the appearance of case statement
without an enclosing switch - Semantic errors
- type mismatches between operators and operands
- A return statement in Java method with result
type void - Logic errors
- Anything from incorrect reasoning on the part of
the programmer to the use in a C program of the
assignment operator instead of comparison
operator .
84.1.3 Syntax Error Handling
- The error handler in a parser has goals that are
simple to state but challenging to realize - Report the presence of errors clearly and
accurately. - Recover from each error quickly enough to detect
subsequent errors. - Add minimal overhead to the processing of correct
programs.
94.1.4 Error-Recovery Strategies
- Once an error is detected, how should the parser
recover? - Although no strategy has proven itself
universally acceptable, a few methods have broad
applicability. - Panic-Mode Recovery
- On discovering an error, the parser discards
input symbols one at a time until one of a
designated set of synchronizing token is found. - The synchronizing tokens are usually delimiters,
such as semicolon or , whose role in the source
program is clear and unambiguous. - Phrase-Level Recovery
- On discovering an error, a parser may perform
local correction on the remaining input that is,
it may replace a prefix for the remaining input
by some string that allows the parser to
continue. - Error Productions
- Global Corrections
104.2 Context-Free Grammars
- Review the definition of a context-free grammar.
- Introduce terminology for talking about parsing.
- Derivation
- Parse tree and derivations
- Ambiguity
114.2.1 The Formal Definition of a Context-Free
Grammar
- See Section 2.2
- Example 4.5
124.2.2 Notational Conventions
- pp. 198199 1 to 7
- Example 4.6
- Using these conventions, the grammar of Example
4.5 can be rewritten as
134.2.3 Derivations
E ? E E E E - E ( E ) id
(4.7) E ? - E ? -(EE) ? -(idE) ? -(idid)
(4.8) E ? - E ? -(EE) ? -(Eid) ? -(idid)
(4.9)
- (4.8) is leftmost derivation, (4.9) is rightmost
derivation
144.2.4 Parse Trees and Derivations
154.2.5 Ambiguity
- Section 2.2.4
- Example 4.11
- Two distinct leftmost derivation for the sentence
id id id
164.2.6 Verifying the Language Generated by a
Compiler
- A proof that a grammar G generates a language L
has two parts - show that every string generated by G is in L,
and conversely that - every string in L can indeed be generated by G.
- Example 4.12
- Consider the grammar S ? (S) S e, which
generates all strings of balanced parentheses,
and only such strings. - We shall show first that every sentence derivable
from S is balanced, and then that every balanced
string is derivable from S. - Show by induction.
174.2.7 Context-Free Grammars vs. Regular
Expressions
- Grammars are a more powerful notation than
regular expressions. - Every construct that can be described by a
regular expression can be described by a grammar,
but not vice-versa. - For example, the regular expression (ab)abb and
the grammar - describe the same language, the set of strings
of as and bs ending in abb. - The language Lanbn n 1 is a prototypical
example of a language that can be described by a
grammar but not by a regular expression. - finite automata cannot count
184.3 Writing a Grammar
- Grammars are capable of describing most, but not
all, of the syntax of programming languages. - E.g., the requirement that identifiers be
declared before that can be used, cannot be
described by a context-free grammar - This section contains
- how to divide work between a lexical analyzer and
a parser, - transformations that could be applied to get a
grammar more suitable for parsing - ambiguity elimination, left-recursion elimination
and left factoring - PL constructs that cannot be described by any
grammar
194.3.1 Lexical vs. Syntactic Analysis
- Reasons of why use regular expressions to define
the lexical syntax of a language. - Separating the syntactic structure of a language
into lexical and non-lexical parts provides a
convenient way of modularizing the front end of a
compiler into two manageable-size components. - The lexical rules of a language are frequently
quite simple, and to describe them we do not need
a notation as powerful as grammars. - Regular expressions generally provide a more
concise and easier-to-understand notation for
tokens than grammars. - More efficient lexical analyzers can be
constructed automatically from regular
expressions than from arbitrary grammars.
204.3.2 Eliminating Ambiguity
214.3.2 Eliminating Ambiguity
224.3.2 Eliminating Ambiguity
234.3.2 Elimination of Left Recursion
244.3.2 Elimination of Left Recursion
254.3.2 Elimination of Left Recursion
S ? Aa b A? A c S d e
A? A c A a d b d e
S? A a b A ? b d A A A ? cA a d A e
264.3.4 Left Factoring
stmt ? if expr then stmt else stmt
if expr then stsmt A ? ??1 ??2 A ? ?A A
? ?1 ?2
274.4 Top-Down Parsing
- Top-down parsing can be viewed as the problem of
constructing a parse tree for the input string,
starting from the root and creating the nodes of
the parse tree in preorder. - Example 4.27
E ? T E E ? T E e T ? F T T ? F
T e F ? ( E ) id
284.4.1 Recursive-Descent Parsing
294.4.1 Recursive-Descent Parsing
- Example 4.29
- Grammar
- S ? c A d
- A ? a b a
- Input w cad
304.4.2 FIRST and FOLLOW
- During top-down parsing, FIRST and FOLLOW allow
us to choose which production to apply, based on
the input symbol. - Define FIRST(?), where ? is any string of grammar
symbols - to be the set of terminals that begin strings
derived from ?. - If ? ?ethen eis also in FIRST(?).
- How FIRST can be used during predictive parsing
- Consider A ? ? ?, where FIRST(?) and FIRST(?)
are disjoint sets. - Next input symbol a. If a is in FIRST (?) then
choose the production A ? ?
314.4.2 FIRST and FOLLOW
- Define FOLLOW(A), for nonterminal A,
- to be the set of terminals a that can appear
immediately to the right of A in some sentential
form that is - the set of terminals a such that there exists a
derivation of the form S ??Aa? for some ?, ? in
Fig. 4.14.
324.4.2 FIRST and FOLLOW
- Rules for computing FIRST(X) for all grammar
symbols X - If X is a terminal, then FIRST(X) X.
- If X is a nonterminal and X? Y1Y2 ... Yk is a
production for some k 1, then place a in
FIRST(X) of for some i, a is in FIRST(Yi), and
eis in all of FIRST(Y1), ... FIRST(Yi-1) that
is, Y1Y2 ... Yi-1 ? e. - If eis in FIRST(Yj) for all j1, 2, ..., k, then
add eto FIRST(X). - For example, everything in FIRST(Y1) is surely
in FIRST(X). If Y1 does not derive e, then we add
nothing more to FIRST(X), but if Y1 ? e, then we
add FIRST(Y2), and so on. - If X ? eis a production, then add eto FIRST(X).
- Compute FIRST for any string X1X2 ... Xn
- Add all non-esymbols of FIRST(X1)
- Add all non-esymbols of FIRST(X2), if eis in
FIRST(X1) - Add all non-esymbols of FIRST(X3), if eis in
FIRST(X2) - and so on
334.4.2 FIRST and FOLLOW
- Compute FOLLOW(A) for all nonterminals A
- Place in FOLLOW(S), where S is the start
symbol, and is the input right delimiter. - If there is a production A ? ?B?, then everything
in FIRST(?) except eis in FIRST(?). - If there is a production A ? ?B, or a production
A ? ?B?, where FIRST(?) containse, then
everything in FOLLOW(A) is in FOLLOW(B).
344.4.2 FIRST and FOLLOW
- Example 4.30
- FIRST(F)FIRST(T)FIRST(E)(, id
- FIRST(E),e
- FIRST(T),e
- FOLLOW(E)FOLLOW(E)),
- FOLLOW(T)FOLLOW(T), ),
- FOLLOW(F), , ),
E ? T E E ? T E e T ? F T T ? F
T e F ? ( E ) id
354.4.3 LL(1) Grammars
- Predictive parser, that is, recursive-descent
parsers needing no backtracking, can be
constructed for a class of grammar called LL(1). - Scanning from left to right, producing a leftmost
derivation, and using one input symbol of look
ahead at each step to make parsing action
decision - The class of LL(1) grammars is rich enough to
cover most programming constructs, although care
is needed in writing a suitable grammar for the
source language. For example, no left-recursive
or ambiguous grammar can be LL(1).
364.4.3 LL(1) Grammars
- A grammar G is LL(1) if and only if whenever A ?
? ? are two distinct productions of G, the
following conditions hold - For no terminal a do both ? and ? derive strings
beginning with a. - At most one of ? and ? can derive the empty
string. - If ? ?e, then ? does not derive any string
beginning with a terminal in FOLLOW(A). Likewise,
if ? ?e, then ? does not derive any string
beginning with a terminal in FOLLOW(A). - Conditions 1 and 2
- FIRST(?) and FIRST(?) are disjoint sets.
- Condition 3
- if eis in FIRST(?), then FIRST(?) and FOLLOW(A)
are disjoint sets, and likewise if eis in
FIRST(?) .
374.4.3 LL(1) Grammars
- Predictive parsers can be constructed for LL(1)
grammars since - the proper production to apply for a nonterminal
can be selected by looking only at the current
input symbol. -
- stmt ? if ( expr ) stmt else stmt
- while ( expr ) stmt
- stmt_list
384.4.3 LL(1) Grammars
- Algorithm 4.31 Construction of a predictive
parsing table. - Input Grammar G.
- Output Parsing table M.
- Method For each production A ? ? of the
grammar, do the following - For each terminal a in FIRST(A), add A ? ? to
MA, a. - If eis in FIRST(?), then for each terminal b in
FOLLOW(A), add A ? ? to MA, b. If eis in
FIRST(?), and is in FOLLOW(A), add A ? ? to
MA, as well.
394.4.3 LL(1) Grammars
- Example 4.30
- FIRST(F)FIRST(T)FIRST(E)(, id
- FIRST(E),e
- FIRST(T),e
- FOLLOW(E)FOLLOW(E)),
- FOLLOW(T)FOLLOW(T), ),
- FOLLOW(F), , ),
E ? T E E ? T E e T ? F T T ? F
T e F ? ( E ) id
404.4.3 LL(1) Grammars
- For every LL(1) grammar, each parsing-table entry
uniquely identifies a production or signals an
error. - Example 4.33 Dangling-else problem
- S ? iEtSS a
- S ? eS e
- E ? b
414.4.4 Nonrecursive Predictive Parsing
- A nonrecursive predictive parser can be built by
manipulating a stack explicitly, rather than
implicitly via recursive call.
424.4.4 Nonrecursive Predictive Parsing
434.4.4 Nonrecursive Predictive Parsing
444.4.5 Error Recovery in Predictive Parsing
- Panic Mode
- Phrase-level Recovery
454.5 Bottom-Up Parsing
- Introduce a general style of bottom-up parsing
shift-reduce parsing. - In Sections 4.6 and 7, introduce the LR grammars,
the largest class of grammars for which
shift-reduce parsers can be built.
464.5.1 Reductions
- Think of bottom-up parsing as the process of
reducing a string w to the start symbol of the
grammar. - At each reduction step, a specific substring
match the body of a production is replaced by the
nonterminal at the head of that production. - The key decisions during bottom-up parsing as the
parse proceeds are about - when to reduce and about
- what production to apply.
- Example 4.37
- A reduction is the reverse of a step in a
derivation. - The following derivation corresponds to the parse
in Fig. 4.25. - E ? T ? T F ? T id ? F id ? id id
474.5.2 Handle Pruning
- Bottom-up parsing during left-to-right scan of
the input constructs a rightmost derivation in
reverse. - Informally, a handle is a substring that
matches the body of a production, and whose
reduction represents one step along the reverse
of a rightmost derivation.
484.5.2 Handle Pruning
- Formally, if S ? ?A? ? ???, then production A?
? in the position following ? in a handle of ???.
rm
rm
- Alternatively, a handle of a right-sentential
form ? is a production A? ? and a position in ?
where ? may be found, such that replacing ? at
that position by A produces the pervious
right-sentential form in rightmost derivation of
?. - A rightmost derivation in reverse can be obtained
by handle pruning.
494.5.3 Shift-Reduce Parsing
- Four possible actions a shift-reduce parser can
make - Shift, reduce, Accept, Error
- The use of a stack in shift-reduce parsing is
justified by an important fact - The handle will always eventually appear on top
of the stack, never inside.
504.5.3 Conflicts During Shift-Reduce Parsing
- There are context-free grammars for which
shift-reduce parsing cannot be used. - Shift/reduce conflict
- Reduce-reduce conflict
- Example 4.38
- The dangling-else grammar
- stmt? if expr then stmt if expr then stmt
else stmt other - Shift-reduce conflict
- STACK
INPUT - if expr then stmt
else
514.5.3 Conflicts During Shift-Reduce Parsing
- Example 4.39
- For the statement p(i,j), appeared as the token
stream id(id, id), after shifting the first three
tokens onto the stack, a shift-reduce parser
would be in configuration - STACK
INPUT - id ( id
, id ) - A reduce-reduce conflict occurs.
524.6 Introduction to LR Parsing Simple LR
- The most prevalent type of bottom-up parser today
is based on a concept called LR(k) parsing - Left-to-right scanning of the input,
- Rightmost derivation in reverse,
- k input symbols of lookahead used in making
parsing decision.
534.6.1 Why LR Parsers?
- A grammar for which we can construct a parsing
table using one of the methods in Sections 4.6
and 7 is said to be an LR grammar. - For a grammar to be LR, it is sufficient that a
left-to-right shift-reduce parser be able to
recognize handles of right-sentential forms when
they appear on top of the stack.
544.6.1 Why LR Parsers?
- LR parsing is attractive for a variety of
reasons - LR parsers can be constructed for recognize
virtually all programming language constructs for
which CFGs can be written. - The LR-parsing method is the most general
nonbacktracking shift-reduce parsing method
known, yet it can be implemented as efficiently
as others, more primitive shift-reduce methods. - An LR parser can detect a syntactic error as soon
as it is possible to do so on a left-to-right
scan of input. - The class of grammars that can be parsed using LR
method is a proper superset of the class of
grammars that can be parsed with predictive or LL
method. - The principal drawback of the LR method is that
- it is too much work to construct an LR parser by
hand for a typical programming-language grammar. - LR parser generator is needed
- Yacc, Section 4.9
554.6.2 Items and the LR(0) Automaton
- How does a shift-reduce parser know when to shift
and when to reduce? - A LR parser makes shift-reduce decisions by
maintaining states to keep track of where we are
in a parse. - States represent set of items.
- An LR(0) item (item for short) of a grammar G is
a production of G with a dot at some position of
the body. - For example production A?XYZ yields the four
items - A??XYZ
- A?X?YZ
- A?XY?Z
- A?XYZ?
- The production A?? generates only one item A? ?.
564.6.2 Items and the LR(0) Automaton
- An item indicates how much of a production we
have seen at a given point in the parsing
process. - For example A??XYZ, A?X?YZ, A?XYZ?
- One collection of sets of LR(0) items, called the
canonical LR(0) collection, provides the basis
for constructing a deterministic finite automaton
that is used to make parsing decision. - Such an automaton is called an LR(0) automaton.
57Items in the shaded parts are nonkernal items
others are kernel items.
584.6.2 Items and the LR(0) Automaton
- To construct the canonical LR(0) collection for a
grammar, we define an augmented grammar and two
functions, CLOSURE and GOTO. - Augmented grammar
- If G is a grammar with start symbol S, then G,
the augmented grammar for G, is G with a new
start symbol S and production S? S - This new starting production is to indicate to
the parser when it should stop parsing and
announce acceptance of the input.
594.6.2 Items and the LR(0) Automaton
- Closure of Item Sets
- If I is a set of items for a grammar G, the
CLOSURE(I) is the set of items constructed from I
by the two rules - Initially, add every item in I to CLOSURE(I).
- If A ? ??B? is in CLOSURE(I) and ??? is a
production, then add the item ???? to CLOSURE(I),
if it is not already there. Apply this rule until
no more new items can be added to CLOSURE(I). - Example 4.40 If I is the set of one item
E??E, then CLOSURE(I) contains the set of
item I0 in Fig. 4.31.
604.6.2 Items and the LR(0) Automaton
- The Function GOTO
- GOTO(I, X) is defined to be the closure of the
set of all items A??X?? such that A???X? is
in I. - Example 4.41 If I has two items E ? E?, E ?
E?T, then GOTO(I, ) contains the items - E ? E ?T
- T? ?T F
- T?? F
- F? ? (E)
- F ? ? id
614.6.2 Items and the LR(0) Automaton
- Use of the LR(0) Automaton
- Then central idea behind Simple LR, or SLR,
parsing is the construction from the grammar of
the LR(0) automaton.
624.6.3 The LR-Parsing Algorithm
634.6.3 The LR-Parsing Algorithm
- Structure of the LR Parsing Table
- The parsing table consists of two parts
- The ACTION function takes as arguments a state I
and a terminal a (or the input endmarker). The
value of ACTIONi, a can have one of four forms - Shift j, where j is a state.
- Reduce A ? ?.
- Accept.
- Error.
- We extend GOTO function, defined on set of items,
to states if GOTOIi, AIj, the GOTO also maps
a state i and a nonterminal A to state j.
644.6.3 The LR-Parsing Algorithm
- LR-Parser Configurations
- A configuration of an LR parser is a pair
- (s0s1sm, aiai1an)
- where the first component is the stack contents,
and the second component is the remaining input.
654.6.3 The LR-Parsing Algorithm
- Algorithm 4.44 LR-parsing algorithm
- INPUT An input string w and an LR-parsing table
with functions ACTION and GOTO for a grammar G. - OUTPUT If w is in L(G), the reduction steps of a
bottom-up parser for w otherwise, an error
indication. - METHOD Initially, the parser has s0 on its
stack, where s0 is the initial state, and w in
the input buffer. The parser then executes the
following program.
664.6.3 The LR-Parsing Algorithm
674.6.4 Constructing the SLR-Parsing Tables
- Algorithm 4.46 Constructing an SLR-parsing
table. - INPUT An augmented grammar G.
- OUTPUT The SLR-parsing table functions ACTION
and GOTO for G. - METHOD
- Construct CI0, I1, , In, the collection of
sets of LR(0) items for G. - State I is constructed from Ii. The parsing
actions for state I are determined as follows - If A? ? ? a ? is in Ii and GOTO(Ii, a) Ii,
then set ACTIONi, a to shift j. Here a must
be a terminal. - If A? ?? is in Ii and ACTIONi, a to reduce
A? ? for all a in FOLLOW(A) here A may not be
S. - If S ? S is in Ii, then set ACTIONI, to
accept. - If any conflicting actions result from the above
rules, we say the grammar is not SLR(1). The
algorithm fails to produce a parser in this case. - The GOTO transition for state I are constructed
for all nonterminals A using the rule IF
GOTO(Ii, A) Ij, then GOTOi, Aj. - All entries not defined by rules (2) and (3) are
made error. - The initial state of the parser is the one
constructed from the set of items containing S
? ?S.
684.6.4 Constructing the SLR-Parsing Tables
694.6.4 Constructing the SLR-Parsing Tables
- Example 4.48
- Every SLR(1) grammar is unambiguous, but there
are many unambiguous grammars that are not
SLR(1). - S ? L R R
- L ? R id
- R ? L
Shift-reduce conflict
704.6.5 Viable Prefixes
- The prefixes of right sentential forms that can
appear on the stack of a shift-reduce parser are
called viable prefixes. - They are defined as follows
- A viable prefix is a prefix of a right sentential
form that does not continue past the right end of
the rightmost handle of that sentential form.
714.6.5 Viable Prefixes
- SLR parsing is based on the fact that LR(0)
automaton recognize viable prefixes. - We say item A??1??2 is valid for a viable
prefixes ??1if there is a derivation S? ?A? ?
??1?2 ?. - The fact that A??1??2 is valid for ??1tells us a
lot about whether to shift or reduce when we find
??1on the parsing stack. - If ?2 ? ?, then it suggests that we have not yet
shift the handle into the stack, so shift is our
move. - If ?2 ?, then it looks as if A??1 is the
handle, and we should reduce by this production. - Two valid items may tell us to do different
things for the same prefix. - Some of these conflict can be resolved by looking
at the next input symbol, and others can be
resolved by the methods of Sec. 4.8. - But we should not suppose that all parsing action
conflicts can be resolved if the LR method is
applied to an arbitrary grammar.
rm
rm
724.6.5 Viable Prefixes
- Compute the set of valid items for each viable
prefix that can appear on the stack of an LR
parser. - The set of valid items for a viable prefix ? is
exactly the set of items reached from the initial
state along the path labeled by ? in the LR(0)
automaton for the grammar. - Example 4.50 (Need the automaton of Fig. 4.31.)
- The items valid for the viable prefix ET? are in
state 7.
734.7 More Powerful LR Parsers
- Extend the pervious LR parsing techniques to use
one symbol of lookahead on the input. - The canonical-LR or just LR method, making
full use of the lookahead symbol(s). This method
uses a large set of items, called the LR(1)
items. - The lookahead-LR or LALR method, which is
based on the LR(0) sets of items, and has many
fewer states than typical parsers based on the
LR(1) items.
744.7.1 Canonical LR(1) Items
- LR(1) item A ? ???, a is valid for a viable
prefix ? if there is a derivation S ? ??? ? ????,
where - ? ??, and
- Either a is the first symbol of ?, or ? is ? and
a is .
rm
rm
754.7.2 Constructing LR(1) Sets of Items
764.7.2 Constructing LR(1) Sets of Items
S ? S S ? C C C ? c C d
774.7.3 Canonical LR(1) Parsing Tables
784.7.4 Constructing LALR Parsing Tables
794.8 Using Ambiguous Grammars
804.9 Parser Generators
- We shall use the LALR parser generator Yacc as
the basis of our discussion. - The first version of Yacc was created by S. C.
Johnson. - Yacc is available as a command on the UNIX system.
814.9.1 The Parser Generator Yacc
- A Yacc source program has three parts
- declarations
-
- translation rules
-
- supporting functions
824.9.1 The Parser Generator Yacc
834.9.2 Using Yacc with Ambiguous Grammars
844.9.3 Creating Yacc Lexical Analyzer with Lex
- Replace the routine yylex() in the third part of
the Yacc specification by statement include
lex.yy.c.
854.9.4 Error Recovery in Yacc