Title: Scanner
1Scanner
2Grammar
3Language
4Recursive Definition
5Mathematical Expression
6Structure of Expressions
7Formal Language
8Backus Naur Form (BNF)
1960 by J. Backus and P. Naur
9EBNF (Extended BNF)
10BNF ? EBNF
BNF
EBNF
11Formalism (Formal notation)
N. Chromsky -
12Differing structural trees for the same expression
13Problem of Different structural trees
14No Ambiguous Sentence
15Context Free Language
- Syntactic equations of the form defined in EBNF
generate context-free languages. - The term "context free is due to Chomsky and
stems from the fact that substitution of the
symbol left of by a sequence derived from the
expression to the right of is always permitted,
regardless of the context in which the symbol is
embedded within the sentence. - It has turned out that this restriction to
context freedom (in the sense of Chomsky) is
quite acceptable for programming languages, and
that it is even desirable. - Context dependence in another sense, however, is
indispensible. We will return to this topic in
Chapter 8.
16Regular Expression
- A language is regular, if its syntax can be
expressed by a single EBNF expression. - The requirement that a single equation suffices
also implies that only terminal symbols occur in
the expression. - Such an expression is called a regular
expression.
17Syntax Analysis v.s. Regular Expression
- The reason for our interest in regular languages
lies in the fact that programs for the
recognition of regular sentences are particularly
simple and efficient. By "recognition" we mean
the determination of the structure of the
sentence, and thereby naturally the determination
of whether the sentence is well formed, that is,
it belongs to the language. Sentence recognition
is called syntax analysis.
18Regular Expression v.s. State Machine
- For the recognition of regular sentences a finite
automaton, also called a state machine, is
necessary and sufficient. In each step the state
machine reads the next symbol and changes state.
The resulting state is solely determined by the
previous state and the symbol read. If the
resulting state is unique, the state machine is
deterministic, otherwise nondeterministic. If the
state machine is formulated as a program, the
state is represented by the current point of
program execution.
19EBNF ? Program
- The analyzing program can be derived directly
from the defining syntax in EBNF. For each EBNF
construct K there exists a translation rule which
yields a program fragment Pr(K). The translation
rules from EBNF to program text are shown below.
Therein sym denotes a global variable always
representing the symbol last read from the source
text by a call to procedure next. Procedure error
terminates program execution, signaling that the
symbol sequence read so far does not belong to
the language.
20Analyzing program
21EBNF with only 1 rule
22First()
23Precondition
24Lexical Analysis for Identifier
25Lexical Analysis for Integer
26Scanner
- The process of syntax analysis is based on a
procedure to obtain the next symbol. This
procedure in turn is based on the definition of
symbols in terms of sequences of one or more
characters. This latter procedure is called a
scanner, and syntax analysis on this second,
lower level, lexical analysis.
27Lexical Analysis v.s. Syntax Analysis
28A Scanner Example
- As an example we show a scanner for a parser of
EBNF. Its terminal symbols and their definition
in terms of characters are
29Procedure GetSym() (1)
30Procedure GetSym() (2)
31Procedure GetSym() (3)
32Syntax Analysis Overview
- Goal determine if the input token stream
satisfies the syntax of the program - What do we need to do this?
- An expressive way to describe the syntax
- A mechanism that determines if the input token
stream satisfies the syntax description - For lexical analysis
- Regular expressions describe tokens
- Finite automata mechanisms to generate tokens
from input stream
33Just Use Regular Expressions?
- REs can expressively describe tokens
- Easy to implement via DFAs
- So just use them to describe the syntax of a
programming language - NO! They dont have enough power to express any
non-trivial syntax - Example Nested constructs (blocks, expressions,
statements) Detect balanced braces
. . .
- We need unbounded counting! - FSAs cannot count
except in a strictly modulo fashion
34Context-Free Grammars
- Consist of 4 components
- Terminal symbols token or ?
- Non-terminal symbols syntactic variables
- Start symbol S special non-terminal
- Productions of the form LHS?RHS
- LHS single non-terminal
- RHS string of terminals and non-terminals
- Specify how non-terminals may be expanded
- Language generated by a grammar is the set of
strings of terminals derived from the start
symbol by repeatedly applying the productions - L(G) language generated by grammar G
S ? a S a S ? T T ? b T b T ? ?
35CFG - Example
- Grammar for balanced-parentheses language
- S ? ( S ) S
- S ? ?
- 1 non-terminal S
- 2 terminals ), )
- Start symbol S
- 2 productions
- If grammar accepts a string, there is a
derivation of that string using the productions - (())
- S (S) ? ((S) S) ? ((?) ? ) ? (())
? Why is the final S required?
36More on CFGs
- Shorthand notation vertical bar for multiple
productions - S ? a S a T
- T ? b T b ?
- CFGs powerful enough to expression the syntax in
most programming languages - Derivation successive application of
productions starting from S - Acceptance? Determine if there is a derivation
for an input token stream
37A Parser
Context free grammar, G
Parser
Yes, if s in L(G) No, otherwise
Token stream, s (from lexer)
Error messages
Syntax analyzers (parsers) CFG acceptors which
also output the corresponding derivation when the
token stream is accepted Various kinds LL(k),
LR(k), SLR, LALR
38RE is a Subset of CFG
Can inductively build a grammar for each RE ? S
? ? a S ? a R1 R2 S ? S1 S2 R1 R2 S ? S1
S2 R1 S ? S1 S ? Where G1 grammar for
R1, with start symbol S1 G2 grammar for R2,
with start symbol S2
39Grammar for Sum Expression
- Grammar
- S ? E S E
- E ? number (S)
- Expanded
- S ? E S
- S ? E
- E ? number
- E ? (S)
4 productions 2 non-terminals (S,E) 4 terminals
(, ), , number start symbol S
40Constructing a Derivation
- Start from S (the start symbol)
- Use productions to derive a sequence of tokens
- For arbitrary strings a, ß, ? and for a
production A ? ß - A single step of the derivation is
- a A ? a ß ? (substitute ß for A)
- Example
- S ? E S
- (S E) E ? (E S E) E
41Class Problem
- S ? E S E
- E ? number (S)
- Derive (1 2 (3 4)) 5
42Parse Tree
S
E
S
- Parse tree tree representation of the
- derivation
- Leaves of the tree are terminals
- Internal nodes are non-terminals
- No information about the order of the
derivation steps
( S )
E
5
E S
E S
1
2
E
( S )
E S
E
3
4
43Parse Tree vs Abstract Syntax Tree
S
Parse tree also called concrete syntax
E
S
( S )
E
5
E S
5
E S
1
1
2
2
E
3
4
( S )
AST discards (abstracts) unneeded information
more compact format
E S
E
3
4
44Derivation Order
- Can choose to apply productions in any order,
select non-terminal and substitute RHS of
production - Two standard orders left and right-most
- Leftmost derivation
- In the string, find the leftmost non-terminal and
apply a production to it - E S ? 1 S
- Rightmost derivation
- Same, but find rightmost non-terminal
- E S ? E E S
45Leftmost/Rightmost Derivation Examples
- S ? E S E
- E ? number (S)
- Leftmost derive (1 2 (3 4)) 5
S ? E S ? (S)S ? (ES) S ? (1S)S ?
(1ES)S ? (12S)S ? (12E)S ? (12(S))S ?
(12(ES))S ? (12(3S))S ? (12(3E))S ?
(12(34))S ? (12(34))E ? (12(34))5
- Now, rightmost derive the same input string
S ? ES ? EE ? E5 ? (S)5 ? (ES)5 ? (EES)5
? (EEE)5 ? (EE(S))5 ? (EE(ES))5
? (EE(EE))5 ? (EE(E4))5 ? (EE(34))5
? (E2(34))5 ? (12(34))5
Result Same parse tree same productions chosen,
but in diff order
46Class Problem
- S ? E S E
- E ? number (S) -S
- Do the rightmost derivation of 1 (2 -(3
4)) 5
47Ambiguous Grammars
- In the sum expression grammar, leftmost and
rightmost derivations produced identical parse
trees - operator associates to the right in parse tree
regardless of derivation order
(12(34))5
5
1
2
3
4
48An Ambiguous Grammar
- associates to the right because of the
right-recursive production S ? E S - Consider another grammar
- S ? S S S S number
- Ambiguous grammar different derivations produce
different parse trees - More specifically, G is ambiguous if there are 2
distinct leftmost (rightmost) derivations for
some sentence
49Ambiguous Grammar - Example
S ? S S S S number
Consider the expression 1 2 3
Derivation 2 S ? SS ? SSS ? 1SS ? 12S
? 123
Derivation 1 S ? SS ? 1S ? 1SS ? 12S
? 123
3
1
2
3
1
2
Obviously not equal!
50Impact of Ambiguity
- Different parse trees correspond to different
evaluations! - Thus, program meaning is not defined!!
3
1
2
3
1
2
9
7
51Can We Get Rid of Ambiguity?
- Ambiguity is a function of the grammar, not the
language! - A context-free language L is inherently ambiguous
if all grammars for L are ambiguous - Every deterministic CFL has an unambiguous
grammar - So, no deterministic CFL is inherently ambiguous
- No inherently ambiguous programming languages
have been invented - To construct a useful parser, must devise an
unambiguous grammar
52Eliminating Ambiguity
- Often can eliminate ambiguity by adding
nonterminals and allowing recursion only on right
or left - S ? S T T
- T ? T num num
- T non-terminal enforces precedence
- Left-recursion left associativity
S
S T
T
T 3
1
2
53A Closer Look at Eliminating Ambiguity
- Precedence enforced by
- Introduce distinct non-terminals for each
precedence level - Operators for a given precedence level are
specified as RHS for the production - Higher precedence operators are accessed by
referencing the next-higher precedence
non-terminal
54Associativity
- An operator is either left, right or non
associative - Left a b c (a b) c
- Right a b c a (b c)
- Non a lt b lt c is illegal (thus undefined)
- Position of the recursion relative to the
operator dictates the associativity - Left (right) recursion ? left (right)
associativity - Non Dont be recursive, simply reference next
higher precedence non-terminal on both sides of
operator
55Class Problem (Tough)
S ? S S S S S S S / S (S) -S S
S number
Enforce the standard arithmetic precedence rules
and remove all ambiguity from the above grammar
Precedence (high to low) (), unary , / ,
- Associativity right rest are left