Title: CS 363 Comparative Programming Languages
1CS 363 Comparative Programming Languages
- Lecture 3 Syntax Notation
2Topics
- The General Problem of Describing Syntax
- Formal Methods of Describing Syntax
- Context Free Grammars, BNF
- Parse Trees
3Introduction
- Who must use language definition
- Language designers
- Implementors
- Programmers (the users of the language)
- Syntax - the form or structure of the
expressions, statements, and program units - Semantics - the meaning of the expressions,
statements, and program units
Our focus today
4What is a language?
- Alphabet (S) finite set of basic syntatic
elements (characters, tokens) - The S of C includes while, for, ,
identifiers, integers, - Sentence finite sequence of elements in S can
be l, the empty string (Some texts use e as the
empty string) - A legal C program is a single sentence in that
language - Language possibly infinite set of sentences
over some alphabet can be , the empty
language. - Set of all legal C programs defines the
language
5Suppose S a,b,c. Some languages over S could
be
- aa,ab,ac,bb,bc,cc
- ab,abc,abcc,abccc,. . .
- l , where l (e) is the empty string (length
0) -
- a,b,c,l
6Recognizing Languages
- Typically the task of a compiler
- Find tokens (S) from the input
- See if tokens in appropriate order
- Determine what that token ordering means
- All of this must be formally specified
7A Typical Compiler Architecture
Syntactic/semantic structure
tokens
Syntactic structure
Scanner (lexical analysis)
Parser (syntax analysis)
Semantic Analysis (IC generator)
Code Generator
Source language
Code Optimizer
Symbol Table
8Token
- lexeme indivisible string in an input language
- ex while, (, main,
- token (possibly infinite) set of lexemes
defining an atomic element with a defined meaning - while_token while
- identifier_token main, x,
- Tokens are often describable using a pattern.
- The language of tokens is regular.
9Lexical Analysis
- Break input string of characters into tokens.
- while (a lt limit) aa 1
- while (a lt limit) aa 1
- Remove white space, comments
10Describing Language Syntax
- Enumeration what are all the possible legal
token orderings - Formal approaches to describing syntax
- Recognizers - used in compilers Is the given
sentence in the language? - Generators generate the sentences of a language
11Metalanguages for Describing Syntax
- A metalanguage is a language used to describe
another language. - Abstractions are used to represent classes of
syntactic structures--they act like syntactic
variables (also called nonterminal symbols) - Define a class of languages called context-free
languages - Context-Free Grammars (Noam Chomsky in mid
1950s) - Backus-Naur Form or BNF (1959 invented by John
Backus to describe Algol 58)
12Backus-Naur Form (BNF)
- ltwhile_stmtgt ? while ( ltlogic_exprgt ) ltstmtgt
- This is a rule describing the structure of a
while statement - Non-terminals are placeholders for other rules
ltwhile_stmtgt, ltlogic_exprgt, ltstmtgt - Tokens (terminal symbols) are part of the
language alpahbet
13BNF Examples
- Vt ,-,0..9, Vn ltLgt,ltDgt, s ltLgt
- ltLgt ? ltLgt ltDgt ltLgt ltDgt ltDgt
- ltDgt ? 0 9
- Vt(,), Vn ltLgt, s ltLgt
- ltLgt ? ( ltLgt ) ltLgt
- ltLgt ? l
recursion
14BNF Examples
- Vt a,b,c,d,,,,-,const, Vn ltprogramgt,
ltstmtsgt, ltstmtgt, ltvargt, ltexprgt, lttermgt - ltprogramgt ? ltstmtsgt
- ltstmtsgt ? ltstmtgt ltstmtgt ltstmtsgt
- ltstmtgt ? ltvargt ltexprgt
- ltvargt ? a b c d
- ltexprgt ? lttermgt lttermgt lttermgt - lttermgt
- lttermgt ? ltvargt const
15Applying BNF rules
- Definition Given a string a A b and a production
A ? g, we can replace A with g - a A b ? a g b is a single step derivation.
- (a, b, and g are strings of zero or more
terminals/non-terminals) - Examples
- ltLgt ltDgt ? ltLgt ltDgt ltDgt using ltLgt ? ltLgt -
ltDgt - ( ltLgt ) ( ltLgt ) ? ( ( ltLgt ) ltLgt ) ( ltLgt )
using ltLgt ? ( ltLgt ) ltLgt
16Derivations
- Definition A sequence of rule applications
- w0 ? w1 ? ? wn
- is a derivation of wn from w0 (w0 ? wn)
- ltLgt production ltLgt ? ( ltLgt ) ltLgt
- ?( ltLgt ) ltLgt production ltLgt ? l
- ( ) ltLgt production ltLgt ? l
- ? ( )
-
- ltLgt ? ()
- If wi has non-terminal symbols, it is referred to
as sentential form.
17Derivation
- A sentence is a sentential form that has only
terminal symbols - A leftmost derivation is one in which the
leftmost nonterminal in each sentential form is
the one that is expanded - A derivation may be neither leftmost nor rightmost
18Derivation of (())()
ltLgt production ltLgt ? ( ltLgt )ltLgt
?(ltLgt) ltLgt production ltLgt ? ( ltLgt )ltLgt
?(ltLgt) (ltLgt)ltLgt production ltLgt ? l
?(ltLgt) (ltLgt) production ltLgt ? ( ltLgt )ltLgt
?((ltLgt)ltLgt)(ltLgt) production ltLgt ? l
?(( ) ltLgt) (ltLgt) production ltLgt ? l
?( ( )ltLgt) ( ) production ltLgt ? l
?( ( ) ) ( )
Grammar ltLgt ? (ltLgt)ltLgt ltLgt ? l
lt Lgt ? (( )) ( )
19Same String, Leftmost Derivation
ltLgt production ltLgt ? ( ltLgt )ltLgt
?(ltLgt) ltLgt production ltLgt ? (ltLgt)ltLgt
?((ltLgt)ltLgt) ltLgt production ltLgt ? l
?(() ltLgt)ltLgt production ltLgt ? l
?(())ltLgt production ltLgt ? (ltLgt)ltLgt
?(( ))(ltLgt) ltLgt production ltLgt ? l
?(( )) () ltLgt production ltLgt ? l
?(()) ()
Grammar ltLgt ? (ltLgt)ltLgt ltLgt ? l
ltLgt ? ? (( )) ( )
20Same String, Rightmost Derivation
ltLgt production ltLgt ? ( ltLgt )ltLgt
?(ltLgt) ltLgt production ltLgt ? (ltLgt)ltLgt
?(ltLgt) (ltLgt) ltLgt production ltLgt ? l
?(ltLgt) ( ltLgt) production ltLgt ? l
?(ltLgt)( ) production ltLgt ? (ltLgt)ltLgt
?((ltLgt) ltLgt)( ) production ltLgt ? l
?((ltLgt)) ( ) production ltLgt ? l
?(()) ()
Grammar ltLgt ? (ltLgt)ltLgt ltLgt ? l
ltLgt ? ? (( )) ( )
21- L(G), the language generated by grammar G is w
in Vt s ? w for start symbol s - Both () and (())() are in L(G) for the previous
grammar.
22Parse Trees
- The parse tree for some string in some language
is defined by the grammar G as follows - The root is the start symbol of G
- The leaves are terminals or l. When visited from
left to right, the leaves form the input string - The interior nodes are non-terminals of G
- For every non-terminal A in the tree with
children B1 Bk, there is some production A ? B1
Bk - If a string is in the given language, a parse
tree must exist.
23Parse Tree for (())()
L
ltLgt
(ltLgt) ltLgt
? (ltLgt) (ltLgt)ltLgt
? (ltLgt) (ltLgt)
? ((ltLgt)ltLgt)(ltLgt)
? (( ) ltLgt) (ltLgt)
? ( ( )ltLgt) ( )
? ( ( ) ) ( )
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
24Ambiguity
- A grammar is ambiguous if there at least two
parse trees (or leftmost derivations ) for some
string in the language - E ? E E
- E ? E E
- E ? 0 9
E
E
E E
E E
4
2
E E
E E
3
4
2
3
2 3 4
25An UnambiguousExpression Grammar
- Grammars can be written that enforce precedence
- ltexprgt ? ltexprgt lttermgt lttermgt
- lttermgt ? lttermgt ltcgt ltcgt
- ltCgt ? 0 1 9
ltexprgt
ltexprgt
lttermgt
ltcgt
lttermgt
lttermgt
2 3 4
4
ltcgt
ltcgt
3
2
26Formal Methods of Describing Syntax
- Operator associativity can also be indicated by a
grammar - ltexprgt -gt ltexprgt ltexprgt const (ambiguous)
- ltexprgt -gt ltexprgt const const (unambiguous)
ltexprgt
ltexprgt
ltexprgt
const
ltexprgt
const
const
27EBNF
- Extended BNF
- Shorthand for BNF
- Optional parts are placed in brackets ( )
- ltproc_callgt -gt ident ( ltexpr_listgt)
- Put alternative parts of RHSs in parentheses and
separate them with vertical bars - lttermgt -gt lttermgt ( -) const
- Put repetitions (0 or more) in braces ( )
- ltidentgt -gt letter letter digit
28BNF and EBNF
- BNF
- ltexprgt ? ltexprgt lttermgt
- ltexprgt - lttermgt
- lttermgt
- lttermgt ? lttermgt ltfactorgt
- lttermgt / ltfactorgt
- ltfactorgt
- EBNF
- ltexprgt ? lttermgt ( -) lttermgt
- lttermgt ? ltfactorgt ( /) ltfactorgt
29Lexical and Syntax Analysis
- If a string is in a language, a parse tree can be
derived for that string - Problem We need to go from a string of
characters (input file) to a legal parse tree to
show that a string is in the language. - From introduction compilers, interpreters,
hybrid approaches - Our Focus Top-Down Parsing
30Parsing
- Take sequence of tokens and produce a parse tree
- Two general algorithms (methods) top-down,
bottom-up - Algorithms derived from the cfg
- Note We cant always derive an algorithm from a
cfg
31Top Down
Start symbol
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
String (())()
32Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
String (())()
33Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
String (())()
34Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
l
String (())()
35Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
l
l
String (())()
36Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
L
L
(
)
l
l
String (())()
37Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
String (())()
38Top Down
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String (())()
39Writing a recursive descent parser
- Procedure for each non-terminal.
- Use next token (lookahead) to choose which
production for that nonterminal to mimic - for non-terminal X, call procedure X()
- for terminals X, call match(X)
- match(symbol)
- if (symbol lookahead)
- lookahead next_token()
- else error()
- Function next_token() gets the next token from
the lexical analyzer must be called before the
first call to get first lookahead.
40Simplified RDP Example
- L ? ( L ) L l
- L()
- if (lookahead ()
/ L ? ( L ) L / - match(() L() match()) L()
-
- else return
/ L ? l / -
- main()
- lookahead next_token()
- L()
41Tracing the Recursive Descent Parse
call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
42Tracing the Recursive Descent Parse
call L() call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
43Tracing the Recursive Descent Parse
call L() call L() call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
44Tracing the Recursive Descent Parse
call L() call L() call L() - return
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
45Tracing the Recursive Descent Parse
call L() call L() call L() - return
call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
46Tracing the Recursive Descent Parse
call L() call L() call L() - return
call L() - return
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
47Tracing the Recursive Descent Parse
call L() call L() - return call L() -
return call L() return
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
48Tracing the Recursive Descent Parse
call L() call L() - return call L() -
return call L() return call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
49Tracing the Recursive Descent Parse
call L() call L() - return call L() -
return call L() return call L()
call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
50Tracing the Recursive Descent Parse
call L() call L() - return call L() -
return call L() return call L()
call L() - return
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
51Tracing the Recursive Descent Parse
call L() call L() - return call L() -
return call L() return call L()
call L() return call L()
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
52Tracing the Recursive Descent Parse
call L() call L() - return call L() -
return call L() return call L()
call L() return call L() return
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
53Tracing the Recursive Descent Parse
call L() - return call L() - return call
L() - return call L() return call
L() - return call L() return call
L() return
L
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String ( ( ) ) ( )
lookahead
54Simplified RDP Example
- L ? ( L ) L l
- L()
- if (lookahead ()
/ L ? ( L ) L / - match(() L() match()) L()
-
- else return
/ L ? l / -
- main()
- lookahead next_token()
- L()
The body of the function for a given non-terminal
mimics the productions.
55Another Grammar
- A ? a B
- A ? b
- A ? c B B
- B ? a B
- B ? b A
- A()
- if (lookahead a)
- lookahead next_token() B()
- else if (lookahead b)
- lookahead next_token()
- else if (lookahead c)
- lookahead next_token() B() B()
- else error()
-
- B()
- if (lookahead a)
- lookahead next_token() B()
- else if (lookahead b)
- lookahead next_token() A()
- else error()
Key Finding the set of symbols (lookahead) that
indicate which production to use!
56How do we find the lookaheads?
- Can compute lookahead sets for some grammars from
FIRST() sets - lookhead(A ? a) FIRST(a)
- For this to work for a given grammar, the
lookahead sets for a given non-terminal will be
disjoint.
57FIRST Sets
- FIRST(a) is the set of all terminal symbols that
can begin some sentential form that starts with a - FIRST(a) a in Vt a ? ab
- U l if a ? l
- Example
- ltstmtgt ? simple begin ltstmtsgt end
- FIRST(ltstmtgt) simple, begin
Remember, a is a string of zero or more
terminals and nonterminals
58Computing FIRST sets
- Initially FIRST(A) is empty
- For productions A ? a b, where a in Vt
- Add a to FIRST(A)
- For productions A ? l
- Add l to FIRST(A)
- For productions A ? a B b, where a ? l and NOT
(B ? l) - Add FIRST(aB) to FIRST(A)
- For productions A ? a, where a ? l
- Add FIRST(a) and l to FIRST(A)
59- To compute FIRST across strings of terminals and
non-terminals - FIRST(l) l
- A if A is a terminal
- FIRST(Aa) FIRST(A) U FIRST(a)
- if A ? l
- FIRST(A) otherwise
60Example 1
- S ? a S e
- S ? B
- B ? b B e
- B ? C
- C ? c C e
- C ? d
- FIRST(C)
- FIRST(B)
- FIRST(S)
61Example 1
- S ? a S e
- S ? B
- B ? b B e
- B ? C
- C ? c C e
- C ? d
- FIRST(C) c,d
- FIRST(B) b,c,d
- FIRST(S) a,b,c,d
62Example 2
- P ? i c n T S
- Q ? P a S b S c S T
- R ? b l
- S ? c R n l
- T ? R S q
- FIRST(P)
- FIRST(Q)
- FIRST(R)
- FIRST(S)
- FIRST(T)
63Example 2
- P ? i c n T S
- Q ? P a S b S c S T
- R ? b l
- S ? c R n l
- T ? R S q
- FIRST(P) i,c,n
- FIRST(Q) i,c,n,a,b
- FIRST(R) b, l
- FIRST(S) c,b,n, l
- FIRST(T) b,c,n,q
64Example 3
- S ? a S e S T S
- T ? R S e Q
- R ? r S r l
- Q ? S T l
- FIRST(S)
- FIRST(R)
- FIRST(T)
- FIRST(Q)
65Example 3
- S ? a S e S T S
- T ? R S e Q
- R ? r S r l
- Q ? S T l
- FIRST(S) a
- FIRST(R) r, l
- FIRST(T) r,a, l
- FIRST(Q) a, l
66Bottom up Parsing (shift/reduce, LR)
- Less intuitive but more efficient than top down
- Two actions
- Shift move some token from the input to the
parse tree forest - Reduce merge 0 or more parser trees with a
single parent.
67Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
String (())()
68Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
(
String (())()
Shift (
69Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
(
(
String (())()
Shift (
70Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
(
L
(
l
String (())()
Reduce L ? l
71Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
(
L
(
)
l
String (())()
Shift )
72Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
(
L
L
(
)
l
l
String (())()
Reduce L ? l
73Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
(
L
L
(
)
l
l
String (())()
Reduce L ? ( L ) L
74Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
(
)
L
L
(
)
l
l
String (())()
Shift )
75Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
(
)
L
L
(
)
(
l
l
String (())()
Shift (
76Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
(
)
L
L
(
)
(
L
l
l
l
String (())()
Reduce L ? l
77Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
(
)
L
L
(
)
(
L
)
l
l
l
String (())()
Shift )
78Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
(
)
L
L
(
)
(
L
)
L
l
l
l
l
String (())()
Reduce L ? l
79Bottom Up
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String (())()
Reduce L ? ( L ) L
80Bottom Up
L
ltLgt ? (ltLgt)ltLgt ltLgt ? l
L
L
(
)
L
L
(
)
L
L
(
)
l
l
l
l
String (())()
Reduce L ? ( L ) L