Title: CSCI 435 Compiler Design
1CSCI 435 Compiler Design
- Week 1 Class 3
- Ray Schneider
2Today's Drill
- Grammars
- Closures
- Outline Code, phew!
- Conway paper from 1963 on J
- On to Chapter 2
3Grammars
- context-free-grammars (abbreviated CF)
- basic kind of grammar used in defining
Programming Languages. "Context free" means that
one can substitute for a non-terminal symbol
without reference to the context. - also regular grammars also called regular
expressions, and - attribute grammars which are context-free-grammars
extended with parameters and code
4So what is a grammar?
- a procedure for generating strings of symbols
with defined properties - symbols are also called TOKENS of the language
- strings of symbols are Program Texts
- and the set of strings of symbols is the
Programming Language - ex. BEGIN print ( "Hi!" ) END
- a string with six Tokens
- The strings are constructed in a structured
fashion and to this structure semantics cam be
attached
5The form of a grammar
- grammar production rule(s)S, the Start
Symbol - production rule HAS 2 parts
- left hand side name of syntactic construct
- right hand side possible forms
- separated by an ? arrow
- ex. expression?'(' expression operator expression
')' - The left hand side is a NON TERMINAL symbol
- the right hand side consists of combinations of
NON TERMINALS and TERMINALS - together they make up the set of Grammar Symbols
collectively called the MEMBERS of the Grammar, G
6Some Conventions
- Want to be able to infer the class of a symbol
from its typographical form, so - NON TERMINALS are denoted by capital letters,ex.
A,B,C and N - Terminals are denoted by lower-case letters near
the end of the alphabet, ex. x, y, and z - Sequences of grammar symbols are denoted by Greek
letters near the beginning of the alphabet, ex. a
(alpha), b (beta), g (gamma) - Lower-case letters near the beginning of the
alphabet (a, b, c, etc.) stand for themselves as
terminals - the empty sequence is denoted by e, (epsilon)
7The Production Process
- sentential form the central data structure in
the production process - syntactic structure added to the sentential form
as a tree with leaves of grammar symbols - production tree combination of sentential form
and syntactic structure - string of terminals is produced by a grammar by
applying production steps to a sentential form
8derivation
Each production step finds a Non-Terminal N in
the leaves of the sentential form, finds a
production rule N?a with N as its left hand side,
and replaces N in the sentential form with a tree
with N as the root and the right hand side of the
production rule, a, as the leaves. ex. given N?a
then bNg can be replaced with bag
1 expression ? '(' expression operator
expression ')' 2 expression ? '1' 3 operator
? '' 4 operator ? ''
Notation R_at_P means Production rule R applied at
position P
START Symbol
Derivation of the String (1(11)) specifically a
leftmost derivation
9Parse Tree of our derivation
- Recursion is necessary to the production process
- We need to maintain the production TREE to find
out the semantics of the program which is the
task of the PARSER
10Extended forms of grammars
- non-terminal ? zero or more grammar symbols
- basic single grammar rule format
- normally a richer notation is used
N?a N?b simple notation each alternative
separately N?g can be combined as N?a
b g where these are the alternatives of N
Thus far the format is BNF which is good for
expressing nesting and recursion but is not as
effective expressing repetition and optionality.
11Additional Notation postfix operators
- Extended BNF adds new forms
- R is one or more R's, expresses repetition
- R? is an occurrence of zero or one R, optionality
- R is an occurrence of zero or more Rs, optional
repetition - parentheses may be needed to group grammar
symbols so that the operators can operate on more
than one
12Properties of Grammars
- left recursive if starting with a sentential form
N we can produce another sentential form starting
with N. ex. - expression?expression '' factor factor
- non-terminal is nullable if starting with N we
can produce an empty sentential form. - non-terminal is useless if it can never produce a
string of terminal symbols - grammar is ambiguous if it can produce two
different production trees with the same leaves
in the same order. This means that the semantics
will differ since the semantics are derived from
the production tree.
13Formalism
- basic unit of a grammar is the symbol
- - they must be distinct, i.e. distinguishable
- examples N, x, procedure_body, tk
- 2. next element the production rule
- given 2 sets of symbols V1 and V2 (a
vocabulary) a production rule is a 2-tuple (a
pair) - (N,a) such that N?V1,a?V2
- where X means a sequence of 0 or more
elements drawn from set X. - the production rule is usually written
N?a
14Formalism 2
- Now we can define a context-free grammar G as a
4-tuple
G(VN,VT,S,P) where VN is the set of
non-terminal symbols VT is the set of terminal
symbols S is the start symbol P is the set of
production rules The above is the context-free
portion of the grammar. Real acceptable grammars
have to satisfy three context conditions.
15Formalism 3
1. VN?VT? i.e. the terminal symbol set and the
non-terminal symbol set have no symbols in
common 2. S?VN i.e. the start symbol is a member
of the non-terminal symbol set 3.
P?(N,a)N?VN,a ?(VN?VT) i.e. a production
rule must have a left side drawn from the
non- terminal symbol set and a right side drawn
from the combination of the non-terminal and
terminal symbol sets. No other symbols allowed.
16The Language Generated by a Grammar
- Sequences of symbols are called strings
- A string in the grammar may be directly derived
from another string, ex. agtb - means b is directly derivable from a iff
?g,d1,d2,N?VN such that ad1Nd2, bd1gd2,
(N,g)?P - In English that's something like the string b can
be produced from the string a iff a contains a
non-terminal symbol and there is a production
rule which allows b to be produced by
substitution for the non-terminal. - this 'replacement' is called a production step
- more generally agtb iff ab or if b can be
produced by a sequence of production steps
17finally
- a sentential form of a grammar G is defined as
a Sgta i.e. all strings must derive from the
start string. - a terminal production of a grammar G is defined
as a sentential form composed of all terminal
symbols aSgta?a?VT - The Language L (G)aSgta?a?VT
18Closure Algorithms
- information improving algorithms often start by
collecting information then apply rules to extend
or draw conclusions from it. - It can be very misleading to consider algorithms
by looking at their pieces in isolation - We will using the construction of the calling
graph of a program as an example.
19Calling Graph of a Program
- Calling Graph is a directed graph (arrows) which
has a node for each routine (procedure or
function) in the program and an arrow from node A
to node B denotes that A calls B either directly
or indirectly.
void P() Q() S() void Q() R()
T() void R() P() void S() void T()
direct calling graph
20Calling Chains change the picture
- to complete the picture we apply the following
rule - If there is an arrow from node A to node B and
one from B to C, make sure there is an arrow from
A to C. This transitivity axiom is written - A?B?B?C ? A?C
- where ? is read "calls directly or indirectly"
- thus A ? A indicates that "routine A is
recursive"
21General From of a Closure Algorithm
- Three Elements
- Data Definitions deriving from the nature of the
problem - Initializations one or more rules for
initialization of the information from the
specific problem to its representation - Inference Rules one or more rules of the form If
I1,I2 then J (i.e. if this information is
present then I infer J)
22Recursion Detection A little more formally
- Data definitions
- Let G be a directed graph with one node for each
routine. The information items are arrows in G. - An arrow from a node A to a node B means that
routine A calls routine B directly or indirectly. - Initializations
- If the body of a routine A contains a call
to routine B, an arrow from A to B must be
present. - Inference Rules
- If there is an arrow from node A to node B
and one from B to C, an arrow from A to C must be
present.
Two Things 1) doesn't specify stuff that should
not be present (need a additional rule) and 2)
doesn't guarantee that the algorithm will stop
23Iteration to the rescue
- Implement the closure algorithm through repeated
bottom-up sweeps
SET the flag something changed TO True WHILE
Something changed SET Something changed TO
False FOR EACH Node 1 IN Graph FOR
EACH Node 2 IN Descendants of Node 1
FOR EACH Node 3 IN Descendants of Node 2
IF there is no arrow from Node 1 to Node 3
Add an arrow from Node 1 to Node 3
SET Something changed TO True
Algorithm is O(n3) per repetition and could be
as bad as O(n5) with WHILE loops Author's
handwaving suggests in practice it tends to run
in linear time.
24The Outline Code (pseudocode)
- Command lines end in ''
- Control lines end in ''
- Body of Control Structure in indented with end
obvious by return to former indentation - KEYWORDS all capitals
- Identifiers generally start with a capital letter
- Type Identifiers and field selectors start with
lower case letters (denote classes) - Field Selectors marked by a 'dot'
ex. Node .left is a postfix operator - // starts a comment line
(46)
25Summary End of Chapter 1
- Compiler is a file conversion program of a very
specialized nature - takes Source Language and produces Target
Language and is written in Implementation
Language - Usual form is a series of processes
- lexical analysis, syntactic analysis,
intermediate form, then code generation - Usual form of semantic representation is the AST
with context and semantic annotations - Program Generators based on formalisms have
allowed compiler generation to be increasingly
automated.
26Homework for Week 1
- For the week do problems 1 through 21 at the end
of the chapter (Chapter 1). Pace yourself and
turn them in Monday. - Get the demo compiler running and run some
examples starting with the example on page 18 and
illustrated in figure 1.17 - (2((34)9))
27References
- Text Modern Compiler Design Figures