Title: CSC 415: Translators and Compilers
1CSC 415 Translators and Compilers
2Course Outline
- Translators and Compilers
- Language Processors
- Compilation
- Syntactic Analysis
- Contextual Analysis
- Run-Time Organization
- Code Generation
- Interpretation
- Major Programming Project
- Project Definition and Planning
- Implementation
- Weekly Status Reports
- Project Presentation
3Project
- Implement a Compiler for the Programming Language
Triangle - Appendix B Informal Specification of the
Programming Language Triangle - Appendix D Class Diagrams for the Triangle
Compiler - Present Project Plan
- What and How
- Weekly Status Reports
- Work accomplished during the reporting period
- Deliverable progress, as a percentage of
completion - Problem areas
- Planned activities for the next reporting period
4Chapter 1 Introduction to Programming Languages
- Programming Language A formal notation for
expressing algorithms. - Programming Language Processors Tools to enter,
edit, translate, and interpret programs on
machines. - Machine Code Basic machine instructions
- Keep track of exact address of each data item and
each instruction - Encode each instruction as a bit string
- Assembly Language Symbolic names for operations,
registers, and addresses.
5Programming Languages
- High Level Languages Notation similar to
familiar mathematical notation - Expressions , -, , /
- Data Types truth variables, characters,
integers, records, arrays - Control Structures if, case, while, for
- Declarations constant values, variables,
procedures, functions, types - Abstraction separates what is to be performed
from how it is to be performed - Encapsulation (or data abstraction) group
together related declarations and selectively
hide some
6Programming Languages
- Any system that manipulates programs expressed in
some particular programming language - Editors enter, modify, and save program text
- Translators and Compilers Translates text from
one language to another. Compiler translates a
program from a high-level language to a low-level
language, preparing it to be run on a machine - Checks program for syntactic and contextual
errors - Interpreters Runs program without compliation
- Command languages
- Database query languages
7Programming Languages Specifications
- Syntax
- Form of the program
- Defines symbols
- How phrases are composed
- Contextual constraints
- Scope determine scope of each declaration
- Type
- Semantics
- Meaning of the program
8Representation
- Syntax
- Backus-Naur Form (BNF) context-free grammar
- Terminal symbols (gt, while, )
- Non-terminal symbols (Program, Command,
Expression, Declaration) - Start symbol (Program)
- Production rules (defines how phrases are
composed from terminals and sub-phrases) - Nab.
- Syntax Tree
- Used to define language in terms of strings and
terminal symbols
9Representation
- Semantics
- Abstract Syntax
- Concentrate on phrase structure alone
- Abstract Syntax Tree
10Contextual Constraints
- Scope
- Binding
- Static determined by language processor
- Dynamic determined at run-time
- Type
- Statically language processor can detect all
errors - Dynamically type errors cannot be detected until
run-time
Will assume static binding and statically typed
11Semantics
- Concerned with meaning of program
- Behavior when run
- Usually specified informally
- Declarative sentences
- Could include side effects
- Correspond to production rules
12Chapter 2 Language Processors
- Translators and Compilers
- Interpreters
- Real and Abstract Machines
- Interpretive Compilers
- Portable Compilers
- Bootstrapping
- Case Study The Triangle Language Processor
13Translators Compilers
- Translator a program that accepts any text
expressed in one language (the translators
source language), and generates a
semantically-equivalent text expressed in another
language (its target language) - Chinese-into-English
- Java-into-C
- Java-into-x86
- X86 assembler
14Translators Compilers
- Assembler translates from an assembly language
into the corresponding machine code - Generates one machine code instruction per source
instruction - Compiler translates from a high-level language
into a low-level language - Generates several machine-code instructions per
source command.
15Translators Compilers
- Disassembler translates a machine code into the
corresponding assembly language - Decompiler translates a low-level language into
a high-level language
Question Why would you want a disassembler or
decompiler?
16Translators Compilers
- Source Program the source language text
- Object Program the target language text
Compiler
Syntax Check
Context Constraints
- Object program semantically equivalent to source
program - If source program is well-formed
17Translators Compilers
- Why would you want to do
- Java-into-C translator
- C-into-Java translator
- Assembly-language-into-Pascal decompiler
18Translators Compilers
P Program Name
L Implementation Language
M Target Machine
For this to work, L must equal M, that is, the
implementation language must be the same as the
machine language
S Source Language
T Target Language
L Translators Implementation Language
S-into-T Translator is itself a program that runs
on machine L
19Translators Compilers
- Translating a source program P
- Expressed in language T,
- Using an S-into-T translator
- Running on machine M
20Translators Compilers
sort
sort
sort
Java
x86
Java
x86
x86
x86
- Translating a source program sort
- Expressed in language Java,
- Using an Java-into-x86 translator
- Running on an x86 machine
The object program is running on the same machine
as the compiler
21Translators Compilers
sort
sort
sort
Java
PPC
Java
PPC
PPC
download
x86
- Translating a source program sort
- Expressed in language Java,
- Using an Java-into-PPC translator
- Running on an x86 machine
- Downloaded to a PPC machine
Cross Compiler The object program is running on
a different machine than the compiler
22Translators Compilers
sort
sort
sort
Java
Java
C
C
C
x86
x86
- Translating a source program sort
- Expressed in language Java,
- Using an Java-into-C translator
- Running on an x86 machine
- Then translating the C program
- Using an C-into x86 compiler
- Running on an x86 machine
- Into x86 object program
Two-stage Compiler The source program is
translated to another language before being
translated into the object program
23Translators Compilers
- Translator Rules
- Can run on machine M only if it is expressed in
machine code M - Source program must be expressed in translators
source language S - Object program is expressed in the translators
target language T - Object program is semantically equivalent to the
source program
24Interpreters
- Accepts any program (source program) expressed in
a particular language (source language) and runs
that source program immediately - Does not translate the source program into object
code prior to execution
25Interpreters
Interpreter
Fetch Instruction
Analyze Instruction
Program Complete
Execute Instruction
- Source program starts to run as soon as the first
instruction is analyzed
26Interpreters
- When to Use Interpretation
- Interactive mode want to see results of
instruction before entering next instruction - Only use program once
- Each instruction expected to be executed only
once - Instructions have simple formats
- Disadvantages
- Slow up to 100 times slower than in machine code
27Interpreters
- Examples
- Basic
- Lisp
- Unix Command Language (shell)
- SQL
28Interpreters
S interpreter expressed in language L
Program P expressed in language S, using
Interpreter S, running on machine M
Program graph written in Basic running on a Basic
interpreter executed on an x86 machine
29Real and Abstract Machines
- Hardware emulation Using software to execute one
set of machine code on another machine - Can measure everything about the new machine
except its speed - Abstract machine emulator
- Real machine actual hardware
An abstract machine is functionally equivalent to
a real machine if they both implement the same
language L
30Real and Abstract Machines
New Machine Instruction (nmi) interpreter written
in C
nmi interpreter expressed in machine code M
nmi interpreter written in C
The nmi interpreter is translated into machine
code M using the C compiler
Compiler to translate C program into M machine
code
31Interpretive Compilers
- Combination of compiler and interpreter
- Translate source program into an intermediate
language - It is intermediate in level between the source
language and ordinary machine code - Its instructions have simple formats, and
therefore can be analyzed easily and quickly - Translation from the source language into the
intermediate language is easy and fast
An interpretive compiles combines fast
compilation with tolerable running speed
32Interpretive Compilers
Java into JVM translator running on machine M
JVM code interpreter running on machine M
A Java program P is first translated into
JVM-code, and then the JVM-code object program is
interpreted
33Portable Compilers
- A program is portable if it can be compiled and
run on any machine, without change - A portable program is more valuable than an
unportable one, because its development cost can
be spread over more copies - Portability is measured by the proportion of code
that remains unchanged when it is moved to a
dissimilar machine - Language affects protability
- Assembly language 0 portable
- High level language approaches 100 portability
34Portable Compilers
- Language Processors
- Valuable and widely used programs
- Typically written in high-level language
- Pascal, C, Java
- Part of language processor is machine dependent
- Code generation part
- Language processor is only about 50 portable
- Compiler that generates intermediate code is more
portable than a compiler that generates machine
code
35Portable Compilers
Java
JVM
Java
Rewrite interpreter in C
36Bootstrapping
- The language processor is used to process itself
- Implementation language is the source language
- Bootstrapping a portable compiler
- A portable compiler can be bootstrapped to make a
true compiler one that generates machine code
by writing an intermediate-language-into-machine-c
ode translator - Full bootstrap
- Writing the compiler in itself
- Using the latest version to upgrade the next
version - Half bootstrap
- Compiler expressed in itself but targeted for
another machine - Bootstrapping to improve efficiency
- Upgrade the compiler to optomize code generation
as well as to improve compile efficiency
37Bootstrapping
Bootstrap an interpretive compiler to generate
machine code
First, write a JVM-coded-into-M translator in Java
Next, compile translator using existing
interpreter
Use translator to translate itself
Two stage Java-into-M compiler
Translate Java-into-JVM-code translator into
machine code
38Bootstrapping
Full bootstrap
v2
v1
Convert the C version of Ada-S into Ada-S version
of Ada-S
Write Ada-S compiler in C
v1
v2
v3
Extend Ada-S compiler to (full) Ada compiler
39Bootstrapping
Half bootstrap
40Bootstrapping
Bootstrap to improve efficiency
41Chapter 3 Compilation
- Phases
- Syntactic Analysis
- Contextual Analysis
- Code Generation
- Passes
- Multi-pass Compilation
- One-pass Compilation
- Compiler Design Issues
- Case Study The Triangle Compiler
42Phases
- Syntactic Analysis
- The source program is parsed to check whether it
conforms to the source languages syntax, and to
determine its phrase structure - Contextual Analysis
- The parsed program is analyzed to check whether
it conforms to the source language's contextual
constraints - Code Generation
- The checked program is translated to an object
program, in accordance with the semantics of the
source and target languages
43Phases
Source Program
Syntactic Analysis
Error Report
AST
Contextual Analysis
Error Report
Decorated AST
Code Generation
Object Program
44Syntactic Analysis
- To determine the source programs phrase
structure - Parsing
- Contextual analysis and code generation must know
how the program is composed - Commands, expressions, declarations,
- Check for conformance to the source languages
syntax - Construct suitable representation of its phrase
structure (AST) - AST
- Terminal nodes corresponding to identifiers,
literals, and operators - Sub trees representing the phases of the source
program - Blanks and comments not in AST (no meaning)
- Punctuation and brackets not in AST (only
separate and enclose)
45Contextual Analysis
- Analyzes the parsed program
- Scope rules
- Type rules
- Produces decorated AST
- AST with information gathered during contextual
analysis - Each applied occurrence of an identifier is
linked ot the corresponding declaration - Each expression is decorated by its type T
46Code Generation
- The final translation of the checked program to
an object program - After syntactic and contextual analysis is
completed - Treatment of identifiers
- Constants
- Binds identifier to value
- Replace each occurrence of identifier with value
- Variables
- Binds identifier to some memory address
- Replace each occurrence of identifier by address
- Target language
- Assembly language
- Machine code
47Passes
- Multi-pass compilation
- Traverses the program or AST several times
- One-pass compilation
- Single traverse of program
- Contextual analysis and code generation are
performed on the fly during syntactic analysis
48Compiler Design Issues
- Speed
- Compiler run time
- Space
- Storage size of compiler files generated
- Modularity
- Multi-pass compiler more modular than one-pass
compiler - Flexibility
- Multi-pass compiler is more flexible because it
generates an AST that can be traversed in any
order by the other phases - Semantics-preserving transformations
- To optimize code must have multi-pass compiler
- Source language properties
- May restrict compiler choice some language
constructs may require multi-pass compilers
49Chapter 4 Syntactic Analysis
- Sub-phases of Syntactic Analysis
- Grammars Revisited
- Parsing
- Abstract Syntax Trees
- Scanning
- Case Study Syntactic Analysis in the Triangle
Compiler
50Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Parser Semantic Analyzer
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
51Syntactic Analysis
- Main function
- Parse source program to discover its phrase
structure - Recursive-descent parsing
- Constructing an AST
- Scanning to group characters into tokens
52Sub-phases of Syntactic Analysis
- Scanning (or lexical analysis)
- Source program transformed to a stream of tokens
- Identifiers
- Literals
- Operators
- Keywords
- Punctuation
- Comments and blank spaces discarded
- Parsing
- To determine the source programs phrase structure
- Source program is input as a stream of tokens
(from the Scanner) - Treats each token as a terminal symbol
- Representation of phrase structure
- AST
53Lexical Analysis A Simple Example
Main() int a, b, c char number5 / get
user inputs / A atoi ( gets(number)) B
atoi (gets(number)) / calculate value for c
/ C 2(ab) a(ab) / print results
/ Printf(d,c)
- Scan the file character by character and group
characters into words and punctuation (tokens),
remove white space and comments - Some tokens for this example
- main
- (
- )
-
- int
- a
- ,
- b
- ,
- c
-
54Creating Tokens Mini-Triangle Example
Input Converter
character string
. . . .
l
e
t
S
v
a
r
y
I
n
t
e
g
e
r
i
n
S
S
S
Scanner
Ident.
colon
Ident.
Ident.
becomes
Ident.
op.
Intlit.
eot
let
var
in
1
y
Integer
y
y
let
var
in
55Tokens in Triangle
- // literals, identifiers, operators...
- INTLITERAL 0, "ltintgt",
- CHARLITERAL 1, "ltchargt",
- IDENTIFIER 2, "ltidentifiergt",
- OPERATOR 3, "ltoperatorgt",
- // reserved words - must be in alphabetical
order... - ARRAY 4, "array",
- BEGIN 5, "begin",
- CONST 6, "const",
- DO 7, "do",
- ELSE 8, "else",
- END 9, "end",
- FUNC 10, "func",
- IF 11, "if",
- IN 12, "in",
- LET 13, "let",
- OF 14, "of",
- PROC 15, "proc",
// punctuation... DOT 21, ".",
COLON 22, "", SEMICOLON 23, "",
COMMA 24, ",", BECOMES 25, "",
IS 26, // brackets... LPAREN 27,
"(", RPAREN 28, ")", LBRACKET
29, ", RBRACKET 30, "", LCURLY
31, "", RCURLY 32, "", // special
tokens... EOT 33, "", ERROR 34
"lterrorgt"
56Grammars Revisited
- Context free grammars
- Generates a set of sentences
- Each sentence is a string of terminal symbols
- An unambiguous sentence has a unique phrase
structure embodied in its syntax tree - Develop parsers from context-free grammars
57Regular Expressions
- A regular expression (RE) is a convenient
notation for expressing a set of stings of
terminal symbols - Main features
- separates alternatives
- indicates that the previous item may be
represented zero or more times - ( and ) are grouping parentheses
58Regular Expression Basics
- e The empty string a special string of length 0
- Regular expression operations
- separates alternatives
- indicates that the previous item may be
represented zero or more times (repetition) - ( and ) are grouping parentheses
59Regular Expression Basics
- Algebraic Properties
- is commutative and associative
- rs sr
- r(st) (rs)t
- Concatenation is associative
- (rs)t r(st)
- Concatenation distributes over
- r(st) rsrt
- (st)r srtr
- e is the identity for concatenation
- e r r
- r e r
- is idempotent
- r r
- r (r e)
60Regular Expression Basics
- Common Extensions
- r one or more of expression r, same as rr
- rk k repetitions of r
- r3 rrr
- r the characters not in the expression r
- \t\n
- r-z range of characters
- 0-9a-z
- r? Zero or one copy of expression (used for
fields of an expression that are optional)
61Regular Expression Example
- Regular Expression for Representing Months
- Examples of legal inputs
- January represented as 1 or 01
- October represented as 10
- First Try 01e0-9
- Matches all legal inputs? Yes
- 1, 2, 3, , 10, 11, 12, 01, 02, , 09
- Matches any illegal inputs? Yes
- 0, 00, 18
62Regular Expression Example
- Regular Expression for Representing Months
- Examples of legal inputs
- January represented as 1 or 01
- October represented as 10
- Second Try 1-9(01-9)(10-2)
- Matches all legal inputs? Yes
- 1, 2, 3, , 10, 11, 12, 01, 02, , 09
- Matches any illegal inputs? No
63Regular Expression Example
- Regular Expression for Floating Point Numbers
- Examples of legal inputs
- 1.0, 0.2, 3.14159, -1.0, 2.7e8, 1.0E-6
- Assume that a 0 is required before numbers less
than 1 and does not prevent extra leading zeros,
so numbers such as 0011 or 0003.14159 are legal - Building the regular expression
- Assume
- Digit ? 0123456789
- Handle simple decimals such as 1.0, 0.2, 3.14159
- Digit.digit
- Add an optional sign (only minus, no plus)
- (- e)digit.digit or -?digit.digit
64Regular Expression Example
- Regular Expression for Floating Point Numbers
(cont.) - Building the regular expression (cont.)
- Format for the exponent
- (Ee)(-)?(digit)
- Adding it as an optional expression to the
decimal part - (- e)digit.digit((Ee)(-)?(digit))?
65Extended BNF
- Extended BNF (EBNF)
- Combination of BNF and RE
- NX, where N is a nonterminal symbol and X is
an extended RE, i.e., an RE constructed from both
terminal and nonterminal symbols - EBNF
- Right hand side may use . , (, )
- Right hand side may contain both terminal and
nonterminal symbols
66Example EBNF
- Expression primary-Expression (Operator
primary-Expression) - Primary-Expression Identifier
- ( Expression )
- Identifier abcde
- Operator -/
- Generates
- e
- a b
- a b c
- a (b c)
- a (b c) / d
- a (b (c (d e)))
67Grammar Transformations
- Left Factorization
- XY XZ is equivalent to X(Y Z)
- single-Command V-name Expression
- if Expression then single-Command
- if Expression then single-Command
- else single-Command
- single-Command V-name Expression
- if Expression then single-Command
- (e else single-Command)
68Grammar Transformations
- Elimination of left recursion
- N X NY is equivalent to NX(Y)
- Identifier Letter
- Identifier Letter
- Identifier Digit
- Identifier Letter
- Identifier (Letter Digit)
- Identifier Letter(Letter Digit)
69Grammar Transformations
- Substitution of nonterminal symbols
- Given NX, we can substitute each occurrence
of N with X - iff NX is nonrecursive and is the only
production rule for N - single-Command for Control-Variable
Expression To-or-Downto - Expression do single-Command
-
- Control-Variable Identifier
- To-or-Downto to
- down
- single-Command for Identifier Expression
(todownto) - Expression do single-Command
-
70Scanning (Lexical Analysis)
- The purpose of scanning is to recognize tokens in
the source program. Or, to group input
characters (the source program text) into tokens. - Difference between parsing and scanning
- Parsing groups terminal symbols, which are
tokens, into larger phrases such as expressions
and commands and analyzes the tokens for
correctness and structure - Scanning groups individual characters into tokens
71Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Parser Semantic Analyzer
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
72Creating Tokens Mini-Triangle Example
Input Converter
character string
. . . .
l
e
t
S
v
a
r
y
I
n
t
e
g
e
r
i
n
S
S
S
Scanner
Ident.
colon
Ident.
Ident.
becomes
Ident.
op.
Intlit.
eot
let
var
in
1
y
Integer
y
y
let
var
in
73What Does a Scanner Do?
- Hand keywords (reserve words)
- Recognizes identifiers and keywords
- Match explicitly
- Write regular expression for each keyword
- Identifier is any alpha numeric string which is
not a keyword - Match as an identifier, perform lookup
- No special regular expressions for keywords
- When an identifier is found, perform lookup into
preloaded keyword table
How does Triangle handle keywords? Discuss in
terms of efficiency and ease to code.
74What Does a Scanner Do?
- Remove white space
- Tabs, spaces, new lines
- Remove comments
- Single line
- -- Ada comment
- Multi-line, start and end delimiters
- Pascal comment
- / c comment /
- Nested
- Runaway comments
- Nonterminated comments cant be detected till end
of file
75What Does a Scanner Do?
- Perform look ahead
- Multi-character tokens
- 1..10 vs. 1.10
- ,
- lt, lt
- etc
- Challenging input languages
- FORTRAN
- Keywords not reserved
- Blanks are not a delimiter
- Example (comma vs. decimal)
- DO10I1,5 start of a do loop (equivalent to a C
for loop) - DO10I1.5 an assignment statement, assignment to
variable DO10I
76What Does a Scanner Do?
- Challenging input languages (cont.)
- PL/I, keywords not reserved
- IF THEN THEN THEN ELSE ELSE ELSE THEN
77What Does a Scanner Do?
- Error Handling
- Error token passed to parser which reports the
error - Recovery
- Delete characters from current token which have
been read so far, restart scanning at next unread
character - Delete the first character of the current lexeme
and resume scanning form next character. - Examples of lexical errors
- 3.25e bad format for a constant
- Var1 illegal character
- Some errors that are not lexical errors
- Mistyped keywords
- Begim
- Mismatched parenthesis
- Undeclared variables
78Scanner Implementation
- Issues
- Simpler design parser doesnt have to worry
about white space, etc. - Improve compiler efficiency allows the
construction of a specialized and potentially
more efficient processor - Compiler portability is enhanced input alphabet
peculiarities and other device-specific anomalies
can be restricted to the scanner
79Scanner Implementation
- What are the keywords in Triangle?
- How are keywords and identifiers implemented in
Triangles? - Is look ahead implemented in Triangle?
- If so, how?
80Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Semantic Analyzer
Parser
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
81Parsing
- Given an unambiguous, context free grammar,
parsing is - Recognition of an input string, i.e., deciding
whether or not the input string is a sentence of
the grammar - Parsing of an input string, i.e., recognition of
the input string plus determination of its phrase
structure. The phrase structure can be
represented by a syntax tree, or otherwise.
Unambiguous is necessary so that every sentence
of the grammar will form exactly one syntax tree.
82Parsing
- The syntax of programming language constructs are
described by context-free grammars. - Advantages of unambiguous, context-free grammars
- A precise, yet easy-to understand, syntactic
specification of the programming language - For certain classes of grammars we can
automatically construct an efficient parser that
determines if a source program is syntactically
well formed. - Imparts a structure to a programming language
that is useful for the translation of source
programs into correct object code and for the
detection of errors. - Easier to add new constructs to the language if
the implementation is based on a grammatical
description of the language
83Parsing
- Check the syntax (structure) of a program and
create a tree representation of the program - Programming languages have non-regular constructs
- Nesting
- Recursion
- Context-free grammars are used to express the
syntax for programming languages
84Context-Free Grammars
- Comprised of
- A set of tokens or terminal symbols
- A set of non-terminal symbols
- A set of rules or productions which express the
legal relationships between symbols - A start or goal symbol
- Example
- expr ? expr digit
- expr ? expr digit
- expr ? digit
- digit ? 0129
- Tokens -,,0,1,2,,9
- Non-terminals expr, digit
- Start symbol expr
85Context-Free Grammars
- expr ? expr digit
- expr ? expr digit
- expr ? digit
- digit ? 0129
Example input 3 8 - 2
86Checking for Correct Syntax
- Given a grammar for a language and a program, how
do you know if the syntax of the program is
legal? - A legal program can be derived from the start
symbol of the grammar
Grammar must be unambiguous and context-free
87Deriving a String
- The derivation begins with the start symbol
- At each step of a derivation the right hand side
of a grammar rule is used to replace a
non-terminal symbol - Continue replacing non-terminals until only
terminal symbols remain
Rule 2
Rule 1
Rule 4
expr ? expr digit ? expr 2 ? expr digit - 2
Rule 3
Rule 4
Rule 4
? expr 8-2 ? digit 8-2 ? 38 -2
88Rightmost Derivation
- The rightmost non-terminal is replaced in each
step
Rule 4
expr digit ? expr 2
Rule 2
expr 2 ? expr digit - 2
Rule 4
expr digit - 2 ? expr 8-2
Rule 3
expr 8-2 ? digit 8-2
Rule 4
digit 8-2 ? 38 -2
89Leftmost Derivation
- The leftmost non-terminal is replaced in each step
Rule 2
expr digit ? expr digit digit
Rule 3
expr digit digit ? digit digit digit
Rule 4
digit digit digit ? 3 digit digit
Rule 4
3 digit digit ? 3 8 digit
Rule 4
3 8 digit ? 3 8 2
90Leftmost Derivation
- The leftmost non-terminal is replaced in each step
expr
1
1
Rule 2
expr digit ? expr digit digit
6
2
2
expr
-
digit
Rule 3
expr digit digit ? digit digit digit
3
3
5
expr
digit
Rule 4
digit digit digit ? 3 digit digit
4
2
Rule 4
3 digit digit ? 3 8 digit
5
4
digit
8
Rule 4
3 8 digit ? 3 8 2
6
3
91Bottom-Up Parsing
- Parser examines terminal symbols of the input
string, in order from left to right - Reconstructs the syntax tree from the bottom
(terminal nodes) up (toward the root node) - Bottom-up parsing reduces a string w to the start
symbol of the grammar. - At each reduction step a particular sub-string
matching the right side of a production is
replaced by the symbol on the left of that
production, and if the sub-string is chosen
correctly at each step, a rightmost derivation is
traced out in reverse.
92Bottom-Up Parsing
- Types of bottom-up parsing algorithms
- Shift-reduce parsing
- At each reduction step a particular sub-string
matching the right side of a production is
replaced by the symbol on the left of that
production, and if the sub-string is chosen
correctly at each step, a rightmost derivation is
traced out in reverse. - LR(k) parsing
- L is for left-to-right scanning of the input, the
R is for constructing a right-most derivation in
reverse, and the k is for the number of input
symbols of look-ahead that are used in making
parsing decisions.
93Bottom-Up Parsing Example38-2
94Bottom-Up Parsing Example38-2
95Bottom-Up Parsing Exampleabbcde
a
b
b
c
d
e
A
a
b
b
c
d
e
Abbcde ? aAbcde
A
a
b
b
c
d
e
aAbcde
96Bottom-Up Parsing Exampleabbcde
A
A
a
b
b
c
d
e
aAbcde ? aAde
A
A
a
b
b
c
d
e
aAde
97Bottom-Up Parsing Exampleabbcde
A
B
A
a
b
b
c
d
e
aAde ? aABe
A
B
A
a
b
b
c
d
e
aABe
98Bottom-Up Parsing Exampleabbcde
S
A
B
A
a
b
b
c
d
e
aABe ? S
99Bottom-Up Parsing Examplethe cat sees a rat.
the
cat
sees
a
rat
.
Noun
.
the
cat
sees
a
rat
the cat sees a rat. ? the Noun sees a rat.
Noun
the
cat
sees
a
rat
.
the Noun sees a rat.
100Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
the
cat
sees
a
rat
.
the Noun sees a rat. ? Subject sees a rat.
Subject
Noun
.
the
cat
sees
a
rat
Subject sees a rat.
101Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
Verb
.
the
cat
sees
a
rat
Subject sees a rat. ? Subject Verb a rat.
Subject
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a rat.
102Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a rat. ? Subject Verb a Noun.
Subject
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a Noun.
103Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a Noun. ? Subject Verb Object.
What would happened if we choose Subject ? a
Noun instead of Object ? a Noun?
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb Object.
104Bottom-Up Parsing Examplethe cat sees a rat.
Sentence
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb Object.
105Top-Down Parsing
- The parser examines the terminal symbols of the
input string, in order from left to right. - The parser reconstructs its syntax tree from the
top (root node) down (towards the terminal
nodes).
An attempt to find the leftmost derivation for an
input string
106Top-Down Parsers
- General rules for top-down parsers
- Start with just a stub for the root node
- At each step the parser takes the left most stub
- If the stub is labeled by terminal symbol t, the
parser connects it to the next input terminal
symbol, which must be t. (If not, the parser has
detected a syntactic error.) - If the stub is labeled by nonterminal symbol N,
the parser chooses one of the production rules
N X1Xn, and grows branches from the node
labeled by N to new stubs labeled X1,, Xn (in
order from left to right). - Parsing succeeds when and if the whole input
string is connected up to the syntax tree.
107Top-Down Parsing
- Two forms
- Backtracking parsers
- Guesses which rule to apply, back up, and changes
choices if it can not proceed - Predictive Parsers
- Predicts which rule to apply by using look-ahead
tokens
Backtracking parsers are not very efficient. We
will cover Predictive parsers
108Predictive Parsers
- Many types
- LL(1) parsing
- First L is scanning the input form left to right
second L is for producing a left-most derivation
1 is for using one input symbol of look-ahead - Table driven with an explicit stack to maintain
the parse tree - Recursive decent parsing
- Uses recursive subroutines to traverse the parse
tree
109Predictive Parsers (Lookahead)
- Lookahead in predictive parsing
- The lookahead token (next token in the input) is
used to determine which rule should be used next - For example
7
term
num
110Predictive Parsers (Lookahead)
7
term
num
3
7
term
num
num
3
-
term
111Predictive Parsers (Lookahead)
num
term
7
3
num
-
term
2
num
term
7
3
num
-
term
e
2
112Recursive-Decent Parsing
- Top-down parsing algorithm
- Consists of a group of methods (programs) parseN,
one for each nonterminal symbol N of the grammar. - The task of each method parseN is to parse a
single N-phrase - These parsing methods cooperate to parse complete
sentences
113Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
the
cat
sees
a
rat
.
- Decide which production rule to apply. Only one,
1. - This step created four stubs.
114Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
115Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
116Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
117Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
118Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
119Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
120Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
- ParseSentence
- ParseSubject
- ParseObject
- ParseVerb
- ParseNoun
121Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
- ParseSentence
- parseSubject
- parseVerb
- parseObject
- parseEnd
Sentence ?
Subject
Verb
Object
.
122Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
Subject ?
- ParseSubject
- if input I
- accept
- else if input a
- accept
- parseNoun
- else if input the
- accept
- parseNoun
- else error
I
a
Noun
the
Noun
123Recursive-Descent Parser for Micro-English
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
- ParseNoun
- if input cat
- accept
- else if input mat
- accept
- else if input rat
- accept
- else error
Noun ?
cat
mat
rat
124Recursive-Descent Parser for Micro-English
Object ?
- ParseObject
- if input me
- accept
- else if input a
- accept
- parseNoun
- else if input the
- accept
- parseNoun
- else error
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
me
a
Noun
the
Noun
125Recursive-Descent Parser for Micro-English
- ParseVerb
- if input like
- accept
- else if input is
- accept
- else if input see
- accept
- else if input sees
- accept
- else error
Verb ?
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
like
is
see
sees
126Recursive-Descent Parser for Micro-English
- ParseEnd
- if input .
- accept
- else error
- Sentence ? Subject Verb Object.
- Subject ? I a Noun the Noun
- Object ? me a Noun the Noun
- Noun ? cat mat rat
- Verb ? like is see sees
.
127Systematic Development of a Recursive-Descent
Parser
- Given a (suitable) context-free grammar
- Express the grammar in EBNF, with a single
production rule for each nonterminal symbol, and
perform any necessary grammar transformations - Always eliminate left recursion
- Always left-factorize whenever possible
- Transcribe each EBNF production rule NX to a
parsing method parseN, whose body is determined
by X - Make the parser consist of
- A private variable currentToken
- Private parsing methods developed in previous
step - Private auxiliary methods accept and acceptIt,
both of which call the scanner - A public parse method that calls parseS, where S
is the start symbol of the grammar), having first
called the scanner to store the first input token
in currentToken
128Quote of the Week
- C makes it easy to shoot yourself in the foot
C makes it harder, but when you do, it blows
away your whole leg. - Bjarne Stroustrup
129Quote of the Week
- Did you really say that?
-
- Dr. Bjarne Stroustrup
-
- Yes, I did say something along the lines of C
makes it easy to shoot yourself in the foot C
makes it harder, but when you do, it blows your
whole leg off. What people tend to miss is that
what I said about C is to a varying extent true
for all powerful languages. As you protect people
from simple dangers, they get themselves into new
and less obvious problems. Someone who avoids
the simple problems may simply be heading for a
not-so-simple one. One problem with very
supporting and protective environments is that
the hard problems may be discovered too late or
be too hard to remedy once discovered. Also, a
rare problem is harder to find than a frequent
one because you don't suspect it. -
- I also said, "Within C, there is a much smaller
and cleaner language struggling to get out." For
example, that quote can be found on page 207 of
The Design and Evolution of C. And no, that
smaller and cleaner language is not Java or C.
The quote occurs in a section entitled "Beyond
Files and Syntax". I was pointing out that the
C semantics is much cleaner than its syntax. I
was thinking of programming styles, libraries and
programming environments that emphasized the
cleaner and more effective practices over archaic
uses focused on the low-level aspects of C.
130Converting EBNF Production Rules to Parsing
Methods
- For production rule NX
- Convert production rule to parsing method named
parseN - Private void parseN ()
- Parse X
-
- Refine parseE to a dummy statement
- Refine parse t (where t is a terminal symbol) to
accept(t) or acceptIt() - Refine parse N (where N is a non terminal symbol)
to a call of the corresponding parsing method - parseN()
- Refine parse X Y to
-
- parseX
- parseY
-
- Refine parse XY
- Switch (currentToken.kind)
- Cases in starterX
- Parse X
- Break
131Converting EBNF Production Rules to Parsing
Methods
- For X Y
- Choose parse X only if the current token is one
that can start an X-phrase - Choose parse Y only if the current token is one
that can start an Y-phrase - startersX and startersY must be disjoint
- For X
- Choose
- while (currentToken.kind is in startersX)
- starterX must be disjoint from the set of
tokens that can follow X in this particular
context
132Converting EBNF Production Rules to Parsing
Methods
- A grammar that satisfies both these conditions is
called an LL(1) grammar - Recursive-descent parsing is suitable only for
LL(1) grammars
133Error Repair
- Good programming languages are designed with a
relatively large distance between syntactically
correct programs, to increase the likelihood that
conceptual mistakes are caught on syntactic
errors. - Error repair usually occurs at two levels
- Local repairs mistakes with little global
import, such as missing semicolons and undeclared
variables. - Scope repairs the program text so that scopes
are correct. Errors of this kind include
unbalanced parentheses and begin/end blocks.
134Error Repair
- Repair actions can be divided into insertions and
deletions. Typically the compiler will use some
look ahead and backtracking in attempting to make
progress in the parse. There is great variation
among compilers, though some languages (PL/C)
carry a tradition of good error repair. Goals of
error repair are - No input should cause the compiler to collapse
- Illegal constructs are flagged
- Frequently occurring errors are repaired
gracefully - Minimal stuttering or cascading of errors.
LL-Style parsing lends itself well to error
repair, since the compiler uses the grammars
rules to predict what should occur next in the
input
135Mini-Triangle Production Rules
- Program Command Program (1.14)
- Command V-name Expression AssignCommand (1.
15a) - Identifier ( Expression ) CallCommand (1.15b
) - Command Command SequentialCommand (1.15c)
- if Expression then Command IfCommand (15.d)
- else Command
- while Expression do Command WhileCommand (1.1
5e - let Declaration in Command LetCommand (1.15f)
- Expression Integer-Literal IntegerExpression (
1.16a) - V-name VnameExpression (1.16b)
- Operator Expression UnaryExpression (1.16c)
- Expression Operator Expression BinaryExpressio
iun (1.16d) - V-name Identifier SimpelVname (1.17)
- Declaration const Identifier
Expression ConstDeclaration (1.18a) - var Identifier Typoe-denoter VarDeclaration
(1.18b)
136Abstract Syntax Trees
- An explicit representation of the source
programs phrase structure - AST for Mini-Triangle
137Abstract Syntax Trees
Program Command Program (1.14
Program
C
AssignCommand
CallCommand
SequentialCommand
E
V
E
Identifier
C2
C1
(1.15a)
(1.15b)
(1.15c)
spelling
Command V-name Expression AssignCommand (
1.15a) Identifier ( Expression
) CallCommand (1.15b) Command
Command SequentialCommand (1.15c)
138Abstract Syntax Trees
WhileCommand
LetCommand
SequentialCommand
E
C
V
D
C2
C1
E
(1.15e)
(1.15f)
(1.15d)
Command if Expression then
Command IfCommand (15.d) else
Command while Expression do
Command WhileCommand (1.15e let Declaration
in Command LetCommand (1.15f)
139Midterm Review Chapter 1
- Context-free Grammar
- A finite set of terminal symbols
- A finite set of non-terminal symbols
- A start symbol
- A finite se to production rules
- Aspects of a programming language that need to be
specified - Syntax form of programs
- Contextual constraints scope rules and type
variables - Semantics meaning of programs
140Midterm Review Chapter 1
- Language specification
- Informal written in English
- Formal precise notation (BNF, EBNF)
- Unambiguous
- Consistent
- Complete
- Context-free language
- Syntax tree
- Phrase
- Sentence
141Midterm Review Chapter 1
- Syntax tree
- Terminal node labeled by terminal symbol
- Non-terminal nodes labeled b y non-terminal
symbol - Abstract Syntax Tree (AST)
- Each non-terminal node ius labeled by production
rule - Each non-terminal node has exactly one subtree
for each subprogram - Does not generate sentences
142Midterm Review Chapter 2
- Translator
- Accepts any text expressed in one language
(source language) and generates a
semantically-equivalent text expressed in another
language (target language) - Compiler
- Translates from high-level language into
low-level language - Interpreter
- A program that accepts any program (source
program) expressed in a particular language
(source language) and runs that source program
immediately
143Midterm Review Chapter 2
- Interpretive compiler
- Combination of compiler and interpreter
- Some of the advantages of each
- Portable compiler
- Compiled and run on any mainline, without change
- Portability measured by proportion of