CSC 415: Translators and Compilers - PowerPoint PPT Presentation

1 / 157
About This Presentation
Title:

CSC 415: Translators and Compilers

Description:

CSC 415: Translators and Compilers Dr. Chuck Lillie Course Outline Translators and Compilers Language Processors Compilation Syntactic Analysis Contextual Analysis ... – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 158
Provided by: lill2
Category:

less

Transcript and Presenter's Notes

Title: CSC 415: Translators and Compilers


1
CSC 415 Translators and Compilers
  • Dr. Chuck Lillie

2
Course Outline
  • Translators and Compilers
  • Language Processors
  • Compilation
  • Syntactic Analysis
  • Contextual Analysis
  • Run-Time Organization
  • Code Generation
  • Interpretation
  • Major Programming Project
  • Project Definition and Planning
  • Implementation
  • Weekly Status Reports
  • Project Presentation

3
Project
  • Implement a Compiler for the Programming Language
    Triangle
  • Appendix B Informal Specification of the
    Programming Language Triangle
  • Appendix D Class Diagrams for the Triangle
    Compiler
  • Present Project Plan
  • What and How
  • Weekly Status Reports
  • Work accomplished during the reporting period
  • Deliverable progress, as a percentage of
    completion
  • Problem areas
  • Planned activities for the next reporting period

4
Chapter 1 Introduction to Programming Languages
  • Programming Language A formal notation for
    expressing algorithms.
  • Programming Language Processors Tools to enter,
    edit, translate, and interpret programs on
    machines.
  • Machine Code Basic machine instructions
  • Keep track of exact address of each data item and
    each instruction
  • Encode each instruction as a bit string
  • Assembly Language Symbolic names for operations,
    registers, and addresses.

5
Programming Languages
  • High Level Languages Notation similar to
    familiar mathematical notation
  • Expressions , -, , /
  • Data Types truth variables, characters,
    integers, records, arrays
  • Control Structures if, case, while, for
  • Declarations constant values, variables,
    procedures, functions, types
  • Abstraction separates what is to be performed
    from how it is to be performed
  • Encapsulation (or data abstraction) group
    together related declarations and selectively
    hide some

6
Programming Languages
  • Any system that manipulates programs expressed in
    some particular programming language
  • Editors enter, modify, and save program text
  • Translators and Compilers Translates text from
    one language to another. Compiler translates a
    program from a high-level language to a low-level
    language, preparing it to be run on a machine
  • Checks program for syntactic and contextual
    errors
  • Interpreters Runs program without compliation
  • Command languages
  • Database query languages

7
Programming Languages Specifications
  • Syntax
  • Form of the program
  • Defines symbols
  • How phrases are composed
  • Contextual constraints
  • Scope determine scope of each declaration
  • Type
  • Semantics
  • Meaning of the program

8
Representation
  • Syntax
  • Backus-Naur Form (BNF) context-free grammar
  • Terminal symbols (gt, while, )
  • Non-terminal symbols (Program, Command,
    Expression, Declaration)
  • Start symbol (Program)
  • Production rules (defines how phrases are
    composed from terminals and sub-phrases)
  • Nab.
  • Syntax Tree
  • Used to define language in terms of strings and
    terminal symbols

9
Representation
  • Semantics
  • Abstract Syntax
  • Concentrate on phrase structure alone
  • Abstract Syntax Tree

10
Contextual Constraints
  • Scope
  • Binding
  • Static determined by language processor
  • Dynamic determined at run-time
  • Type
  • Statically language processor can detect all
    errors
  • Dynamically type errors cannot be detected until
    run-time

Will assume static binding and statically typed
11
Semantics
  • Concerned with meaning of program
  • Behavior when run
  • Usually specified informally
  • Declarative sentences
  • Could include side effects
  • Correspond to production rules

12
Chapter 2 Language Processors
  • Translators and Compilers
  • Interpreters
  • Real and Abstract Machines
  • Interpretive Compilers
  • Portable Compilers
  • Bootstrapping
  • Case Study The Triangle Language Processor

13
Translators Compilers
  • Translator a program that accepts any text
    expressed in one language (the translators
    source language), and generates a
    semantically-equivalent text expressed in another
    language (its target language)
  • Chinese-into-English
  • Java-into-C
  • Java-into-x86
  • X86 assembler

14
Translators Compilers
  • Assembler translates from an assembly language
    into the corresponding machine code
  • Generates one machine code instruction per source
    instruction
  • Compiler translates from a high-level language
    into a low-level language
  • Generates several machine-code instructions per
    source command.

15
Translators Compilers
  • Disassembler translates a machine code into the
    corresponding assembly language
  • Decompiler translates a low-level language into
    a high-level language

Question Why would you want a disassembler or
decompiler?
16
Translators Compilers
  • Source Program the source language text
  • Object Program the target language text

Compiler
Syntax Check
Context Constraints
  • Object program semantically equivalent to source
    program
  • If source program is well-formed

17
Translators Compilers
  • Why would you want to do
  • Java-into-C translator
  • C-into-Java translator
  • Assembly-language-into-Pascal decompiler

18
Translators Compilers
P Program Name
L Implementation Language
M Target Machine
For this to work, L must equal M, that is, the
implementation language must be the same as the
machine language
S Source Language
T Target Language
L Translators Implementation Language
S-into-T Translator is itself a program that runs
on machine L
19
Translators Compilers
  • Translating a source program P
  • Expressed in language T,
  • Using an S-into-T translator
  • Running on machine M

20
Translators Compilers
sort
sort
sort
Java
x86
Java
x86
x86
x86
  • Translating a source program sort
  • Expressed in language Java,
  • Using an Java-into-x86 translator
  • Running on an x86 machine

The object program is running on the same machine
as the compiler
21
Translators Compilers
sort
sort
sort
Java
PPC
Java
PPC
PPC
download
x86
  • Translating a source program sort
  • Expressed in language Java,
  • Using an Java-into-PPC translator
  • Running on an x86 machine
  • Downloaded to a PPC machine

Cross Compiler The object program is running on
a different machine than the compiler
22
Translators Compilers
sort
sort
sort
Java
Java
C
C
C
x86
x86
  • Translating a source program sort
  • Expressed in language Java,
  • Using an Java-into-C translator
  • Running on an x86 machine
  • Then translating the C program
  • Using an C-into x86 compiler
  • Running on an x86 machine
  • Into x86 object program

Two-stage Compiler The source program is
translated to another language before being
translated into the object program
23
Translators Compilers
  • Translator Rules
  • Can run on machine M only if it is expressed in
    machine code M
  • Source program must be expressed in translators
    source language S
  • Object program is expressed in the translators
    target language T
  • Object program is semantically equivalent to the
    source program

24
Interpreters
  • Accepts any program (source program) expressed in
    a particular language (source language) and runs
    that source program immediately
  • Does not translate the source program into object
    code prior to execution

25
Interpreters
Interpreter
Fetch Instruction
Analyze Instruction
Program Complete
Execute Instruction
  • Source program starts to run as soon as the first
    instruction is analyzed

26
Interpreters
  • When to Use Interpretation
  • Interactive mode want to see results of
    instruction before entering next instruction
  • Only use program once
  • Each instruction expected to be executed only
    once
  • Instructions have simple formats
  • Disadvantages
  • Slow up to 100 times slower than in machine code

27
Interpreters
  • Examples
  • Basic
  • Lisp
  • Unix Command Language (shell)
  • SQL

28
Interpreters
S interpreter expressed in language L
Program P expressed in language S, using
Interpreter S, running on machine M
Program graph written in Basic running on a Basic
interpreter executed on an x86 machine
29
Real and Abstract Machines
  • Hardware emulation Using software to execute one
    set of machine code on another machine
  • Can measure everything about the new machine
    except its speed
  • Abstract machine emulator
  • Real machine actual hardware

An abstract machine is functionally equivalent to
a real machine if they both implement the same
language L
30
Real and Abstract Machines
New Machine Instruction (nmi) interpreter written
in C
nmi interpreter expressed in machine code M
nmi interpreter written in C
The nmi interpreter is translated into machine
code M using the C compiler
Compiler to translate C program into M machine
code
31
Interpretive Compilers
  • Combination of compiler and interpreter
  • Translate source program into an intermediate
    language
  • It is intermediate in level between the source
    language and ordinary machine code
  • Its instructions have simple formats, and
    therefore can be analyzed easily and quickly
  • Translation from the source language into the
    intermediate language is easy and fast

An interpretive compiles combines fast
compilation with tolerable running speed
32
Interpretive Compilers
Java into JVM translator running on machine M
JVM code interpreter running on machine M
A Java program P is first translated into
JVM-code, and then the JVM-code object program is
interpreted
33
Portable Compilers
  • A program is portable if it can be compiled and
    run on any machine, without change
  • A portable program is more valuable than an
    unportable one, because its development cost can
    be spread over more copies
  • Portability is measured by the proportion of code
    that remains unchanged when it is moved to a
    dissimilar machine
  • Language affects protability
  • Assembly language 0 portable
  • High level language approaches 100 portability

34
Portable Compilers
  • Language Processors
  • Valuable and widely used programs
  • Typically written in high-level language
  • Pascal, C, Java
  • Part of language processor is machine dependent
  • Code generation part
  • Language processor is only about 50 portable
  • Compiler that generates intermediate code is more
    portable than a compiler that generates machine
    code

35
Portable Compilers
Java
JVM
Java
Rewrite interpreter in C
36
Bootstrapping
  • The language processor is used to process itself
  • Implementation language is the source language
  • Bootstrapping a portable compiler
  • A portable compiler can be bootstrapped to make a
    true compiler one that generates machine code
    by writing an intermediate-language-into-machine-c
    ode translator
  • Full bootstrap
  • Writing the compiler in itself
  • Using the latest version to upgrade the next
    version
  • Half bootstrap
  • Compiler expressed in itself but targeted for
    another machine
  • Bootstrapping to improve efficiency
  • Upgrade the compiler to optomize code generation
    as well as to improve compile efficiency

37
Bootstrapping
Bootstrap an interpretive compiler to generate
machine code
First, write a JVM-coded-into-M translator in Java
Next, compile translator using existing
interpreter
Use translator to translate itself
Two stage Java-into-M compiler
Translate Java-into-JVM-code translator into
machine code
38
Bootstrapping
Full bootstrap
v2
v1
Convert the C version of Ada-S into Ada-S version
of Ada-S
Write Ada-S compiler in C
v1
v2
v3
Extend Ada-S compiler to (full) Ada compiler
39
Bootstrapping
Half bootstrap
40
Bootstrapping
Bootstrap to improve efficiency
41
Chapter 3 Compilation
  • Phases
  • Syntactic Analysis
  • Contextual Analysis
  • Code Generation
  • Passes
  • Multi-pass Compilation
  • One-pass Compilation
  • Compiler Design Issues
  • Case Study The Triangle Compiler

42
Phases
  • Syntactic Analysis
  • The source program is parsed to check whether it
    conforms to the source languages syntax, and to
    determine its phrase structure
  • Contextual Analysis
  • The parsed program is analyzed to check whether
    it conforms to the source language's contextual
    constraints
  • Code Generation
  • The checked program is translated to an object
    program, in accordance with the semantics of the
    source and target languages

43
Phases
Source Program
Syntactic Analysis
Error Report
AST
Contextual Analysis
Error Report
Decorated AST
Code Generation
Object Program
44
Syntactic Analysis
  • To determine the source programs phrase
    structure
  • Parsing
  • Contextual analysis and code generation must know
    how the program is composed
  • Commands, expressions, declarations,
  • Check for conformance to the source languages
    syntax
  • Construct suitable representation of its phrase
    structure (AST)
  • AST
  • Terminal nodes corresponding to identifiers,
    literals, and operators
  • Sub trees representing the phases of the source
    program
  • Blanks and comments not in AST (no meaning)
  • Punctuation and brackets not in AST (only
    separate and enclose)

45
Contextual Analysis
  • Analyzes the parsed program
  • Scope rules
  • Type rules
  • Produces decorated AST
  • AST with information gathered during contextual
    analysis
  • Each applied occurrence of an identifier is
    linked ot the corresponding declaration
  • Each expression is decorated by its type T

46
Code Generation
  • The final translation of the checked program to
    an object program
  • After syntactic and contextual analysis is
    completed
  • Treatment of identifiers
  • Constants
  • Binds identifier to value
  • Replace each occurrence of identifier with value
  • Variables
  • Binds identifier to some memory address
  • Replace each occurrence of identifier by address
  • Target language
  • Assembly language
  • Machine code

47
Passes
  • Multi-pass compilation
  • Traverses the program or AST several times
  • One-pass compilation
  • Single traverse of program
  • Contextual analysis and code generation are
    performed on the fly during syntactic analysis

48
Compiler Design Issues
  • Speed
  • Compiler run time
  • Space
  • Storage size of compiler files generated
  • Modularity
  • Multi-pass compiler more modular than one-pass
    compiler
  • Flexibility
  • Multi-pass compiler is more flexible because it
    generates an AST that can be traversed in any
    order by the other phases
  • Semantics-preserving transformations
  • To optimize code must have multi-pass compiler
  • Source language properties
  • May restrict compiler choice some language
    constructs may require multi-pass compilers

49
Chapter 4 Syntactic Analysis
  • Sub-phases of Syntactic Analysis
  • Grammars Revisited
  • Parsing
  • Abstract Syntax Trees
  • Scanning
  • Case Study Syntactic Analysis in the Triangle
    Compiler

50
Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Parser Semantic Analyzer
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
51
Syntactic Analysis
  • Main function
  • Parse source program to discover its phrase
    structure
  • Recursive-descent parsing
  • Constructing an AST
  • Scanning to group characters into tokens

52
Sub-phases of Syntactic Analysis
  • Scanning (or lexical analysis)
  • Source program transformed to a stream of tokens
  • Identifiers
  • Literals
  • Operators
  • Keywords
  • Punctuation
  • Comments and blank spaces discarded
  • Parsing
  • To determine the source programs phrase structure
  • Source program is input as a stream of tokens
    (from the Scanner)
  • Treats each token as a terminal symbol
  • Representation of phrase structure
  • AST

53
Lexical Analysis A Simple Example
Main() int a, b, c char number5 / get
user inputs / A atoi ( gets(number)) B
atoi (gets(number)) / calculate value for c
/ C 2(ab) a(ab) / print results
/ Printf(d,c)
  • Scan the file character by character and group
    characters into words and punctuation (tokens),
    remove white space and comments
  • Some tokens for this example
  • main
  • (
  • )
  • int
  • a
  • ,
  • b
  • ,
  • c

54
Creating Tokens Mini-Triangle Example
Input Converter
character string
. . . .
l
e
t
S
v
a
r
y

I
n
t
e
g
e
r
i
n
S
S
S
Scanner
Ident.
colon
Ident.
Ident.
becomes
Ident.
op.
Intlit.
eot
let
var
in

1
y

Integer
y
y


let
var
in
55
Tokens in Triangle
  • // literals, identifiers, operators...
  • INTLITERAL 0, "ltintgt",
  • CHARLITERAL 1, "ltchargt",
  • IDENTIFIER 2, "ltidentifiergt",
  • OPERATOR 3, "ltoperatorgt",
  • // reserved words - must be in alphabetical
    order...
  • ARRAY 4, "array",
  • BEGIN 5, "begin",
  • CONST 6, "const",
  • DO 7, "do",
  • ELSE 8, "else",
  • END 9, "end",
  • FUNC 10, "func",
  • IF 11, "if",
  • IN 12, "in",
  • LET 13, "let",
  • OF 14, "of",
  • PROC 15, "proc",

// punctuation... DOT 21, ".",
COLON 22, "", SEMICOLON 23, "",
COMMA 24, ",", BECOMES 25, "",
IS 26, // brackets... LPAREN 27,
"(", RPAREN 28, ")", LBRACKET
29, ", RBRACKET 30, "", LCURLY
31, "", RCURLY 32, "", // special
tokens... EOT 33, "", ERROR 34
"lterrorgt"
56
Grammars Revisited
  • Context free grammars
  • Generates a set of sentences
  • Each sentence is a string of terminal symbols
  • An unambiguous sentence has a unique phrase
    structure embodied in its syntax tree
  • Develop parsers from context-free grammars

57
Regular Expressions
  • A regular expression (RE) is a convenient
    notation for expressing a set of stings of
    terminal symbols
  • Main features
  • separates alternatives
  • indicates that the previous item may be
    represented zero or more times
  • ( and ) are grouping parentheses

58
Regular Expression Basics
  • e The empty string a special string of length 0
  • Regular expression operations
  • separates alternatives
  • indicates that the previous item may be
    represented zero or more times (repetition)
  • ( and ) are grouping parentheses

59
Regular Expression Basics
  • Algebraic Properties
  • is commutative and associative
  • rs sr
  • r(st) (rs)t
  • Concatenation is associative
  • (rs)t r(st)
  • Concatenation distributes over
  • r(st) rsrt
  • (st)r srtr
  • e is the identity for concatenation
  • e r r
  • r e r
  • is idempotent
  • r r
  • r (r e)

60
Regular Expression Basics
  • Common Extensions
  • r one or more of expression r, same as rr
  • rk k repetitions of r
  • r3 rrr
  • r the characters not in the expression r
  • \t\n
  • r-z range of characters
  • 0-9a-z
  • r? Zero or one copy of expression (used for
    fields of an expression that are optional)

61
Regular Expression Example
  • Regular Expression for Representing Months
  • Examples of legal inputs
  • January represented as 1 or 01
  • October represented as 10
  • First Try 01e0-9
  • Matches all legal inputs? Yes
  • 1, 2, 3, , 10, 11, 12, 01, 02, , 09
  • Matches any illegal inputs? Yes
  • 0, 00, 18

62
Regular Expression Example
  • Regular Expression for Representing Months
  • Examples of legal inputs
  • January represented as 1 or 01
  • October represented as 10
  • Second Try 1-9(01-9)(10-2)
  • Matches all legal inputs? Yes
  • 1, 2, 3, , 10, 11, 12, 01, 02, , 09
  • Matches any illegal inputs? No

63
Regular Expression Example
  • Regular Expression for Floating Point Numbers
  • Examples of legal inputs
  • 1.0, 0.2, 3.14159, -1.0, 2.7e8, 1.0E-6
  • Assume that a 0 is required before numbers less
    than 1 and does not prevent extra leading zeros,
    so numbers such as 0011 or 0003.14159 are legal
  • Building the regular expression
  • Assume
  • Digit ? 0123456789
  • Handle simple decimals such as 1.0, 0.2, 3.14159
  • Digit.digit
  • Add an optional sign (only minus, no plus)
  • (- e)digit.digit or -?digit.digit

64
Regular Expression Example
  • Regular Expression for Floating Point Numbers
    (cont.)
  • Building the regular expression (cont.)
  • Format for the exponent
  • (Ee)(-)?(digit)
  • Adding it as an optional expression to the
    decimal part
  • (- e)digit.digit((Ee)(-)?(digit))?

65
Extended BNF
  • Extended BNF (EBNF)
  • Combination of BNF and RE
  • NX, where N is a nonterminal symbol and X is
    an extended RE, i.e., an RE constructed from both
    terminal and nonterminal symbols
  • EBNF
  • Right hand side may use . , (, )
  • Right hand side may contain both terminal and
    nonterminal symbols

66
Example EBNF
  • Expression primary-Expression (Operator
    primary-Expression)
  • Primary-Expression Identifier
  • ( Expression )
  • Identifier abcde
  • Operator -/
  • Generates
  • e
  • a b
  • a b c
  • a (b c)
  • a (b c) / d
  • a (b (c (d e)))

67
Grammar Transformations
  • Left Factorization
  • XY XZ is equivalent to X(Y Z)
  • single-Command V-name Expression
  • if Expression then single-Command
  • if Expression then single-Command
  • else single-Command
  • single-Command V-name Expression
  • if Expression then single-Command
  • (e else single-Command)

68
Grammar Transformations
  • Elimination of left recursion
  • N X NY is equivalent to NX(Y)
  • Identifier Letter
  • Identifier Letter
  • Identifier Digit
  • Identifier Letter
  • Identifier (Letter Digit)
  • Identifier Letter(Letter Digit)

69
Grammar Transformations
  • Substitution of nonterminal symbols
  • Given NX, we can substitute each occurrence
    of N with X
  • iff NX is nonrecursive and is the only
    production rule for N
  • single-Command for Control-Variable
    Expression To-or-Downto
  • Expression do single-Command
  • Control-Variable Identifier
  • To-or-Downto to
  • down
  • single-Command for Identifier Expression
    (todownto)
  • Expression do single-Command

70
Scanning (Lexical Analysis)
  • The purpose of scanning is to recognize tokens in
    the source program. Or, to group input
    characters (the source program text) into tokens.
  • Difference between parsing and scanning
  • Parsing groups terminal symbols, which are
    tokens, into larger phrases such as expressions
    and commands and analyzes the tokens for
    correctness and structure
  • Scanning groups individual characters into tokens

71
Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Parser Semantic Analyzer
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
72
Creating Tokens Mini-Triangle Example
Input Converter
character string
. . . .
l
e
t
S
v
a
r
y

I
n
t
e
g
e
r
i
n
S
S
S
Scanner
Ident.
colon
Ident.
Ident.
becomes
Ident.
op.
Intlit.
eot
let
var
in

1
y

Integer
y
y


let
var
in
73
What Does a Scanner Do?
  • Hand keywords (reserve words)
  • Recognizes identifiers and keywords
  • Match explicitly
  • Write regular expression for each keyword
  • Identifier is any alpha numeric string which is
    not a keyword
  • Match as an identifier, perform lookup
  • No special regular expressions for keywords
  • When an identifier is found, perform lookup into
    preloaded keyword table

How does Triangle handle keywords? Discuss in
terms of efficiency and ease to code.
74
What Does a Scanner Do?
  • Remove white space
  • Tabs, spaces, new lines
  • Remove comments
  • Single line
  • -- Ada comment
  • Multi-line, start and end delimiters
  • Pascal comment
  • / c comment /
  • Nested
  • Runaway comments
  • Nonterminated comments cant be detected till end
    of file

75
What Does a Scanner Do?
  • Perform look ahead
  • Multi-character tokens
  • 1..10 vs. 1.10
  • ,
  • lt, lt
  • etc
  • Challenging input languages
  • FORTRAN
  • Keywords not reserved
  • Blanks are not a delimiter
  • Example (comma vs. decimal)
  • DO10I1,5 start of a do loop (equivalent to a C
    for loop)
  • DO10I1.5 an assignment statement, assignment to
    variable DO10I

76
What Does a Scanner Do?
  • Challenging input languages (cont.)
  • PL/I, keywords not reserved
  • IF THEN THEN THEN ELSE ELSE ELSE THEN

77
What Does a Scanner Do?
  • Error Handling
  • Error token passed to parser which reports the
    error
  • Recovery
  • Delete characters from current token which have
    been read so far, restart scanning at next unread
    character
  • Delete the first character of the current lexeme
    and resume scanning form next character.
  • Examples of lexical errors
  • 3.25e bad format for a constant
  • Var1 illegal character
  • Some errors that are not lexical errors
  • Mistyped keywords
  • Begim
  • Mismatched parenthesis
  • Undeclared variables

78
Scanner Implementation
  • Issues
  • Simpler design parser doesnt have to worry
    about white space, etc.
  • Improve compiler efficiency allows the
    construction of a specialized and potentially
    more efficient processor
  • Compiler portability is enhanced input alphabet
    peculiarities and other device-specific anomalies
    can be restricted to the scanner

79
Scanner Implementation
  • What are the keywords in Triangle?
  • How are keywords and identifiers implemented in
    Triangles?
  • Is look ahead implemented in Triangle?
  • If so, how?

80
Structure of a Compiler
Lexical Analyzer
Source code
Symbol Table
tokens
Semantic Analyzer
Parser
parse tree
Intermediate Code Generation
intermediate representation
Optimization
intermediate representation
Assembly Code Generation
Assembly code
81
Parsing
  • Given an unambiguous, context free grammar,
    parsing is
  • Recognition of an input string, i.e., deciding
    whether or not the input string is a sentence of
    the grammar
  • Parsing of an input string, i.e., recognition of
    the input string plus determination of its phrase
    structure. The phrase structure can be
    represented by a syntax tree, or otherwise.

Unambiguous is necessary so that every sentence
of the grammar will form exactly one syntax tree.
82
Parsing
  • The syntax of programming language constructs are
    described by context-free grammars.
  • Advantages of unambiguous, context-free grammars
  • A precise, yet easy-to understand, syntactic
    specification of the programming language
  • For certain classes of grammars we can
    automatically construct an efficient parser that
    determines if a source program is syntactically
    well formed.
  • Imparts a structure to a programming language
    that is useful for the translation of source
    programs into correct object code and for the
    detection of errors.
  • Easier to add new constructs to the language if
    the implementation is based on a grammatical
    description of the language

83
Parsing
  • Check the syntax (structure) of a program and
    create a tree representation of the program
  • Programming languages have non-regular constructs
  • Nesting
  • Recursion
  • Context-free grammars are used to express the
    syntax for programming languages

84
Context-Free Grammars
  • Comprised of
  • A set of tokens or terminal symbols
  • A set of non-terminal symbols
  • A set of rules or productions which express the
    legal relationships between symbols
  • A start or goal symbol
  • Example
  • expr ? expr digit
  • expr ? expr digit
  • expr ? digit
  • digit ? 0129
  • Tokens -,,0,1,2,,9
  • Non-terminals expr, digit
  • Start symbol expr

85
Context-Free Grammars
  1. expr ? expr digit
  2. expr ? expr digit
  3. expr ? digit
  4. digit ? 0129

Example input 3 8 - 2
86
Checking for Correct Syntax
  • Given a grammar for a language and a program, how
    do you know if the syntax of the program is
    legal?
  • A legal program can be derived from the start
    symbol of the grammar

Grammar must be unambiguous and context-free
87
Deriving a String
  • The derivation begins with the start symbol
  • At each step of a derivation the right hand side
    of a grammar rule is used to replace a
    non-terminal symbol
  • Continue replacing non-terminals until only
    terminal symbols remain

Rule 2
Rule 1
Rule 4
expr ? expr digit ? expr 2 ? expr digit - 2
Rule 3
Rule 4
Rule 4
? expr 8-2 ? digit 8-2 ? 38 -2
88
Rightmost Derivation
  • The rightmost non-terminal is replaced in each
    step

Rule 4
expr digit ? expr 2
Rule 2
expr 2 ? expr digit - 2
Rule 4
expr digit - 2 ? expr 8-2
Rule 3
expr 8-2 ? digit 8-2
Rule 4
digit 8-2 ? 38 -2
89
Leftmost Derivation
  • The leftmost non-terminal is replaced in each step

Rule 2
expr digit ? expr digit digit
Rule 3
expr digit digit ? digit digit digit
Rule 4
digit digit digit ? 3 digit digit
Rule 4
3 digit digit ? 3 8 digit
Rule 4
3 8 digit ? 3 8 2
90
Leftmost Derivation
  • The leftmost non-terminal is replaced in each step

expr
1
1
Rule 2
expr digit ? expr digit digit
6
2
2
expr
-
digit
Rule 3
expr digit digit ? digit digit digit
3
3
5
expr
digit

Rule 4
digit digit digit ? 3 digit digit
4
2
Rule 4
3 digit digit ? 3 8 digit
5
4
digit
8
Rule 4
3 8 digit ? 3 8 2
6
3
91
Bottom-Up Parsing
  • Parser examines terminal symbols of the input
    string, in order from left to right
  • Reconstructs the syntax tree from the bottom
    (terminal nodes) up (toward the root node)
  • Bottom-up parsing reduces a string w to the start
    symbol of the grammar.
  • At each reduction step a particular sub-string
    matching the right side of a production is
    replaced by the symbol on the left of that
    production, and if the sub-string is chosen
    correctly at each step, a rightmost derivation is
    traced out in reverse.

92
Bottom-Up Parsing
  • Types of bottom-up parsing algorithms
  • Shift-reduce parsing
  • At each reduction step a particular sub-string
    matching the right side of a production is
    replaced by the symbol on the left of that
    production, and if the sub-string is chosen
    correctly at each step, a rightmost derivation is
    traced out in reverse.
  • LR(k) parsing
  • L is for left-to-right scanning of the input, the
    R is for constructing a right-most derivation in
    reverse, and the k is for the number of input
    symbols of look-ahead that are used in making
    parsing decisions.

93
Bottom-Up Parsing Example38-2
94
Bottom-Up Parsing Example38-2
95
Bottom-Up Parsing Exampleabbcde
a
b
b
c
d
e
A
a
b
b
c
d
e
Abbcde ? aAbcde
A
a
b
b
c
d
e
aAbcde
96
Bottom-Up Parsing Exampleabbcde
A
A
a
b
b
c
d
e
aAbcde ? aAde
A
A
a
b
b
c
d
e
aAde
97
Bottom-Up Parsing Exampleabbcde
A
B
A
a
b
b
c
d
e
aAde ? aABe
A
B
A
a
b
b
c
d
e
aABe
98
Bottom-Up Parsing Exampleabbcde
S
A
B
A
a
b
b
c
d
e
aABe ? S
99
Bottom-Up Parsing Examplethe cat sees a rat.
the
cat
sees
a
rat
.
Noun
.
the
cat
sees
a
rat
the cat sees a rat. ? the Noun sees a rat.
Noun
the
cat
sees
a
rat
.
the Noun sees a rat.
100
Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
the
cat
sees
a
rat
.
the Noun sees a rat. ? Subject sees a rat.
Subject
Noun
.
the
cat
sees
a
rat
Subject sees a rat.
101
Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
Verb
.
the
cat
sees
a
rat
Subject sees a rat. ? Subject Verb a rat.
Subject
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a rat.
102
Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a rat. ? Subject Verb a Noun.
Subject
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a Noun.
103
Bottom-Up Parsing Examplethe cat sees a rat.
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb a Noun. ? Subject Verb Object.
What would happened if we choose Subject ? a
Noun instead of Object ? a Noun?
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb Object.
104
Bottom-Up Parsing Examplethe cat sees a rat.
Sentence
Subject
Object
Noun
Noun
Verb
.
the
cat
sees
a
rat
Subject Verb Object.
105
Top-Down Parsing
  • The parser examines the terminal symbols of the
    input string, in order from left to right.
  • The parser reconstructs its syntax tree from the
    top (root node) down (towards the terminal
    nodes).

An attempt to find the leftmost derivation for an
input string
106
Top-Down Parsers
  • General rules for top-down parsers
  • Start with just a stub for the root node
  • At each step the parser takes the left most stub
  • If the stub is labeled by terminal symbol t, the
    parser connects it to the next input terminal
    symbol, which must be t. (If not, the parser has
    detected a syntactic error.)
  • If the stub is labeled by nonterminal symbol N,
    the parser chooses one of the production rules
    N X1Xn, and grows branches from the node
    labeled by N to new stubs labeled X1,, Xn (in
    order from left to right).
  • Parsing succeeds when and if the whole input
    string is connected up to the syntax tree.

107
Top-Down Parsing
  • Two forms
  • Backtracking parsers
  • Guesses which rule to apply, back up, and changes
    choices if it can not proceed
  • Predictive Parsers
  • Predicts which rule to apply by using look-ahead
    tokens

Backtracking parsers are not very efficient. We
will cover Predictive parsers
108
Predictive Parsers
  • Many types
  • LL(1) parsing
  • First L is scanning the input form left to right
    second L is for producing a left-most derivation
    1 is for using one input symbol of look-ahead
  • Table driven with an explicit stack to maintain
    the parse tree
  • Recursive decent parsing
  • Uses recursive subroutines to traverse the parse
    tree

109
Predictive Parsers (Lookahead)
  • Lookahead in predictive parsing
  • The lookahead token (next token in the input) is
    used to determine which rule should be used next
  • For example

7
term
num

110
Predictive Parsers (Lookahead)
7
term
num

3
7
term
num

num
3
-
term
111
Predictive Parsers (Lookahead)
num
term

7
3
num
-
term
2
num
term

7
3
num
-
term
e
2
112
Recursive-Decent Parsing
  • Top-down parsing algorithm
  • Consists of a group of methods (programs) parseN,
    one for each nonterminal symbol N of the grammar.
  • The task of each method parseN is to parse a
    single N-phrase
  • These parsing methods cooperate to parse complete
    sentences

113
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
the
cat
sees
a
rat
.
  • Decide which production rule to apply. Only one,
    1.
  • This step created four stubs.

114
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
115
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
116
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
cat
sees
a
rat
the
117
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
118
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
119
Recursive-Decent Parsing
Sentence
.
Verb
Subject
Object
Noun
Noun
cat
sees
a
rat
the
120
Recursive-Descent Parser for Micro-English
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees
  • ParseSentence
  • ParseSubject
  • ParseObject
  • ParseVerb
  • ParseNoun

121
Recursive-Descent Parser for Micro-English
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees
  • ParseSentence
  • parseSubject
  • parseVerb
  • parseObject
  • parseEnd

Sentence ?
Subject
Verb
Object
.
122
Recursive-Descent Parser for Micro-English
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees

Subject ?
  • ParseSubject
  • if input I
  • accept
  • else if input a
  • accept
  • parseNoun
  • else if input the
  • accept
  • parseNoun
  • else error

I

a
Noun

the
Noun
123
Recursive-Descent Parser for Micro-English
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees
  • ParseNoun
  • if input cat
  • accept
  • else if input mat
  • accept
  • else if input rat
  • accept
  • else error

Noun ?
cat

mat

rat
124
Recursive-Descent Parser for Micro-English
Object ?
  • ParseObject
  • if input me
  • accept
  • else if input a
  • accept
  • parseNoun
  • else if input the
  • accept
  • parseNoun
  • else error
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees

me

a
Noun

the
Noun
125
Recursive-Descent Parser for Micro-English
  • ParseVerb
  • if input like
  • accept
  • else if input is
  • accept
  • else if input see
  • accept
  • else if input sees
  • accept
  • else error

Verb ?
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees

like

is

see

sees
126
Recursive-Descent Parser for Micro-English
  • ParseEnd
  • if input .
  • accept
  • else error
  1. Sentence ? Subject Verb Object.
  2. Subject ? I a Noun the Noun
  3. Object ? me a Noun the Noun
  4. Noun ? cat mat rat
  5. Verb ? like is see sees

.
127
Systematic Development of a Recursive-Descent
Parser
  • Given a (suitable) context-free grammar
  • Express the grammar in EBNF, with a single
    production rule for each nonterminal symbol, and
    perform any necessary grammar transformations
  • Always eliminate left recursion
  • Always left-factorize whenever possible
  • Transcribe each EBNF production rule NX to a
    parsing method parseN, whose body is determined
    by X
  • Make the parser consist of
  • A private variable currentToken
  • Private parsing methods developed in previous
    step
  • Private auxiliary methods accept and acceptIt,
    both of which call the scanner
  • A public parse method that calls parseS, where S
    is the start symbol of the grammar), having first
    called the scanner to store the first input token
    in currentToken

128
Quote of the Week
  • C makes it easy to shoot yourself in the foot
    C makes it harder, but when you do, it blows
    away your whole leg.
  • Bjarne Stroustrup

129
Quote of the Week
  • Did you really say that?
  •  
  • Dr. Bjarne Stroustrup
  •  
  • Yes, I did say something along the lines of C
    makes it easy to shoot yourself in the foot C
    makes it harder, but when you do, it blows your
    whole leg off. What people tend to miss is that
    what I said about C is to a varying extent true
    for all powerful languages. As you protect people
    from simple dangers, they get themselves into new
    and less obvious problems. Someone who avoids
    the simple problems may simply be heading for a
    not-so-simple one. One problem with very
    supporting and protective environments is that
    the hard problems may be discovered too late or
    be too hard to remedy once discovered. Also, a
    rare problem is harder to find than a frequent
    one because you don't suspect it.
  •  
  • I also said, "Within C, there is a much smaller
    and cleaner language struggling to get out." For
    example, that quote can be found on page 207 of
    The Design and Evolution of C. And no, that
    smaller and cleaner language is not Java or C.
    The quote occurs in a section entitled "Beyond
    Files and Syntax". I was pointing out that the
    C semantics is much cleaner than its syntax. I
    was thinking of programming styles, libraries and
    programming environments that emphasized the
    cleaner and more effective practices over archaic
    uses focused on the low-level aspects of C.

130
Converting EBNF Production Rules to Parsing
Methods
  • For production rule NX
  • Convert production rule to parsing method named
    parseN
  • Private void parseN ()
  • Parse X
  • Refine parseE to a dummy statement
  • Refine parse t (where t is a terminal symbol) to
    accept(t) or acceptIt()
  • Refine parse N (where N is a non terminal symbol)
    to a call of the corresponding parsing method
  • parseN()
  • Refine parse X Y to
  • parseX
  • parseY
  • Refine parse XY
  • Switch (currentToken.kind)
  • Cases in starterX
  • Parse X
  • Break

131
Converting EBNF Production Rules to Parsing
Methods
  • For X Y
  • Choose parse X only if the current token is one
    that can start an X-phrase
  • Choose parse Y only if the current token is one
    that can start an Y-phrase
  • startersX and startersY must be disjoint
  • For X
  • Choose
  • while (currentToken.kind is in startersX)
  • starterX must be disjoint from the set of
    tokens that can follow X in this particular
    context

132
Converting EBNF Production Rules to Parsing
Methods
  • A grammar that satisfies both these conditions is
    called an LL(1) grammar
  • Recursive-descent parsing is suitable only for
    LL(1) grammars

133
Error Repair
  • Good programming languages are designed with a
    relatively large distance between syntactically
    correct programs, to increase the likelihood that
    conceptual mistakes are caught on syntactic
    errors.
  • Error repair usually occurs at two levels
  • Local repairs mistakes with little global
    import, such as missing semicolons and undeclared
    variables.
  • Scope repairs the program text so that scopes
    are correct. Errors of this kind include
    unbalanced parentheses and begin/end blocks.

134
Error Repair
  • Repair actions can be divided into insertions and
    deletions. Typically the compiler will use some
    look ahead and backtracking in attempting to make
    progress in the parse. There is great variation
    among compilers, though some languages (PL/C)
    carry a tradition of good error repair. Goals of
    error repair are
  • No input should cause the compiler to collapse
  • Illegal constructs are flagged
  • Frequently occurring errors are repaired
    gracefully
  • Minimal stuttering or cascading of errors.

LL-Style parsing lends itself well to error
repair, since the compiler uses the grammars
rules to predict what should occur next in the
input
135
Mini-Triangle Production Rules
  • Program Command Program (1.14)
  • Command V-name Expression AssignCommand (1.
    15a)
  • Identifier ( Expression ) CallCommand (1.15b
    )
  • Command Command SequentialCommand (1.15c)
  • if Expression then Command IfCommand (15.d)
  • else Command
  • while Expression do Command WhileCommand (1.1
    5e
  • let Declaration in Command LetCommand (1.15f)
  • Expression Integer-Literal IntegerExpression (
    1.16a)
  • V-name VnameExpression (1.16b)
  • Operator Expression UnaryExpression (1.16c)
  • Expression Operator Expression BinaryExpressio
    iun (1.16d)
  • V-name Identifier SimpelVname (1.17)
  • Declaration const Identifier
    Expression ConstDeclaration (1.18a)
  • var Identifier Typoe-denoter VarDeclaration
    (1.18b)

136
Abstract Syntax Trees
  • An explicit representation of the source
    programs phrase structure
  • AST for Mini-Triangle

137
Abstract Syntax Trees
  • Program ASTs (P)

Program Command Program (1.14
Program
C
  • Command ASTs (C)

AssignCommand
CallCommand
SequentialCommand
E
V
E
Identifier
C2
C1
(1.15a)
(1.15b)
(1.15c)
spelling
Command V-name Expression AssignCommand (
1.15a) Identifier ( Expression
) CallCommand (1.15b) Command
Command SequentialCommand (1.15c)
138
Abstract Syntax Trees
  • Command ASTs (C)

WhileCommand
LetCommand
SequentialCommand
E
C
V
D
C2
C1
E
(1.15e)
(1.15f)
(1.15d)
Command if Expression then
Command IfCommand (15.d) else
Command while Expression do
Command WhileCommand (1.15e let Declaration
in Command LetCommand (1.15f)
139
Midterm Review Chapter 1
  • Context-free Grammar
  • A finite set of terminal symbols
  • A finite set of non-terminal symbols
  • A start symbol
  • A finite se to production rules
  • Aspects of a programming language that need to be
    specified
  • Syntax form of programs
  • Contextual constraints scope rules and type
    variables
  • Semantics meaning of programs

140
Midterm Review Chapter 1
  • Language specification
  • Informal written in English
  • Formal precise notation (BNF, EBNF)
  • Unambiguous
  • Consistent
  • Complete
  • Context-free language
  • Syntax tree
  • Phrase
  • Sentence

141
Midterm Review Chapter 1
  • Syntax tree
  • Terminal node labeled by terminal symbol
  • Non-terminal nodes labeled b y non-terminal
    symbol
  • Abstract Syntax Tree (AST)
  • Each non-terminal node ius labeled by production
    rule
  • Each non-terminal node has exactly one subtree
    for each subprogram
  • Does not generate sentences

142
Midterm Review Chapter 2
  • Translator
  • Accepts any text expressed in one language
    (source language) and generates a
    semantically-equivalent text expressed in another
    language (target language)
  • Compiler
  • Translates from high-level language into
    low-level language
  • Interpreter
  • A program that accepts any program (source
    program) expressed in a particular language
    (source language) and runs that source program
    immediately

143
Midterm Review Chapter 2
  • Interpretive compiler
  • Combination of compiler and interpreter
  • Some of the advantages of each
  • Portable compiler
  • Compiled and run on any mainline, without change
  • Portability measured by proportion of
Write a Comment
User Comments (0)
About PowerShow.com