Title: Compilers: Principles, Techniques, and Tools
1Compilers Principles, Techniques, and Tools
- Jing-Shin Chang
- Department of Computer Science Information
Engineering - National Chi-Nan University
2Goals
- What is a Compiler? Why? Applications?
- How to Write a Compiler by Hands?
- Theories and Principles behind compiler
construction - Parsing, Translation Compiling - Techniques for Efficient Parsing
- How to Write a Compiler with Tools
3Table of Contents
- 1. Introduction What, Why Apps
- 2. How A Simple Compiler
- - What is A Better Typical Compiler
- 3. Lexical Analysis
- - Regular Expression and Scanner
- 4. Syntax Analysis
- - Grammars and Parsing
- 5. Top-Down Parsing LL(1)
- 6. Bottom-Up Parsing LR(1)
4Table of Contents
- 7. Syntax-Directed Translation
- 8. Semantic Processing
- 9. Symbol Tables
- 10. Run-time Storage Organization
5Table of Contents
- 11. Translation of Special Structures
- . Modular Program Structures
- . Declarations
- . Expressions and Data Structure References
- . Control Structures
- . Procedures and Functions
- 12. General Translation Scheme
- - Attribute Grammars
6Table of Contents
- 13. Code Generation
- 14. Global Optimization
- 15. Tools Compiler Compiler
7What is A Compiler?
- Functional blocks
- Forms of compilers
8The Compiler
- What is a compiler?
- A program for translating programming languages
into machine languages - source language gt target language
- Why compilers?
- Filling the gaps between a programmer and the
computer hardware
9Compiler A Bridge Between PL and Hardware
Applications (High Level Language)
A B C D
Compiler
Operating System
MOV A, C MUL A, D ADD A, B MOV va, A
Hardware (Low Level Language)
Register-based or Stack-based machines
Assembly Codes
10Typical Machine Instructions Register-based
Machines
A
B C
D E
H L
- Data Transfer
- MOV A, B
- MOV A, mem
- More IN/OUT, Push, Pop, ...
- Arithmetic Operation
- ADD A, C // A A C
- MUL A, D // A A D
- More ADC, SUB, SBB, INC
- Logical Operation
- AND A, 00001111B // A A 00001111B
- More OR, NOT, XOR, Shift, Rotate
- Program Control
- JMP, JZ, JNZ, Call,
- Low Level Instructions Features
- Mostly Simple Binary Operators (using source
target operands)
Registers of an Intel 8085 processor
11Typical Machine Instructions Stack-based
Machines
SP SP
SP-1
- Data Transfer
- Push A // SP (SP) A
- Push mem // SP (SP) mem
- Dup // (SP1) (SP) SP
- Pop mem // mem (SP) SP--
- Arithmetic Operation
- ADD // (SP-1) (SP) (SP-1) SP--
- MUL // (SP-1) (SP) x (SP-1) SP--
- Logical Operation
- Program Control
- Low Level Instructions Features
- Mostly Simple Binary Operators
- Operations are applied to the topmost 2 source
operands - return results to new stack top (destination
operand) - Almost no general purpose registers
12Compiler (1) - Compilation
MOV A, C MUL A, D ADD A, B MOV va, A
A B C D
Source Program/Code (P.L., Formal Spec.)
Target Program/Code (P.L., Assembly, Machine Code)
Compiler
Error Message
13Machine Independent Intermediate Instructions
- Low Level Instructions Features
- Mostly Simple Binary Operators
- Result is often save to Accumulator (A register)
- Not intuitive to programmers
- Intermediate instructions
- 3 address codes (for register-based machines)
- A B C
- 2 source operands, one destination operand
- Easy to map to machine instructions (share one
source destination operand) - A A B
- Stack machine codes (for stack-based machines)
14Compiler A Bridge Between PL and Hardware
Applications (High Level Language)
A B C D
Compiler
T1 C D T2 B T1 A T2
Operating System
Intermediate Codes
Hardware (Low Level Language)
MOV A, C MUL A, D ADD A, B MOV va, A
Register-based or Stack-based machines
Assembly Codes
15Compiler (1) - Compilation
MOV A, C MUL A, D ADD A, B MOV va, A
T1 C D T2 B T1 A T2
A B C D
Source Program/Code (P.L., Formal Spec.)
Target Program/Code (P.L., Assembly, Machine Code)
Compiler
Error Message
16Compiler (2a) Execution
Running the compiled codes
Input
Output
Target Code
(in Real Machine)
Target code (compiled)
Loader
(load into Real Machine)
17Compiler (2b) Compile Go
Two working phases in two passes
Source Program
Error Message
Compiler
Target Code
Output
Input
(in Real Machine)
- Compiler Two independent phases to complete the
work - (1) Compilation Phase Source to Target
compilation - (2) Execution Phase run compiled codes
respond to input - produce output
18Compiler (2c) compile go
Two working phases in two passes
Source program ( executable Target code)
Compiler (Loader)
Output
Input
(target loaded into Real Machine)
- Compiler Two independent phases to complete the
work - (1) Compilation Phase Source to Target
compilation - (2) Execution Phase run compiled codes
respond to input - produce output
19Interpreter (1)
Source program
Output
Interpreter
Input
Error Message
- Interpreter One single pass to complete the
two-phases work - Each source statement is Compiled and Executed
subsequently - The next statement is then handled in the same
way
20Interpreter (2)
- Compile and then execute for each incoming
statements - Do not save compiled codes in executable files
- Save storage
- Re-compile the same statements if loop back
- Slower
- Detect (compilation runtime) errors as one
occurs during the execution time - ? Compiler Detect syntax/semantic errors
(compilation errors) during compilation time
21Hybrid Compiler Interpreter?
Source program
Error Message
Compiler
Intermediate program
Interpreter
Output
Input
(with/without JIT)
22Hybrid Compiler Interpreter?
Source program
- Intermediate program
- without syntax/semantic errors
- machine independent
- Interpreter
- do not interpret high level source
- but compiled low level code
- easy to interpret efficient
Compiler
Intermediate program
Interpreter
Output
Input
(with/without JIT)
23Hybrid Method Virtual Machine
Source program
Translator
(Compiler)
Intermediate program
Virtual Machine (VM)
Output
Input
(Interpreter with/without JIT)
24Example Java Compiler Java VM
Java program
(app.java)
Java Compiler
(Javac)
(app.class)
Java Bytecodes
Java Virtual Machine
Output
Input
(Interpreter with/without JIT)
25Hybrid Method Virtual Machine
- Compile source program into a platform
independent code - E.g., Java gt Bytecodes (stack-based
instructions) - Execute the code with a virtual machine
- High portability The platform independent code
can be distributed on the web, downloaded and
executed in any platform that had VM
pre-installed - Good for cross-platform applications
26Just-in-time (JIT) Compilation
- Compile a new statement (only once) as it comes
for the first time - And save the compiled codes
- Executed by virtual/real machine
- Do not re-compile as it loop back
- Example
- Java VM (simple Interpreter version, without
JIT) high penalty in performance due to
interpretation - Java VM JIT improved by the order of a factor
of 10 - JIT translate bytecodes during run time to the
native target machine instruction set
27Comparison of Different Compilation-and-Go Schemes
- Normal Compilers
- Will generate codes for all statements whether
they will be executed or not - Separate the compilation phase and execution
phase into two different phrases - Syntax semantic errors are detected at
compilation time - Interpreters and JIT Compilers
- Can generate codes only for statements that are
really executed - Will depend on your input different execution
flows mean different sets of executed codes - Interpreter Syntax semantic errors are
detected at run/execution time - JIT vs. Simple Interpreter
- JIT save the target machine codes
- Can be re-used, and compiled at most once
- Interpreter do not save target machine codes
- Compiled more than once
28Register-Based Virtual Machine for Android Phone
Dalvik VM
- Java VM (JVM) Stack-based Instruction Set
- Normally less efficient than RISC or CISC
instructions - Limited memory organization
- Requires too many swap and copy operations
29Register-Based Virtual Machine for Android Phone
Dalvik VM
- Dalvik VM (for Android OS) Register-based
Instruction Set - Smaller size
- Better memory efficiency
- Good for phone and other embedded systems
- Generation and Execution of Dalvik byte codes
- Compiled/Translated from Java byte code into a
new byte code - app.java (Java source)
- javac (Java Compiler)gt app.class
(executable by JVM) - dx (in Android SDK tool) gt app.dex (Dalvik
Executable) - compression gt apps.apk (Android
Application Package) - Dalvik VM gt (execution)
30How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phases
- Quick Review on Syntax Semantics
- Processing Phases in Detail
- Structure of Compilers
31Source Program
Modified Source Program
Compiler
A language-Processing System
Target Assembly Program
Assembler
Relocatable Machine Code
Target Machine Code
32Programming Languages vs. Natural Languages
- Natural languages for communication between
native speakers of the same or different
languages - Chinese, English, French, Japanese
- Programming languages for communication between
programmers and computers - Generic High-Level Programming Languages
- Basic, Fortran, COBOL, Pascal, C/C, Java
- Typesetting Languages
- TROFF (TBL, EQN, PIC), La/Tex, PostScript
- Markup Language -- Structured Documents
- SGML, HTML, XML, ...
- Script Languages
- Csh, bsh, awk, perl, python, javascript, asp,
jsp, php
33Machine Independent Intermediate Instructions
- Low Level Instructions Features
- Mostly Simple Binary Operators
- Result is often save to Accumulator (A register)
- Not intuitive to programmers
- Intermediate instructions
- 3 address codes (for register-based machines)
- A B C
- 2 source operands, one destination operand
- Easy to map to machine instructions (share one
source destination operand) - A A B
- Stack machine codes (for stack-based machines)
34Compiler A Bridge Between PL and Hardware
Applications (High Level Language)
A B C D
Compiler
T1 C D T2 B T1 A T2
Operating System
Intermediate Codes
Hardware (Low Level Language)
MOV A, C MUL A, D ADD A, B MOV va, A
Register-based or Stack-based machines
Assembly Codes
35Compiler with Intermediate Codes
MOV A, C MUL A, D ADD A, B MOV va, A
T1 C D T2 B T1 A T2
A B C D
Source Program/Code (P.L., Formal Spec.)
Target Program/Code (P.L., Assembly, Machine Code)
Compiler
Error Message
36float position, initial, rate position initial
rate 60
Tokens
3-address codes, or Stack machine codes
Typical Phases of a Compiler
Parse Tree or Syntax Tree
Optimized codes
Syntax Tree or Annotated Syntax Tree
Assembly (or Machine) Codes
37Analysis-Synthesis Model of a Compiler
- Analysis Program gt Constituents gt I.R.
- Lexical Analysis linear gt token
- Syntax Analysis hierarchical, nested gt tree
- Identify relations/actions among tokens e.g.,
add(b, mult(c,d)) - Semantic Analysis check legal constraints /
meanings - By examining attributes associated with tokens
relations - Synthesis I.R. gt I.R. gt Target Language
- Intermediate Code Generation
- generate intermediate representation (I.R.) from
syntax - Code Optimization generate better equivalent IR
- machine independent machine dependent
- Code Generation
38Typical Modules of a Compiler
Annotated Syntax Tree
39float position, initial, rate position initial
rate 60
Tokens
3-address codes, or Stack machine codes
Typical Phases of a Compiler
Parse Tree or Syntax Tree
Optimized codes
Syntax Tree or Annotated Syntax Tree
Assembly (or Machine) Codes
40How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax Semantics
- Processing Phrases in Detail
- Structure of Compilers
41Syntax Analysis Structure
- Syntax Analysis (Parsing) match input tokens
against a grammar of the language - To ensure that the input tokens form a legal
sentence (statement) - To build the structure representation of the
input tokens - So the structure can be used for translation (or
code generation) - Knowledge source
- Grammar in CFG (Context-Free Grammar) form
- Additional semantic rules for semantic checks and
translation (in later phases)
id1 id2 id3 60
Grammar
Syntax Analysis
S ? id e S ? e ? id t e ? t ? id n t
?
Parse Tree (Concrete syntax tree)
42Grammar Context Free Grammar
43Context Free Grammar (CFG)Specification for
Structures Constituency
- Parse Tree graphical representation of structure
- root node (S) a sentential level structure
- internal nodes constituents of the sentence
- arcs relationship between parent nodes and their
children (constituents) - terminal nodes surface forms of the input
symbols (e.g., words) - alternative representation bracketed notation
- e.g., I saw the girl in the park
- Example
44Parse Tree I saw the girl in the park
45CFG Components
- CFG formal specification of parse trees
- G ?, N, P, S
- ? terminal symbols
- N non-terminal symbols
- P production rules
- S start symbol
- ? terminal symbols
- the input symbols of the language
- programming language tokens (reserved words,
variables, operators, ) - natural languages words or parts of speech
- pre-terminal parts of speech (when words are
regarded as terminals) - N non-terminal symbols
- groups of terminals and/or other non-terminals
- S start symbol the largest constituent of a
parse tree - P production (re-writing) rules
- form a ? ß (a non-terminal, ß string of
terminals and non-terminals) - meaning a re-writes to (consists of, derived
into)ß, or ßreduced to a - start with S-productions (S ? ß)
46CFG Example Grammar
- Grammar Rules
- S ? NP VP
- NP ? Pron Proper-Noun Det Norm
- Norm ? Noun Norm Noun
- VP ? Verb Verb NP Verb NP PP Verb PP
- PP ? Prep NP
- S sentence, NP noun phrase, VP verb phrase
- Pron pronoun
- Det determiner, Norm Norminal
- PP prepositional phrase, Prep preposition
- Lexicon (in CFG form)
- Noun ? girl park desk
- Verb ? like want is saw walk
- Prep ? by in with for
- Det ? the a this these
- Pron ? I you he she him
- Proper-Noun ? IBM Microsoft Berkeley
47Syntax vs. Semantic Analyses
- Syntax
- How the input tokens look like? Do they form a
legal structure? - Analysis of relationship between elements
- e.g., operator-operands relationship
- Semantic
- What they mean? And, thus, how they act?
- Analysis of detailed attributes of elements and
check constraints over them under the given
syntax - Not all knowledge between elements can be
conveniently represented by a simple syntactic
structure. Various kinds of attributes are
associated with sub-structures in the given syntax
48Syntax vs. Semantic Analyses
- Examples
- int a, b, c ,d float f char s1, s2
- a b c d
- a b f d // OK, but not strictly
right - a b s1 s2 // BAD is undefined for
strings - a b s1 3 // OK? if properly defined
- All the above statements have the same look
- Convenient to represent them with the same
syntactic structure (grammar/production rules) - But Semantically
- Not all of them are meaningful (?? string
string ??) - You have to check their other attributes for
meanings - Not all meaningful statements will mean/act the
same and have the same codes ( int int ? int
float ? string int) - You have to generate different codes according to
other attributes of the tokens, since
instructions are limited - E.g., INT and FLOAT additions may use different
machine instructions, like ADD and ADDF
respectively.
semantic analyzer
49Semantic Analysis Attributes
Parse Tree (Concrete Syntax Tree)
Semantic checks abstraction
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)
id1
id2
id3
60
50How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax Semantics
- Processing Phrases in Detail
- Structure of Compilers
51Symbol Table Management
- Symbols
- Variable names, procedure names, constant
literals (3.14159) - Symbol Table
- A record for each name describing its attributes
- Managing Information about names
- Variable attributes
- Type, register/storage allocated, scope
- Procedure names
- Number and types of arguments
- Method of argument passing
- By value, address, reference
521 Lexical Analysis Tokenization
I saw the girls I see the girls
final initial rate 60 f i
r 60
Both looks the same. So you want to represent
them with the same normalized token string, and
hide detailed features as additional attributes.
Lexical Analysis
I(1psg) see (ed) the girl (s) I(1psg) see
(prs) the girl (s)
id1 id2 id3 60
1 I I 1psg
2 see saw ed
3 the the
4 girl girls 3ppl s
1 id1 final float R2
2 id2 initial float R1
3 id3 rate float
4 const1 60 const 60.0
532 Syntax Analysis Structure
I see (ed) the girl (s)
id1 id2 id3 60
Grammar
Syntax Analysis
Normalized tokens have the same parse/syntax
tree whether they were see/saw and
girl/girls.
Parse Tree (Concrete syntax tree)
54Syntax Analysis Structure
I see (ed) the girl (s)
id1 id2 id3 60
Syntax Analysis
Sentence
NP
verb
NP
I see (ed) the girl (s)
Syntax Tree (Abstract syntax tree)
55Semantic Analysis Attributes
Syntax Tree (Abstract Syntax Tree)
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)
563 Semantic Analysis Attributes
Semantic checks abstraction
Parse Tree (Concrete Syntax Tree)
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Sentence
Syntax Tree (Abstract Syntax Tree)
NP.subject
verb
NP.object
I see (ed) the girl (s)
57Semantic Analysis Attributes
Semantic checks abstraction
Parse Tree (Concrete Syntax Tree)
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)
id1
id2
id3
60
583 Semantic Analysis Attributes
Parse Tree (Concrete Syntax Tree)
Semantic checks abstraction
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)
id1
id2
id3
60
593 Semantic Analysis Attributes
Parse Tree (Concrete Syntax Tree)
Semantic checks abstraction
Semantic Rules Assoc. with Grammar Productions
Semantic Analysis
Syntax Tree (Abstract Syntax Tree)
id1
id2
id3
60
60Semantic Checking
- Semantic Constraints
- Agreement (somewhat syntactic)
- Subject-Verb I have, she has/had, I do have, she
does not - NP Quantifier-noun a book, two books
- Selectional Constraint
- Kill ? Animate
- Kiss ? Animate
abstraction
61Semantic Checking
- Semantic Constraints
- Agreement (somewhat syntactic)
- Subject-Verb I have, she has/had, I do have, she
does not - NP Quantifier-noun a book, two books
- Selectional Constraint
- Kill ? Animate
- Kiss ? Animate
semantic checking
Seeed(I, the girls)
(semantically meaningful)
Kill/Kiss (John, the Stone)
(semantically meaningless unless the Stone
refers to an animate entity)
62Parse Tree vs. Syntax Tree
- Parse Tree (aka concrete syntax tree)
- Tree concrete representation drawn according to a
grammar - For validating correctness of syntax of input
- For easy parsing (or fitting constraints of
parsing algorithm) - Normally constructed incrementally during parsing
- Syntax Tree (aka abstract syntax tree)
- Tree logical representation that characterize the
abstract relationships between constituents - For representing semantic relationships
semantic checking - Normalizing various parse trees of the same
meaning (semantics) - May ignore non-essential syntactic details
- Not always the same as parse tree
- May be constructed in parallel with the parse
tree during parsing - Or converted from parse tree after syntactic
parsing - Annotated Syntax Tree (AST)
- Syntax Tree with annotated attributes
63Parse Tree vs. Syntax Tree
Parse Tree for G1
- Parse Tree (depend on grammar)
- Input T T T
- G1 T (( T) ( T) )
- E ? T R
- R ? T R
- R ? ltnullgt
- G2 ((T) T) T
- E ? E T
- E ? T
- Syntax Tree
- Abstract representation for syntax defined by
G1/G2 - Use operation as parent nodes and operands as
children nodes - Operation-operand relationship Easy for
instruction selection in code generation (e.g.,
ADD R1, R2)
Parse Tree for G2
Syntax Tree (independent of G1 or G2)
644 Intermediate Code Generation
Attribute evaluation (assembly codes are
attributes for code generation)
Action(anim,anim)
see (ed)
anim
anim
subject
object
I the girl (s)
Intermediate Code Generation
logic form
3-address codes
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Seeed(I, the girls)
654 Intermediate Code Generation
Attribute evaluation (assembly codes are
attributes for code generation)
Action(anim,anim)
anim
anim
Action
Intermediate Code Generation
logic form
3-address codes
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Seeed(I, the girls)
66Syntax-Directed Translation (1)
- Translation from input to target can be regarded
as attribute evaluation. - Evaluate attributes of each node, in a well
defined order, based on the particular piece of
sub-tree structure (syntax) wherein the
attributes are to be evaluated. - Attributes the particular properties associated
with a tree node (a node may have many
attributes) - Abstract representation of the sub-tree rooted at
that node - The attributes of the root node represent the
particular properties of the whole input
statement or sentence. - E.g., value associated with a mathematic
sub-expression - E.g., machine codes associated with a
sub-expression - E.g., language translation associated with a
sub-sentence
67Syntax-Directed Translation (2)
- Synthesis Attributes
- Attributes that can be evaluated based on the
attributes of children nodes - E.g., value of math. expression can be acquired
from the values of sub-expressions (and the
operators being applied) - a b c d
- (? a.val b.val tmp.val where tmp.val c.val
d.val) - girls girl s
- (? tr.girls tr.girl tr.s ???? ???)
- Inherited Attributes
- Attributes evaluatable from parent and/or sibling
nodes - E.g., data type of a variable can be acquired
from its left-hand side type declaration or from
the type of its left-hand side brother - int a, b, c (? a.type INT b.type a.type
)
68Syntax-Directed Translation (3)
- Attribute evaluation order
- Any order that can evaluate the attribute AFTER
all its dependent attributes are evaluated will
result in correct evaluation. - General topological order
- Analyze the dependency between attributes and
construct an attribute tree or forest - Evaluate the attribute of any leave node, and
mark it as evaluated, thus logically remove it
from the attribute tree or forest - Repeat for any leave nodes that have not been
marked, until no unmarked node
695 Code OptimizationNormalization
temp1 i2r ( 60 ) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
Normalization into better equivalent form
(optional)
Was_Killed(Bill, John)
Seeed(I, the girls)
Unify passive/active voices
Code Optimization
Killed(John, Bill)
Seeed(I, the girls)
temp1 id3 60.0 id1 id2 temp1
706 Code Generation
temp1 id3 60.0 id1 id2 temp1
Seeed(I, the girls)
Selection of target words order of phrases
Code Generation
Selection of usable codes order of codes
Allocation of available registers
movf id3, r2 mulf 60.0, r2 movf
id2, r1 addf r2, r1 movf r1, id1
Lexical ?? ? (?, ?? ?)
Structural ? ?? ?? ? ?
71Objectives of Optimizing Compilers
- Correct codes preserve meaning
- Better performance
- Maximum Execution Efficiency
- Minimum Code Size
- Embedded systems
- Minimizing Power Consumptions
- Mobile devices
- Typically, faster execution also implies lower
power - Reasonable compilation time
- Manageable engineering and maintenance efforts
72Optimization for Computer Architectures (1)
- Parallelism
- Instruction level multiple operations are
executed simultaneously - Processor check dependency in sequential
instructions, issue them in parallel - Hardware scheduler change order of instruction
- Compilers rearrange instructions to make
instruction level parallelism more effective - Instruction set supports
- Very long Instruction word issues multiple
operations in parallel - Instructions that can operate on Vector data at
the same time - Compilers generate codes for such machine from
sequential codes - Processor level different threads of the same
application are run on different processors - Multiprocessors multithreaded codes
- Programmer write multithreaded codes, vs
- Compiler generate parallel codes automatically
73Optimization for Computer Architectures (2)
- Memory Hierarchies
- No storage that is both fast and large
- Registers (tens hundreds bytes), caches (KMB),
main/physical memory (MGB), secondary/virtual
memory (hard disks) (GTB) - Using registers effectively is probably the
single most important problem in optimizing a
program - Cache-management by hardware is not effective in
scientific code that has large data structures
(arrays) - Improve effectiveness of memory hierarchies
- By changing layout of data, or
- Changing the order of instructions accessing the
data - Improve effectiveness of instruction cache
- Change the layout of codes
74How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax Semantics
- Processing Phrases in Detail
- Structure of Compilers
75Structure of a Compiler
- Front End Source Dependent
- Lexical Analysis
- Syntax Analysis
- Semantic Analysis
- Intermediate Code Generation
- (Code Optimization machine independent)
- Back End Target Dependent
- Code Optimization
- Target Code Generation
76Structure of a Compiler
Fortran
Pascal
C
Intermediate Code
MIPS
SPARC
Pentium
77History
- 1st Fortran compiler 1950s
- efficient? (compared with assembly program)
- not bad, but much easier to write programs
- high-level languages are feasible.
- 18 man-year, ad hoc structure
- Today, we can build a simple compiler in a few
month. - Crafting an efficient and reliable compiler is
still challenging.
78Cousins of the Compiler
- Preprocessors macro definition/expansion
- Interpreters
- Compiler vs. interpreter vs. just-in-time
compilation - Assemblers 1-pass / 2-pass
- Linkers link source with library functions
- Loaders load executables into memory
- Editors editing sources (with/without syntax
prediction) - Debuggers symbolically providing stepwise trace
- Profilers gprof (call graph and time analysis)
- Project managers IDE
- Integrated Development Environment
- Deassemblers, Decompilers low-level to
high-level language conversion
79Applications of Compilation Techniques
80Applications of Compilation Techniques
- Virtually any kinds of Programming Languages and
Specification Languages with Regular and
Well-defined Grammatical Structures will need a
kind of compiler (or its variant, or a part of
it) to analyze and then process them.
81Applications of Lexical Analysis
- Text/Pattern Processing
- grep get lines with specified pattern
- Ex grep From /var/spool/mail/andy
- sed stream editor, editing specified patterns
- Ex ls .JPG sed s/JPG/jpg/
- tr simple translation between patterns (e.g.,
uppercases to lowercases) - Ex tr a-z A-Z lt mytext gt mytext.uc
- AWK pattern-action rule processing
- pattern processing based on regular expression
- Ex awk '1John"countENDprint count ' lt
Students.txt
82Applications of Lexical Analysis
- Search Engines/Information Retrieval
- full text search, keyword matching, fuzzy match
- Database Machine
- fast matching over large database
- database filter
- Fast Multiple Matching Algorithms
- Optimized/specialized lexical analyzers (FSA)
- Examples KMP, Boyer-Moore (BM),
83Applications Syntax Analysis
- Structured Editor/Word Processor
- Integrated Develop Environment (IDE)
- automatic formatting, keyword insertion
- Incremental Parser vs. Full-blown Parsing
- incremental patching analysis made by
incremental changes, instead of re-parsing or
re-compiling - Pretty Printer beautify nested structures
- cb (C-beautifier)
- indent (an even more versatile C-beautifier)
84Applications Syntax Analysis
- Static Checker/Debugger lint
- check errors without really running, e.g.,
- statement not reachable
- used before defined
85Application of Optimization Techniques
- Data flow analysis
- Software testing
- Locating errors before running (static checking)
- Locate errors along all possible execution paths
- not only on test data set
- Type Checking
- Dereferncing null or freed pointers
- Dangerous user supplied strings
- Bound Checking
- Security vulnerability buffer over-run attack
- Tracking values of pointers across procedures
- Memory management
- Garbage collection
86Applications of Compilation Techniques
- Pre-processor Macro definition/expansion
- Active Webpages Processing
- Script or programming languages embedded in
webpages for interactive transactions - Examples JavaScript, JSP, ASP, PHP
- Compiler Apps expansion of embedded statements,
in addition to web page parsing - Database Query Language SQL
87Applications of Compilation Techniques
- Interpreter
- no pre-compilation
- executed on-the-fly
- e.g., BASIC
- Script Languages C-shell, Perl
- Function for batch processing multiple
files/databases - mostly interpreted, some pre-compiled
- Some interpreted and save compiled codes
88Applications of Compilation Techniques
- Text Formatter
- Troff, LaTex, Eqn, Pic, Tbl
- VLSI Design Silicon Compiler
- Hardware Description Languages
- variables gt control signals / data
- Circuit Synthesis
- Preliminary Circuit Simulation by Software
89Applications of Compilation Techniques
90Advanced Applications
- Natural Language Processing
- advanced search engines retrieve relevant
documents - more than keyword matching
- natural language query
- information extraction
- acquire relevant information (into structured
form) - text summarization
- get most brief relevant paragraphs
- text/web mining
- mining information rules from text/web
91Advanced Applications
- Machine Translation
- Translating a natural language into another
- Models
- Direct translation
- Transfer-Based Model
- Inter-lingua Model
- Transfer-Based Model
- Analysis-Transfer-Generation (or Synthesis) model
92Tools for Compiler Construction
93Tools Automatic Generation of Lexical Analyzers
and Compilers
- Lexical Analyzer Generator LEX
- Input Token Pattern specification (in regular
expression) - Output a lexical analyzer
- Parser Generator YACC
- compiler-compiler
- Input Grammar Specification (in context-free
grammar) - Output a syntax analyzer (aka parser)
94Tools
- Syntax Directed Translation engines
- translations associated with nodes
- translations defined in terms of translations of
children - Automatic code generation
- translation rules
- template matching
- Data flow analyses
- dependency of variables constructs
95Programming Languages
- Issues about Modern PLs
- Module programming Parameter passing
- Nested modules Scopes
- Static dynamic allocation
96Programming Language Basics
- Static vs. Dynamic Issues or Policies
- Static determined at compile time
- Dynamic determined at run time
- Scopes of declaration
- Region in which the use of x refer to a
declaration of x - Static Scope (aka lexical scope)
- Possible to determine the scope of declaration by
looking at the program - C, Java (and most PL)
- Delimited by block structures
- Dynamic scope
- At run time, the same use of x could refer to any
of several declarations of x.
97Programming Language Basics
- Variable declaration
- Static variables
- Possible to determine the location in memory
where the declared variable can be found - Public static int x // C
- Only one copy of x, can be determined at compile
time - Global declarations and declared constants can
also be made static - Dynamic variables
- Local variables without the static keyword
- Each object of the class would have its own
location where x would be held. - At run time, the same use of x in different
objects could refer to any of several different
locations.
98Programming Language Basics
- Parameter Passing Mechanisms
- called by value
- make a copy of physical value
- called by reference
- make a copy of the address of a physical object
- call by name (Algol 60)
- callee executed as if the actual parameter were
substituted literally for the formal parameter
in the code of the callee - macro expansion of formal parameter into actual
parameter
99Formal Languages
100Languages, Grammars and Recognition Machines
I saw a girl in the park
Language
define
accept
generate
Grammar (expression)
Parser (automaton)
construct
Parsing Table
S? NP VP NP? pron det n
S? NP VP NP? pron det n
101Languages
- Alphabet - any finite set of symbols 0, 1
binary alphabet - String - a finite sequence of symbols from an
alphabet 1011 a string of length 4 ? the
empty string - Language - any set of strings on an
alphabet 00, 01, 10, 11 the set of strings of
length 2 ? the empty set
102Terms for Parts of a String
- string banana
- (proper) prefix ?, b, ba, ban, ..., banana
- (proper) suffix ?, a, na, ana, ..., banana
- (proper) substring ?, b, a, n, ba, an, na,
..., banana - subsequence (including non-consecutive ones)
?, b, a, n, ba, bn, an, aa, na, nn, ..., banana - sentence a string in the language
103Operations on Strings
- concatenation x dog y house xy doghouse
- exponentiation s0 ? s1 s s2 ss
104Operations on Languages
- Union of L and M, L ? M L ? M s s ? L or
s ? M - Concatenation of L and M, LM LM st s ? L
and t ? M - Kleene closure of L, L L
- Positive closure of L, L L
105Grammars
- The sentences in a language may be defined by a
set of rules called a grammar L 00, 01, 10,
11 - (the set of binary digits of
length 2) - G (01)(01)
- Languages of different degree of regularity can
be specified with grammar of different
expressive powers - Chomsky Hierarchy
- Regular Grammar lt Context-Free Grammar lt
Context-Sensitive Grammar lt Unrestricted
106Automata
- An acceptor/recognizer of a language is an
automaton which determines if an input string is
a sentence in the language - A transducer of a language is an automaton which
determines if an input string is a sentence in
the language, and may produce strings as output
if it is in the language - Implementation state transition functions
(parsing table)
107Transducer
language L1
language L2
accept
translation
Define / Generate
Define / Generate
automaton
grammar G1
grammar G2
construct
108Meta-languages
- Meta-language a language used to define another
language Different meta-languages will be
used to define the various components of
a programming language so that these
components can be analyzed automatically
109Definition of Programming Languages
- Lexical tokens regular expressions
- Syntax context free grammars
- Semantics attribute grammars
- Intermediate code generation attribute
grammars - Code generation tree grammars
110Implementation of Programming Languages
- Regular expressions finite automata, lexical
analyzer - Context free grammars pushdown automata,
parser - Attribute grammars attribute evaluators, type
checker and intermediate code generator - Tree grammars finite tree automata, code
generator
111Appendix Machine Translation
112Machine Translation (Transfer Approach)
Analysis
Transfer
Synthesis
SL Text
SL IR
TL IR
TL Text
SL Dictionaries Grammar
TL Dictionaries Grammar
SL-TL Dictionaries Transfer Rules
IR Intermediate Representation
- Analysis is target independent, and
- Generation (Synthesis) is source independent
113ExampleMiss Smith put two books on this dining
table.
- Analysis
- Morphological and Lexical Analysis
- Part-of-speech (POS) Tagging
- n. Missn. Smithv. put
(ed)q. twon. book (s)p. ond.
thisn. dining table.
114ExampleMiss Smith put two books on this dining
table.
S
VP
NP
V
NP
PP
Miss Smith put(ed) two book(s) on this dining
table
115ExampleMiss Smith put two books on this dining
table.
- Transfer
- (1) Lexical Transfer Miss ??
Smith ??? put (ed) ? two ?
book (s) ? on ??? this ?
dining table ??
116ExampleMiss Smith put two books on this dining
table.
- Transfer
- (2) Phrasal/Structural Transfer
?????????????? ?????????????? -
117ExampleMiss Smith put two books on this dining
table.
- Generation Morphological Structural
- ?????????????? ???????(?)???(?)????
- ?????(?)?(?)????(?)????
118(No Transcript)
119position initial rate 60
lexical analyzer
Aho 86
id1 id2 id3 60
syntax analyzer
id1
SYMBOL TABLE
id2
position
initial
rate
1
id3
60
2
semantic analyzer
3
id1
4
id2
id3
inttoreal 60
120C
intermediate code generator
Aho 86
temp1 inttoreal (60) temp2 id3
temp1 temp3 id2 temp2 id1 temp3
code optimizer
temp1 id3 60.0 id1 id2 temp1
code generator
Binary Code
121Detailed Steps (1) Analysis
- Text Pre-processing (separating texts from tags)
- Clean up garbage patterns (usually introduced
during file conversion) - Recover sentences and words (e.g., ltBgtClt/Bgt
omputer) - Separate Processing-Regions from
Non-Processing-Regions (e.g., File-Header-Sections
, Equations, etc.) - Extract and mark strings that need special
treatment (e.g., Topics, Keywords, etc.) - Identify and convert markup tags into internal
tags (de-markup however, markup tags also
provide information) - Discourse and Sentence Segmentation
- Divide text into various primary processing units
(e.g., sentences) - Discourse Cue Phrases
- Sentence mainly classify the type of Period
and Carriage Return in English (sentence
stops vs. abbreviations/titles)
122Detailed Steps (2) Analysis (Cont.)
- Stemming
- English perform morphological analysis (e.g.,
-ed, -ing, -s, -ly, re-, pre-, etc.) and Identify
root form (e.g., got ltgetgt, lay ltlie/laygt, etc.) - Chinese mainly detect suffix lexemes (e.g., ???,
???, etc.) - Text normalization Capitalization, Hyphenation,
- Tokenization
- English mainly identify split-idiom (e.g., turn
NP on) and compound - Chinese Word Segmentation (e.g., ?? ?? ??)
- Regular Expression numerical strings/expressions
(e.g., twenty millions), date, (each being
associated with a specific type) - Tagging
- Assign Part-of-Speech (e.g., n, v, adj, adv,
etc.) - Associated forms are basically independent of
languages starting from this step
123Detailed Steps (3) Analysis (Cont.)
- Parsing
- Decide suitable syntactic relationship (e.g.,
PP-Attachment) - Decide Word-Sense
- Decide appropriate lexicon-sense (e.g.,
River-Bank, Money-Bank, etc.) - Assign Case-Label
- Decide suitable semantic relationship (e.g.,
Patient, Agent, etc.) - Anaphora and Antecedent Resolution
- Pronoun reference (e.g., he refers to the
president)
124Detailed Steps (4) Analysis (Cont.)
- Decide Discourse Structure
- Decide suitable discourse segments relationship
(e.g., Evidence, Concession, Justification, etc.
Marcu 2000.) - Convert into Logical Form (Optional)
- Co-reference resolution (e.g., president refers
to Bill Clinton), scope resolution (e.g.,
negation), Temporal Resolution (e.g., today, last
Friday), Spatial Resolution (e.g., here, next),
etc. - Identify roles of Named-Entities (Person,
Location, Organization), and determine IS-A (also
Part-of) relationship, etc. - Mainly used in inference related applications
(e.g., QA, etc.)
125Detailed Steps (5) Transfer
- Decide suitable Target Discourse Structure
- For example Evidence, Concession, Justification,
etc. Marcu 2000. - Decide suitable Target Lexicon Senses
- Sense Mapping may not be one-to-one (sense
resolution might be different in different
languages, e.g. snow has more senses in Eskimo) - Sense-Token Mapping may not be one-to-one
(lexicon representation power might be different
in different languages, e.g., DINK, ?, etc).
It could be 2-1, 1-2, etc. -
- Decide suitable Target Sentence Structure
- For example verb nominalization, constitute
promotion and demotion (usually occurs when
Sense-Token-Mapping is not 1-1) - Decide appropriate Target Case
- Case Label might change after the structure has
been modified - (Example) verb nominalization that you
(AGENT) invite me ? your (POSS) invitation
126Detailed Steps (6) Generation
- Adopt suitable Sentence Syntactic Pattern
- Depend on Style (which is the distributions of
lexicon selection and syntactic patterns adopted) - Adopt suitable Target Lexicon
- Select from Synonym Set (depend on style)
- Add de (Chinese), comma, tense, measure
(Chinese), etc. - Morphological generation is required for
target-specific tokens - Text Post-processing
- Final string substitution (replace those markers
of special strings) - Extract and export associated information (e.g.,
Glossary, Index, etc.) - Restore customers markup tags (re-markup) for
saving typesetting work