Title: Comp 104: Operating Systems Concepts
1Comp 104 Operating Systems Concepts
- Introduction to Compilers
2Today
- Compilers
- Definition
- Structure
- Passes
- Lexical Analysis
- Symbol table
- Access methods
3Compilers
- Definition
- A compiler is a program which translates a
high-level source program into a lower-level
object program (target)
SOURCEPROG.
ANALYSEDPROG.
OBJECTPROG.
analysis
synthesis
4History
- Late 1940ies (post-von Neumann)
- Programs were written in machine code
- C7 06 0000 0002 (move the number 2 to location
0000 (hex) - Highly complex, tedious and prone to error
- Assemblers appeared
- Machine instructions given as mnemonics
- MOV X,2 (assuming X has the value 0000 (hex))
- Greatly improved the speed and accuracy of
writing code - But still non-trivial, and non-portable to new
processors - Needed a mathematical notation
- Fortran appeared between 1954-57
- X 2
- Exploited context free grammars (Chomsky) and
finite state automatata
5Compiler
- Responsible for converting source code into
executable code. - Analyses the code to determine the functionality
- Synthesises executable code for a given processor
- Optimises code to improve performance, or exploit
specific processor instructions - Assumes various data structures
- Tokens
- Variables, language keywords, syntactic
constructs etc - Symbol Table
- Relates user defined entities (variables,
methods, classes etc) with their associated
values or internal structures - Literal Table
- Stores constants, strings, etc. Used to reduce
the size of the resulting code - Syntax/Parse Tree
- The resulting structure formed through the
analysis of the code - Intermediate Code
- Intermediate representation between different
phases of the compilation
6Phases and other tools
- Interpreters
- Unlike compilers, code is executed immediately
- Slow execution, used more for scripting or
functional languages - Assemblers
- Constructs final machine code from processor
specific Assembly code - Often used as last phase of a compilation process
to produce binary executable. - Linkers
- Collates separately compiled objects into a
single file, including shared library objects or
system calls. - Preprocessors
- Called prior to the compilation process to
perform macro substitutions - E.g. RATFOR preprocessor, or cpp for C code
- Profilers
- Collects statistics about the behaviour of a
program and can be used to improve the
performance of the code.
7Analysis and Synthesis
- Analysis
- checks that program constructs are legal and
meaningful - builds up information about objects declared
- Synthesis
- takes analysed program and generates code
necessary for its execution - Compilation based on language definition, which
comprises - syntax
- semantics
8Compiler Structure
source program (character stream)
tokens
parser
SYMBOL TABLE
scanner
IR (parse tree)
semantic routines
optimiser
IR (tuples)
IR Intermediate Representation
code generator
target code
9Compiler Organisation
- Each of compiler tasks described previously (in
Compiler Structure) is a phase - Phases can be organised into a number of passes
- a pass consists of one or more phases acting on
some representation of the complete program - representations produced between source and
target are Intermediate Representations (IRs)
10Single Pass Compilers
- One pass compilers very common because of their
simplicity - No IRs all phases of compiler interleaved
- Compilation driven by parser
- Scanner acts as subroutine of parser, returning a
token on each call - As each phrase recognised by parser, it calls
semantic routines to process declarations, check
for semantic errors and generate code - Code not as efficient as multi-pass
11Multi-Pass Compilers
- Number of passes depends on number of IRs and on
any optimisations - Multi-pass allows complete separation of phases
- more modular
- easier to develop
- more portable
- Main forms of IR
- Abstract Syntax Tree (AST)
- Intermediate Code (IC)
- Postfix
- Tuples
- Virtual Machine Code
12Compiler Implementation
- Compilers often written in HLLs for ease of
maintenance, portability, etc. - e.g. Pascal compiler written in C, runs on
machine X - Problem always need both compilers available
- To alter compiler
- Make necessary changes
- Re-compile using C compiler
- To move to machine Y
- Re-write code generator to produce code for Y
- Compile compiler on machine Y (using Ys C
compiler)
13Bootstrapping
- Suppose our compiler is written in the language
it compiles - e.g. C compiler written in C language
- We can then run compiler through itself!
- Bootstrapping
- To alter compiler
- Make necessary changes
- Run compiler through itself
- To move to machine Y
- Re-write code generator to produce code for Y
- Run compiler through itself to generate version
of compiler that will run directly on Y
14The Scanner (Lexical Analyser)
- Converts groups of characters into tokens
(lexemes) - tokens usually represented as integers
- white space and comments are skipped
- Each token may be accompanied by a value
- could be a pointer to further information
- As identifiers encountered, entered into a symbol
table - used to collect info. about declared objects
- Scanners often hand-coded for efficiency, but may
be automatically generated (e.g. Lex)
15Example
TOKEN VALUE
symboltable
begin
int a
begin int a float b a 1 b 1.2 a b
1 print (a 2) end
a
float b
b
a 1
b 1.2
16Symbol Table Access
- The symbol table is used by most compiler phases
- Even used post-compilation (debugging)
- Structure of table and algorithms used can make
difference between a slow and fast compiler - Methods
- Sequential lookup
- Binary chop and binary tree
- Hash addressing
- Hash chaining
17Sequential Lookup
- Table is just a vector of names
- Search sequentially from beginning
- If name not found, add to end
- Advantages
- Very simple to implement
- Disadvantages
- Inefficient
- For table with N names, requires N/2 comparisons
on average - Can slow down a compiler by a factor of 10 or more
18Binary Chop
- Keep names in alphabetical order
- To find name
- Compare with middle element to determine which
half - Compare with middle element again to narrow down
to quarter, etc. - Advantage
- Much more efficient than sequential
- log2N-1 comparisons on average
- Disadvantage
- Adding a new name means shifting up every name
above it
19Question
- If the symbol table for a compiler is size 4096,
how many comparisons on average need to be made
when performing a lookup using the binary chop
method? - 2
- 11
- 12
- 16
- 31
Answer b 11 as there are log2N-1 comparisons
on average
20Binary Tree
- Each node contains pointer to 2 sub-trees
- Left sub-tree contains all names lt current
- Right sub-tree has all names gt current
- Advantages
- In best case, search time can be as good as
binary chop - Adding a new name is simple and efficient
- Disadvantages
- Efficiency depends on how balanced the tree is
- Tree can easily become unbalanced
- In worst case, method as bad as sequential
lookup! - May need to do costly re-balancing occasionally
21Hash Addressing
- To determine position in table, apply a hash
function, returning a hash key - Example fn Sum of character codes modulo N,
where N is table size (prime) - Advantages
- Can be highly efficient
- Even similar names can generate totally different
hash keys - Disadvantages
- Requires hash function producing good
distribution - Possibility of collisions
- May require re-hashing mechanism, possibly
multiple times
22Hash Chaining
- As before, but link together names having same
hash key
hash(fred)
fred
jim
- Number of comparisonsneeded very small
array of pointers
23Question
- Concerning compilation, which of the following is
NOT a method for symbol table access? - Sequential lookup
- Direct lookup
- Binary chop
- Hash addressing
- Hash chaining
Answer b Direct Lookup
24Reserved Words
- Words like for, while, if, etc. are
reserved words - Could use binary chop on a table of reserved
words first if not there, search symbol table - Simpler to pre-hash all reserved words into the
symbol table and use one lookup mechanism
25Today
- Parsing
- Context-free grammar BNF
- Example The Micro language
- Parse Tree
- Abstract syntax tree
25
26Parser (Syntax Analyser)
- Reads tokens and groups them into units as
specified by language grammari.e. it recognises
syntactic phrases - Parser must produce good errors and be able to
recover from errors
26
27Scanning and Parsing
source file
sum x1 x2
input stream
sum x1 x2
Regular expressions define tokens
Scanner
tokens
BNF rules define grammar elements
Parser
sum x1 x2
parse tree
27
28Syntax
- Defines the structure of legal statements in the
language - Usually specified formally using a context-free
grammar (CFG) - Notation most widely used is Backus-Naur Form
(BNF), or extended BNF - A CFG is written as a set of rules (productions)
- In extended BNF
- ... means zero or many
- ... means zero or one
28
29Backus Naur Form
- Backus Naur Form (BNF) iw a standard notation for
expressing syntax as a set of grammar rules. - BNF was developed by Noam Chomsky, John Backus,
and Peter Naur. - First used to describe Algol.
- BNF can describe any context-free grammar.
- Fortunately, computer languages are mostly
context-free. - Computer languages remove non-context-free
meaning by either - (a) defining more grammar rules or
- (b) pushing the problem off to the semantic
analysis phase.
29
30A Context-Free Grammar
- A grammar is context-free if all the syntax rules
apply regardless of the symbols before or after
(the context). - Example
(1) sentence gt noun-phrase verb-phrase
. (2) noun-phrase gt article noun (3) article gt
a the (4) noun gt boy girl cat
dog (5) verb-phrase gt verb noun-phrase (6) verb
gt sees pets bites Terminal symbols 'a'
'the' 'boy' 'girl' 'sees' 'pets' 'bites'
30
31A Context-Free Grammar
A sentence that matches the productions (1) - (6)
is valid.
a girl sees a boy a girl sees a girl a girl sees
the dog the dog pets the girl a boy bites the
dog a dog pets the boy ...
To eliminate unwanted sentences without imposing
context sensitive grammar, specify semantic
rules "a boy may not bite a dog"
31
32Backus Naur Form
- Grammar Rules or Productions define symbols.
assignment_stmt id expression
The nonterminal symbol being defined.
The definition (production)
Nonterminal Symbols anything that is defined on
the left-side of some production. Terminal
Symbols things that are not defined by
productions. They can be literals, symbols, and
other lexemes of the language defined by lexical
rules. Identifiers id A-Za-z_\w Delimi
ters Operators - /
32
33Backus Naur Form (2)
- Different notations (same meaning)
- assignment_stmt id expression term
- ltassignment-stmtgt gt ltidgt ltexprgt lttermgt
- AssignmentStmt ? id expression term
- , gt, ? mean "consists of" or "defined
as" - Alternatives ( " " )
- Concatenation
expression gt expression term expression -
term term
number gt DIGIT number DIGIT
33
34Alternative Example
- The following BNF syntax is an example of how an
arithmetic expression might be constructed in a
simple language - Note the recursive nature of the rules
34
35Syntax for Arithmetic Expr.
ltexpressiongt lttermgt ltaddopgt lttermgt
ltexpressiongt ltaddopgt lttermgt
lttermgt ltprimarygt lttermgt ltmultopgt ltprimarygt
ltprimarygt ltdigitgt ltlettergt ( ltexpressiongt
)
ltdigitgt 0 1 2 ... 9
ltlettergt a b c ... y z
ltmultopgt /
ltaddopgt -
- Are the following expressions legal, according to
this syntax? - i) -a
- ii) bc(3/d)
- iii) a(c-(4b))
- iv) 5(9-e)/d
35
36BNF rules can be recursive
- expr gt expr term
- expr - term term
- term gt term factor
- term / factor
- factor
- factor gt ( expr ) ID NUMBER
- where the tokens are
- NUMBER 0-9
- ID A-Za-z_A-Za-z_0-9
36
37Uses of Recursion
- Repetition
- expr gt expr term
- gt expr term term
- gt expr term term term
- gt term ... term term
- Parser can recursively expand expr each time one
is found - Could lead to arbitrary depth analysis
- Greatly simplifies implementation
37
38Example The Micro Language
- To illustrate BNF parsing, consider an example
imaginary language the Micro language - 1) A program is of the form begin sequence
of statements end - 2) Only statements allowed are
- assignment
- read (list of variables)
- write (list of expressions)
38
39Micro
- 3) Variables are declared implicitly
- their type is integer
- 4) Each statement ends in a semi-colon
- 5) Only operators are , -
- parentheses may be used
39
40Micro CFG
- 1) A program is of the formbegin
statementsend - 2) Permissible statements
- assignment
- read (list of variables)
- write (list of expressions)
- 3) Variables are declared implicitly
- their type is integer
- 4)Statements end in a semi-colon
- 5) Valid operators are , - but can use
parentheses
- ltprogramgt begin ltstat-listgt end
- ltstat-listgt ltstatementgt ltstatementgt
- ltstatementgt id ltexprgt
- ltstatementgt read ( ltid-listgt )
- ltstatementgt write ( ltexpr-listgt )
- ltid-listgt id , id
- ltexpr-listgt ltexprgt , ltexprgt
- ltexprgt ltprimarygt ltaddopgt ltprimarygt
- ltprimarygt ( ltexprgt )
- ltprimarygt id
- ltprimarygt intliteral
- ltaddopgt
- ltaddopgt -
40
41BNF
- Items such as ltprogramgt are non-terminals
- require further expansion
- Items such as begin are terminals
- correspond to language tokens
- Usual to combine productions using (or)
- e.g. ltprimarygt ( ltexprgt ) id
intliteral
41
42Parsing
- Bottom-up
- Look for patterns in the input which correspond
to phrases in the grammar - Replace patterns of items by phrases, then
combine these into higher-level phrases, and so
on - Stop when input converted to single ltprogramgt
- Top-down
- Assume input is a ltprogramgt
- Search for each of the sub-phrases forming a
ltprogramgt, then for each of the sub-sub-phrases,
and so on - Stop when we reach terminals
- A program is syntactically correct iff it can be
derived from the CFG
42
43Question
- Consider the following grammar, where S, A and B
are non-terminals, and a and b are terminals - S AB
- A a
- A BaB
- B bbA
- Which of the following is FALSE?
- The length of every string derived from S is
even. - No string derived from S has an odd number of
consecutive bs. - No string derived from S has three consecutive
as. - No string derived from S has four consecutive
bs. - Every string derived from S has at least as many
bs as as.
Answerc No string derived from S has three
consecutive as
43
44Example
- Parse begin A B (10 - C) end
ltprogramgt
begin ltstat-listgt end (apply rule 1)
begin ltstatementgt end (2)
begin id ltexprgt end (3)
begin id ltprimarygt ltaddopgt ltprimarygt end (8)
begin id ltprimarygt ltprimarygt end (12)
...
44
45Exercise
- Complete the previous parse
- Clue - this is the final line of the parse
- begin id id (intliteral - id) end
45
46Answer
- Parse begin A B (10 - C) end
- ltprogramgt
- begin ltstat-listgt end
(apply rule 1) - begin ltstatementgt end
(2) - begin id ltexprgt end
(3) - begin id ltprimarygt ltaddopgt ltprimarygt end
(8) - begin id ltprimarygt ltprimarygt end
(12) - begin id id ltprimarygt end
(10) - begin id id (ltexprgt) end
(9) - begin id id (ltprimarygtltaddopgtltprimarygt)
end (8) - begin id id (ltprimarygt - ltprimarygt) end
(13) - begin id id (intliteral - ltprimarygt) end
(11) - begin id id (intliteral - id) end
(10)
46
47Parse Tree
- ltprogramgtbegin ltstat-listgt
end ltstatementgt id
ltexprgt ltprimarygt ltaddopgt ltprimarygt
id ( ltexprgt
) ltprimarygt ltaddopgt
ltprimarygt intliteral
- id
- The parser creates a data structure representing
how the input is matched to grammar rules. - Usually as a tree.
- Also called syntax tree or derivation tree
47
48Expression Grammars
- For expressions, a CFG can indicate
associativity and operator precedence, e.g.
ltexprgt ltfactorgt ltaddopgt ltfactorgt
ltfactorgt ltprimarygt ltmultopgt ltprimarygt
ltprimarygt ( ltexprgt ) id literal
ltexprgtltfactorgt ltaddopgt
ltfactorgtltprimarygt ltprimarygt ltmultopgt
ltprimarygt id id
id
ABC
48
49Ambiguity
- A grammar is ambiguous if there is more than one
parse tree for a valid sentence. - Example
- expr gt expr expr expr expr id
- number
- How would you parse x y z using this rule?
49
50Example of Ambiguity
- Grammar Rules
- expr gt expr expr expr ? expr (
expr ) NUMBER - Expression 2 3 4
- Two possible parse trees
50
51Another Example of Ambiguity
- Grammar rules
- expr gt expr expr expr - expr
( expr ) NUMBER - Expression 2 - 3 - 4
- Parse trees
51
52Ambiguity
- Ambiguity can lead to inconsistent
implementations of a language. - Ambiguity can cause infinite loops in some
parsers. - Specification of a grammar should be unambiguous!
- How to resolve ambiguity
- rewrite grammar rules to remove ambiguity
- add some additional requirement for parser, such
as "always use the left-most match first" - EBNF (later) helps remove ambiguity
52
53Abstract Syntax Tree (AST)
- More compact form of derivation tree
- contains just enough info. to drive later
phasese.g. Y 3X I
id
id
const 3 id
to symbol table
IX
Y
tag attribute
53
54Semantics
- Specify meaning of language constructs
- usually defined informally
- A statement may be syntactically legal but
semantically meaningless - colourless green ideas sleep furiously
- Semantic errors may be
- static (detected at compile time)e.g. a x
true - dynamic (detected at run time)e.g. array
subscript out of bounds
54
55Question
- If the array x contains 20 ints, as defined by
the following declaration - int x new int20
- What kind of message would be generated by the
following line of code? - a 22
- val xa
- A Syntax Error.
- A Static Semantic Error.
- A Dynamic Semantic Error.
- A Warning, rather than an error.
- None of the above.
Answer c A dynamic semantic error the value of
a would cause an array out of bounds error
55
56Semantics
- Also needed to generate appropriate codee.g. a
b - in Java and C, this means assign b to a
- in Pascal and Ada, this means compare equality of
a and b - hence, generate different code in each case
56
57Semantic Routines
- 1) Semantic analysis
- Completes analysis phase of compilation
- Object descriptors are associated with
identifiers in symbol table - Static semantic error checking performed
- 2) Semantic synthesis
- Code generation
57
58Object Descriptors
Symbol table entry
(to next entryin chain)
name token descriptor list
link
- Token tells us what name is
- e.g. while-token, if-token, identifier, etc.
- A descriptor contains things like type, address,
array bounds, etc. - Need a list of descriptors because of identifier
re-use
58
59Identifier Re-use
- Can have code such as int x // level
1 main() float x // level 2
symbol table entry
x
2 float
1 integer
59
60Descriptor Lists
- For efficiency, the most local descriptors are
kept at the front of the list - At the end of a block, all descriptors declared
in that block must be deleted - To aid in this, all descriptors within same block
may be linked together
60
61Attribute Propagation
- Before code can be generated, semantic attributes
may need to be propagated through tree - Top-down (inherited attributes)
- declarations processed to build symbol table
- identifiers looked up in table to attach
attribute info to nodes - Bottom-up (synthesised attributes)
- determine types of expressions based on operators
and types of identifiers - Propagation can be done at same time as static
semantic error checking, and often forms next
pass - May also be combined with code generation
61
62Example a bc bd
float a, d int b, c
(float) a
(float) (int)
(float) b (int) c (int) b (int)
d (float)
SYMBOLTABLE
synthesised
inherited
- Type attribute recorded in extra field of each
node - After propagation, tree is said to be decorated
62
63Static Semantic Error Checking
- With info from attribute propagation, static
checking often trivial, e.g. - type mismatch(compare type attributes)
- identifier not declared(null descriptor field
in symbol table) - identifier already declared(descriptor with
current level number already present)
63
64Question
- A BNF grammar includes the following statement
- ltstatementgt ltidengt ( ltexprgt )
- What kind of message would be produced by the
following line of code? - a (2 b
- A Syntax Error.
- A Static Semantic Error.
- A Dynamic Semantic Error.
- A Warning, rather than an error.
- None of the above.
Answer a A syntax error all the tokens are
valid, but the close parenthesis is missing,
resulting in an error in the grammar
64
65Code Generation
- Often performed by tree-walking the
ASTGenAssign(node) // Gen code for RHS,
leaving result in R1 GenExpr(node.rhs, R1)
//Calculate addr for LHS GenAddr(node.lhs,
Addr) Gen(STORE, R1, Addr)GenExpr(node,
reg) if (node.type op)
GenExpr(node.lhs, reg) GenExpr(node.rhs,
reg1) Gen(node.opcode, reg, reg1)
...
65
66Abstract Syntax Tree (AST) Again
- More compact form of derivation tree
- contains just enough info. to drive later
phasese.g. Y 3X I
id
id
const 3 id
to symbol table
IX
Y
tag attribute
66
67Tree Walking
- LOAD R1, 3 LOAD R2, XY
(int) (int) MULT R1, R2 LOAD R2,
I (int) I (int) ADD R1,
R2 STORE R1, Y - 3 X (int)
- Advantage of AST is that order of traversal can
be chosen - code generated in one-pass compiler corresponds
to strictly fixed traversal of tree(hence, code
not as good)
67
68Intermediate Code (IC)
- Instead of generating target machine code,
semantic routines may generate IC. - can form input to separate code generator (CG)
- advantage is that all target machine dependencies
can be limited to CG - Postfix
- e.g. a bc bd a b c b d
- Concise and simple, but not very good for
generating code unless stack-based architecture
used
68
69Postfix
- In normal algebraic notation the arithmetic
operator appears between the two operands to
which it is being applied - This is called infix notation
- example a / b c
- It may require parentheses to specify the desired
order of operations - example a / (b c)
- In postfix (or Reverse Polish) notation the
operator is placed directly after the two
operands to which it applies - Therefore, in postfix notation the need for
parenthesis is eliminated
69
70Operator Precedence
- To do the conversion from infix to postfix, we
need to prioritise operators as follows - highest priority
- , /
- , -
- lt, gt, , ...
- (and)
- (or) lowest priority
70
71Exercise
- Convert the following infix expressions into
postfix - ab/c
- ac(b-d)
- acb-d
71
72Postfix
- Example 1
- The infix expression a b c
- Becomes in postfix a b c
- Example 2
- The infix expression a (b c)
- Becomes in postfix a b c
- Example 3
- The infix expression b c 5 ( 3 6 / a )
- Becomes in postfix b c 5 3 6 a /
72
73Question
- Which of the following postfix expressions is
equivalent to the following expression? - ab c/d
- a b c d - /
- a b - c d /
- a b c d / -
- a b c d / -
- a b c - d /
Answer d a b c d / -
73
74Today
- Code generation
- Three address code
- Code optimisation
- Techniques
- Classification of optimisations
- Time of application
- Area of application
74
75Intermediate Code
- Code can be generated from syntax tree
- However, this doesnt represent target code very
well - Tree represents constructs such as conditionals
(ifthenelse) or loops (whiledo) - Target code includes jumps to memory addresses
- Intermediate code represents a linearisation of
the syntax tree - Postfix is an example of a stack-based
linerisation - Typically related in some way to target
architecture - Good for efficient code
- Can be exploited by code optimisation routines
75
76Three Address Code
- Reflects the notion of simple operations of the
form - x y op z
- Many instructions are of this form
- Introduces the notion of temporary variables
- These represent interior nodes in the tree
- Usually assigned to registers
- Represents a left-to-right linearization of the
code - Other variants exist, e.g. for unary operations
- x -y
-
76
77Three Address Code
- Consider the arithmetic expression
- 2a(b-3)
- The corresponding three-address code is
- t1 2 a
- t2 b 3
- t3 t1 t2
77
78Example factorial function
- read x
- if (0 lt x) then
- fact 1
- repeat
- fact fact x
- x x 1
- until x 0
- write fact
- end
- read x
- t1 x gt 0
- if_false t1 goto L1
- fact 1
- label L2
- t2 fact x
- fact t2
- t3 x 1
- x t3
- t4 x 0
- if_false t4 goto L2
- write fact
- label L1
- halt
78
79P-Code
- Was initially a target assembly generated by
Pascal compilers in early 70ies - Format is very similar to assembly
- designed to work on a hypothetical stack machine
called a P-machine - aim was to aid portability
- P-code instructions could then be mapped to
assembly for target platform - Simple, abstract version given on the next slide
79
80P-Code
- Consider the arithmetic expression
- 2a(b-3)
- The corresponding P-code is
- lcd 2 load constant 2
- lod a load value of var a
- mpi integer multiplication
- lod b load value of var b
- ldc 3 load constant 3
- sbi integer subtraction
- adi integer addition
80
81Question
- Which of the following is NOT a form of
intermediate representation used by compilers? - Postfix
- Tuples
- Context-free grammar
- Abstract syntax tree
- Virtual machine code
Answer c A context-free grammar defines the
language used by the compiler the rest are
intermediate representations
81
82Code Optimisation
- Aim is to improve quality of target code
- Disadvantages
- compiler more difficult to write
- compilation time may double or triple
- target code often bears little resemblance to
unoptimised code - greater chance of translation errors
- more difficult to debug programs
82
83Optimisation Techniques
- Constant folding
- can evaluate expressions involving constants at
compile-time - aim is for the compiler to pre-compute (or
remove) as many operations as possible - a 316 - 2LOAD 1, 46STORE 1, a
83
84Techniques
- Global register allocation
- analyse program to determine which variables are
likely to be used most and allocate these to
registers - good use of registers is a very important feature
of efficient code - aided by architectures that provide an increased
number of registers
84
85Techniques
- Code deletion
- identify and delete unreachable or dead code
- boolean debug false...if (debug)
... No need to generate code for this
85
86Techniques
- Common sub-expression elimination
- avoid generating code for unnecessary operations
by identifying expressions that are repeated - a (bc/5 x) - (bc/5 y)
- generate code for bc/5 only once
86
87Exercise
- Optimise the following
- a 100322
- b (a-30)5
- if (altb)
- screen.println(a)
-
87
88Techniques
- Code motion out of loops
- for (int i0 i lt n i) x a 5
//loop-invariant code Screen.println(xi) - x a 5for (int i0 i lt n i)
Screen.println(xi)
88
89Techniques
- Strength reduction
- replace operations by others which are equivalent
but more efficiente.g. a 2LOAD 1, a LOAD 1,
aMULT 1, 2 ADD 1, 1
89
90Question
- What optimisation technique could be applied in
the following examples? - a b2
- a a / 2
- Constant Folding
- Code Deletion
- Common Sub-Expression Elimination
- Strength Reduction
- Global Register Allocation
Answer d Both expressions can be reduced by
changing the operator a b 2 can be reduced
to a b b a a / 2 is a right shift
operation a a gtgt 1
90
91Classification of Optimisations
- Optimisations can be classified according to
their different characteristics - Two useful classifications
- the period of the compilation process during
which an optimisation can be applied - the area of the program to which the optimisation
applies
91
92Time of Application
- Optimisations can be performed at virtually every
stage of the compilation process - e.g. constant folding can be performed during
parsing - other optimisations might be applied to target
code - The majority of optimisations are performed
either during or just after intermediate code
generation, or during target code generation - source-level optimisations do not depend upon
characteristics of the target machine and can be
performed earlier - target-level optimisations depend upon the target
architecture - sometimes an optimisation can consist of both
92
93Target Code Optimisations
- Optimisations performed on target code are known
as peephole optimisations - scan target code, searching for sequences of
target code that can be replaced by more
efficient ones, e.g. - LOAD 1, a INC aADD 1, 1STORE 1, a
- replacements may introduce further possibilities
- effective and simple
- sometimes tacked onto end of one-pass compiler
93
94Area of Application
- Optimisations can be applied to different areas
of a program - Local optimisations those that are applied to
straight-line segments of code, i.e. with no
jumps into or out of the sequence - easiest optimisations to perform
- Global optimisations those that extend beyond
basic blocks but are confined to an individual
procedure - more difficult to perform
- Inter-procedural optimisations those that extend
beyond the boundaries of procedures to the entire
program - most difficult optimisations to perform
94
95Today
- Compiler-writing tools
- Regular expressions
- Lex
- Yacc
- Code generator generators
95
96Compiler-Writing Tools
- Various software tools exist which aid in the
construction of compilers. - Parser generators
- e.g. yacc
- Code generator generators
- Scanner generators
- e.g. lex
- The input to lex consists of a definition of each
token as a regular expression
96
97Regular Expressions (REs)
- Used in many UNIX tools, e.g. awk, grep, sed,
lex, vi - REs specify patterns to be matched against input
text - An RE may be just a stringcat matches the
string cat - A full stop matches any single charc.t matches
cat, cut, cot, etc.
97
98REs
- The beginning of a line is specified by
- End of line is specified as
- An asterisk means zero or more occurrences of
the immediately preceding itemxyz matches xz,
xyz, xyyz, xyyyz, etc. - A plus sign means one or morexyz matches
xyz, xyyz, etc. - A vertical bar means or e.g.x(ab)y matches
xay or xby
98
99Exercise
- What will be matched by the pattern a.d in the
following line of characters? - add a dog and aardvark
- Using the same line of characters what will match
a.d ? - What do we get if we search a file for all
occurrences of the following patterns? - hello
-
99
100Exercise
- Using the same line of characters as before, what
will be matched by the following? - and
- and
- What will be matched by
- 101
- .
100
101Character Classes
- Square brackets denote a character
classabc matches character a, b, or c - Can also abbreviate1-6 is equivalent to
123456 - Asterisk and plus may be applied to character
classese.g. to define hex numbers in a Java or C
program0x0-9a-fA-F - Can negate a character classabc match any
char except a,b,c - Note different use of
101
102Exercise
- Which of the following will match
- Kkaitle
- Kate kite kale kit ?
- What matches the following? \t\n
102
103Lex
- Input to lex consists of pairs of REs and actions
- Each RE defines a particular language token
- Each action is a fragment of C code, to be
executed if the token is encountered - Lex transforms this input to a C function called
yylex(), which returns a token each time it is
called - The string that matches an RE is placed in an
array called yytext - Extra info about a token can be passed back to
calling program via a global variable called
yylval
103
104Lex Example
- \t\n while return(WHILE_SYMB)for retu
rn(FOR_SYMB)..0-9 / convert to int
/ yylval atoi(yytext)
return(NUMBER) a-zA-Za-zA-Z0-9
/ find in sym tab / yylval
lookup(yytext) return(IDENT)
104
105Yacc
- Stands for Yet Another Compiler-Compiler
- It is a parser generator
- Parser generators are programs that take as input
the grammar defining a language and produce as
output a parser for that language - A Yacc parser matches sequences of input tokens
to the rules of the given grammar
105
106Yacc
- The specification file that Yacc takes as input
consists of three sections - Definitions contains info about the tokens, data
types and grammar rules required to build the
parser - Rules contains the rules (in a form of BNF) of
the grammar, along with actions in C code to be
executed whenever a given rule is recognised - Auxiliary routines contains any auxiliary
procedure and function declarations required to
complete the parser -
106
107Yacc Example
- Example format of rules
- assign IDENT BECOMES expr SEMI / action
for assignment / while WHILE expr DO
statement / action for while stat / - The parsing procedure produced by Yacc is called
yyparse() - returns an int value 0 if the parse is
successful, 1 otherwise
107
108Error Recovery in Yacc
- Errors need to be recognised and recovered from
Yacc provides error productions as the principal
way to achieve this - Error productions have on their right hand side
an error pseudotoken - These productions identify a context in which
erroneous tokens can be deleted until tokens are
encountered that enable the parse to be
re-started - When errors are encountered appropriate syntax
error messages will be generated
108
109Code Generator Generators
- CGGs remove the burden of deciding what code to
generate for each construct - Implementer produces a formal description of what
each target machine instruction does - CG automatically searches machine description to
find the instructions(s) that produce desired
computation - Code almost as good as conventional compiler, but
generation speed much slower
109
110Question
- Lex is a software tool that can be used to aid
compiler construction. It is an example of which
of the following? - A scanner generator
- A parser generator
- A code generator generator
- A semantic analyser
- A code debugger
Answer a Lex is responsible for identifying
tokens using regular expressions. It is
therefore a scanner generator
110