Title: Levels of Programming Languages
1Levels of Programming Languages
High-level program
class Triangle ... float surface()
return bh/2
Low-level program
LOAD r1,b LOAD r2,h MUL r1,r2 DIV r1,2 RET
Executable Machine code
0001001001000101001001001110110010101101001...
2Compilers and other translators
- Examples
- Chinese gt English
- Java gt JVM byte codes
- Scheme gt C
- C gt Scheme
- x86 Assembly Language gt x86 binary codes
Other non-traditional examples disassembler,
decompiler (e.g. JVM gt Java)
3Tombstone Diagrams
- What are they?
- diagrams consisting out of a set of puzzle
pieces we can use to reason about language
processors and programs - different kinds of pieces
- combination rules (not all diagrams are well
formed)
4Syntax Specification
- Syntax is specified using Context Free
Grammars - A finite set of terminal symbols
- A finite set of non-terminal symbols
- A start symbol
- A finite set of production rules
- Usually CFG are written in Bachus Naur Form or
BNF notation. - A production rule in BNF notation is written as
- N a where N is a non terminal
and a a sequence of terminals and non-terminals - N a b ... is an abbreviation for
several rules with N - as left-hand side.
5Concrete and Abstract Syntax
- The previous grammar specified the concrete
syntax of mini triangle.
The concrete syntax is important for the
programmer who needs to know exactly how to write
syntactically well-formed programs.
The abstract syntax omits irrelevant syntactic
details and only specifies the essential
structure of programs.
Example different concrete syntaxes for an
assignment v e (set! v e) e -gt v v e
6Abstract Syntax Trees
- Abstract Syntax Tree for dd10n
AssignmentCmd
BinaryExpression
BinaryExpression
VName
VNameExp
IntegerExp
VNameExp
SimpleVName
SimpleVName
SimpleVName
Int-Lit
Ident
Op
Ident
Ident
Op
10
d
n
d
7Contextual Constraints
Syntax rules alone are not enough to specify the
format of well-formed programs.
Example 1 let const m2 in m x
Example 2 let const m2 var nBoolean in
begin n mlt4 n n1 end
8Semantics
Specification of semantics is concerned with
specifying the meaning of well-formed programs.
- Terminology
- Expressions are evaluated and yield values (and
may or may not perform side effects) - Commands are executed and perform side effects.
- Declarations are elaborated to produce bindings
- Side effects
- change the values of variables
- perform input/output
9Phases of a Compiler
- A compilers phases are steps in transforming
source code into object code. - The different phases correspond roughly to the
different parts of the language specification - Syntax analysis lt-gt Syntax
- Contextual analysis lt-gt Contextual constraints
- Code generation lt-gt Semantics
10Compiler Passes
- A pass is a complete traversal of the source
program, or a complete traversal of some internal
representation of the source program. - A pass can correspond to a phase but it does
not have to! - Sometimes a single pass corresponds to several
phases that are interleaved in time. - What and how many passes a compiler does over the
source program is an important design decision.
11Syntax Analysis
Dataflow chart
Source Program
Stream of Characters
Scanner
Error Reports
Stream of Tokens
Parser
Error Reports
Abstract Syntax Tree
12Regular Expressions
- RE are a notation for expressing a set of strings
of terminal symbols.
Different kinds of RE e The empty
string t Generates only the string t X
Y Generates any string xy such that x is
generated by x and y is generated by Y X
Y Generates any string which generated either
by X or by Y X The concatenation of zero or
more strings generated by X (X) For grouping,
13FA and the implementation of Scanners
- Regular expressions, (N)DFA-e and NDFA and DFAs
are all equivalent formalism in terms of what
languages can be defined with them. - Regular expressions are a convenient notation for
describing the tokens of programming languages. - Regular expressions can be converted into FAs
(the algorithm for conversion into NDFA-e is
straightforward) - DFAs can be easily implemented as computer
programs.
14JFlex Lexical Analyzer Generator for Java
Definition of tokens Regular Expressions
JFlex
Java File Scanner Class Recognizes Tokens
15Parsing
- Parsing Recognition determining phrase
structure (for example by generating AST) - Different types of parsing strategies
- bottom up
- top down
- Recursive descent parsing
- What is it
- How to implement one given an EBNF specification
- Bottom up parsing algorithms
16Top-down parsing
Sentence
The
cat
sees
a
rat
.
The
cat
sees
rat
.
17Bottom up parsing
The
cat
sees
a
rat
.
The
cat
sees
a
rat
.
18Development of Recursive Descent Parser
- (1) Express grammar in EBNF
- (2) Grammar Transformations
- Left factorization and Left recursion elimination
- (3) Create a parser class with
- private variable currentToken
- methods to call the scanner accept and acceptIt
- (4) Implement private parsing methods
- add private parseN method for each non terminal
N - public parse method that
- gets the first token form the scanner
- calls parseS (S is the start symbol of the
grammar)
19LL 1 Grammars
- The presented algorithm to convert EBNF into a
parser does not work for all possible grammars. - It only works for so called LL 1 grammars.
- Basically, an LL1 grammar is a grammar which can
be parsed with a top-down parser with a lookahead
(in the input stream of tokens) of one token. - What grammars are LL1?
- How can we recognize that a grammar is (or is
not) LL1? - gt We can deduce the necessary conditions from
the parser generation algorithm.
20LR parsing
- The algorithm makes use of a stack.
- The first item on the stack is the initial state
of a DFA - A state of the automaton is a set of LR0/LR1
items. - The initial state is constructed from productions
of the form S a , (where S is the start
symbol of the CFG) - The stack contains (in alternating) order
- A DFA state
- A terminal symbol or part (subtree) of the parse
tree being constructed - The items on the stack are related by transitions
of the DFA - There are two basic actions in the algorithm
- shift get next input token
- reduce build a new node (remove children from
stack)
21Bottom Up Parsers Overview of Algorithms
- LR0 The simplest algorithm, theoretically
important but rather weak (not practical) - SLR An improved version of LR0 more practical
but still rather weak. - LR1 LR0 algorithm with extra lookahead token.
- very powerful algorithm. Not often used because
of large memory requirements (very big parsing
tables) - LALR Watered down version of LR1
- still very powerful, but has much smaller parsing
tables - most commonly used algorithm today
22JavaCUP A LALR generator for Java
Grammar BNF-like Specification
Definition of tokens Regular Expressions
JavaCUP
JFlex
Java File Parser Class Uses Scanner to get
TokensParses Stream of Tokens
Java File Scanner Class Recognizes Tokens
Syntactic Analyzer
23Contextual Analysis -gt Decorated AST
Annotations
Program
result of identification
LetCommand
type result of type checking
SequentialCommand
SequentialDeclaration
AssignCommand
int
AssignCommand
BinaryExpr
VarDecl
Char.Expr
VNameExp
Int.Expr
char
int
int
int
SimpleT
SimpleV
SimpleV
char
int
Ident
Ident
Ident
Ident
Ident
Ident
Ident
Op
Char.Lit
Int.Lit
n
c
n
n
Integer
Char
c
1
24Nested Block Structure
A language exhibits nested block structure if
blocks may be nested one within another
(typically with no upper bound on the level of
nesting that is allowed).
Nested
- There can be any number of scope levels
(depending on the level of nesting of blocks) - Typical scope rules
- no identifier may be declared more than once
within the same block (at the same level). - for any applied occurrence there must be a
corresponding declaration, either within the same
block or in a block in which it is nested.
25Type Checking
- For most statically typed programming languages,
a bottom up algorithm over the AST - Types of expression AST leaves are known
immediately - literals gt obvious
- variables gt from the ID table
- named constants gt from the ID table
- Types of internal nodes are inferred from the
type of the children and the type rule for that
kind of expression
26Runtime organization
- Data Representation how to represent values of
the source language on the target machine. - Primitives, arrays, structures, unions, pointers
- Expression Evaluation How to organize computing
the values of expressions (taking care of
intermediate results) - Register vs. stack machine
- Storage Allocation How to organize storage for
variables (considering different lifetimes of
global, local and heap variables) - Activation records, static links
- Routines How to implement procedures, functions
(and how to pass their parameters and return
values) - Value vs. reference, closures, recursion
- Object Orientation Runtime organization for OO
languages - Method tables
27Tricky sort
identity
n23
check
check
p
i88
n15
check
check
p
i88
n7
check
check
p
i88
n88
check
identity
p
28JVM
External representation platform independent
JVM
internal representation implementation dependent
.class files
load
classes
primitive types
integers
objects
arrays
methods
The JVM is an abstract machine in the true sense
of the word. The JVM spec. does not specify
implementation details (can be dependent on
target OS/platform, performance requirements
etc.) The JVM spec defines a machine independent
class file format that all JVM implementations
must support.
29Inspecting JVM code
javac Factorial.java javap -c -verbose
Factorial Compiled from Factorial.java public
class Factorial extends java.lang.Object
public Factorial() / Stack1, Locals1,
Args_size1 / public int fac(int)
/ Stack2, Locals4, Args_size2 / Method
Factorial() 0 aload_0 1 invokespecial 1
ltMethod java.lang.Object()gt 4 return
30Inspecting JVM Code ...
// address 0 1 2
3 Method int fac(int) // stack this n result i
0 iconst_1 // stack this n result i 1
1 istore_2 // stack this n result i
2 iconst_2 // stack this n result
i 2 3 istore_3 // stack this n
result i 4 goto 14 7 iload_2 //
stack this n result i result 8 iload_3
// stack this n result i result i 9 imul
// stack this n result i result i 10
istore_2 11 iinc 3 1 14 iload_3 //
stack this n result i i 15 iload_1 //
stack this n result i i n 16 if_icmple 7 //
stack this n result i 19 iload_2 //
stack this n result i result 20 ireturn
31Code Generation
Source Program
Target program
let var n integer var c charin begin c
n n1end
PUSH 2LOADL 38STORE 1SBLOAD 0LOADL 1CALL
addSTORE 0SBPOP 2HALT
Source and target program must be semantically
equivalent
Semantic specification of the source language is
structured in terms of phrases in the SL
expressions, commands, etc. gt Code generation
follows the same inductive structure.
32Specifying Code Generation with Code Templates
The code generation functions for Mini Triangle
Phrase Class Function Effect of the generated
code
Run program P then halt. Starting and finishing
with empty stack Execute Command C. May update
variables but does not shrink or grow the
stack! Evaluate E, net result is pushing the
value of E on the stack. Push value of constant
or variable on the stack. Pop value from stack
and store in variable V Elaborate declaration,
make space on the stack for constants and
variables in the decl.
Program Command Expres- sion V-name V-name Decl
a-ration
run P execute C evaluate E fetch V assign
V elaborate D
33Code Generation with Code Templates
While command
execute while E do C JUMP h g execute
C h evaluateE JUMPIF(1) g
C
E
34Code improvement (optimization)
- The code generated by our compiler is not
efficient - It computes values at runtime that could be known
at compile time - It computes values more times than necessary
- We can do better!
- Constant folding
- Common sub-expression elimination
- Code motion
- Dead code elimination
35Optimization implementation
- Is the optimization correct or safe?
- Is the optimization an improvement?
- What sort of analyses do we need to perform to
get the required information? - Local
- Global