Title: Foundations of Software Design
1Foundations of Software Design
Lecture 24 Compilers, Lexers, and Parsers Intro
to Graphs Marti Hearst Fall 2002
2How Do Computers Work (Revisited)?
Machine Instructions
Bits Bytes
Binary Numbers
3The Compiler
- What is a compiler?
- A recognizer (of some source language L).
- A translator (of programs written in L into
programs written in some object or target
language L'). - A compiler is itself a program, written in some
host language - Operates in phases
Programming Languages
Assembly Language
Machine Instructions
4Converting Java to Byte Code
- When you compile a java program, javac produces
byte codes (stored in the class file). - The byte codes are not converted to machine code.
- Instead, they are interpreted in the VM when you
run the program called java.
5C code
Translated by the C compiler (gcc or cc)
Assembly Language
Creates the JVM once
Machine Code
Java code
Translated by the java compiler (javac or jit)
Java Virtual Machine
Byte code (class file)
Individual program is loaded run in JVM
6Compiler Compilers
- Which came first the compiler or the program?
- The very first one has to be written in assembly
language! - This is why most programming languages today
start with the C code generator - After you have created the first compiler for a
given language, say java, then you - Use that compiler to compile itself!!
7Compiling Your Compiler
Write the first java compiler using C
Write the second java compiler using java
Compile using gcc
Compile using javac
Javac in C
Javac in java
Write other java programs
Compile using javac
8Compiler in more detail.
Lexical analyzer (scanner)
Syntax analyzer (parser)
Semantic analyzer
Intermediate Code Generator
Optimizer
Code Generator
9The Scanner
- Task
- Translate the sequence of characters into a
corresponding sequence of tokens (by grouping
characters into lexemes). - How its done
- Specify lexemes using Regular Expressions
- Convert these Regular Expressions into Finite
Automata
10Lexemes and Tokens
- Here are some Java lexemes and the corresponding
tokens -
- index tmp 37 102
- SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT
-
- Note that multiple lexemes can correspond to the
same token (e.g., there are many identifiers). - Given the source code
- position initial rate 60
- a Java scanner would return the following
sequence of tokens - IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT
SEMI-COLON
11The Scanner
- Also called the Lexer
- How it works
- Reads characters from the source program.
- Groups the characters into lexemes (sequences of
characters that "go together"). - Each lexeme corresponds to a token
- the scanner returns the next token (plus maybe
some additional information) to the parser. - The scanner may also discover lexical errors
(e.g., erroneous characters). - The definitions of what is a lexeme, token, or
bad character all depend on the source language.
12Two kinds of Automata
- Deterministic (DFA)
- No state has more than one outgoing edge with the
same label. - Non-Deterministic (NFA)
- States may have more than one outgoing edge with
same label. - Edges may be labeled with ? (epsilon), the empty
string. - The automaton can take an ? epsilon transition
without looking at the current input character.
13Regular Expressions to Finite Automata
NFA
Regular expressions
DFA
Lexical Specification
Table-driven Implementation of DFA
14BNF
- Backus-Naur form, Backus-Normal form
- A set of rules (or productions)
- Each of which expresses the ways symbols of the
language can be grouped together - Non-terminals are written upper-case
- Terminals are written lower-case
- The start symbol is the left-hand side of the
first production - The rules for a CFG are often referred to as its
BNF
15Java Identifier Definition
- Described in the Java specification
- http//java.sun.com/docs/books/jls/second_edition/
html/lexical.doc.html44591 - An identifier is an unlimited-length sequence of
Java letters and Java digits, the first of which
must be a Java letter. - An identifier cannot have the same spelling
(Unicode character sequence) as a keyword (3.9),
Boolean literal (3.10.3), or the null literal
(3.10.7).
16Java Identifier Definition
17Java Integer Literals
- An integer literal may be expressed in decimal
(base 10), hexadecimal (base 16), or octal (base
8) - Examples
- 0 2 0372 0xDadaCafe 1996 0x00FF00FF
(opt means optional)
18Defining Java Decimal Numerals
- A decimal numeral is either the single ASCII
character 0, representing the integer zero, or
consists of an ASCII digit from 1 to 9,
optionally followed by one or more ASCII digits
from 0 to 9, representing a positive integer
19Defining Floating-Point Literals
- A floating-point literal has the following
parts a whole-number part, a decimal point
(represented by an ASCII period character), a
fractional part, an exponent, and a type suffix.
The exponent, if present, is indicated by the
ASCII letter e or E followed by an optionally
signed integer.
20From the Lucene HTML Scanner
21The Functionality of the Parser
- Input sequence of tokens from lexical analysis
- Output parse tree of the program
- parse tree is generated if the input is a legal
program - if input is an illegal program, syntax errors are
issued - Note
- Instead of parse tree, some parsers produce
directly - abstract syntax tree (AST) symbol table, or
- intermediate code, or
- object code
22Parser vs. Scanner
Phase Input Output
Scanner String of characters String of tokens
Parser String of tokens Parse tree
23The Parser
- Groups tokens into "grammatical phrases",
discovering the underlying structure of the
source program. - Finds syntax errors.
- Example
- position 5
- corresponds to the sequence of tokens
- IDENT ASSIGN TIMES INT-LIT SEMI-COLON
- All are legal tokens, but that sequence of tokens
is erroneous. - Might find some "static semantic" errors, e.g., a
use of an undeclared variable, or variables that
are multiply declared. - Might generate code, or build some intermediate
representation of the program such as an
abstract-syntax tree.
24What must the parser do?
- Recognizer not all strings of tokens are
programs - must distinguish between valid and invalid
strings of tokens - Translator must expose program structure
- e.g., associativity and precedence
- must return the parse tree
- We need
- A language for describing valid strings of tokens
- context-free grammars
- (analogous to regular expressions in the scanner)
- A method for distinguishing valid from invalid
strings of tokens (and for building the parse
tree) - the parser
- (analogous to the state machine in the scanner)
25Parser Example
position
initial
rate
60
26The Semantic Analyzer
- The semantic analyzer checks for (more) "static
semantic" errors, e.g., type errors. - Annotates and/or changes the abstract syntax tree
- (e.g., it might annotate each node that
represents an expression with its type). - Example with before and after
(float)
(float)
position
(float)
(float)
initial
(float)
rate
(float)
int- to-float()
(float)
60
(int)
27Intermediate Code Generator
- The intermediate code generator translates from
abstract-syntax tree to intermediate code. - One possibility is 3-address code.
- Here's an example of 3-address code for the
abstract-syntax tree shown above - temp1 int-to-float(60)
- temp2 rate temp1
- temp3 initial temp2
- position temp3
28The Optimizer
int count 0 for (int j0 j lt 25 j)
int temp j 1 count 3
- Examine the program and rewrite it in ways the
preserve the meaning but are more efficient. - Incredibly complex programs and algorithms
- Example
- Move the declaration of temp outside the loop so
it isnt re-declared every time the loop is
executed - Change 25 to 10 since it is a constant (no need
to do an expensive multiply at run time) - If we removed the line with temp, the program
might even skip the loop altogether - You can see in advance that count ends up 30
29The Code Generator
- The code generator generates object code from
(optimized) intermediate code. -
- LOADF rate,R1
- MULF 60.0,R1
- LOADF initial,R2
- ADDF R2,R1
- STOREF R1,position
30Tools
- Scanner Generator
- Used to create a scanner automatically
- Input
- a regular expression for each token to be
recognized - Output
- a finite state machine
- Examples
- lex or flex (produce C code), or jlex (produce
java) - Compiler Compilers
- yacc (produces C) or JavaCC (produces Java, also
has a scanner generator).
31From the Lucene HTML Parser
32From the Lucene HTML Parser
33Graphs / Networks
34What is a Graph?
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47Next Time
- Graph Traversal
- Directed Graphs (digraphs)
- DAGS
- Weighted Graphs