Title: CSC 8310 Linguistics of Programming Languages
1CSC 8310 Linguistics of Programming Languages
- Fall 2009
- Instructor Vijay Gehlot
- Week 2
2Phases/Components of Compiler
Source
Tokens
AST
1. Scanner (Lexical Analyzer)
2. Parser (Syntax Analyzer)
3. Semantic Analyzer
AnnotatedAST
Intermediate Code
Machine Code
Target
Other Components Symbol Table and Error Handler
1-3
Analysis Phase, 4-7 Synthesis Phase, 1-5 Front
End, 6-7 Back End
31. Scanner (Lexical Analyzer)
- Remove white spaces and comments
- Tokenize (group into meaningful pieces)
- Handle lexical errors (this is the first stage in
compilation from which we get compilation error) - Tokens may have values
- Token id (identifier)
- Token value e.g. x
42. Parser (Syntax Analyzer)
- Checks whether sequence of tokens conforms to the
(syntactic) rules of the language. - E.g. ab each token is valid but the sequence is
not - Handle syntax errors
- Uses Context Free Grammar (CFG) defined later
53. Semantic Analyzer (Type checker)
- Type checking
- Other semantic checking
- Unbound/undeclared variable
- Multiply defined names
- Uninitialized variable
64-5. Intermediate Code Generator and Optimizer
- Many forms
- Machine independent optimization
- Algebraic simplification
- Optional component
76-7. Code Generator and Optimizer
- Target specific architecture
- Machine dependent optimization
- Specialized instructions
- Optimization optional
- Examples code and optimization (Java/C)
8Languages
- Any language has
- Syntax
- Described using a formal notationContext Free
Grammar (CFG) - Specialized notation called Regular Expressions
can be used for lexical syntax, i.e., tokens - Semantics
- Many different ways
- Operational, Axiomatic, Denotational
9Context Free Grammar (CFG)
- Derivations
- Parse trees
- Abstract syntax trees (ABS)
- Ambiguity (Syntactic)
- BNF (Backus Naur Form), EBNF (Extended Backus
Naur Form), Syntax diagrams
10Definition of Context Free Grammar
- Definition A CFG consists of
- A finite set of non-terminal symbols (N)
- A finite set of terminal symbols (tokens) (T)
- A distinguished start symbol S from N
- A finite collection of rules (or productions) of
the form - X ? A1 A2,An where
- X is from N,
- Ai is from N or T, n0,
- if n0 then write Ai ? e
11One Step Derivation
- Definition Given a sequence of symbols
- A1 A2 Ai Ak
- If Ai ? B1 B2 Bj is a production,
- then we can obtain
- A1 A2Ai-1 B1 B2 Bj Ai1 An
- in one step.
- We denote it as
- A1 A2 Ak gt A1 A2Ai-1 B1 B2 Bj Ai1 An
Ai
12Derivation Sequence
- Definition
- Let S be the start symbol.
- Let t1 t2 ti be sequence of terminal symbols
(tokens, a program). - A derivation sequence for t1 t2 ti is a sequence
of one step derivations that starts with S and
ends with t1 t2 ti .
13Definition of Parsing
- Definition A sentence (sequence of terminal
symbols) is syntactically valid if there is a
derivation sequence for it. - Typically choice of nonterminals to be expanded
- Two canonical ways
- Leftmost expand leftmost nonterminal at each
step - Rightmost expand rightmost nonterminal at each
step - Correspond to Top-down and Bottom-up parsing
14Types of Parsers
- Top down parser
- Mimics leftmost derivation
- Bottom up parser
- Mimics rightmost derivation in reverse
- Top down parsers cannot handle left recursion.
- These are deterministic parsers.
- General parsing is expensive and not practical
15Parse Trees
- Definition Parse Tree is a tree such that
- all interior nodes are labeled from (N)
- root labeled with S (start symbol)
- all leaves are labeled from (T)
- if X
- A1 A2 Ai Ak
- where X is a nonterminal and Ais are terminals or
non terminals, then - X ? A1 A2 Ai Ak
- must be a production.
16Parse Trees (cont.)
- Definition t1 t2 tn is valid if there is a
parse tree whose leaves spell t1 t2 tn when
read left to write. - Can be constructed from derivation or directly.
17Ambiguity
- Definition A CFG is ambiguous if there is at
least one sentence for which there is more than
one leftmost (or rightmost) derivations or parse
trees. - Should be avoided
- No general algorithm
- Typically has to do with grouping
- Can rewrite or redefine syntax
- Pros/Cons
18AST
- Definition Abstract Syntax Tree (AST) is a tree
in which interior nodes are operations and
children are operands. - while (condition) body
19Other Approaches for Describing Syntax
- BNF
- EBNF All are equivalent to
CFG - Syntax Diagrams
- Some actual examples
- http//java.sun.com/docs/books/jls/second_edition/
html/grammars.doc.html44271 - http//www.scheme.com/tspl2d/grammar.htmlg2488
- http//www.schemers.org/Documents/Standards/R5RS/H
TML/r5rs-Z-H-10.html_chap_7
20ANTLR Tool
- Parsers can be automatically generated from a CFG
description - ANTLR is one such tool that generates a
(recursive descent) parser (in Java by default) - Other tools Yacc, Bison, SableCC, JavaCC,
MLYacc, etc.
ANTLR Tool
Grammar file
Java Code
21ANTLR Tool
- Allows EBNF.
- Is in LL category and hence cannot handle
right-recursive grammar rules - Has a GUI-based grammar development environment
called ANTLRWorks - Includes automatic transformation of
right-recursive rules
22ANTLR Grammar File Format
- Simplified version
- grammar name
- / Comment Lexical Rules (Tokens).
- Token names must begin with an uppercase
letter / - RULE1 ...
- RULE2 ...
- ...
- / Comment Syntax Rules.
- Non-terminals must begin with a lowercase
letter / - rule1 ... ... ...
- rule2 ... ... ...
- ...
23ANTLR Output
- From a grammar named T in file T.g it generates
- TLexer.java
- TParser.java
- T.tokens
- ANTLR generates a method for each rule in a
grammar. - The methods are wrapped in a Java class
definition (Tparser.java). - ANTLR provides named actions so you can insert
fields and instance methods into the generated
class definition. E.g., - grammar T
- _at_header import java.util.
- _at_members
- int n
- public void foo() ...
24ANTLR Output
- Compile the generated files. Make sure the ANTLR
jar file is on classpath - To use
- import org.antlr.runtime.
- public class Test
- public static void main(String args) throws
Exception - ANTLRInputStream input new ANTLRInputStream(Sys
tem.in) - TLexer lexer new TLexer(input)
- CommonTokenStream tokens new
CommonTokenStream(lexer) - TParser parser new TParser(tokens)
- parser.rule1() // invoke method associated with
the start symbol -