Language Descriptions and Compiler Architecture - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Language Descriptions and Compiler Architecture

Description:

... split the input sentence in the source language into tokens. ... translator or compiler? The secret of language translation is to know what to do for all legal ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 35
Provided by: lech88
Category:

less

Transcript and Presenter's Notes

Title: Language Descriptions and Compiler Architecture


1
Language Descriptions and Compiler Architecture
  • Part 1 Translators and Compilers (Slides 2-7)
  • Part 2 Syntax Diagrams - useful visualization
    tools for learning and translating languages.
    (Slides 8-18)
  • Part 3 Languages defined by grammar rules with
    various types of recursion. (Slides 19-34)

2
Part 1 Compiler Architecture
  • Modern compilers have four major parts
  • Preprocessors to split the input sentence in the
    source language into tokens.
  • The parser that derives or reverse-engineers the
    derivation tree structure and a dictionary of
    names with their properties from these tokens.
  • A code generator which produces strings in the
    desired output or target language.
  • A main program, or controller which invokes the
    other modules and manages I/O files.


3
Preprocessors
  • There are two kinds of preprocessors
  • A lexical analyzer to find meaningful atomic
    parts or tokens to the parser.
  • A macro-processor to perform string substitution,
    by defining and/or expanding some tokens before
    parsing can proceed.
  • Lexical analysis is needed before the macro
    preprocesor and again before the parser.


4
Parsers
  • Modern parsers act as front ends to code
    generators and optimizers, which require several
    intermediate products
  • A memory-resident database that represents the
    programs parse, derivation, or abstract syntax
    tree.
  • A symbol table or dictionary of names used in the
    program together with their semantic properties
    (datatype, etc.).
  • These are accessed during code generation.
  • Error messages for the programmer.


5
Code Generators
  • Modern code generators have two or more stages
  • Intermediate language (IL) code generation
  • Traverse the parse tree and generate an initial
    sequence of instructions that could execute (or
    interpret) each basic block of code correctly if
    the IL is executable (or interpretable).
  • Code Optimization
  • Rearrange the execution sequence to expedite
    necessary operations and avoid wasted ones
  • Map IL instructions into the most efficient ones
    for the target language, OS and platform
  • Different types of optimization may be done in
    separate IL code transformation steps.

6
Target Languages
  • There can be many target language types
  • Object code to be linked.
  • Assembly language to be assembled and linked.
  • IL output for a virtual machine (VM)
  • The VM may be emulated by software on a real
    hardware platform (e.g. byte code IL for the
    JVM).
  • A different source language for which there
    already exists a compiler (e.g. the cfront
    compiler translated C programs into C).
  • Translating a compiler program into a subset of
    its own source language is called
    boot-strapping the compiler.

7
What makes a good translator or compiler?
  • The secret of language translation is to know
    what to do for all legal derivations and explain
    what is wrong with all illegal ones.
  • Good compilers produce Warnings for all syntax
    violations and as many as possible semantic ones.
  • Exercise Try recompiling your last program with
    all warnings turned on
  • The gnu (gcc or g) option for this is -Wall.
  • In practice, some violations are justifiable, but
    reasons for them should be documented.


8
Part 2 Syntax Diagrams
  • Syntax diagrams are a tool to visualize language
    structure and implement translators.
  • Syntax diagrams are directed di-graphs with two
    kinds of nodes
  • An oval node, labeled by a terminal symbol in T
    represents (scanning of) that terminal symbol
    (token)
  • A rectangular node, labeled by a nonterminal
    symbol or grammar variable in V NT, represents
    (recognition or translation of) that nonterminal
    symbol.
  • The nodes are connected by directed edges
  • The edges specify alternate sequences of symbols
    that are syntactically valid sentences in the
    language.
  • Paths through the syntax diagrams determine the
    flow of control through language parsers or
    translators.

9
BNF Definitions for Language Syntax
  • A BNF grammar for a language L is a finite set of
    syntax rules that describes how to compose or
    derive any string in L.
  • A BNF grammar cannot describe the meaning of
    sentences in a language. There are other more
    complex rules to specify meaning or semantics.
  • These BNF rules are used to analyze (parse) a
    string in L.
  • A program is a legal string or sentence in a
    programming language.
  • A parser analyzes this string by progressively
    decomposing it into a tree of constitutent parts.
  • A successful parse is a constructive proof that
    the string is a member of L.


10
BNF Rules For a Grammar
  • BNF rules require additional variables, called
    nonterminal symbols, from another finite set
    called NT that depends on L.
  • A unique symbol in NT is called the Start Symbol
  • Let V NT T.
  • NT is disjoint from T NT intersect T is empty.
  • BNF rules are formed using 4 other meta-symbols
    not in V
  • Brackets () to represent nested tree
    structures
  • Connective or to represent alternate
    choices.
  • Assignment for deriving any RHS from the LHS


11
Recursion vs. Iteration in Syntax Diagrams
  • A node with a cyclic or looping path around it
    implies iteration (lists) or recursion (nesting),
    depending on its context.
  • A diagram whose name also appears (directly or
    indirectly) inside one of its rectangles implies
    recursive BNF rules.
  • The two above properties may be combined.
  • Recursion can be of three types
  • Embedded or essential E ( E )
  • Left-recursive E E Id
  • Right-recursive E Id E

12
BNF for A instead of A
  • In Regular Expressions, superscript is a
    unary operator shorthand A AA AA.
  • If variable Alist represents A as in the
    previous example, then A is represented by
    either
  • APlus A Alist Alist A Alist ltemptygt
  • or
  • APlus Alist A Alist Alist A ltemptygt
  • The first pair of rules is right-recursive
  • so they have a (deterministic) syntax diagram
    representation

B
S
A
D
Alist
13
BNF with right recursion Syntax Diagrams with
cycles
  • Regular Expressions use the star () operator to
    represent unconstrained iteration S BAD
  • BNF grammars represent A by a new variable Alist
    which is recursively defined either
  • S B Alist D
  • Alist A Alist ltemptygt lt--
    (Right-recursive)
  • or
  • S B Alist D
  • Alist Alist A ltemptygt lt--
    (Left-recursive)
  • Syntax diagrams show iteration as cyclic paths

B
D
s
A
14
Nested Grammar Rules
  • A is an A-based grammar pattern that is nested
    within the RH-Side of Ss grammar rules
  • S B A Alist D lt-- at least one A
  • Alist Alist A ltemptygt lt--Left-recursive
  • Syntax Diagram
  • A Alist represents A A A because A is in
    the forward path and cannot be bypassed with
    ltemptygt.
  • A is defined by its own syntax diagram (next
    slide)

A
A
B
S
A
D
Alist
Pass Exit
15
Alternate Diagrams for A
  • Both of these syntax diagrams define A

A A A
A
A
S2
S1
S1
A
A
A
  • The second diagram above permits different
    processing to be done on the first occurrence of
    A, because it separates the first A from A
  • In diagram 2, each A-parser call-site (State S1
    or S2) can customize its own response.
  • In diagram 1, implementers must maintain a
    separate first-time flag to do this.

16
Nested Syntax Diagrams
  • The first syntax diagram below represents the
    grammar rule S B A AList D .
  • In diagram 2, the dashed rectangle encloses a
    sub-diagram for A which is the in-line expansion
    of the variable Alist.

S
Pass
Alist
B
D
A
Alist or A
Pass
B
D
A
s
A
17
What about Fail Exits?
A
B
S
A
D
Alist
Pass Exit
  • This syntax diagram only shows a Pass Exit
  • It ignores the possibility of failure.
  • Syntax errors need specific diagnostics.
  • Syntax diagrams and their state transition tables
    also need Fail Exits that report errors.
  • Specifying these checks in additional state
    diagram nodes further enhances the power of
    syntax diagrams as a visual design tool.

18
What about Semantic Errors?
  • Semantic Errors are errors which cannot be
    detected by the syntactic rules specified by the
    grammar. Some of these errors can be caught by
    contex-sensitive checks in the compiler.
  • Syntax diagrams can also be expanded by adding
    non-grammar-based state nodes where semantic
    error checks are performed (and their Pass/Fail
    status remembered).
  • For example, lexical analyzers can check size
    limits and word lengths, and parsers can check
    that variable types are declared and appropriate
    with their use.
  • Adding extra state nodes to specify these checks
    further enhances the visualization capability of
    syntax diagrams.

19
Languages as Infinite Subsets of Strings
  • A language L is a (usually infinite) set of
    strings (sequences) of constant symbols or
    characters in a finite set or alphabet T.
  • T is called Ls vocabulary of terminal symbols.
  • The language T is called the universal or
    all-inclusive language over vocabulary T.
  • T is the set of all strings of symbols from T.
  • T includes strings of any length, including 0.
  • Subsets of T can be sorted by string-length
  • Tk includes all strings of length exactly k.
  • To is the set containing one string of length
    0
  • Using to denote set union, T is the infinite
    polynomial To T1 T2 etc.


20
Languages as Semigroups
  • T is an example of a (non-commutative)
    semigroup, because it is closed under string
    concatenation
  • If s1 and s2 are members of T, so is s1s2 or
    s2s1.
  • Languages (subsets of T) are closed under direct
    product of sets
  • If L1 and of sts2 are subsets of T, so is L1L2
    or L2L1.
  • The identity element for concatenation of strings
    is the null string for direct products of
    languages it is the singleton set T0 .
  • If L is in T, then T0 L L and L T0 L.
  • Caution the identity for set union (L1 L2) is
    the empty set , which is not the same as T0
    !


21
BNF Rules are also parsed
  • Grammar rules are just strings
  • Their vocabulary includes V T NT plus the
    meta-symbols (, ), ,
  • T is a set of terminal symbols (chars or tokens).
  • NT is the set of variables or non-terminals.
  • BNF Grammar Rule Syntax LHS RHS
  • LHS means left-hand side of a rule.
  • RHS means right-hand side of a rule.
  • The substitution operator says that the RHS
    of any rule may replace its LHS during any step
    of the derivation of a string in L from S in NT.
  • A well-formed RHS s may include choice operator
    and balanced braces for sub-expression
    grouping.


22
Context-Free Grammar (CFG) Restrictions
  • LHS for context-free grammars (CFGs)
  • We will only be interested in context-free
    grammars(CFGs).
  • In rules of a CFG, LHS is restricted to be a
    single nonterminal or variable symbol in NT.
  • Each symbol in NT can derive a language in V
  • Rules for combining symbols are called
    productions
  • Sequences of productions produce exactly all of
    the strings of T that belong to L and no others.


23
Deriving a Sentence
  • Derivation starts with the start symbol S in NT
    which gradually evolves into a string d in V
  • Each derivation step replaces any variable in d
    by the RHS of any of its production rules.
  • This step is repeated for another variable in d
    until all variables have been eliminated.
  • It terminates when the derived string d is in T
    (contains no more elements from NT).
  • The result d is in T and is a legal sentence in
    the language L.
  • Parsing is the reverse process - analyze the
    sentence to find its derivation history in
    reverse ( productions in reverse are called
    reductions.)


24
Extended BNF Rules
  • Grammar vocabulary is still V NT T.
  • Merge alternate rules for the same LHS, by
    introducing OR-combinations of alternative RHSs,
    separated by .
  • Create expression trees containing
    sub-expressions surrounded by parentheses.
  • Allow braces for optional and ellipsis or for
    repeatable sub-expressions
    and


25
Recursive BNF Grammars
  • Recursion may be direct (visible in one rule)
  • A bA c
  • This rule generates a regular language or set of
    strings bc.
  • Recursion may be indirect (visible after variable
    substitution)
  • A (C ( ), C A )
  • The rules above generate the same language as
    these
  • (directly recursive) rules A (A)
    ()
  • To see this, replace C in A-rules by the RHS of
    all C-rules.
  • Both grammars generate this set of nested
    parenthesis pairs, where b denotes ( and e
    denotes )
  • bk ek k gt 1

26
Recursive BNF Example
  • Example the BNF grammar
  • A bC be, C Ae
  • represents the same language as A bAe be.
  • Eliminate C by substituting the RHS of its
    C-rules.
  • Both grammars generate the set of nested
    parentheses bk ek k gt 1
  • This language is context-free but not regular
  • indirect imbedded recursion requires balanced
    b..e pairs.
  • Counting matched occurrences requires unbounded
    memory.
  • Left-most or right-most recursion is equivalent
    to iteration, so such grammar rules are regular
  • We can replace left- or right-recursive BNF
    rules by extended EBNF rules. Extended rules can
    denote regular expressions (lists of repeated
    items with separator symbols).

27
Context-free Grammars
  • The grammar A bAe be which generates the set
    of strings bk ek k gt 1 is the simplest
    example of a language which is context-free but
    not regular
  • Unbounded k-value is assumed to avoid impractical
    compromises.
  • Imbedded recursion requires balanced b..d pairs.
  • Counting matched occurrences requires unbounded
    memory.
  • Left-most and right-most recursion are equivalent
    to iteration, so such grammar rules are regular
  • We can replace left- or right-recursive BNF
    rules by extended EBNF rules. Extended rules can
    denote regular expressions (lists of repeated
    items with separator symbols).

28
Examples of Recursion
  • Embedded essential recursion
  • E ( E ) Id Number (direct)
  • Stmt simpleStmt Block
  • Block StmtList (indirect)
  • StmtList Stmt StmtList Stmt
  • Left-recursion vs. right-recursion (mutex)
  • StmtList (above) (left-recursive)
  • Blist B Blist ltemptygt
    (right-recursive)
  • Embedded recursion is not a problem for either
    stack-based parsing method.

29
Recursive-Descent Parsing
  • Recursive descent or top-down parsers make nested
    procedure calls for each occurrence of a variable
    on a syntax diagram, including those representing
    embedded recursion.
  • Alternate right-hand sides become OR-connected or
    branching paths on syntax diagrams.
  • Right-recursive rules can be replaced by
    iterative loops.
  • Syntax diagrams can guide the manual design of
    recursive-descent parsers.
  • See Wirth Ch. 5 (99f204 handout)

30
Automating Recursive-Descent
  • Yacc-based LR(k) parsers are automatically
    generated from EBNF grammar rules.
  • Recursive descent parsers can also be generated
    automatically
  • Syntax diagrams can be translated into state
    transition tables (manual or automatic)._
  • State transition tables can drive a generic
    table-driven interpreter.
  • Alternatively, these tables can be translated
    directly into parser or translator source code.
  • In either case, the translator actions must be
    hand-tailored by the programmer to the specified
    source language and target output.


31
What about left-recursion?
  • Left-recursion would cause stack overflow in a
    recursive descent parser.
  • Left-recursion would require look-ahead beyond
    the end of the string generated by the recursive
    variable in order to terminate the recursion.
  • Fortunately, left and right recursive grammars
    are interchangeable, at least for parsing (syntax
    error detection).
  • Left-recursive rules are prohibited for
    recursive-descent parsing, but they can be
    converted into right-recursive ones (by
    translating AA to AA and A to (AAltemptygt)
    plus other conversions).

32
Parsers vs. Translators
  • Unfortunately, left and right recursion are not
    equivalent when implementing translators.
  • The situation is more involved because
  • Translators do more that return syntax errors.
  • Different processing may be needed after the
    first or before the last occurrence of the
    repeated non-terminal.
  • Recursive-descent parsers for right-recursive
    grammar rules have limited look-into
    capability, which can recognize first-instances,
    but do not know if the next occurrence is the
    last one.
  • Left-recursion requires look-beyond-current-insta
    nce capability to recognize the final occurrence
    (because something different follows it) and
    terminate the recursion (to avoid an infinite
    loop and stack overflow).

33
Left-recursive Grammars
  • Grammars with left-recursive rules are called
    LR(k) grammars, because LR(k) parsers are
    normally used. (The number k is the number of
    symbols of look-ahead is needed beyond the end of
    the symbol to be parsed.)
  • LR(k) parsing requires look-beyond-current-instan
    ce capability in order to recognize the final
    occurrence (assuming something different follows
    it) to disambiguate the alternate RHSides of
    left-recursive rules
  • When parsing Alist Alist A A, which
    alternate RHS to try first?
  • Both rules begin with A, so the correct one
    depends on what follows A (assumed not confusable
    with A).
  • We must parse A in order to discover what follows
    it.
  • The equivalent left-recursive rule, Alist A
    Alist ltemptygt,
  • can choose between A and something else before
    parsing A.


34
Left-recursive Parsers
  • Left-recursive grammars are handled easily by
    more sophisticated LR(k) parsers. These can be
    generated automatically from EBNF specifications
    by Yacc.
  • Yacc generates a table-driven stack-based parser.
  • The flip side for Yacc is that right-recursive
    grammar rules are expensive to process (they
    require heavy use of the stack).

Yacc stands for Yet Another
Compiler-Compiler See our LexYacc text, p.
197
Write a Comment
User Comments (0)
About PowerShow.com