Chapter 4 Lexical and Syntax Analysis - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Chapter 4 Lexical and Syntax Analysis

Description:

If the RHS is ( expr ), call lex to pass over the left parenthesis, call expr, and check for the right parenthesis */ lex( ); expr ... – PowerPoint PPT presentation

Number of Views:485
Avg rating:3.0/5.0
Slides: 44
Provided by: david2548
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4 Lexical and Syntax Analysis


1
Chapter 4 Lexical and Syntax Analysis
  • CS 350 Programming Language Design
  • Indiana University Purdue University Fort Wayne
  • Mark Temte

2
Chapter 4 topics
  • Introduction
  • Lexical Analysis
  • Parsing
  • Recursive-Descent Parsing
  • Bottom-Up Parsing

3
Introduction
  • The syntax analysis portion of a compiler
    typically consists of two parts
  • A low-level part called a lexical analyzer
  • A deterministic finite automaton (DFA)
  • Based on a regular grammar
  • A high-level part called a parser (or syntax
    analyzer)
  • A push-down automaton
  • Based on a context-free grammar described with
    BNF

4
Introduction
  • Reasons to use BNF to describe syntax
  • Provides a clear and concise syntax description
  • The parser can be based directly on the BNF
  • Parsers based on BNF are easy to maintain
  • Reasons to separate lexical and syntax analysis
  • Simplicity
  • Less complex approaches can be used for lexical
    analysis
  • Separating them out simplifies the parser
  • Efficiency
  • Separation allows optimization of the lexical
    analyzer
  • Portability
  • Parts of the lexical analyzer may not be
    portable, but the parser always is portable

5
Lexical Analysis
  • A lexical analyzer is a pattern matcher for
    character strings
  • A lexical analyzer is a front-end for the
    parser
  • Identifies substrings of the source program that
    belong together (lexemes)
  • Lexemes match a character pattern, which is
    associated with a lexical category called a token
  • myCount is a lexeme
  • The token for myCount might be called IDENT

6
Lexical Analysis
  • A lexical analyzer also . . .
  • Skips comments
  • Skips blanks outside lexemes
  • Inserts lexemes for identifiers into a symbol
    table
  • Detects syntactic errors in lexemes
  • Ill-formed floating-point literals, for example

7
Lexical Analysis
  • The lexical analyzer is typically a function that
    is called by the parser when it needs the next
    token
  • Three common approaches to building a lexical
    analyzer
  • Write a formal description of the tokens and use
    a software tool that constructs table-driven
    lexical analyzers using the description
  • A UNIX tool that does this is lex
  • Draw a state transition diagram that describes
    the tokens and write a program that implements
    the state diagram
  • Draw a state transition diagram that describes
    the tokens and hand-construct a table-driven
    implementation of the state diagram
  • We examine the second approach

8
Lexical Analysis
  • State diagram design
  • A naïve state diagram would have a transition
    from every state resulting from every character
    in the source language
  • Such a diagram would be very large!
  • Transitions can usually be combined to simplify
    the state diagram
  • To recognize an identifier, all uppercase and
    lowercase letters are equivalent
  • Use a character class that includes all letters
  • To recognize an integer literal, all digits are
    equivalent
  • Use a digit class

9
Lexical Analysis
  • Reserved words and identifiers can be recognized
    together
  • Then use a table lookup to determine whether a
    possible identifier is in fact a reserved word
  • Alternative is to have a separate part of the
    diagram for each reserved word

10
Lexical Analysis
  • A lexical analyzer typically has several global
    variables
  • Character nextChar
  • charClass (letter, digit, etc.)
  • String lexeme
  • Some convenient utility subprograms are . . .
  • getChar
  • Gets the next character of input, puts it in
    nextChar, determines its class, and puts the
    class in charClass
  • addChar
  • Adds the character from nextChar to the lexeme
    string
  • Lookup
  • Determines whether the string in lexeme is a
    reserved word
  • Returns a code

11
Example state transition diagram
12
A simple lexical analyzer
/ a simple lexical analyzer / int lex( )
getChar( ) switch ( charClass ) /
Parse identifiers and reserved words /
case LETTER addChar( )
getChar( ) while ( charClass LETTER
charClass DIGIT ) addChar(
) getChar( )
return lookup( lexeme ) break

13
A simple lexical analyzer
/ Parse integer literals /
case DIGIT addChar( )
getChar( ) while ( charClass DIGIT )
addChar( ) getChar(
) return INT_LIT
break / End of switch / / End of
function lex /
14
Parsing
  • A parser is a recognizer for a context-free
    language
  • Given an input program, a parser . . .
  • Finds all syntax errors
  • For each syntax error, an appropriate diagnostic
    message is generated
  • Recovery is attempted to find additional syntax
    errors
  • Produce the parse tree for the program
  • Perhaps just a traversal of the nodes of the
    parse tree

15
Parsing
  • Two categories of parsers
  • Top down
  • Produce the parse tree, beginning at the root
  • Order is that of a leftmost derivation
  • Bottom up
  • Produce the parse tree, beginning at the leaves
  • Order is that of the reverse of a rightmost
    derivation
  • Parsers look only one token ahead in the input

16
Top-down parsers
  • A top-down parser traces the parse tree in
    preorder
  • It produces a leftmost derivation of the program
  • Partway through the leftmost derivation, suppose
    that the sentential form, xA?, has been derived
  • Nonterminal A must be replaced next
  • There may be several RHSs for A
  • Call these A-rules

17
Top-down parsers
  • The parser must choose the correct A-rule to get
    the next sentential form in the leftmost
    derivation
  • The parser is guided by the single lookahead
    token
  • The chosen A-rule must uniquely produce the
    lookahead token
  • The most common top-down parsing algorithms . . .
  • Recursive descent parser
  • A coded implementation
  • LL parsers
  • Table driven implementation
  • Left-to-right scan of tokens produces a Leftmost
    derivation

18
Bottom-up parsers
  • Start with the tokens of the program and work
    back to the start symbol
  • We end up with a rightmost derivation in reverse
    order
  • Try to match the RHS of some production rule with
    a substring of tokens and replace the substring
    with the LHS of the production rule
  • This is called a reduction
  • The goal is to find a series of reductions
  • Each reduction should produce the previous
    sentential form in a rightmost derivation

19
Bottom-up parsers
  • Problem
  • More than one RHS may match input
  • The correct RHS must be correctly selected based
    only on the lookahead token
  • The correct RHS is called the handle
  • The most common bottom-up parsing algorithms are
    in the LR family
  • Table driven implementation
  • Left-to-right scan of tokens produces a Rightmost
    derivation

20
The Complexity of Parsing
  • Parsers that work for any unambiguous grammar are
    complex and inefficient
  • The big-O is O(n3), where n is the length of the
    input
  • General parsers often reach dead ends and must
    back up and reparse
  • Practical parsers only work for a subset of all
    unambiguous grammars
  • The big-O of these is O(n), where n is the input
    length

21
Recursive-descent parsing
  • This involves a subprogram for each nonterminal
    in the grammar
  • This subprogram parses the sub-sentences that can
    be generated by that nonterminal
  • Recursive production rules lead to recursive
    subprograms
  • EBNF is ideally suited for being the basis for a
    recursive-descent parser
  • EBNF minimizes the number of nonterminals

22
Recursive-descent parsing
  • Consider a grammar for simple expressions
  • For a production rule LHS with only one RHS . . .
  • Work through the RHS, symbol-by-symbol
  • For any terminal symbol, compare it with the
    lookahead token
  • If they match, continue else there is an error
  • For any nonterminal symbol, call the symbols
    associated parsing subprogram

ltexprgt ? lttermgt ( - ) lttermgt lttermgt ?
ltfactorgt ( / ) ltfactorgt ltfactorgt ? id (
ltexprgt )
23
Recursive-descent parsing
  • Assume we have a lexical analyzer named lex,
    which puts the next token code in nextToken
  • This particular routine does not detect errors

/ Function expr parses strings in the language
generated by the rule
ltexprgt ? lttermgt ( - ) lttermgt
/ void expr( ) / Parse
the first term /    term( ) / As long as
the next token is or -, call lex to get the
next token, and parse the next term

/    while ( nextToken PLUS_CODE
nextToken MINUS_CODE )      lex(
)     term( )   
Convention term() and every other parsing
subprogram leaves the next token in nextToken
when it finishes
24
Recursive-descent parsing
  • A production rule LHS that has more than one RHS
    requires an initial process to determine which
    RHS it is to parse
  • The correct RHS is chosen on the basis of the
    lookahead token
  • The lookahead is compared with the first token
    that can be generated by each RHS until a match
    is found
  • The possible tokens that can be generated must be
    determined by analysis when the compiler is
    constructed
  • If no match is found, it is a syntax error

25
Recursive-descent parsing
/ Function factor parses strings in the language
generated by the rule
ltfactorgt -gt id ( ltexprgt )
/ void
factor( ) / Determine which RHS /     if
( nextToken ID_CODE )
/ For the RHS id, just call lex /     
lex()     else if ( nextToken LEFT_PAREN_CODE
) / If the RHS is (ltexprgt), call lex to
pass over the left parenthesis, call
expr, and check for the right parenthesis
/       lex(
) expr( )     if ( nextToken
RIGHT_PAREN_CODE ) lex( ) else
error( ) else error( ) /
Neither RHS matches / / end of factor /
26
Recursive-descent parsing
  • The LL grammar class has a problem with left
    recursion
  • If a grammar has left recursion, either direct or
    indirect, it cannot be the basis for a top-down
    parser
  • For example, no production rule may have the form
  • A ? A B
  • A recursive descent parser subprogram for A would
    immediately call itself, resulting in an infinite
    chain of recursive calls
  • Fortunately, a grammar can be modified to remove
    left recursion

27
Recursive-descent parsing
  • The LL grammar class also has a problem with
    pairwise disjointness
  • Lack of pairwise disjointness is another
    characteristic of grammars that disallows
    top-down parsing
  • This is the inability to determine the correct
    RHS on the basis of one token of lookahead

28
Pairwise disjointness problem
  • Define the FIRST set of a symbol string ? by
    FIRST(?) a ? gt a?
  • If ? gt ?, ? is in FIRST(?))
  • Here ? is the empty string
  • Pairwise disjointness test
  • Let A be any LHS nonterminal that has more than
    one RHS
  • Then for each pair of rules, A ? ?i and A ? ?k,
    it must be true that
  • FIRST(?i) ? FIRST(?k) ?

29
Pairwise disjointness problem
  • Examples
  • The following group of production rules pass
    pairwise disjointness test
  • A ? a bB cAb
  • The next group of production rules do not pass
  • A ? a aB
  • A grammar that fails the pairwise disjointness
    test can often be modified successfully using
    left factoring

30
Left factoring example
  • The production rule group
  • ltid_listgt ? identifier identifier ,
    ltid_listgt
  • fails the pairwise disjointness test
  • Replace the group with
  • ltid_listgt ? identifier ltnewgt
  • ltnewgt ? , ltid_listgt ?

31
Bottom-up parsing
  • Recall that a bottom-up parser produces a
    rightmost derivation in reverse order by reading
    input from left to right

Simple grammar E ? E T T T ? T F F F ? (
E ) id
  • Rightmost derivation
  • E
  • E T
  • E T F
  • E T c
  • E F c
  • E b c
  • T b c
  • F b c
  • a b c

32
Bottom-up parsing
  • Given a right sentential form, the bottom-up
    parsing problem is to find the correct RHS (the
    handle) to reduce to a LHS to get the previous
    right sentential form in a rightmost derivation
  • Some handle definitions
  • Definition ? is the handle of the right
    sentential form
  • ? ? ? w if and only if S gt ? A
    w gt ? ? w
  • Definition ? is a phrase of the right sentential
    form ?
  • if and only if S gt ? ?1A?2 gt ?1??2
  • Def ? is a simple phrase of the right sentential
    form ?
  • if and only if S gt ? ?1A?2 gt ?1??2

33
Bottom-up parsing
  • Intuition about handles
  • The handle of a right sentential form is its
    leftmost simple phrase
  • Given a parse tree, it is now easy to find the
    handle
  • Of course, you are not given the parse tree in
    advance
  • Parsing can be thought of as handle pruning

34
Bottom-up parsing
  • Bottom-up parsers are often called shift-reduce
    parsers
  • The focus of parser activity is a parse stack
  • Shift and reduce activity
  • Reduce is the action of replacing the handle on
    the top of the parse stack with its corresponding
    LHS
  • Shift is the action of moving the next input
    token to the top of the parse stack

35
Bottom-up parsing
  • Advantages of LR parsers
  • They will work for nearly all grammars that
    describe programming languages.
  • They work on a larger class of grammars than
    other bottom-up algorithms, but are as efficient
    as any other bottom-up parser.
  • They can detect syntax errors as soon as it is
    possible
  • LL parsers also have this property
  • The LR class of grammars is a superset of the
    class of grammars that can be parsed by LL parsers

36
Bottom-up parsing
  • LR parsers
  • Are table driven
  • It is usually not practical to construct a table
    by hand
  • The table must be constructed automatically from
    the grammar by a program
  • For example, the UNIX program is yacc

37
Bottom-up parsing
  • LR parsing was discovered by Donald Knuth (1965)
  • Knuths insight
  • A bottom-up parser can use the entire history of
    the parse, up to the current point, to make
    parsing decisions
  • There are only a finite and relatively small
    number of different parse situations that could
    have occurred, so the history can be stored as a
    sequence of states Sm, on the parse stack

38
Bottom-up parsing
  • An LR configuration is the entire state of an LR
    parser
  • (S0 X1 S1 X2 S2 Xm Sm, ai ai1an )
  • The uppercase letters represent the parse stack
  • The lowercase letters represent the unread input
  • There is one state S for each grammar symbol X on
    the parse stack

39
Bottom-up parsing
  • LR parser operation

40
Bottom-up parsing
  • LR parser table has two components
  • ACTION table
  • The ACTION table specifies the action of the
    parser, given the parser state and the next token
  • Rows are state names
  • Columns are terminals
  • GOTO table
  • The GOTO table specifies which state to put on
    top of the parse stack after a reduction action
    has taken place
  • Rows are state names
  • Columns are nonterminals

41
Form of an LR parsing table
42
Form of an LR parsing table
  • This parsing table resulted from the grammar
  • Initial LR configuration (S0, a1an)

Grammar 1. E ? E T 2. E ? T 3. T ? T F 4. T ?
F 5. F ? ( E ) 6. F ? id
43
Parser actions
  • If ACTIONSm, ai Shift S, the next
    configuration is
  • (S0X1S1X2S2XmSmaiS, ai1an)
  • If ACTIONSm, ai Reduce A ? ? and
  • S GOTOSm-r, A, where r the length of ?,
    the next configuration is
  • (S0 X1 S1 X2 S2Xm-r Sm-r A S, ai ai1an )
  • If ACTIONSm, ai Accept, the parse is complete
    and no errors were found
  • If ACTIONSm, ai Error, the parser calls an
    error-handling routine
Write a Comment
User Comments (0)
About PowerShow.com