Lecture 4 Concepts of Programming Languages - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 4 Concepts of Programming Languages

Description:

Title: Lecture Data Structures and Practise Author: arne Last modified by: arne Created Date: 2/27/2003 6:18:58 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:289
Avg rating:3.0/5.0
Slides: 64
Provided by: arne59
Category:

less

Transcript and Presenter's Notes

Title: Lecture 4 Concepts of Programming Languages


1
Lecture 4Concepts of Programming Languages
  • Arne Kutzner
  • Hanyang University / Seoul Korea

2
Topics
  • Lexical Analysis
  • The Parsing Problem
  • Recursive-Descent Parsing
  • Bottom-Up Parsing

3
Introduction
  • Language implementation systems must analyze
    source code, regardless of the specific
    implementation approach
  • Nearly all syntax analysis is based on a formal
    description of the syntax of the source language
    (BNF)

4
Syntax Analysis
  • The syntax analysis portion of a language
    processor nearly always consists of two parts
  • A low-level part called a lexical analyzer
    (mathematically, a finite automaton based on a
    regular grammar)
  • A high-level part called a syntax analyzer, or
    parser (mathematically, a push-down automaton
    based on a context-free grammar, or BNF)

5
Advantages of Using CFG/BNF to Describe Syntax
  • Provides a clear and concise syntax description
  • The parser can be constructed of foundation of
    CFG/BNF

6
Lexical Analysis
  • A lexical analyzer is a front-end for the
    parser
  • pattern matcher for character strings
  • Identifies substrings of the source program that
    belong together - lexemes
  • Lexemes match a character pattern, which is
    associated with a lexical category called a token
  • sum is a lexeme its token may be IDENT

7
Reasons to Separate Lexical and Syntax Analysis
  • Simplicity - less complex approaches can be used
    for lexical analysis (no need for the use of
    grammars for token extraction)
  • Efficiency - separation allows significant less
    complex parsers

8
We need first some theory
9
Regular Expressions
  • Given a finite alphabet S, the following
    constants are defined as regular expressions
  • (empty set) Ø denoting the set Ø.
  • (empty string) e denoting the set containing only
    the "empty" string, which has no characters at
    all.
  • (literal character) a in S denoting the set
    containing only the character a.

10
Regular Expressions (cont.)
  • Given regular expressions R and S, the following
    operations over them are defined to produce
    regular expressions
  • (concatenation) RS denotes the set of strings
    that can be obtained by concatenating a string in
    R and a string in S. For example "ab",
    "c""d", "ef" "abd", "abef", "cd", "cef".
  • (alternation) R S denotes the set union of sets
    described by R and S. For example, if R
    describes "ab", "c" and S describes "ab", "d",
    "ef", expression R S describes "ab", "c",
    "d", "ef".
  • Alternation is sometimes denoted by

11
Regular Expressions (cont.)
  • 3. (Kleene star) R denotes the smallest
    superset of set described by R that contains e
    and is closed under string concatenation. This
    is the set of all strings that can be made by
    concatenating any finite number (including zero)
    of strings from set described by R. For example,
    "0","1" is the set of all finite binary
    strings (including the empty string), and "ab",
    "c" e, "ab", "c", "abab", "abc", "cab",
    "cc", "ababab", "abcab", ... .

12
Regular Languages
  • The collection of regular languages over an
    alphabet S is defined recursively as follows
  • The empty language Ø is a regular language.
  • For each a ? S (a belongs to S), the singleton
    language a is a regular language.
  • If A and B are regular languages, then A ? B
    (union), A B (concatenation), and A (Kleene
    star) are regular languages.
  • No other languages over S are regular.

13
Regular Expressions and Regular Languages
  • The family of languages defined by regular
    expressions are the regular languages.
  • Regular expressions can be used for lexeme/token
    description/specification. E.g. description of a
    token Identifier as Letter (Digit Letter)
  • Regular expressions are generators like grammars
  • In fact, you can describe every regular
    expressions by means of a grammar

14
Examples of regular expressions
  • What are the words of the following expressions?
  • (0 1)(00 01 10 11)
  • (0 1)(0 1)(0 1)(0 1)(0 1)
  • Are languages of the following 3 expressions
  • 0 1 (0 1)
  • (0 1) 1 (0 1)
  • (0 1) 1 0
  • equal?

15
Recognizer for Regular Expressions
  • A deterministic finite automaton (DFA) M is a
    5-tuple, (Q, S, d, q0, F), consisting of
  • a finite set of states (Q)
  • a finite set of input symbols called the alphabet
    (S)
  • a transition function (d  Q S ? Q)
  • a start state (q0 ? Q)
  • a set of accept states (F ? Q)

16
Language accepted by a DFA
  • Let w a1a2 ... an be a string over the alphabet
    S. The automaton M accepts the string w if a
    sequence of states, r0,r1, ..., rn, exists in Q
    with the following conditions
  • r0 q0
  • ri1 d(ri, ai1), for i 0, ..., n-1
  • rn ? F.

17
DFA Example
  • M (Q, S, d, q0, F) where
  • Q S1, S2,
  • S 0, 1,
  • q0 S1,
  • F S1,
  • d is the following state transition table

corresponding state diagram for M
18
DFA Example (cont.)
  • The Language recognized by M is the regular
    language given by the regular expression 1( 0 1
    0 1 ),
  • The accepted language consists of all words that
    contains an even number of 0s.

19
Kleenes Theorem
  • Part 1 If R is regular expression over the
    alphabet S, and L is the language in S
    corresponding to R, then there is a
    (deterministic) finite automaton M recognizing L.
  • Part 2 If M (Q, S, d, q0, F) is a
    (deterministic) finite automaton recognizing the
    language L, then there is a regular expression
    over S corresponding to L.
  • So, DFAs recognize exactly the set of regular
    languages/expressions.

20
Limitations of regular languages
  • There is no regular expression for the language
    1n 0n , n 0 (n ones followed by n zeros)
  • But you can easily give a CFG for the above
    languageltAgt -gt 1 ltAgt 0 e
  • Other example Dyck language balanced strings of
    parentheses (e.g. )
  • Grammar ? (-gt Exercise)

21
Practical implementation of lexical analyzers
  • DFAs and regular expressions are the foundations
    of lexical analyzer construction
  • Possible approaches for implementing a lexical
    analyzer
  • Write a formal description of the tokens and use
    a software tool that constructs table-driven
    lexical analyzers given such a description
  • Design a state diagram that describes the tokens
    and write a program that implements the state
    diagram
  • Design a state diagram that describes the tokens
    and hand-construct a table-driven implementation
    of the state diagram

22
Lexical Analysis (cont.)
  • In many cases, symbols of transitions are
    combined/grouped in order to simplify the state
    diagram
  • When recognizing an identifier, all uppercase and
    lowercase letters are equivalent
  • Use a character class that includes all letters
  • When recognizing an integer literal, all digits
    are equivalent - use a digit class

23
Lexical Analysis (cont.)
  • Reserved words can be recognized in the context
    of identifier recognition
  • Use a table lookup to determine whether a
    possible identifier is in fact a reserved word

24
Lexical Analysis (cont.)Example Program
  • The proposed lexical analyzer is a function that
    should be called by the parser when it request a
    fresh token/lexems
  • Utility subprograms
  • getChar - gets the next character of input, puts
    it in nextChar, determines its class and puts the
    class in charClass
  • addChar - puts the character from nextChar into
    the place the lexeme is being accumulated, lexeme
  • lookup - determines whether the string in lexeme
    is a reserved word (returns a code)

25
State diagram for recognizing identifiers and
integer numbers
26
Lexical Analysis / Example Prg.
  • int lex()
  • lexLen 0
  • static int first 1
  • / If it is the first call to lex, initialize
    by calling getChar /
  • if (first)
  • getChar()
  • first 0
  • getNonBlank()
  • switch (charClass)
  • / Parse identifiers and reserved words /
  • case LETTER
  • addChar()
  • getChar()
  • while (charClass LETTER charClass
    DIGIT)
  • addChar()
  • getChar()

27
Lexical Analysis / Example Prg.
  • / Parse integer literals /
  • case DIGIT
  • addChar()
  • getChar()
  • while (charClass DIGIT)
  • addChar()
  • getChar()
  • return INT_LIT
  • break
  • / End of switch /
  • / End of function lex /

28
Parsing Problem
29
The Parsing Problem
  • Goals of the parser, given an input program
  • Produce a parse tree
  • Find all syntax errors for each, produce an
    appropriate diagnostic message and recover
    quickly

30
The Parsing Problem (cont.)
  • Two categories of parsers
  • Top down - produce the parse tree, beginning at
    the root
  • Order is that of a leftmost derivation
  • Traces or builds the parse tree in preorder
  • Bottom up - produce the parse tree, beginning at
    the leaves
  • Order is that of the reverse of a rightmost
    derivation
  • Useful parsers look only one token ahead in the
    input

31
The Parsing Problem (cont.)
  • Top-down Parsers
  • Given a sentential form, xA? , the parser must
    choose the correct A-rule to get the next
    sentential form in the leftmost derivation, using
    only the first token produced by A
  • The most common top-down parsing algorithms
  • Recursive descent - a coded implementation
  • LL parsers - table driven implementation

32
The Parsing Problem (cont.)
  • Bottom-up parsers
  • Special form of push down automata
  • Given a right sentential form, ?, determine what
    substring of ? is the right-hand side of the rule
    in the grammar that must be reduced to produce
    the previous sentential form in the right
    derivation
  • The most common bottom-up parsing algorithms are
    in the LR family

33
Recursive-Descent Parsing
  • Approach - Coded parser
  • Subprogram for each nonterminal in the grammar,
    which can parse sentences that can be generated
    by that nonterminal
  • EBNF well suited for being the basis of a
    recursive-descent parser, because EBNF minimizes
    the number of nonterminals

34
Recursive-Descent Parsing (cont.)
  • A grammar for simple expressions
  • ltexprgt ? lttermgt ( -) lttermgt
  • lttermgt ? ltfactorgt ( /) ltfactorgt
  • ltfactorgt ? id ( ltexprgt )

35
Recursive-Descent Parsing (cont.)
  • Assume we have a lexical analyzer named lex,
    which puts the next token code in nextToken
  • The coding process when there is only one RHS
  • For each terminal symbol in the RHS, compare it
    with the next input token if they match,
    continue, else there is an error
  • For each nonterminal symbol in the RHS, call its
    associated parsing subprogram

36
Recursive-Descent Parsing (cont.)
  • / Function expr
  • Parses strings in the language
  • generated by the rule
  • ltexprgt ? lttermgt ( -) lttermgt
  • /
  • void expr()
  • / Parse the first term /
  •   term()

37
Recursive-Descent Parsing
  • / As long as the next token is or -, call
  • lex to get the next token, and parse the
  • next term /
  •   while (nextToken PLUS_CODE
  • nextToken MINUS_CODE)
  •     lex()
  •     term()
  •   
  • This particular routine does not detect errors
  • Convention Every parsing routine leaves the next
    token in nextToken

38
Recursive-Descent Parsing (cont.)
  • A nonterminal that has more than one RHS requires
    an initial process to determine which RHS it is
    to parse
  • The correct RHS is chosen on the basis of the
    next token of input (the lookahead)
  • The next token is compared with the first token
    that can be generated by each RHS until a match
    is found
  • If no match is found, it is a syntax error

39
Recursive-Descent Parsing (cont.)
  • / Function factor
  • Parses strings in the language
  • generated by the rule
  • ltfactorgt -gt id (ltexprgt) /
  • void factor()
  • / Determine which RHS /
  •    if (nextToken) ID_CODE)
  • / For the RHS id, just call lex /
  •      lex()

40
Recursive-Descent Parsing (cont.)
  • / If the RHS is (ltexprgt) call lex to pass
  • over the left parenthesis, call expr, and
  • check for the right parenthesis /
  •    else if (nextToken LEFT_PAREN_CODE)
  •      lex()
  • expr()
  •     if (nextToken RIGHT_PAREN_CODE)
  • lex()
  • else
  • error()
  • / End of else if (nextToken ... /
  • else error() / Neither RHS matches /

41
Recursive-Descent Parsing (cont.)
  • The Left Recursion ProblemIf a grammar
    comprises left recursion, either direct or
    indirect, it cannot be the basis of a top-down
    (recursive-decent) parser
  • A grammar can be modified, so that it becomes
    free of left recursion
  • LL Grammar Class Class of grammars without
    left recursion

42
Elimination of left recursion
  • Direct recursion
  • For each nonterminal A,
  • Group the A-rules as A ? Aa1 Aam ß1 ß2
    ßn
  • where none of the ßs begins with A
  • 2. Replace the original A-rules with
  • A ? ß1A ß2A ßnA
  • A ? a1A a2A amA e
  • Indirect recursion
  • See separated PDF-document

43
Recursive-Descent Parsing (cont.)
  • The other characteristic of grammars that
    disallows top-down parsing is the lack of
    pairwise disjointness
  • The inability to determine the correct RHS on the
    basis of one token of lookahead
  • Def FIRST(?) a ? gt a?
  • (If ? gt ?, ? is in FIRST(?))

44
Recursive-Descent Parsing (cont.)
  • Pairwise Disjointness Test
  • For each nonterminal, A, in the grammar that has
    more than one RHS, for each pair of rules, A ? ?i
    and A ? ?j, it must be true that
  • FIRST(?i) ? FIRST(?j) ?
  • Examples
  • A ? a bB cAb
  • A ? a aB

45
Recursive-Descent Parsing (cont.)
  • Left factoring can be used for removing pairwise
    disjointness.
  • Exampleltvariablegt?ident ident'('ltexpressiongt')
    '
  • left factor toltvariablegt ? ident
    ltnewgtltnewgt ? ? '('ltexpressiongt')'
  • or in EBNFltvariablegt ? ident
    '('ltexpressiongt')'
  • Problem with first transformation Introduction
    of ? rule. (Troublemaker in the context of the
    elimination of left recursion)

46
Bottom-up Parsing
  • The parsing problem is finding the correct RHS in
    a right-sentential form to reduce to get the
    previous right-sentential form in the derivation
  • Bottom-up parser represent an extended form of
    push down automata.

47
Definition Pushdown Automaton
  • A PDA is formally defined as a 7-tuple (Q, S,
    G, d, q0, Z, F), where
  • Q is a finite set of states
  • S is a finite set which is called the input
    alphabet
  • G is a finite set which is called the stack
    alphabet
  • d  Q (S?e) G ? Q G , the transition
    function
  • q0 ? Q is the start state
  • Z ? Q is the initial stack symbol
  • F ? Q is the set of accepting states

48
PDA computation
  • Assume d of M maps (p,a,A) to (q,a) and that M
    is
  • in state p?Q,
  • with a ?(S?e) on input
  • and A? G as topmost stack symbol,
  • Then M performs the following actions
  • may read a (move one position right on input)
  • change the state to q
  • pop A, replacing it by a
  • IMPORTANTThe (S?e) component of the
    transition relation is used to formalize that the
    PDA can either read a letter from the input, or
    proceed leaving the input untouched.

49
PDA computation graphically
50
Example PDA
  • M(Q, S, G, d, p, Z, F), where
  • states Q p,q,r
  • input alphabet S 0, 1
  • stack alphabet G A, Z
  • start state q0 p
  • start stack symbol Z
  • accepting states F r

Move number State Input Stack symbol Moves
1 p 0 Z p, AZ
2 p 0 A p, AA
3 p e Z q, Z
4 p e A q, A
5 q 1 A q, e
6 q e Z r, Z
51
Language of example PDA
  • PDA for language 0n1n n 0
  • Corresponding grammar ltAgt -gt 1 ltAgt 0 e

52
Important Lemmas
  • For every grammar G there is a pushdown automaton
    M, so that the language generated by G is
    recognized by the automaton M.
  • For very PDA M there is a grammar G, so that
    language recognized by M is generated by the
    grammar G.
  • PDA and context free grammars are equal concepts
    with respect to its recognized/generated
    languages.

53
Bottom-up Parsing / Handles
  • Definitions of Handle / Phrase / Simple Phrase
  • ? is the handle of the right sentential form ?
    ??w if and only if S gtrm ?Aw gtrm ??w
  • ? is a phrase of the right sentential form ? if
    and only if S gt ? ?1A?2 gt ?1??2
  • ? is a simple phrase of the right sentential form
    ? if and only if S gt ? ?1A?2 gt ?1??2

54
Bottom-up Parsing (cont.)
  • Shift-Reduce Algorithms
  • Reduce is the action of replacing the handle on
    the top of the parse stack with its corresponding
    LHS
  • Shift is the action of moving the next token to
    the top of the parse stack

55
Bottom-up Parsing (cont.)
  • Advantages of LR parsers
  • They will work for nearly all grammars that
    describe programming languages.
  • They can detect syntax errors as soon as it is
    possible.
  • The LR class of grammars is a superset of the
    class parsable by LL parsers.

56
Bottom-up Parsing (cont.)
  • LR parsers must be constructed with a tool
  • Knuths insight A bottom-up parser could use the
    entire history of the parse, up to the current
    point, to make parsing decisions
  • There were only a finite and relatively small
    number of different parse situations that could
    have occurred, so the history could be stored in
    a parser state, on the parse stack

57
Bottom-up Parsing (cont.)
  • An LR configuration stores the state of an LR
    parser
  • (S0X1S1X2S2XmSm, aiai1an)

58
Bottom-up Parsing (cont.)
  • LR parsers are table driven, where the table has
    two components, an ACTION table and a GOTO table
  • The ACTION table specifies the action of the
    parser, given the parser state and the next token
  • Rows are state names columns are terminals
  • The GOTO table specifies which state to put on
    top of the parse stack after a reduction action
    is done
  • Rows are state names columns are nonterminals

59
Structure of An LR Parser
60
Bottom-up Parsing (cont.)
  • Initial configuration (S0, a1an)
  • Parser actions
  • If ACTIONSm, ai Shift S, the next
    configuration is
  • (S0X1S1X2S2XmSmaiS, ai1an)
  • If ACTIONSm, ai Reduce A ? ? and S
    GOTOSm-r, A, where r the length of ?, the
    next configuration is(S0X1S1X2S2Xm-rSm-rAS,
    aiai1an)

61
Bottom-up Parsing (cont.)
  • Parser actions (continued)
  • If ACTIONSm, ai Accept, the parse is complete
    and no errors were found.
  • If ACTIONSm, ai Error, the parser calls an
    error-handling routine.

62
LR Parsing Table
S4
63
Bottom-up Parsing (cont.)
  • A parser table can be generated from a given
    grammar with a tool, e.g., yacc
Write a Comment
User Comments (0)
About PowerShow.com