Language Translation Principles - PowerPoint PPT Presentation

About This Presentation
Title:

Language Translation Principles

Description:

Language translation tools. Parser: scans source code, compares with ... to a single non-terminal on the left: these are known as context-free grammars ... – PowerPoint PPT presentation

Number of Views:681
Avg rating:3.0/5.0
Slides: 41
Provided by: cshe4
Learn more at: https://www.kirkwood.edu
Category:

less

Transcript and Presenter's Notes

Title: Language Translation Principles


1
Language Translation Principles
  • Part 1 Language Specification

2
Attributes of a language
  • Syntax rules describing use of language tokens
  • Semantics logical meaning of combinations of
    tokens
  • In a programming language, tokens include
    identifiers, keywords, and punctuation

3
Linguistic correctness
  • A syntactically correct program is one in which
    the tokens are arranged so that the code can be
    successfully translated into a lower-level
    language
  • A semantically correct program is one that
    produces correct results

4
Language translation tools
  • Parser scans source code, compares with
    established syntax rules
  • Code generator replaces high level source code
    with semantically equivalent low level code

5
Techniques to describe syntax of a language
  • Grammars specify how you combine atomic elements
    of language (characters) to form legal strings
    (words, sentences)
  • Finite State Machines specify syntax of a
    language through a series of interconnected
    diagrams
  • Regular Expressions symbolic representation of
    patterns describing strings applications include
    forming search queries as well as language
    specification

6
Elements of a Language
  • Alphabet finite, non-empty set of characters
  • not precisely the same thing we mean when we
    speak of natural language alphabet
  • for example, the alphabet of C includes the
    upper- and lowercase letters of the English
    alphabet, the digits 0-9, and the following
    punctuation symbols
  • ,,,,(,),,-,,/,,,,
  • Pep/8 alphabet is similar, but uses less
    punctuation
  • Language of real numbers has its own alphabet
    the set of characters 0,1,2,3,4,5,6,7,8,9,,-,.

7
Language as ADT
  • A language is an example of an Abstract Data Type
    (ADT)
  • An ADT has these characteristics
  • Set of possible values (an alphabet)
  • Set of operations on those values
  • One of the operations on the set of values in a
    language is concatenation

8
Concatenation
  • Concatenation is the joining of two or more
    characters to form a string
  • Many programming language tokens are formed this
    way for example
  • and form
  • and form
  • 1, 2, 3 and 4 form 1234
  • Concatenation always involves two operands
    either one can be a string or a single character

9
String characteristics
  • The number of characters in a string is the
    strings length
  • An empty string is a string with length 0 we
    denote the empty string with the symbol ?
  • The ? is the identity element for concatenation
    if x is string, then
  • ?x x? x

10
Closure of an alphabet
  • The set of all possible strings that can formed
    by concatenating elements from an alphabet is the
    alphabets closure, denoted T for some alphabet
    T
  • The closure of an alphabet includes strings that
    are not valid tokens in the language it is not a
    finite set
  • For example, if R is the real number alphabet,
    then R includes
  • -0.092 and 563.18 but also
  • .0.0.- and 2-4-2.9..-5.

11
Languages Grammars
  • A language is a subset of the closure of an
    alphabet
  • A grammar specifies how to concatenate symbols
    from an alphabet to form legal strings in a
    language

12
Parts of a grammar
  • N a nonterminal alphabet each element of N
    represents a group of characters from
  • T a terminal alphabet
  • P a set of rules for string production uses
    nonterminals to describe language structure
  • S the start symbol, an element of N

13
Terminal vs. non-terminal symbols
  • A non-terminal symbol is used to describe or
    represent a set of terminal symbols
  • For example, the following standard data types
    are terminal symols in C and Java int, double,
    float, char
  • The non-terminal symbol could be
    used to represent any or all of these

14
Valid strings
  • S (the start symbol) is a single symbol, not a
    set
  • Given S and P (rules for production), you can
    decide whether a set of symbols is a valid string
    in the language
  • Conversely, starting from S, if you can generate
    a string of terminal symbols using P, you can
    create a valid string

15
Productions
A ? w
a string of terminals non-terminals
a non-terminal
produces
16
Derivations
  • A grammar specifies a language through the
    derivation process
  • begin with the start symbol
  • substitute for non-terminals using rules of
    production until you get a string of terminals

17
Example a grammar for identifiers (a toy example)
  • N , ,
  • T a, b, c, 1, 2, 3
  • P the productions (? means produces)
  • ?
  • ?
  • ?
  • ? a
  • ? b
  • ? c
  • ? 1
  • ? 2
  • ? 3
  • S

18
Example deriving a12bc
  • ? (rule 2)
  • ? c (rule 6)
  • ? means ? c (rule 2)
  • derives in one ? bc (rule 5)
  • step ? bc (rule 3)
  • ? 2bc (rule 8)
  • ? 2bc (rule 3)
  • ? 12bc (rule 7)
  • ? 12bc
  • ? a12bc

19
Closure of derivation
  • The symbol ? means derives in 0 or more steps
  • A language specified by a grammar consists of all
    strings derivable from the start symbol using the
    rules of production
  • provides operational test for membership in the
    language
  • if a string cant be derived using production
    rules, it isnt in the language

20
Example attempting to derive 2a
  • ?
  • ? a
  • Since there is no ?
    combination in the production rules, we cant
    proceed any further
  • This means that 2a isnt a valid string in our
    language

21
A grammar for signed integers
  • N I, F, M
  • I means integer
  • F means first symbol optional sign
  • M means magnitude
  • T ,-,d (d means digit 0-9)
  • P the productions
  • I ? FM
  • F?
  • F? -
  • F? ? (means /- is optional)
  • M ? dM
  • M ? d
  • S I

22
Examples
  • Deriving 14 Deriving -7
  • I ? FM I ? FM
  • ? ?M ? -M
  • ? dM ? -d
  • ? dd ? -7
  • ? 14

23
Recursive rules
  • Both of the previous examples (identifiers,
    integers) have rules in which a nonterminal is
    defined in terms of itself
  • ? and
  • M ? dM
  • Such rules produce languages with infinite sets
    of legal sentences

24
Context-sensitive grammar
  • A grammar in which the production rules may
    contain more than one non-terminal on the left
    side
  • The opposite (all of the examples we have seen
    thus far), have production rules restricted to a
    single non-terminal on the left these are known
    as context-free grammars

25
Example
  • N A,B,C
  • T a,b,c
  • P the productions
  • A ? aABC
  • A ? abC
  • CB ? BC
  • bB ? bb
  • bC ? bc This rule is context-sensitive C can be
  • substituted with c only if C is immediately
  • preceded by b
  • cC ? cc
  • S A

26
Context-sensitive grammar
  • N A, B, C
  • T a, b, c
  • P the productions
  • A -- aABC
  • A -- abC
  • CB -- BC
  • bB -- bb
  • bC -- bc
  • cC -- cc
  • S A

Example aaabbbcc is a valid string by A
aABC (1) aaABCBC (1) aaabCBCBC (2)
aaabBCCBC (3) aaabBCBCC (3) aaabBBCCC
(3) aaabbBCCC (4) aaabbbCCC (4)
aaabbbcCC (5) Here, we substituted c for
C this is allowable only if C has b in front
of it aaabbbccC (6) aaabbbccc (6)
27
Valid invalid strings from previous example
  • Valid
  • abc
  • aabbcc
  • aaabbbccc
  • aaaabbbbcccc
  • Invalid
  • aabc
  • cba
  • bbbccc

The grammar describes a language consisting of
strings that start with a number of as, followed
by an equal number of bs and cs this language
can be defined mathematically as L anbncn
n 0 Note an means the concatenation of n as
28
A grammar for expressions
  • N E, T, F where
  • E expression
  • T term T , , (, ), a
  • F factor
  • P the productions
  • E - E T
  • E - T
  • T - T F
  • T - F
  • F - (E)
  • F - a
  • S E

29
Applying the grammar
  • You cant reach a valid conclusion if you dont
    have a valid string, but the opposite is not true
  • For example, suppose we want to parse the string
    (a a) a using the grammar we just saw
  • First attempt
  • E T (by rule 2)
  • F (by rule 4)
  • and, were stuck, because F can only produce
    (E) or a so we reach a dead end, even though the
    string is valid

30
Applying the grammar
  • Heres a parse that works for (aa)a
  • E ET (rule 1)
  • TT (rule 2)
  • FT (rule 4)
  • (E)T (rule 5)
  • (T)T (rule 2)
  • (TF)T (rule 3)
  • (Ta)T (rule 6)
  • (Fa)F (rule 4 applied twice)
  • (aa) a (rule 6 applied twice)

31
Deriving a valid string from a grammar
  • Arbitrarily pick a nonterminal on right side of
    current intermediate string select rules for
    substitution until you get a string of terminals
  • Automatic translators have more difficult
    problem
  • given string of terminals, determine if string is
    valid, then produce matching object code
  • only way to determine string validity is to
    derive it from the start string of the grammar
    this is called parsing

32
The parsing problem
  • Automatic translators arent at liberty to pick
    rules randomly (as illustrated by the first
    attempt to translate the preceding expression)
  • Parsing algorithm must search for the right
    sequence of substitutions to derive a proposed
    string
  • Translator must also be able to prove that no
    derivation exists if proposed string is not valid

33
Syntax tree
  • A parse routine can be represented as a tree
  • start symbol is the root
  • interior nodes are nonterminal symbols
  • leaf nodes are terminal symbols
  • children of an interior node are symbols from
    right side of production rule substituted for
    parent node in derivation

34
Syntax tree for (aa)a
35
Grammar for a programming language
  • A grammar for a subset of the C language is
    laid out on pages 340-341 of the textbook
  • A sampling (suitable for either C or Java) is
    given on the next couple of slides

36
Rules for declarations
  • -
  • - char int double
  • (remember, this is subset of actual language)
  • -
  • ,
  • -
  • - abc zABZ
  • - 0123456789

37
Rules for control structures
  • -
  • if ()
  • if ()
  • else
  • -
  • while ()
  • do while ()

38
Rules for expressions
  • -
  • -
  • -




  • etc.

39
Backus-Naur Form (BNF)
  • BNF is the standardized form for specification of
    a programming language by its rules of production
  • In BNF, the - operator is written
  • ALGOL-60 first popularized the form

40
BNF described in terms of itself (from Wikipedia)

" ""
""
"
" ""
""


"
" '"' '"' "'"
"'"
Write a Comment
User Comments (0)
About PowerShow.com