Language Translation Principles - PowerPoint PPT Presentation

About This Presentation

Title:

Language Translation Principles

Description:

Language translation tools. Parser: scans source code, compares with ... to a single non-terminal on the left: these are known as context-free grammars ... – PowerPoint PPT presentation

Number of Views:681

Avg rating:3.0/5.0

Slides: 41

Provided by: cshe4

Learn more at: https://www.kirkwood.edu

Category:

more less

Transcript and Presenter's Notes

Title: Language Translation Principles

1
Language Translation Principles

Part 1 Language Specification

2
Attributes of a language

Syntax rules describing use of language tokens
Semantics logical meaning of combinations of
tokens
In a programming language, tokens include
identifiers, keywords, and punctuation

3
Linguistic correctness

A syntactically correct program is one in which
the tokens are arranged so that the code can be
successfully translated into a lower-level
language
A semantically correct program is one that
produces correct results

4
Language translation tools

Parser scans source code, compares with
established syntax rules
Code generator replaces high level source code
with semantically equivalent low level code

5
Techniques to describe syntax of a language

Grammars specify how you combine atomic elements
of language (characters) to form legal strings
(words, sentences)
Finite State Machines specify syntax of a
language through a series of interconnected
diagrams
Regular Expressions symbolic representation of
patterns describing strings applications include
forming search queries as well as language
specification

6
Elements of a Language

Alphabet finite, non-empty set of characters
not precisely the same thing we mean when we
speak of natural language alphabet
for example, the alphabet of C includes the
upper- and lowercase letters of the English
alphabet, the digits 0-9, and the following
punctuation symbols
,,,,(,),,-,,/,,,,
Pep/8 alphabet is similar, but uses less
punctuation
Language of real numbers has its own alphabet
the set of characters 0,1,2,3,4,5,6,7,8,9,,-,.

7
Language as ADT

A language is an example of an Abstract Data Type
(ADT)
An ADT has these characteristics
Set of possible values (an alphabet)
Set of operations on those values
One of the operations on the set of values in a
language is concatenation

8
Concatenation

Concatenation is the joining of two or more
characters to form a string
Many programming language tokens are formed this
way for example
and form
and form
1, 2, 3 and 4 form 1234
Concatenation always involves two operands
either one can be a string or a single character

9
String characteristics

The number of characters in a string is the
strings length
An empty string is a string with length 0 we
denote the empty string with the symbol ?
The ? is the identity element for concatenation
if x is string, then
?x x? x

10
Closure of an alphabet

The set of all possible strings that can formed
by concatenating elements from an alphabet is the
alphabets closure, denoted T for some alphabet
T
The closure of an alphabet includes strings that
are not valid tokens in the language it is not a
finite set
For example, if R is the real number alphabet,
then R includes
-0.092 and 563.18 but also
.0.0.- and 2-4-2.9..-5.

11
Languages Grammars

A language is a subset of the closure of an
alphabet
A grammar specifies how to concatenate symbols
from an alphabet to form legal strings in a
language

12
Parts of a grammar

N a nonterminal alphabet each element of N
represents a group of characters from
T a terminal alphabet
P a set of rules for string production uses
nonterminals to describe language structure
S the start symbol, an element of N

13
Terminal vs. non-terminal symbols

A non-terminal symbol is used to describe or
represent a set of terminal symbols
For example, the following standard data types
are terminal symols in C and Java int, double,
float, char
The non-terminal symbol could be
used to represent any or all of these

14
Valid strings

S (the start symbol) is a single symbol, not a
set
Given S and P (rules for production), you can
decide whether a set of symbols is a valid string
in the language
Conversely, starting from S, if you can generate
a string of terminal symbols using P, you can
create a valid string

15
Productions
A ? w
a string of terminals non-terminals
a non-terminal
produces
16
Derivations

A grammar specifies a language through the
derivation process
begin with the start symbol
substitute for non-terminals using rules of
production until you get a string of terminals

17
Example a grammar for identifiers (a toy example)

N , ,
T a, b, c, 1, 2, 3
P the productions (? means produces)
?
?
?
? a
? b
? c
? 1
? 2
? 3
S

18
Example deriving a12bc

? (rule 2)
? c (rule 6)
? means ? c (rule 2)
derives in one ? bc (rule 5)
step ? bc (rule 3)
? 2bc (rule 8)
? 2bc (rule 3)
? 12bc (rule 7)
? 12bc
? a12bc

19
Closure of derivation

The symbol ? means derives in 0 or more steps
A language specified by a grammar consists of all
strings derivable from the start symbol using the
rules of production
provides operational test for membership in the
language
if a string cant be derived using production
rules, it isnt in the language

20
Example attempting to derive 2a

?
? a
Since there is no ?
combination in the production rules, we cant
proceed any further
This means that 2a isnt a valid string in our
language

21
A grammar for signed integers

N I, F, M
I means integer
F means first symbol optional sign
M means magnitude
T ,-,d (d means digit 0-9)
P the productions
I ? FM
F?
F? -
F? ? (means /- is optional)
M ? dM
M ? d
S I

22
Examples

Deriving 14 Deriving -7
I ? FM I ? FM
? ?M ? -M
? dM ? -d
? dd ? -7
? 14

23
Recursive rules

Both of the previous examples (identifiers,
integers) have rules in which a nonterminal is
defined in terms of itself
? and
M ? dM
Such rules produce languages with infinite sets
of legal sentences

24
Context-sensitive grammar

A grammar in which the production rules may
contain more than one non-terminal on the left
side
The opposite (all of the examples we have seen
thus far), have production rules restricted to a
single non-terminal on the left these are known
as context-free grammars

25
Example

N A,B,C
T a,b,c
P the productions
A ? aABC
A ? abC
CB ? BC
bB ? bb
bC ? bc This rule is context-sensitive C can be
substituted with c only if C is immediately
preceded by b
cC ? cc
S A

26
Context-sensitive grammar

N A, B, C
T a, b, c
P the productions
A -- aABC
A -- abC
CB -- BC
bB -- bb
bC -- bc
cC -- cc
S A

Example aaabbbcc is a valid string by A
aABC (1) aaABCBC (1) aaabCBCBC (2)
aaabBCCBC (3) aaabBCBCC (3) aaabBBCCC
(3) aaabbBCCC (4) aaabbbCCC (4)
aaabbbcCC (5) Here, we substituted c for
C this is allowable only if C has b in front
of it aaabbbccC (6) aaabbbccc (6)
27
Valid invalid strings from previous example

Valid
abc
aabbcc
aaabbbccc
aaaabbbbcccc

Invalid
aabc
cba
bbbccc

The grammar describes a language consisting of
strings that start with a number of as, followed
by an equal number of bs and cs this language
can be defined mathematically as L anbncn
n 0 Note an means the concatenation of n as
28
A grammar for expressions

N E, T, F where
E expression
T term T , , (, ), a
F factor
P the productions
E - E T
E - T
T - T F
T - F
F - (E)
F - a
S E

29
Applying the grammar

You cant reach a valid conclusion if you dont
have a valid string, but the opposite is not true
For example, suppose we want to parse the string
(a a) a using the grammar we just saw
First attempt
E T (by rule 2)
F (by rule 4)
and, were stuck, because F can only produce
(E) or a so we reach a dead end, even though the
string is valid

30
Applying the grammar

Heres a parse that works for (aa)a
E ET (rule 1)
TT (rule 2)
FT (rule 4)
(E)T (rule 5)
(T)T (rule 2)
(TF)T (rule 3)
(Ta)T (rule 6)
(Fa)F (rule 4 applied twice)
(aa) a (rule 6 applied twice)

31
Deriving a valid string from a grammar

Arbitrarily pick a nonterminal on right side of
current intermediate string select rules for
substitution until you get a string of terminals
Automatic translators have more difficult
problem
given string of terminals, determine if string is
valid, then produce matching object code
only way to determine string validity is to
derive it from the start string of the grammar
this is called parsing

32
The parsing problem

Automatic translators arent at liberty to pick
rules randomly (as illustrated by the first
attempt to translate the preceding expression)
Parsing algorithm must search for the right
sequence of substitutions to derive a proposed
string
Translator must also be able to prove that no
derivation exists if proposed string is not valid

33
Syntax tree

A parse routine can be represented as a tree
start symbol is the root
interior nodes are nonterminal symbols
leaf nodes are terminal symbols
children of an interior node are symbols from
right side of production rule substituted for
parent node in derivation

34
Syntax tree for (aa)a
35
Grammar for a programming language

A grammar for a subset of the C language is
laid out on pages 340-341 of the textbook
A sampling (suitable for either C or Java) is
given on the next couple of slides

36
Rules for declarations

-
- char int double
(remember, this is subset of actual language)
-
,
-
- abc zABZ
- 0123456789

37
Rules for control structures

-
if ()
if ()
else
-
while ()
do while ()

38
Rules for expressions

-
-
-
etc.

39
Backus-Naur Form (BNF)

BNF is the standardized form for specification of
a programming language by its rules of production
In BNF, the - operator is written
ALGOL-60 first popularized the form