Title: Language Translation Principles
1Language Translation Principles
- Part 1 Language Specification
2Attributes of a language
- Syntax rules describing use of language tokens
- Semantics logical meaning of combinations of
tokens - In a programming language, tokens include
identifiers, keywords, and punctuation
3Linguistic correctness
- A syntactically correct program is one in which
the tokens are arranged so that the code can be
successfully translated into a lower-level
language - A semantically correct program is one that
produces correct results
4Language translation tools
- Parser scans source code, compares with
established syntax rules - Code generator replaces high level source code
with semantically equivalent low level code
5Techniques to describe syntax of a language
- Grammars specify how you combine atomic elements
of language (characters) to form legal strings
(words, sentences) - Finite State Machines specify syntax of a
language through a series of interconnected
diagrams - Regular Expressions symbolic representation of
patterns describing strings applications include
forming search queries as well as language
specification
6Elements of a Language
- Alphabet finite, non-empty set of characters
- not precisely the same thing we mean when we
speak of natural language alphabet - for example, the alphabet of C includes the
upper- and lowercase letters of the English
alphabet, the digits 0-9, and the following
punctuation symbols - ,,,,(,),,-,,/,,,,
- Pep/8 alphabet is similar, but uses less
punctuation - Language of real numbers has its own alphabet
the set of characters 0,1,2,3,4,5,6,7,8,9,,-,.
7Language as ADT
- A language is an example of an Abstract Data Type
(ADT) - An ADT has these characteristics
- Set of possible values (an alphabet)
- Set of operations on those values
- One of the operations on the set of values in a
language is concatenation
8Concatenation
- Concatenation is the joining of two or more
characters to form a string - Many programming language tokens are formed this
way for example - and form
- and form
- 1, 2, 3 and 4 form 1234
- Concatenation always involves two operands
either one can be a string or a single character
9String characteristics
- The number of characters in a string is the
strings length - An empty string is a string with length 0 we
denote the empty string with the symbol ? - The ? is the identity element for concatenation
if x is string, then - ?x x? x
10Closure of an alphabet
- The set of all possible strings that can formed
by concatenating elements from an alphabet is the
alphabets closure, denoted T for some alphabet
T - The closure of an alphabet includes strings that
are not valid tokens in the language it is not a
finite set - For example, if R is the real number alphabet,
then R includes - -0.092 and 563.18 but also
- .0.0.- and 2-4-2.9..-5.
11Languages Grammars
- A language is a subset of the closure of an
alphabet - A grammar specifies how to concatenate symbols
from an alphabet to form legal strings in a
language
12Parts of a grammar
- N a nonterminal alphabet each element of N
represents a group of characters from - T a terminal alphabet
- P a set of rules for string production uses
nonterminals to describe language structure - S the start symbol, an element of N
13Terminal vs. non-terminal symbols
- A non-terminal symbol is used to describe or
represent a set of terminal symbols - For example, the following standard data types
are terminal symols in C and Java int, double,
float, char - The non-terminal symbol could be
used to represent any or all of these
14Valid strings
- S (the start symbol) is a single symbol, not a
set - Given S and P (rules for production), you can
decide whether a set of symbols is a valid string
in the language - Conversely, starting from S, if you can generate
a string of terminal symbols using P, you can
create a valid string
15Productions
A ? w
a string of terminals non-terminals
a non-terminal
produces
16Derivations
- A grammar specifies a language through the
derivation process - begin with the start symbol
- substitute for non-terminals using rules of
production until you get a string of terminals
17Example a grammar for identifiers (a toy example)
- N , ,
- T a, b, c, 1, 2, 3
- P the productions (? means produces)
- ?
- ?
- ?
- ? a
- ? b
- ? c
- ? 1
- ? 2
- ? 3
- S
18Example deriving a12bc
- ? (rule 2)
- ? c (rule 6)
- ? means ? c (rule 2)
- derives in one ? bc (rule 5)
- step ? bc (rule 3)
- ? 2bc (rule 8)
- ? 2bc (rule 3)
- ? 12bc (rule 7)
- ? 12bc
- ? a12bc
19Closure of derivation
- The symbol ? means derives in 0 or more steps
- A language specified by a grammar consists of all
strings derivable from the start symbol using the
rules of production - provides operational test for membership in the
language - if a string cant be derived using production
rules, it isnt in the language
20Example attempting to derive 2a
- ?
- ? a
- Since there is no ?
combination in the production rules, we cant
proceed any further - This means that 2a isnt a valid string in our
language
21A grammar for signed integers
- N I, F, M
- I means integer
- F means first symbol optional sign
- M means magnitude
- T ,-,d (d means digit 0-9)
- P the productions
- I ? FM
- F?
- F? -
- F? ? (means /- is optional)
- M ? dM
- M ? d
- S I
22Examples
- Deriving 14 Deriving -7
- I ? FM I ? FM
- ? ?M ? -M
- ? dM ? -d
- ? dd ? -7
- ? 14
23Recursive rules
- Both of the previous examples (identifiers,
integers) have rules in which a nonterminal is
defined in terms of itself - ? and
- M ? dM
- Such rules produce languages with infinite sets
of legal sentences
24Context-sensitive grammar
- A grammar in which the production rules may
contain more than one non-terminal on the left
side - The opposite (all of the examples we have seen
thus far), have production rules restricted to a
single non-terminal on the left these are known
as context-free grammars
25Example
- N A,B,C
- T a,b,c
- P the productions
- A ? aABC
- A ? abC
- CB ? BC
- bB ? bb
- bC ? bc This rule is context-sensitive C can be
- substituted with c only if C is immediately
- preceded by b
- cC ? cc
- S A
26Context-sensitive grammar
- N A, B, C
- T a, b, c
- P the productions
- A -- aABC
- A -- abC
- CB -- BC
- bB -- bb
- bC -- bc
- cC -- cc
- S A
Example aaabbbcc is a valid string by A
aABC (1) aaABCBC (1) aaabCBCBC (2)
aaabBCCBC (3) aaabBCBCC (3) aaabBBCCC
(3) aaabbBCCC (4) aaabbbCCC (4)
aaabbbcCC (5) Here, we substituted c for
C this is allowable only if C has b in front
of it aaabbbccC (6) aaabbbccc (6)
27Valid invalid strings from previous example
- Valid
- abc
- aabbcc
- aaabbbccc
- aaaabbbbcccc
The grammar describes a language consisting of
strings that start with a number of as, followed
by an equal number of bs and cs this language
can be defined mathematically as L anbncn
n 0 Note an means the concatenation of n as
28A grammar for expressions
- N E, T, F where
- E expression
- T term T , , (, ), a
- F factor
- P the productions
- E - E T
- E - T
- T - T F
- T - F
- F - (E)
- F - a
- S E
29Applying the grammar
- You cant reach a valid conclusion if you dont
have a valid string, but the opposite is not true - For example, suppose we want to parse the string
(a a) a using the grammar we just saw - First attempt
- E T (by rule 2)
- F (by rule 4)
- and, were stuck, because F can only produce
(E) or a so we reach a dead end, even though the
string is valid
30Applying the grammar
- Heres a parse that works for (aa)a
- E ET (rule 1)
- TT (rule 2)
- FT (rule 4)
- (E)T (rule 5)
- (T)T (rule 2)
- (TF)T (rule 3)
- (Ta)T (rule 6)
- (Fa)F (rule 4 applied twice)
- (aa) a (rule 6 applied twice)
31Deriving a valid string from a grammar
- Arbitrarily pick a nonterminal on right side of
current intermediate string select rules for
substitution until you get a string of terminals - Automatic translators have more difficult
problem - given string of terminals, determine if string is
valid, then produce matching object code - only way to determine string validity is to
derive it from the start string of the grammar
this is called parsing
32The parsing problem
- Automatic translators arent at liberty to pick
rules randomly (as illustrated by the first
attempt to translate the preceding expression) - Parsing algorithm must search for the right
sequence of substitutions to derive a proposed
string - Translator must also be able to prove that no
derivation exists if proposed string is not valid
33Syntax tree
- A parse routine can be represented as a tree
- start symbol is the root
- interior nodes are nonterminal symbols
- leaf nodes are terminal symbols
- children of an interior node are symbols from
right side of production rule substituted for
parent node in derivation
34Syntax tree for (aa)a
35Grammar for a programming language
- A grammar for a subset of the C language is
laid out on pages 340-341 of the textbook - A sampling (suitable for either C or Java) is
given on the next couple of slides
36Rules for declarations
- -
- - char int double
- (remember, this is subset of actual language)
- -
- ,
- -
-
-
- - abc zABZ
- - 0123456789
37Rules for control structures
- -
- if ()
- if ()
- else
- -
- while ()
- do while ()
38Rules for expressions
39Backus-Naur Form (BNF)
- BNF is the standardized form for specification of
a programming language by its rules of production - In BNF, the - operator is written
- ALGOL-60 first popularized the form
40BNF described in terms of itself (from Wikipedia)
" ""
""
"
" ""
""
"
" '"' '"' "'"
"'"