Title: CS60057 Speech
1CS60057Speech Natural Language Processing
Lecture 3 27 July 2007
2Levels of (Formal) Description
- 6 basic levels (more or less explicitly present
in most theories) - and beyond (pragmatics/logic/...)
- meaning (semantics)
- (surface) syntax
- morphology
- phonology
- phonetics/orthography
- Each level has an input and output representation
- output from one level is the input to the next
(upper) level - sometimes levels might be skipped (merged) or
split
3Phonetics/Orthography
- Input
- acoustic signal (phonetics) / text (orthography)
- Output
- phonetic alphabet (phonetics) / text
(orthography) - Deals with
- Phonetics
- consonant vowel ( others) formation in the
vocal tract - classification of consonants, vowels, ... in
relation to frequencies, shape position of the
tongue and various muscles - intonation
- Orthography normalization, punctuation, etc.
4Phonology
- Input
- sequence of phones/sounds (in a phonetic
alphabet) or normalized text (sequence of
(surface) letters in one languages alphabet)
NB phones vs. phonemes - Output
- sequence of phonemes ( (lexical) letters in an
abstract alphabet) - Deals with
- relation between sounds and phonemes (units which
might have some function on the upper level) - e.g. u oo (as in book), æ a (cat) i y
(flies)
5Morphology
- Input
- sequence of phonemes ( (lexical) letters)
- Output
- sequence of pairs (lemma, (morphological) tag)
- Deals with
- composition of phonemes into word forms and their
underlying lemmas (lexical units) morphological
categories (inflection, derivation, compounding) - e.g. quotations quote/V -ation(der.V-gtN)
NNS.
6(Surface) Syntax
- Input
- sequence of pairs (lemma, (morphological) tag)
- Output
- sentence structure (tree) with annotated nodes
(all lemmas, (morphosyntactic) tags, functions),
of various forms - Deals with
- the relation between lemmas morphological
categories and the sentence structure - uses syntactic categories such as Subject, Verb,
Object,... - e.g. I/PP1 see/VB a/DT dog/NN
- ((I/sg)SB ((see/pres)V
(a/ind dog/sg)OBJ)VP)S
7Meaning (semantics)
- Input
- sentence structure (tree) with annotated nodes
(lemmas, (morphosyntactic) tags, surface
functions) - Output
- sentence structure (tree) with annotated nodes
(semantic lemmas, (morpho-syntactic) tags, deep
functions) - Deals with
- relation between categories such as Subject,
Object and (deep) categories such as Agent,
Effect adds other cats - e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S
- (I/Sg/Pat/t
(see/Perf/Pred/t) Tom/Sg/Ag/f)
8...and Beyond
- Input
- sentence structure (tree) annotated nodes
(autosemantic lemmas, (morphosyntactic) tags,
deep functions) - Output
- logical form, which can be evaluated (true/false)
- Deals with
- assignment of objects from the real world to the
nodes of the sentence structure - e.g. (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
- see(Mark-TwainSSN...,Tom-SawyerSSN...)Ti
mebef 99/9/27/1415Place39s1940N76s3710W
9Three Views
- Three equivalent formal ways to look at what
were up to (not including tables)
Regular Expressions
Finite State Automata
Regular Languages
10Transition
- Finite-state methods are particularly useful in
dealing with a lexicon. - Lots of devices, some with limited memory, need
access to big lists of words. - So well switch to talking about some facts about
words and then come back to computational methods
11MORPHOLOGY
12Morphology
- Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes (morph shape, logos word) - We can usefully divide morphemes into two classes
- Stems The core meaning bearing units
- Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions - Prefix un-, anti-, etc
- Suffix -ity, -ation, etc
- Infix are inserted inside the stem
- Tagalog um hingi? humingi
- Circumfixes precede and follow the stem
- English doesnt stack more affixes.
- But Turkish can have words with a lot of
suffixes. - Languages, such as Turkish, tend to string
affixes together are called agglutinative
languages.
13Surface and Lexical Forms
- The surface level of a word represents the actual
spelling - of that word.
- geliyorum eats cats kitabim
- The lexical level of a word represents a simple
concatenation - of morphemes making up that word.
- gel PROG 1SG
- eat AOR
- cat PLU
- kitap P1SG
- Morphological processors try to find
correspondences between lexical and surface forms
of words. - Morphological recognition/ analysis surface
to lexical - Morphological generation/ synthesis lexical
to surface
14Morphology Morphemes Order
- Handles what is an isolated form in written text
- Grouping of phonemes into morphemes
- sequence deliverables deliver, able and s (3
units) - Morpheme Combination
- certain combinations/sequencing possible, other
not - deliverables, but not ablederives nouns,
but not nouning - typically fixed (in any given language)
15Inflectional Derivational Morphology
- We can also divide morphology up into two broad
classes - Inflectional
- Derivational
- Inflectional morphology concerns the combination
of stems and affixes where the resulting word - Has the same word class as the original
- Serves a grammatical/semantic purpose different
from the original - After a combination with an inflectional
morpheme, - the meaning and class of the actual stem usually
do not change. - eat / eats pencil / pencils
- After a combination with an derivational
morpheme, the meaning and the class of the actual
stem usually change. - compute / computer do / undo friend /
friendly - Uygar / uygarlas kapi / kapici
- The irregular changes may happen with
derivational affixes.
16Morphological Parsing
- Morphological parsing is to find the lexical form
of a word - from its surface form.
- cats -- cat N PLU
- cat -- cat N SG
- goose -- goose N SG or goose V
- geese -- goose N PLU
- gooses -- goose V 3SG
- catch -- catch V
- caught -- catch V PAST or catch V PP
- There can be more than one lexical level
representation - for a given word. (ambiguity)
17Morphological Analysis
- Analyzing words into their linguistic components
(morphemes). - Morphemes are the smallest meaningful units of
language. - cars carPLU
- giving givePROG
- AsachhilAma AsAPROGPAST1st I/We was/were
coming - Ambiguity More than one alternatives
- flies flyVERBPROG
- flyNOUNPLU
- mAtAla
- kare
18- Fly s ? flys ? flies (y ?i rule)
- Duckling
- Go-getter ? get er
- Doer ? do er
- Beer ? ?
- What knowledge do we need?
- How do we represent it?
- How do we compute with it?
19Knowledge needed
- Knowledge of stems or roots
- Duck is a possible root, not duckl
- We need a dictionary (lexicon)
- Only some endings go on some words
- Do er ok
- Be er not ok
- In addition, spelling change rules that adjust
the surface form - Get er double the t getter
- Fox s insert e foxes
- Fly s insert e flys y to i flies
- Chase ed drop e - chased
20Put all this in a big dictionary (lexicon)
- Turkish approx 600 ? 106 forms
- Finnish 107
- Hindi, Bengali, Telugu, Tamil?
- Besides, always novel forms can be constructed
- Anti-missile
- Anti-anti-missile
- Anti-anti-anti-missile
- ..
- Compounding of words Sanskrit, German
21Morphology From Morphemes to Lemmas Categories
- Lemma lexical unit, pointer to lexicon
- typically is represented as the base form, or
dictionary headword - possibly indexed when ambiguous/polysemous
- state1 (verb), state2 (state-of-the-art), state3
(government) - from one or more morphemes (root, stem,
rootderivation, ...) - Categories non-lexical
- small number of possible values (lt 100, often lt
5-10)
22Morphology Level The Mapping
- Formally A ? 2(L,C1,C2,...,Cn)
- A is the alphabet of phonemes (A denotes any
non-empty sequence of phonemes) - L is the set of possible lemmas, uniquely
identified - Ci are morphological categories, such as
- grammatical number, gender, case
- person, tense, negation, degree of comparison,
voice, aspect, ... - tone, politeness, ...
- part of speech (not quite morphological category,
but...) - A, L and Ci are obviously language-dependent
23Morphological Analysis (cont.)
- Relatively simple for English.
- But for many Indian languages, it may be more
difficult. - Examples
- Inflectional and Derivational Morphology.
- Common tools Finite-state transducers
24Bengali Verb Paradigms
25Bengali Verb morphology for one of the paradigms
26(No Transcript)
27(No Transcript)
28Finite State Machines
- FSAs are equivalent to regular languages
- FSTs are equivalent to regular relations (over
pairs of regular languages) - FSTs are like FSAs but with complex labels.
- We can use FSTs to transduce between surface and
lexical levels.
29Simple Rules
30Adding in the Words
31Derivational Rules
32Parsing/Generation vs. Recognition
- Recognition is usually not quite what we need.
- Usually if we find some string in the language we
need to find the structure in it (parsing) - Or we have some structure and we want to produce
a surface form (production/generation) - Example
- From cats to cat N PL and back
33Morphological Parsing
- Given the input cats, wed like to outputcat N
Pl, telling us that cat is a plural noun. - Given the Spanish input bebo, wed like to
outputbeber V PInd 1P Sg telling us that
bebo is the present indicative first person
singular form of the Spanish verb beber, to
drink.
34Morphological Anlayser
- To build a morphological analyser we need
- lexicon the list of stems and affixes, together
with basic information about them - morphotactics the model of morpheme ordering (eg
English plural morpheme follows the noun rather
than a verb) - orthographic rules these spelling rules are used
to model the changes that occur in a word,
usually when two morphemes combine (e.g., flys
flies)
35Lexicon Morphotactics
- Typically list of word parts (lexicon) and the
models of ordering can be combined together into
an FSA which will recognise the all the valid
word forms. - For this to be possible the word parts must first
be classified into sublexicons. - The FSA defines the morphotactics (ordering
constraints).
36Sublexicons to classify the list of word parts
37FSA Expresses Morphotactics (ordering model)
38Towards the Analyser
- We can use lexc or xfst to build such an FSA (see
lex1.lexc) - To augment this to produce an analysis we must
create a transducer Tnum which maps between the
lexical level and an "intermediate" level that is
needed to handle the spelling rules of English.
39Three Levels of Analysis
401. Tnum Noun Number Inflection
- multi-character symbols
- morpheme boundary
- word boundary
41Intermediate Form to Surface
- The reason we need to have an intermediate form
is that funny things happen at morpheme
boundaries, e.g. - cats ? cats
- foxs ? foxes
- flys ? flies
- The rules which describe these changes are called
orthographic rules or "spelling rules".
42More English Spelling Rules
- consonant doubling beg / begging
- y replacement try/tries
- k insertion panic/panicked
- e deletion make/making
- e insertion watch/watches
- Each rule can be stated in more detail ...
43Spelling Rules
- Chomsky Halle (1968) invented a special
notation for spelling rules. - A very similar notation is embodied in the
"conditional replacement" rules of xfst. - E -gt F L _ R
- which means replace E with F when it appears
between left context L and right context R
44A Particular Spelling Rule
- This rule does e-insertion
- -gt e x _ s
45e insertion over 3 levels
The rule corresponds to the mapping
between surface and intermediate levels
46e insertion as an FST
47Incorporating Spelling Rules
- Spelling rules, each corresponding to an FST, can
be run in parallel provided that they are
"aligned". - The set of spelling rules is positioned between
the surface level and the intermediate level. - Parallel execution of FSTs can be carried out
- by simulation in this case FSTs must first be
aligned. - by first constructing a a single FST
corresponding to their intersection.
48Putting it all together
execution of FSTi takes place in parallel
49Kaplan and KayThe Xerox View
FSTi are aligned but separate
FSTi intersected together
50Finite State Transducers
- The simple story
- Add another tape
- Add extra symbols to the transitions
- On one tape we read cats, on the other we write
cat N PL, or the other way around.
51FSTs
52English Plural
53Transitions
- cc means read a c on one tape and write a c on
the other - Ne means read a N symbol on one tape and write
nothing on the other - PLs means read PL and write an s
54Typical Uses
- Typically, well read from one tape using the
first symbol on the machine transitions (just as
in a simple FSA). - And well write to the second tape using the
other symbols on the transitions.
55Ambiguity
- Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state. - Didnt matter which path was actually traversed
- In FSTs the path to an accept state does matter
since differ paths represent different parses and
different outputs will result
56Ambiguity
- Whats the right parse for
- Unionizable
- Union-ize-able
- Un-ion-ize-able
- Each represents a valid path through the
derivational morphology machine.
57Ambiguity
- There are a number of ways to deal with this
problem - Simply take the first output found
- Find all the possible outputs (all paths) and
return them all (without choosing) - Bias the search so that only one or a few likely
paths are explored
58The Gory Details
- Of course, its not as easy as
- cat N PL lt-gt cats
- As we saw earlier there are geese, mice and oxen
- But there are also a whole host of
spelling/pronunciation changes that go along with
inflectional changes - Cats vs Dogs
- Fox and Foxes
59Multi-Tape Machines
- To deal with this we can simply add more tapes
and use the output of one tape machine as the
input to the next - So to handle irregular spelling changes well add
intermediate tapes with intermediate symbols
60Generativity
- Nothing really privileged about the directions.
- We can write from one and read from the other or
vice-versa. - One way is generation, the other way is analysis
61Multi-Level Tape Machines
- We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape
62Lexical to Intermediate Level
63Intermediate to Surface
- The add an e rule as in foxs lt-gt foxes
64Foxes
65Note
- A key feature of this machine is that it doesnt
do anything to inputs to which it doesnt apply. - Meaning that they are written out unchanged to
the output tape. - Turns out the multiple tapes arent really
needed they can be compiled away.
66Overall Scheme
- We now have one FST that has explicit information
about the lexicon (actual words, their spelling,
facts about word classes and regularity). - Lexical level to intermediate forms
- We have a larger set of machines that capture
orthographic/spelling rules. - Intermediate forms to surface forms
67Overall Scheme