CS60057 Speech - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

CS60057 Speech

Description:

Tagalog: um hingi humingi. Circumfixes precede and follow the stem ... typically is represented as the 'base form', or 'dictionary headword' ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 68
Provided by: IBMU306
Category:

less

Transcript and Presenter's Notes

Title: CS60057 Speech


1
CS60057Speech Natural Language Processing
  • Autumn 2007

Lecture 3 27 July 2007
2
Levels of (Formal) Description
  • 6 basic levels (more or less explicitly present
    in most theories)
  • and beyond (pragmatics/logic/...)
  • meaning (semantics)
  • (surface) syntax
  • morphology
  • phonology
  • phonetics/orthography
  • Each level has an input and output representation
  • output from one level is the input to the next
    (upper) level
  • sometimes levels might be skipped (merged) or
    split

3
Phonetics/Orthography
  • Input
  • acoustic signal (phonetics) / text (orthography)
  • Output
  • phonetic alphabet (phonetics) / text
    (orthography)
  • Deals with
  • Phonetics
  • consonant vowel ( others) formation in the
    vocal tract
  • classification of consonants, vowels, ... in
    relation to frequencies, shape position of the
    tongue and various muscles
  • intonation
  • Orthography normalization, punctuation, etc.

4
Phonology
  • Input
  • sequence of phones/sounds (in a phonetic
    alphabet) or normalized text (sequence of
    (surface) letters in one languages alphabet)
    NB phones vs. phonemes
  • Output
  • sequence of phonemes ( (lexical) letters in an
    abstract alphabet)
  • Deals with
  • relation between sounds and phonemes (units which
    might have some function on the upper level)
  • e.g. u oo (as in book), æ a (cat) i y
    (flies)

5
Morphology
  • Input
  • sequence of phonemes ( (lexical) letters)
  • Output
  • sequence of pairs (lemma, (morphological) tag)
  • Deals with
  • composition of phonemes into word forms and their
    underlying lemmas (lexical units) morphological
    categories (inflection, derivation, compounding)
  • e.g. quotations quote/V -ation(der.V-gtN)
    NNS.

6
(Surface) Syntax
  • Input
  • sequence of pairs (lemma, (morphological) tag)
  • Output
  • sentence structure (tree) with annotated nodes
    (all lemmas, (morphosyntactic) tags, functions),
    of various forms
  • Deals with
  • the relation between lemmas morphological
    categories and the sentence structure
  • uses syntactic categories such as Subject, Verb,
    Object,...
  • e.g. I/PP1 see/VB a/DT dog/NN
  • ((I/sg)SB ((see/pres)V
    (a/ind dog/sg)OBJ)VP)S

7
Meaning (semantics)
  • Input
  • sentence structure (tree) with annotated nodes
    (lemmas, (morphosyntactic) tags, surface
    functions)
  • Output
  • sentence structure (tree) with annotated nodes
    (semantic lemmas, (morpho-syntactic) tags, deep
    functions)
  • Deals with
  • relation between categories such as Subject,
    Object and (deep) categories such as Agent,
    Effect adds other cats
  • e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S
  • (I/Sg/Pat/t
    (see/Perf/Pred/t) Tom/Sg/Ag/f)

8
...and Beyond
  • Input
  • sentence structure (tree) annotated nodes
    (autosemantic lemmas, (morphosyntactic) tags,
    deep functions)
  • Output
  • logical form, which can be evaluated (true/false)
  • Deals with
  • assignment of objects from the real world to the
    nodes of the sentence structure
  • e.g. (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
  • see(Mark-TwainSSN...,Tom-SawyerSSN...)Ti
    mebef 99/9/27/1415Place39s1940N76s3710W

9
Three Views
  • Three equivalent formal ways to look at what
    were up to (not including tables)

Regular Expressions
Finite State Automata
Regular Languages
10
Transition
  • Finite-state methods are particularly useful in
    dealing with a lexicon.
  • Lots of devices, some with limited memory, need
    access to big lists of words.
  • So well switch to talking about some facts about
    words and then come back to computational methods

11
MORPHOLOGY
12
Morphology
  • Morphology is the study of the ways that words
    are built up from smaller meaningful units called
    morphemes (morph shape, logos word)
  • We can usefully divide morphemes into two classes
  • Stems The core meaning bearing units
  • Affixes Bits and pieces that adhere to stems to
    change their meanings and grammatical functions
  • Prefix un-, anti-, etc
  • Suffix -ity, -ation, etc
  • Infix are inserted inside the stem
  • Tagalog um hingi? humingi
  • Circumfixes precede and follow the stem
  • English doesnt stack more affixes.
  • But Turkish can have words with a lot of
    suffixes.
  • Languages, such as Turkish, tend to string
    affixes together are called agglutinative
    languages.

13
Surface and Lexical Forms
  • The surface level of a word represents the actual
    spelling
  • of that word.
  • geliyorum eats cats kitabim
  • The lexical level of a word represents a simple
    concatenation
  • of morphemes making up that word.
  • gel PROG 1SG
  • eat AOR
  • cat PLU
  • kitap P1SG
  • Morphological processors try to find
    correspondences between lexical and surface forms
    of words.
  • Morphological recognition/ analysis surface
    to lexical
  • Morphological generation/ synthesis lexical
    to surface

14
Morphology Morphemes Order
  • Handles what is an isolated form in written text
  • Grouping of phonemes into morphemes
  • sequence deliverables deliver, able and s (3
    units)
  • Morpheme Combination
  • certain combinations/sequencing possible, other
    not
  • deliverables, but not ablederives nouns,
    but not nouning
  • typically fixed (in any given language)

15
Inflectional Derivational Morphology
  • We can also divide morphology up into two broad
    classes
  • Inflectional
  • Derivational
  • Inflectional morphology concerns the combination
    of stems and affixes where the resulting word
  • Has the same word class as the original
  • Serves a grammatical/semantic purpose different
    from the original
  • After a combination with an inflectional
    morpheme,
  • the meaning and class of the actual stem usually
    do not change.
  • eat / eats pencil / pencils
  • After a combination with an derivational
    morpheme, the meaning and the class of the actual
    stem usually change.
  • compute / computer do / undo friend /
    friendly
  • Uygar / uygarlas kapi / kapici
  • The irregular changes may happen with
    derivational affixes.

16
Morphological Parsing
  • Morphological parsing is to find the lexical form
    of a word
  • from its surface form.
  • cats -- cat N PLU
  • cat -- cat N SG
  • goose -- goose N SG or goose V
  • geese -- goose N PLU
  • gooses -- goose V 3SG
  • catch -- catch V
  • caught -- catch V PAST or catch V PP
  • There can be more than one lexical level
    representation
  • for a given word. (ambiguity)

17
Morphological Analysis
  • Analyzing words into their linguistic components
    (morphemes).
  • Morphemes are the smallest meaningful units of
    language.
  • cars carPLU
  • giving givePROG
  • AsachhilAma AsAPROGPAST1st I/We was/were
    coming
  • Ambiguity More than one alternatives
  • flies flyVERBPROG
  • flyNOUNPLU
  • mAtAla
  • kare

18
  • Fly s ? flys ? flies (y ?i rule)
  • Duckling
  • Go-getter ? get er
  • Doer ? do er
  • Beer ? ?
  • What knowledge do we need?
  • How do we represent it?
  • How do we compute with it?

19
Knowledge needed
  • Knowledge of stems or roots
  • Duck is a possible root, not duckl
  • We need a dictionary (lexicon)
  • Only some endings go on some words
  • Do er ok
  • Be er not ok
  • In addition, spelling change rules that adjust
    the surface form
  • Get er double the t getter
  • Fox s insert e foxes
  • Fly s insert e flys y to i flies
  • Chase ed drop e - chased

20
Put all this in a big dictionary (lexicon)
  • Turkish approx 600 ? 106 forms
  • Finnish 107
  • Hindi, Bengali, Telugu, Tamil?
  • Besides, always novel forms can be constructed
  • Anti-missile
  • Anti-anti-missile
  • Anti-anti-anti-missile
  • ..
  • Compounding of words Sanskrit, German

21
Morphology From Morphemes to Lemmas Categories
  • Lemma lexical unit, pointer to lexicon
  • typically is represented as the base form, or
    dictionary headword
  • possibly indexed when ambiguous/polysemous
  • state1 (verb), state2 (state-of-the-art), state3
    (government)
  • from one or more morphemes (root, stem,
    rootderivation, ...)
  • Categories non-lexical
  • small number of possible values (lt 100, often lt
    5-10)

22
Morphology Level The Mapping
  • Formally A ? 2(L,C1,C2,...,Cn)
  • A is the alphabet of phonemes (A denotes any
    non-empty sequence of phonemes)
  • L is the set of possible lemmas, uniquely
    identified
  • Ci are morphological categories, such as
  • grammatical number, gender, case
  • person, tense, negation, degree of comparison,
    voice, aspect, ...
  • tone, politeness, ...
  • part of speech (not quite morphological category,
    but...)
  • A, L and Ci are obviously language-dependent

23
Morphological Analysis (cont.)
  • Relatively simple for English.
  • But for many Indian languages, it may be more
    difficult.
  • Examples
  • Inflectional and Derivational Morphology.
  • Common tools Finite-state transducers

24
Bengali Verb Paradigms
25
Bengali Verb morphology for one of the paradigms
26
(No Transcript)
27
(No Transcript)
28
Finite State Machines
  • FSAs are equivalent to regular languages
  • FSTs are equivalent to regular relations (over
    pairs of regular languages)
  • FSTs are like FSAs but with complex labels.
  • We can use FSTs to transduce between surface and
    lexical levels.

29
Simple Rules
30
Adding in the Words
31
Derivational Rules
32
Parsing/Generation vs. Recognition
  • Recognition is usually not quite what we need.
  • Usually if we find some string in the language we
    need to find the structure in it (parsing)
  • Or we have some structure and we want to produce
    a surface form (production/generation)
  • Example
  • From cats to cat N PL and back

33
Morphological Parsing
  • Given the input cats, wed like to outputcat N
    Pl, telling us that cat is a plural noun.
  • Given the Spanish input bebo, wed like to
    outputbeber V PInd 1P Sg telling us that
    bebo is the present indicative first person
    singular form of the Spanish verb beber, to
    drink.

34
Morphological Anlayser
  • To build a morphological analyser we need
  • lexicon the list of stems and affixes, together
    with basic information about them
  • morphotactics the model of morpheme ordering (eg
    English plural morpheme follows the noun rather
    than a verb)
  • orthographic rules these spelling rules are used
    to model the changes that occur in a word,
    usually when two morphemes combine (e.g., flys
    flies)

35
Lexicon Morphotactics
  • Typically list of word parts (lexicon) and the
    models of ordering can be combined together into
    an FSA which will recognise the all the valid
    word forms.
  • For this to be possible the word parts must first
    be classified into sublexicons.
  • The FSA defines the morphotactics (ordering
    constraints).

36
Sublexicons to classify the list of word parts
37
FSA Expresses Morphotactics (ordering model)
38
Towards the Analyser
  • We can use lexc or xfst to build such an FSA (see
    lex1.lexc)
  • To augment this to produce an analysis we must
    create a transducer Tnum which maps between the
    lexical level and an "intermediate" level that is
    needed to handle the spelling rules of English.

39
Three Levels of Analysis
40
1. Tnum Noun Number Inflection
  • multi-character symbols
  • morpheme boundary
  • word boundary

41
Intermediate Form to Surface
  • The reason we need to have an intermediate form
    is that funny things happen at morpheme
    boundaries, e.g.
  • cats ? cats
  • foxs ? foxes
  • flys ? flies
  • The rules which describe these changes are called
    orthographic rules or "spelling rules".

42
More English Spelling Rules
  • consonant doubling beg / begging
  • y replacement try/tries
  • k insertion panic/panicked
  • e deletion make/making
  • e insertion watch/watches
  • Each rule can be stated in more detail ...

43
Spelling Rules
  • Chomsky Halle (1968) invented a special
    notation for spelling rules.
  • A very similar notation is embodied in the
    "conditional replacement" rules of xfst.
  • E -gt F L _ R
  • which means replace E with F when it appears
    between left context L and right context R

44
A Particular Spelling Rule
  • This rule does e-insertion
  • -gt e x _ s

45
e insertion over 3 levels
The rule corresponds to the mapping
between surface and intermediate levels
46
e insertion as an FST
47
Incorporating Spelling Rules
  • Spelling rules, each corresponding to an FST, can
    be run in parallel provided that they are
    "aligned".
  • The set of spelling rules is positioned between
    the surface level and the intermediate level.
  • Parallel execution of FSTs can be carried out
  • by simulation in this case FSTs must first be
    aligned.
  • by first constructing a a single FST
    corresponding to their intersection.

48
Putting it all together
execution of FSTi takes place in parallel
49
Kaplan and KayThe Xerox View
FSTi are aligned but separate
FSTi intersected together
50
Finite State Transducers
  • The simple story
  • Add another tape
  • Add extra symbols to the transitions
  • On one tape we read cats, on the other we write
    cat N PL, or the other way around.

51
FSTs
52
English Plural
53
Transitions
  • cc means read a c on one tape and write a c on
    the other
  • Ne means read a N symbol on one tape and write
    nothing on the other
  • PLs means read PL and write an s

54
Typical Uses
  • Typically, well read from one tape using the
    first symbol on the machine transitions (just as
    in a simple FSA).
  • And well write to the second tape using the
    other symbols on the transitions.

55
Ambiguity
  • Recall that in non-deterministic recognition
    multiple paths through a machine may lead to an
    accept state.
  • Didnt matter which path was actually traversed
  • In FSTs the path to an accept state does matter
    since differ paths represent different parses and
    different outputs will result

56
Ambiguity
  • Whats the right parse for
  • Unionizable
  • Union-ize-able
  • Un-ion-ize-able
  • Each represents a valid path through the
    derivational morphology machine.

57
Ambiguity
  • There are a number of ways to deal with this
    problem
  • Simply take the first output found
  • Find all the possible outputs (all paths) and
    return them all (without choosing)
  • Bias the search so that only one or a few likely
    paths are explored

58
The Gory Details
  • Of course, its not as easy as
  • cat N PL lt-gt cats
  • As we saw earlier there are geese, mice and oxen
  • But there are also a whole host of
    spelling/pronunciation changes that go along with
    inflectional changes
  • Cats vs Dogs
  • Fox and Foxes

59
Multi-Tape Machines
  • To deal with this we can simply add more tapes
    and use the output of one tape machine as the
    input to the next
  • So to handle irregular spelling changes well add
    intermediate tapes with intermediate symbols

60
Generativity
  • Nothing really privileged about the directions.
  • We can write from one and read from the other or
    vice-versa.
  • One way is generation, the other way is analysis

61
Multi-Level Tape Machines
  • We use one machine to transduce between the
    lexical and the intermediate level, and another
    to handle the spelling changes to the surface
    tape

62
Lexical to Intermediate Level
63
Intermediate to Surface
  • The add an e rule as in foxs lt-gt foxes

64
Foxes
65
Note
  • A key feature of this machine is that it doesnt
    do anything to inputs to which it doesnt apply.
  • Meaning that they are written out unchanged to
    the output tape.
  • Turns out the multiple tapes arent really
    needed they can be compiled away.

66
Overall Scheme
  • We now have one FST that has explicit information
    about the lexicon (actual words, their spelling,
    facts about word classes and regularity).
  • Lexical level to intermediate forms
  • We have a larger set of machines that capture
    orthographic/spelling rules.
  • Intermediate forms to surface forms

67
Overall Scheme
Write a Comment
User Comments (0)
About PowerShow.com