Title: Morphology-2
1Morphology-2
- Sudeshna Sarkar
- Professor
- Computer Science Engineering Department
- Indian Institute of Technology Kharagpur
2Morphology in NLP
- Analysis vs synthesis
- what does dogs mean? vs what is the plural of
dog? - Analysis
- Need to identify lexeme
- Tokenization
- To access lexical information
- Inflections (etc) carry information that will be
needed by other processes (eg agreement useful in
parsing, inflections can carry meaning (eg tense,
number) - Morphology can be ambiguous
- May need other process to disambiguate (eg German
en) - Synthesis
- Need to generate appropriate inflections from
underlying representation
3Morphological processing
- Stemming
- String-handling approaches
- Regular expressions
- Mapping onto finite-state automata
- 2-level morphology
- Mapping between surface form and lexical
representation
4Stemming
- Stemming is the particular case of tokenization
which reduces inflected forms to a single base
form or stem - Stemming algorithms are basic string-handling
algorithms, which depend on rules which identify
affixes that can be stripped
5Surface and Lexical Forms
- The surface level of a word represents the actual
spelling - of that word.
- geliyorum eats cats kitabim
- The lexical level of a word represents a simple
concatenation - of morphemes making up that word.
- gel PROG 1SG
- eat AOR
- cat PLU
- kitap P1SG
- Morphological processors try to find
correspondences between lexical and surface forms
of words. - Morphological recognition/ analysis surface
to lexical - Morphological generation/ synthesis lexical
to surface
6Morphological Parsing
- Morphological parsing is to find the lexical form
of a word - from its surface form.
- cats -- cat N PLU
- cat -- cat N SG
- goose -- goose N SG or goose V
- geese -- goose N PLU
- gooses -- goose V 3SG
- catch -- catch V
- caught -- catch V PAST or catch V PP
- AsachhilAma AsAPROGPAST1st I/We was/were
coming - There can be more than one lexical level
representation - for a given word. (ambiguity)
- flies flyVERBPROG
- flyNOUNPLU
- mAtAla
- kare
7- The history of morphological analysis dates back
to the ancient Indian linguist Pa?ini, who
formulated the 3,959 rules of Sanskrit morphology
in the text A??adhyayi by using a Constituency
Grammar.
8Formal definition of the problem
- Surface form The word (ws) as it occurs in the
text. sings - ws ? L ? S
- Lexical form The root word(s) (r1, r2, ) and
other grammatical features (F). sing,v,sg,3rd
- wl ? S,F
- wl ? ?
9Analysis Synthesis
- Morphological Analysis Maps a string from
surface form to corresponding lexical form. - fMA S ? ?
- Morphological Synthesis Maps a string from
lexical form to surface form. - fMA ? ? S
10Relationship between MA MS
- fMS??fMA(ws) ws
- fMA??fMS(wl) wl
- fMS? fMA, fMA? fMS
- But is that really the case?
-1
-1
11- Fly s ? flys ? flies (y ?i rule)
- Duckling
- Go-getter ? get er
- Doer ? do er
- Beer ? ?
- What knowledge do we need?
- How do we represent it?
- How do we compute with it?
12Knowledge needed
- Knowledge of stems or roots
- Duck is a possible root, not duckl
- We need a dictionary (lexicon)
- Only some endings go on some words
- Do er ok
- Be er not ok
- In addition, spelling change rules that adjust
the surface form - Get er double the t getter
- Fox s insert e foxes
- Fly s insert e flys y to i flies
- Chase ed drop e - chased
13Put all this in a big dictionary (lexicon)
- Turkish approx 600 ? 106 forms
- Finnish 107
- Hindi, Bengali, Telugu, Tamil?
- Besides, always novel forms can be constructed
- Anti-missile
- Anti-anti-missile
- Anti-anti-anti-missile
- ..
- Compounding of words Sanskrit, German
14Dictionary
- Lemma lexical unit, pointer to lexicon
- typically is represented as the base form, or
dictionary headword - possibly indexed when ambiguous/polysemous
- state1 (verb), state2 (state-of-the-art), state3
(government) - from one or more morphemes (root, stem,
rootderivation, ...) - Categories non-lexical
- small number of possible values (lt 100, often lt
5-10)
15Morphological Analyzer
- Relatively simple for English.
- But for many Indian languages, it may be more
difficult. - Examples
- Inflectional and Derivational Morphology.
- Common tools Finite-state transducers
- A transducer maps a set/string of symbols to
another set/string of symbols
16A simpler problem
- Linear concatenation of morphemes with possible
spelling changes at the boundary and a few
irregular cases. - Quite practical assumptions
- English, Hindi, Bengali, Telugu, Tamil, French,
Turkish - Exceptions Semitic languages, Sanskrit
17Computational Morphology
- Approaches
- Lexicon only
- Rules only
- Lexicon and Rules
- Finite-state Automata
- Finite-state Transducers
18Computational Morphology
- Systems
- WordNets morphy
- PCKimmo
- Named after Kimmo Koskenniemi, much work done by
Lauri Karttunen, Ron Kaplan, and Martin Kay - Accurate but complex
- http//www.sil.org/pckimmo/
- Two-level morphology
- Commercial version available from InXight Corp.
- Background
- Chapter 3 of Jurafsky and Martin
- A short history of Two-Level Morphology
- http//www.ling.helsinki.fi/koskenni/esslli-2001-
karttunen/
19Morphological Anlayser
- To build a morphological analyser we need
- lexicon the list of stems and affixes, together
with basic information about them - morphotactics the model of morpheme ordering (eg
English plural morpheme follows the noun rather
than a verb) - orthographic rules these spelling rules are used
to model the changes that occur in a word,
usually when two morphemes combine (e.g., flys
flies)
20Finite State Machines
- FSAs are equivalent to regular languages
- FSTs are equivalent to regular relations (over
pairs of regular languages) - FSTs are like FSAs but with complex labels.
- We can use FSTs to transduce between surface and
lexical levels.
21Can FSAs help?
Reg-noun
Plural (-s)
Q0
Q1
Q2
Irreg-pl-noun
Irreg-sg-noun
22Whats this for?
un
Adj-root
Q0
Q1
Q2
Q3
-er -est -ly
e
un?ADJ-ROOTer est ly?
23Morphotactics
- The last two examples basically model some parts
of the English morphotactics - But where is the information about regular and
irregular roots?LEXICON - Can we include the lexicon in the FSA?
24The English Pluralization FSA
25After adding a mini-lexicon
a
s
g
u
b
Q1
Q2
s
Q0
d
o
g
m
a
n
n
e
26Elegance Power
- FSAs are elegant because
- NFA ?? DFA
- Closed under Union, Intersection, Concatenation,
Complementation - Traversal is always linear on input size
- Well-known algorithms for minimization,
determinization, compilation etc. - They are powerful because they can capture
- Linear morphology
- Irregularities
27But
- FSAs are language recognizer/generator.
- We need transducers to build
- Morphological Analyzers (fMA)
- Morphological Synthesizers (fMS)
28Finite State Transducers
s i n g s
Finite State Machine
Surface form
Lexical form
s i n g v sg
29Formal Definition
- A 6-tuple S,?,Q,d,q0,F
- S is the (finite) set of input symbols
- ? is the (finite) set of output symbols
- Q is the set (FINITE) of states
- d is the transition function Q?? S to Q ? ?
- q0 ? Q is the start state
- F ? Q is the set of accepting states
30An example FST
aa
se
gg
bb
uu
Q1
Q2
ss
Q0
dd
oo
gg
aa
mm
nn
nn
ea
31The Lexicon FST
aa
sPl
gg
Sg
bb
uu
Q1
Q2
ss
Q0
dd
oo
gg
Sg
aa
nn
mm
Q3
ea
Pl
nn
Q4
32Ways to look at FSTs
- Recognizer of a pair of strings
- Generator of a pair of strings
- Translator from one regular language to another
- Computer of a relation regular relation
33Invertibility
- Given T S,?,Q,d,q0,F
- Construct T-1 ?,S,Q,d-1,q0,F
- such that if d(x,q) ? (y,q)
- then d-1(y,q) ? (x,q)
- where, x ? S and y ? ?
34Compositionality
- T1 S, X, Q1,d1,q1,F1 T2 X, ?,
Q2,d2,q2,F2 - Define T3 S, ?, Q3,d3,q3,F3
- such that Q3 Q1 ? Q2
- q3 (q1, q2)
- d3 ((q,s), i) ((q,s),o) if
- ?c s.t d1 (q, i) (q,c) and d2 (s, c) (s,o)
35Modeling Orthographic Rules
- Spelling changes in morpheme boundaries
- buss ? buses, watchs ? watches
- flys ? flies
- makeing ? making
- Rules
- E-insertion takes place if the stem ends in s, z,
ch, sh etc. - y maps to ie when pluralization marker s is added
36Incorporating Spelling Rules
- Spelling rules, each corresponding to an FST, can
be run in parallel provided that they are
"aligned". - The set of spelling rules is positioned between
the surface level and the intermediate level. - Parallel execution of FSTs can be carried out
- by simulation in this case FSTs must first be
aligned. - by first constructing a a single FST
corresponding to their intersection.
37Rewrite Rules
- Chomsky and Halle (1968)
- General form
- a ? b / ?__ ?
- E-insertion
- e ? e / x,s,z,ch,sh __ s
- Kay and Kaplan (1994) showed that FSTs can be
compiled from general rewrite rules
38Two-level Morphology (Koskenniemi, 1983)
b u s N Pl
lexical
LEXICON FST
b u s s
intermediate
FST1
FSTn
orthographic rules
b u s e s
surface
39A Single FST for MA and MS
Pl
N
s
u
b
Pl
N
s
u
b
LEXICON FST
Morphology FST
s
s
u
b
FST1
FSTn
orthographic rules
40Can we do without the lexicon
- Not really!
- But for some applications we might need to know
the stem only - Surface form ? Stem Stemming
- Porter Stemming algorithm (1980) is a very
popular technique that does not use lexicon.
41Derivational Rules
42Lexicon Morphotactics
- Typically list of word parts (lexicon) and the
models of ordering can be combined together into
an FSA which will recognise the all the valid
word forms. - For this to be possible the word parts must first
be classified into sublexicons. - The FSA defines the morphotactics (ordering
constraints).
43Sublexicons to classify the list of word parts
reg-noun irreg-pl-noun irreg-sg-noun plural
cat mice mouse -s
fox sheep sheep
geese goose
44Towards the Analyser
- We can use lexc or xfst to build such an FSA (see
lex1.lexc) - To augment this to produce an analysis we must
create a transducer Tnum which maps between the
lexical level and an "intermediate" level that is
needed to handle the spelling rules of English.
45Ambiguity
- Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state. - Didnt matter which path was actually traversed
- In FSTs the path to an accept state does matter
since differ paths represent different parses and
different outputs will result
46Ambiguity
- Whats the right parse for
- Unionizable
- Union-ize-able
- Un-ion-ize-able
- Each represents a valid path through the
derivational morphology machine.
47Ambiguity
- There are a number of ways to deal with this
problem - Simply take the first output found
- Find all the possible outputs (all paths) and
return them all (without choosing) - Bias the search so that only one or a few likely
paths are explored
48Generativity
- Nothing really privileged about the directions.
- We can write from one and read from the other or
vice-versa. - One way is generation, the other way is analysis
49Multi-Level Tape Machines
- We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape
50Note
- A key feature of this machine is that it doesnt
do anything to inputs to which it doesnt apply. - Meaning that they are written out unchanged to
the output tape. - Turns out the multiple tapes arent really
needed they can be compiled away.
51Overall Scheme
- We now have one FST that has explicit information
about the lexicon (actual words, their spelling,
facts about word classes and regularity). - Lexical level to intermediate forms
- We have a larger set of machines that capture
orthographic/spelling rules. - Intermediate forms to surface forms
52Other Issues
- How to formulate the rewrite rules?
- How to ensure coverage?
- What to do for unknown roots?
- Is it possible to learn morphology of a language
in supervised/unsupervised manner? - What about non-linear morphology?
53References
- Chapter 3, pp 57-89
- Speech and Language Processing by D. Jurafsky
J. H. Martin, Pearson Education Asia, 2002 (2000) - Slides based on the chapter
- Chapter 2, pp 70
- Natural Language Understanding by J. Allen,
Pearson Education, 2003 (1995) - Slide by Monojit Choudhury
54Spelling errors
55Non-word error detection
- Any word not in a dictionary
- Assume its a spelling error
- Need a big dictionary!
- What to use?
- FST dictionary!!
56Isolated word error correction
- How do I fix graffe?
- Search through all words
- graf
- craft
- grail
- giraffe
- Pick the one thats closest to graffe
- What does closest mean?
- We need a distance metric.
- The simplest one edit distance.
- (More sophisticated probabilistic ones noisy
channel)
57Edit Distance
- The minimum edit distance between two strings
- Is the minimum number of editing operations
- Insertion
- Deletion
- Substitution
- Needed to transform one into the other
58Minimum Edit Distance
- If each operation has cost of 1
- Distance between these is 5
- If substitutions cost 2 (Levenshtein)
- Distance between these is 8
59Part of Speech Tagging
- Task
- assign the right part-of-speech tag, e.g. noun,
verb, conjunction, to a word in context - POS taggers
- need to be fast in order to process large corpora
- should take no more than time linear in the size
of the corpora - full parsing is slow
- e.g. context-free grammar ? n3, n length of the
sentence - POS taggers try to assign correct tag without
actually parsing the sentence
60Part-of-Speech (POS)
- Categories to which words are assigned according
to their function. - Noun, verb, adjective, preposition, adverb,
article, pronoun, conjunction, etc.
61POS Tagging
- The process of assigning a part-of-speech to each
word in a sentence
- Keep the book on the top
shelf .
.
N
ADJ N
DET
ADV ADJ P
N V
N V
DET
62Techniques for POS tagging
- Linguistic approaches
- Statistical approaches
- Hidden Markov Model
- Maximum Entropy
- CRF
63Named Entity Recognition
- Named Entity Recognition (NER) Locate and
Classify the Names in Text - Example
- Jawaharlal Nehru was the first prime
minister of India. - Per-beg Per-end Title-beg Title-end
Loc - Importance
- Information Extraction, Question-Answering
- Can help Summarization, ASR and MT
- Intelligent document access
- etc
64Syntax
- Order and group words together in sentence
- The dog barked at the visitor
- Vs
- Barked dog the at visitor the
65Semantics
- Understand word meanings and combine meanings in
larger units - Lexical semantics
- Compositional sematics
66Discourse Pragmatics
- Interpret utterances in context
- Resolve references
- I'm afraid I can't do that
- that ?
- Speech act interpretation
- Open the pod bay doors
- Command
67Phonology
- The study of the sound patterns of languages.
68Computational phonology
- Automatic Speech Recognition (ASR)
- take an acoustic waveform as input and produce
as output a string of words. - Text-To-Speech (TTS)
- take a sequence of text words and produce as
output an acoustic waveform. - ? How words are pronounced in terms
- of individual speech units called
phones.
69Speech sounds and phonetic transcription
- A phone a speech sound, represented by IPA or
ARPAbet. - IPA An evolving standard with the goal of
transcribing the sounds of all human languages. - ARPAbet A phonetic alphabet designed for
American English using only ASCII symbols.
70Why phonology?
- Text to speech (TTS) applications include a
component which converts spelled words to
sequences of phonemes ( sound representations).
G2P - grapheme to phoneme conversion - E.g., sight ?S AY1 T
- John ? J AA1 N
- Phoneme to Grapheme for speech recognition
71Varieties of sounds in peoples speech
- Most phonemes have several different
pronunciations (called their allophones),
determined by nearby sounds, most usually by the
following sound. - A striking instance of such variation is in the
realization of the phoneme /T/ in American
English.
72Grapheme phoneme relationships
- LTS Letter to sound, or G2P relationships.
- In some languages, this is simple, e.g., Sanskrit
- But in English and in French, its very messy.
- Why? Because the spelling system is based on how
the language used to be pronounced, and the
pronunciation has since changed. - Schwa deletion in Hindi
73References
- Chapter 3, pp 57-89
- Speech and Language Processing by D. Jurafsky
J. H. Martin, Pearson Education Asia, 2002 (2000) - Slides based on the chapter
- Chapter 2, pp 70
- Natural Language Understanding by J. Allen,
Pearson Education, 2003 (1995) - Slide by Monojit Choudhury