Title: Tokenization and Morphology
1 Tokenization and Morphology
School of Computing FACULTY OF ENGINEERING
-
- Eric Atwell, Language Research Group
- (with thanks to Katja Markert, Marti Hearst, and
other contributors)
2Reminder
- The main areas of linguistics
- Rationalism language models based on expert
introspection - Empiricism models via machine-learning from a
corpus - Corpus text selected by language, genre, domain,
- Brown, LOB, BNC, Penn Treebank, MapTask, CCA,
- Corpus Annotation text headers, PoS, parses,
- Corpus size is no. of words depends on
tokenisation - We can count word tokens, word types, type-token
distribution - Lexeme/lemma is root form, v inflections (be v
am/is/was)
3Whats a word?
- How many words do you find in the following short
text? - What is the biggest/smallest plausible answer to
this question? - What problems do you encounter?
- Its a shame that our data-base is not
up-to-date. It is a shame that um, data base A
costs 2300.50 and that database B costs 5000.
All databases cost far too much. - Time 3 minutes
4Counting words tokenization
- Tokenisation is a processing step where the input
text is - automatically divided into units called tokens
where each is either a word or a number or a
punctuation mark - So, word count can ignore numbers, punctuation
marks (?) - Word Continuous alphanumeric characters
delineated by whitespace. - Whitespace space, tab, newline.
- BUT dividing at spaces is too simple Its, data
base - Another approach is to use regular expressions to
specify which substrings are valid words.
5Regular expressions for tokenization
- wordr r'(\w)
- hyphen r'(\w\-\s?\w)
- Eg data-base, Allows for a space after the hyphen
- apostrophe r'(\w\'\w)
- Eg isnt
- numbers r'((\)?\d(\.)?\d?)
- Needs to handle large numbers with commas
6Some Tokenization Issues
- Sentence Boundaries
- Punctuation, eg quotation marks around sentences?
- Periods end of line or not?
- Proper Names
- What to do about
- New York-New Jersey train?
- California Governor Arnold Schwarzenegger?
- Contractions
- Thats Freds jackets pocket.
- Im doing what youre saying Dont do!.
7(No Transcript)
8(No Transcript)
9Jabberwocky Analysis
- This is nonsense or is it?
- This is not English but its much more like
English than it is like French or German or
Chinese or - Why do we pretty much understand the words?
10Jabberwocky Analysis
- Why do we pretty much understand the words?
- We recognize combinations of morphemes.
- Chortled - Laugh in a breathy, gleeful way
(Definition from Oxford American Dictionary) A
combination of "chuckle" and "snort." - Galumphing - Moving in a clumsy, ponderous, or
noisy manner. Perhaps a blend of "gallop" and
"triumph." (Definition from Oxford American
Dictionary) - Activity
- Make up a word whose meaning can be inferred from
the morphemes that you used.
11Jabberwocky Analysis
- Why do we pretty much understand the words?
- Surrounding English words strongly indicate the
parts-of-speech of the nonsense words. - toves probably can perform an action
- (because they did gyre and gimble)
- wabe is probably a place.
- (they did in the wabe)
http//assets.cambridge.org/052185/542X/excerpt/05
2185542X_excerpt.pdf
12Jabberwocky Analysis
- Surrounding English words strongly indicate the
parts-of-speech of the nonsense words. - Its similar in the French Translation
Example from http//www.departments.bucknell.edu/l
inguistics/lectures/05lect02.html
13Morphology
- Morphology
- The study of the way words are built up from
smaller meaning units. - Morphemes
- The smallest meaningful unit in the grammar of a
language. - Contrasts
- Derivational vs. Inflectional
- Regular vs. Irregular
- Concatinative vs. Templatic (root-and-pattern)
- A useful resource
- Glossary of linguistic terms by Eugene Loos
- http//www.sil.org/linguistics/GlossaryOfLinguisti
cTerms/contents.htm
14Examples (English)
- unladylike
- 3 morphemes, 4 syllables
- un- not
- lady (well behaved) female adult human
- -like having the characteristics of
- Cant break any of these down further without
distorting the meaning of the units - technique
- 1 morpheme, 2 syllables
- dogs
- 2 morphemes, 1 syllable
- -s, a plural marker on nouns
15Morpheme Definitions
- Root
- The portion of the word that
- is common to a set of derived or inflected forms,
if any, when all affixes are removed - is not further analyzable into meaningful
elements - carries the principle portion of meaning of the
words - Stem
- The root or roots of a word, together with any
derivational affixes, to which inflectional
affixes are added. - Affix
- A bound morpheme that is joined before, after, or
within a root or stem. - Clitic
- a morpheme that functions syntactically like a
word, but does not appear as an independent
phonological word - Arabic al in Al-Qaeda (definite particle)
- English s in Hals (genitive particle)
16Inflectional vs. Derivational
- Word Classes
- Parts of speech noun, verb, adjectives, etc.
- Word class dictates how a word combines with
morphemes to form new words - Inflection
- Variation in the form of a word, typically by
means of an affix, that expresses a grammatical
contrast. - Doesnt change the word class
- Usually produces a predictable, nonidiosyncratic
change of meaning. - run -gt runs running ran
- Derivation
- The formation of a new word or inflectable stem
from another word or stem. - compute -gt computer -gt computerization
17Inflectional Morphology
- Adds
- tense, number, person, mood, aspect
- Word class doesnt change
- Word serves new grammatical role
- Examples
- come is inflected for person and number
- The pizza guy comes at noon.
- las and rojas are inflected for agreement with
manzanas in grammatical gender by -a and in
number by s - las manzanas rojas (the red apples)
18Derivational Morphology
- Word class changes verb ? noun, noun ? adjective
etc - Nominalization (formation of nouns from other
parts of speech, primarily verbs in English) - computerization
- appointee
- killer
- fuzziness
- Formation of adjectives (primarily from nouns)
- computational
- clueless
- Embraceable
- Difficult cases
- building ? from which word-class and sense of
build?
19Concatinative Morphology
- MorphemeMorphemeMorpheme
- Stems also called lemma, base form, root, lexeme
- hopeing ? hoping hop ? hopping
- Affixes
- Prefixes Antidisestablishmentarianism
- Suffixes Antidisestablishmentarianism
- Infixes hingi (borrow) humingi (borrower) in
Tagalog - Circumfixes sagen (say) gesagt (said) in
German - Agglutinative Languages
- uygarlastiramadiklarimizdanmissinizcasina
- uygarlastiramadiklarimizdanmissinizcasin
a - Behaving as if you are among those whom we could
not cause to become civilized
20Templatic Morphology
- Roots and Patterns
- Example Hebrew or Arabic or Amharic (spoken in
Ethiopia) - Root
- Consists of 3 consonants CCC
- Carries basic meaning
- Template
- Gives the ordering of consonants and vowels
- Specifies semantic information about the verb
- Active, passive, middle voice
- Example (Hebrew)
- lmd (to learn or study)
- CaCaC -gt lamad (he studied)
- CiCeC -gt limed (he taught)
- CuCaC -gt lumad (he was taught)
21Morphological Analysis Tools
- Porter stemmer
- A simple approach just hack off the end of the
word! - Frequently used, especially for Information
Retrieval, but results are pretty ugly!
22porter.demo()
- Original
- Pierre Vinken , 61 years old , will join the
board as a nonexecutive - director Nov. 29 . Mr. Vinken is chairman of
Elsevier N.V. , the Dutch - publishing group . Rudolph Agnew , 55 years old
and former chairman of - Consolidated Gold Fields PLC , was named a
nonexecutive director of - this British industrial conglomerate . A form of
asbestos once used to - make Kent cigarette filters has caused a high
percentage of cancer - deaths among a group of workers exposed to it
more than 30 years ago , - researchers reported .
- Results
- Pierr Vinken , 61 year old , will join the board
as a nonexecut - director Nov. 29 . Mr. Vinken is chairman of
Elsevi N.V. , the Dutch - publish group . Rudolph Agnew , 55 year old and
former chairman of - Consolid Gold Field PLC , wa name a nonexecut
director of thi British - industri conglomer . A form of asbesto onc use to
make Kent cigarett - filter ha caus a high percentag of cancer death
among a group of - worker expos to it more than 30 year ago ,
research report .
23Morphological Analysis Tools
- WordNets morphy()
- A slightly more sophisticated approach
- Use an understanding of inflectional morphology
- Uses a set of Rules of Detachment
- Use an Exception List for irregulars
- Handle collocations in a special way
- Do the transformation, compare the result to the
WordNet dictionary - If the transformation produces a real word, then
keep it, else use the original word. - For more details, see
- http//wordnet.princeton.edu/man/morphy.7WN.html
24Some morphy() output
- gtgtgt wntools.morphy('dogs')
- 'dog'
- gtgtgt wntools.morphy('running', pos'verb')
- 'run'
- gtgtgt wntools.morphy('corpora')
- 'corpus'
- gtgtgt
25Morphological Analysis Tools
- Very sophisticated programs have been developed
- Use a techniqued called Two-Level Phonology
- Has been applied to numerous languages
- Best known PCKimmo
- After Kimmo Koskenniemi, based in part on work by
Lauri Kartunnen in 1983 - Uses
- A rules file which specifies the alphabet and the
phonological (or spelling) rules, - A lexicon file which lists lexical items and
encodes morphotactic constraints. - http//www.sil.org/pckimmo/
- Commercial versions are available
- inXights LinguistX version based on technology
developed by Kaplan and others from Xerox PARC
(or at least used to be)
26Morphological Analysis Tools
- cheat store all variants in a dictionary
database, eg - CatVar
- Categorial Variation Database
- A database of clusters of uninflected words
(lexemes) and their categorial (i.e.
part-of-speech) variants. - Example the developing cluster(develop(V),
developer(N), developed(AJ), developing(N),
developing(AJ), development(N)). - http//clipdemos.umiacs.umd.edu/catvar
- based on published dictionaries LDOCE, CELEX,
OALD, PROPOSEL ...
27MorphoChallenge
- One problem with rule-based systems (PCkimmo) or
dictionary-lookup systems Porting to new
languages - In principle, Unsupervised Machine Learning could
learn from any language data-set, by finding
recurring patterns which correspond to roots,
prefixes, postfixes - MorphoChallenge is a contest to find the best UML
morphological analyser - http//www.cis.hut.fi/morphochallenge2005/
- http//www.cis.hut.fi/morphochallenge2007/
- http//www.cis.hut.fi/morphochallenge2008/
- Atwell, Roberts Combinatory Hybrid Elementary
Analysis of Text http//www.cis.hut.fi/morphochall
enge2005/P07_Atwell.pdf
28Arabic morphological analysis
- Arabic is particularly challenging - different
script, infixes, vowels may be left out in
written Arabic - Leeds researcher Majdi Sawalha online analysis
tool http//www.comp.leeds.ac.uk/sawalha/ - Sawalha, Majdi Atwell, Eric (2010). Fine-Grain
Morphological Analyzer and Part-of-Speech Tagger
for Arabic Text. in Proceedings of the Language
Resource and Evaluation Conference LREC 2010,
17-23 May 2010, Valetta, Malta. - http//www.comp.leeds.ac.uk/sawalha/sawalha10lrecB
.pdf
29Reminder
- Tokenization - by whitespace, regular expressions
- Problems Its data-base New York
- Jabberwocky shows we can break words into
morphemes - Morpheme types root/stem, affix, clitic
- Derivational vs. Inflectional
- Regular vs. Irregular
- Concatinative vs. Templatic (root-and-pattern)
- Morphological analysers Porter stemmer, Morphy,
PC-Kimmo - Morphology by lookup CatVar, CELEX, OALD
- Unsupervised Machine Learning MorphoChallenge