Tokenization and Morphology - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Tokenization and Morphology

Description:

School of Computing FACULTY OF ENGINEERING Tokenization and Morphology Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 30
Provided by: compLeeds3
Category:

less

Transcript and Presenter's Notes

Title: Tokenization and Morphology


1
Tokenization and Morphology
School of Computing FACULTY OF ENGINEERING
  • Eric Atwell, Language Research Group
  • (with thanks to Katja Markert, Marti Hearst, and
    other contributors)

2
Reminder
  • The main areas of linguistics
  • Rationalism language models based on expert
    introspection
  • Empiricism models via machine-learning from a
    corpus
  • Corpus text selected by language, genre, domain,
  • Brown, LOB, BNC, Penn Treebank, MapTask, CCA,
  • Corpus Annotation text headers, PoS, parses,
  • Corpus size is no. of words depends on
    tokenisation
  • We can count word tokens, word types, type-token
    distribution
  • Lexeme/lemma is root form, v inflections (be v
    am/is/was)

3
Whats a word?
  • How many words do you find in the following short
    text?
  • What is the biggest/smallest plausible answer to
    this question?
  • What problems do you encounter?
  • Its a shame that our data-base is not
    up-to-date. It is a shame that um, data base A
    costs 2300.50 and that database B costs 5000.
    All databases cost far too much.
  • Time 3 minutes

4
Counting words tokenization
  • Tokenisation is a processing step where the input
    text is
  • automatically divided into units called tokens
    where each is either a word or a number or a
    punctuation mark
  • So, word count can ignore numbers, punctuation
    marks (?)
  • Word Continuous alphanumeric characters
    delineated by whitespace.
  • Whitespace space, tab, newline.
  • BUT dividing at spaces is too simple Its, data
    base
  • Another approach is to use regular expressions to
    specify which substrings are valid words.

5
Regular expressions for tokenization
  • wordr r'(\w)
  • hyphen r'(\w\-\s?\w)
  • Eg data-base, Allows for a space after the hyphen
  • apostrophe r'(\w\'\w)
  • Eg isnt
  • numbers r'((\)?\d(\.)?\d?)
  • Needs to handle large numbers with commas

6
Some Tokenization Issues
  • Sentence Boundaries
  • Punctuation, eg quotation marks around sentences?
  • Periods end of line or not?
  • Proper Names
  • What to do about
  • New York-New Jersey train?
  • California Governor Arnold Schwarzenegger?
  • Contractions
  • Thats Freds jackets pocket.
  • Im doing what youre saying Dont do!.

7
(No Transcript)
8
(No Transcript)
9
Jabberwocky Analysis
  • This is nonsense or is it?
  • This is not English but its much more like
    English than it is like French or German or
    Chinese or
  • Why do we pretty much understand the words?

10
Jabberwocky Analysis
  • Why do we pretty much understand the words?
  • We recognize combinations of morphemes.
  • Chortled - Laugh in a breathy, gleeful way
    (Definition from Oxford American Dictionary) A
    combination of "chuckle" and "snort."
  • Galumphing - Moving in a clumsy, ponderous, or
    noisy manner. Perhaps a blend of "gallop" and
    "triumph." (Definition from Oxford American
    Dictionary)
  • Activity
  • Make up a word whose meaning can be inferred from
    the morphemes that you used.

11
Jabberwocky Analysis
  • Why do we pretty much understand the words?
  • Surrounding English words strongly indicate the
    parts-of-speech of the nonsense words.
  • toves probably can perform an action
  • (because they did gyre and gimble)
  • wabe is probably a place.
  • (they did in the wabe)

http//assets.cambridge.org/052185/542X/excerpt/05
2185542X_excerpt.pdf
12
Jabberwocky Analysis
  • Surrounding English words strongly indicate the
    parts-of-speech of the nonsense words.
  • Its similar in the French Translation

Example from http//www.departments.bucknell.edu/l
inguistics/lectures/05lect02.html
13
Morphology
  • Morphology
  • The study of the way words are built up from
    smaller meaning units.
  • Morphemes
  • The smallest meaningful unit in the grammar of a
    language.
  • Contrasts
  • Derivational vs. Inflectional
  • Regular vs. Irregular
  • Concatinative vs. Templatic (root-and-pattern)
  • A useful resource
  • Glossary of linguistic terms by Eugene Loos
  • http//www.sil.org/linguistics/GlossaryOfLinguisti
    cTerms/contents.htm

14
Examples (English)
  • unladylike
  • 3 morphemes, 4 syllables
  • un- not
  • lady (well behaved) female adult human
  • -like having the characteristics of
  • Cant break any of these down further without
    distorting the meaning of the units
  • technique
  • 1 morpheme, 2 syllables
  • dogs
  • 2 morphemes, 1 syllable
  • -s, a plural marker on nouns

15
Morpheme Definitions
  • Root
  • The portion of the word that
  • is common to a set of derived or inflected forms,
    if any, when all affixes are removed
  • is not further analyzable into meaningful
    elements
  • carries the principle portion of meaning of the
    words
  • Stem
  • The root or roots of a word, together with any
    derivational affixes, to which inflectional
    affixes are added.
  • Affix
  • A bound morpheme that is joined before, after, or
    within a root or stem.
  • Clitic
  • a morpheme that functions syntactically like a
    word, but does not appear as an independent
    phonological word
  • Arabic al in Al-Qaeda (definite particle)
  • English s in Hals (genitive particle)

16
Inflectional vs. Derivational
  • Word Classes
  • Parts of speech noun, verb, adjectives, etc.
  • Word class dictates how a word combines with
    morphemes to form new words
  • Inflection
  • Variation in the form of a word, typically by
    means of an affix, that expresses a grammatical
    contrast.
  • Doesnt change the word class
  • Usually produces a predictable, nonidiosyncratic
    change of meaning.
  • run -gt runs running ran
  • Derivation
  • The formation of a new word or inflectable stem
    from another word or stem.
  • compute -gt computer -gt computerization

17
Inflectional Morphology
  • Adds
  • tense, number, person, mood, aspect
  • Word class doesnt change
  • Word serves new grammatical role
  • Examples
  • come is inflected for person and number
  • The pizza guy comes at noon.
  • las and rojas are inflected for agreement with
    manzanas in grammatical gender by -a and in
    number by s
  • las manzanas rojas (the red apples)

18
Derivational Morphology
  • Word class changes verb ? noun, noun ? adjective
    etc
  • Nominalization (formation of nouns from other
    parts of speech, primarily verbs in English)
  • computerization
  • appointee
  • killer
  • fuzziness
  • Formation of adjectives (primarily from nouns)
  • computational
  • clueless
  • Embraceable
  • Difficult cases
  • building ? from which word-class and sense of
    build?

19
Concatinative Morphology
  • MorphemeMorphemeMorpheme
  • Stems also called lemma, base form, root, lexeme
  • hopeing ? hoping hop ? hopping
  • Affixes
  • Prefixes Antidisestablishmentarianism
  • Suffixes Antidisestablishmentarianism
  • Infixes hingi (borrow) humingi (borrower) in
    Tagalog
  • Circumfixes sagen (say) gesagt (said) in
    German
  • Agglutinative Languages
  • uygarlastiramadiklarimizdanmissinizcasina
  • uygarlastiramadiklarimizdanmissinizcasin
    a
  • Behaving as if you are among those whom we could
    not cause to become civilized

20
Templatic Morphology
  • Roots and Patterns
  • Example Hebrew or Arabic or Amharic (spoken in
    Ethiopia)
  • Root
  • Consists of 3 consonants CCC
  • Carries basic meaning
  • Template
  • Gives the ordering of consonants and vowels
  • Specifies semantic information about the verb
  • Active, passive, middle voice
  • Example (Hebrew)
  • lmd (to learn or study)
  • CaCaC -gt lamad (he studied)
  • CiCeC -gt limed (he taught)
  • CuCaC -gt lumad (he was taught)

21
Morphological Analysis Tools
  • Porter stemmer
  • A simple approach just hack off the end of the
    word!
  • Frequently used, especially for Information
    Retrieval, but results are pretty ugly!

22
porter.demo()
  • Original
  • Pierre Vinken , 61 years old , will join the
    board as a nonexecutive
  • director Nov. 29 . Mr. Vinken is chairman of
    Elsevier N.V. , the Dutch
  • publishing group . Rudolph Agnew , 55 years old
    and former chairman of
  • Consolidated Gold Fields PLC , was named a
    nonexecutive director of
  • this British industrial conglomerate . A form of
    asbestos once used to
  • make Kent cigarette filters has caused a high
    percentage of cancer
  • deaths among a group of workers exposed to it
    more than 30 years ago ,
  • researchers reported .
  • Results
  • Pierr Vinken , 61 year old , will join the board
    as a nonexecut
  • director Nov. 29 . Mr. Vinken is chairman of
    Elsevi N.V. , the Dutch
  • publish group . Rudolph Agnew , 55 year old and
    former chairman of
  • Consolid Gold Field PLC , wa name a nonexecut
    director of thi British
  • industri conglomer . A form of asbesto onc use to
    make Kent cigarett
  • filter ha caus a high percentag of cancer death
    among a group of
  • worker expos to it more than 30 year ago ,
    research report .

23
Morphological Analysis Tools
  • WordNets morphy()
  • A slightly more sophisticated approach
  • Use an understanding of inflectional morphology
  • Uses a set of Rules of Detachment
  • Use an Exception List for irregulars
  • Handle collocations in a special way
  • Do the transformation, compare the result to the
    WordNet dictionary
  • If the transformation produces a real word, then
    keep it, else use the original word.
  • For more details, see
  • http//wordnet.princeton.edu/man/morphy.7WN.html

24
Some morphy() output
  • gtgtgt wntools.morphy('dogs')
  • 'dog'
  • gtgtgt wntools.morphy('running', pos'verb')
  • 'run'
  • gtgtgt wntools.morphy('corpora')
  • 'corpus'
  • gtgtgt

25
Morphological Analysis Tools
  • Very sophisticated programs have been developed
  • Use a techniqued called Two-Level Phonology
  • Has been applied to numerous languages
  • Best known PCKimmo
  • After Kimmo Koskenniemi, based in part on work by
    Lauri Kartunnen in 1983
  • Uses
  • A rules file which specifies the alphabet and the
    phonological (or spelling) rules,
  • A lexicon file which lists lexical items and
    encodes morphotactic constraints.
  • http//www.sil.org/pckimmo/
  • Commercial versions are available
  • inXights LinguistX version based on technology
    developed by Kaplan and others from Xerox PARC
    (or at least used to be)

26
Morphological Analysis Tools
  • cheat store all variants in a dictionary
    database, eg
  • CatVar
  • Categorial Variation Database
  • A database of clusters of uninflected words
    (lexemes) and their categorial (i.e.
    part-of-speech) variants.
  • Example the developing cluster(develop(V),
    developer(N), developed(AJ), developing(N),
    developing(AJ), development(N)).
  • http//clipdemos.umiacs.umd.edu/catvar
  • based on published dictionaries LDOCE, CELEX,
    OALD, PROPOSEL ...

27
MorphoChallenge
  • One problem with rule-based systems (PCkimmo) or
    dictionary-lookup systems Porting to new
    languages
  • In principle, Unsupervised Machine Learning could
    learn from any language data-set, by finding
    recurring patterns which correspond to roots,
    prefixes, postfixes
  • MorphoChallenge is a contest to find the best UML
    morphological analyser
  • http//www.cis.hut.fi/morphochallenge2005/
  • http//www.cis.hut.fi/morphochallenge2007/
  • http//www.cis.hut.fi/morphochallenge2008/
  • Atwell, Roberts Combinatory Hybrid Elementary
    Analysis of Text http//www.cis.hut.fi/morphochall
    enge2005/P07_Atwell.pdf

28
Arabic morphological analysis
  • Arabic is particularly challenging - different
    script, infixes, vowels may be left out in
    written Arabic
  • Leeds researcher Majdi Sawalha online analysis
    tool http//www.comp.leeds.ac.uk/sawalha/
  • Sawalha, Majdi Atwell, Eric (2010). Fine-Grain
    Morphological Analyzer and Part-of-Speech Tagger
    for Arabic Text. in Proceedings of the Language
    Resource and Evaluation Conference LREC 2010,
    17-23 May 2010, Valetta, Malta.
  • http//www.comp.leeds.ac.uk/sawalha/sawalha10lrecB
    .pdf

29
Reminder
  • Tokenization - by whitespace, regular expressions
  • Problems Its data-base New York
  • Jabberwocky shows we can break words into
    morphemes
  • Morpheme types root/stem, affix, clitic
  • Derivational vs. Inflectional
  • Regular vs. Irregular
  • Concatinative vs. Templatic (root-and-pattern)
  • Morphological analysers Porter stemmer, Morphy,
    PC-Kimmo
  • Morphology by lookup CatVar, CELEX, OALD
  • Unsupervised Machine Learning MorphoChallenge
Write a Comment
User Comments (0)
About PowerShow.com