Acquisition of Morphology by Computer: Unsupervised learning - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Acquisition of Morphology by Computer: Unsupervised learning

Description:

... that ends in ing is morphologically complex: string, sing, etc. ... Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ... we analyze sing into s ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 44
Provided by: humanitie
Category:

less

Transcript and Presenter's Notes

Title: Acquisition of Morphology by Computer: Unsupervised learning


1
Acquisition of Morphology by Computer
Unsupervised learning
  • John Goldsmith
  • The University of Chicago

2
The goal
  • To produce a morphological analysis of a corpus
    from an unknown language automatically
  • that is, with no knowledge of the structure of
    that language built in
  • To produce both generalizations about the
    language, and a correct analysis of each word in
    the corpus.

3
raw data
Linguistica
Analyzed data
4
  • Implemented in Linguistica, a program that runs
    under Windows that you can download at
  • humanities.uchicago.edu/faculty/goldsmith

5
  • The goal is not to eliminate either linguists or
    linguistics
  • The goal is to understand what the goal of a
    linguistic analysis is so well that we can state
    it explicitly and algorithmically.

6
Other work in this area
  • Derrick Higgins on Thursday
  • Michael Brent 1993
  • Zellig Harris 1955 and 1967, follow-up Hafer
    and Weiss 1974

7
Zellig HarrisRight-branching count
  • Right-branching count of jum 2
  • jum p (jump, jumping, jumps, jumped, jumpy)
  • b (jumble)
  • Right-branching count of jump5
  • e (jumped)
  • i (jumping)
  • jump s (jumps)
  • y (jumpy)
  • (jump)

8
Zellig HarrisRight-branching count
predicted break
19 9 6 3 1 3 1
1
a c c e p t i n g
able ing
lerate (accelerate) nted (accented)
ident (accident) laim (acclaim) omodate
(accomodate) reditated (accredited) used
(accused)
9
Zellig HarrisRight-branching count
d dead f deaf l deal n dean t death
Bad predictions
a
18
a
e
5
d
b debate, debuting c decade, december,
decide d dedicate, deduce, deduct e
deep f
9
i
e defeat, defend, defer i deficit, deficiency
r defraud
3
Good predictions
o
10
Zellig HarrisRight-branching count
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
11
The problem with Harris approach
  • it cannot distinguish between
  • phonological freedom due to phonological patterns
    (C after V, V after C)
  • phonological freedom due to morphological pattern
    (...any morpheme after a ...)
  • But thats the problem its supposed to solve.

12
Global approach
  • Focus on devising a method for evaluating a
    hypothesis, given the data.
  • Finding explicit methods of discovery is
    important, but those methods play no role in
    evaluating the analysis for a given corpus.
  • (Very similar in conception to Chomskys notion
    of an evaluation metric.)

13
Framework for evaluation
  • Jorma Rissanens Minimum Description Length
    (MDL).
  • Quite intricate but we can get a very good feel
    for the general idea with a naïve version of
    MDL...

14
Naive description length
Count the total number of letters in the list of
stems and affixes the fewer, the better.
15
Intuition
A word which is morphologically complex reveals
that composite character by virtue of being
composed of (one or more) strings of letters
which have a relatively high frequency throughout
the corpus.
16
Naive description length 2
  • Lexicographers know what they are doing when they
    indicate the entry for the verb laugh as laugh,
    s, ed, ing --
  • They recognize that the tilde allows them
    to utilize the regularities of the language in
    order to save space and specification, and
    implicitly to underscore the regularity of the
    pattern that the stem possesses.

17
  • Morphological analysis is not merely a matter of
    frequency.
  • Not every word that ends in ing is
    morphologically complex string, sing, etc.

18
Frequencies are important but far from the whole
story
  • Every word that ends in ity also ends in ty.
  • Hence tys frequency gt itys frequency.
  • Yet -ty is a suffix only in a few words (like
    six-ty)
  • ity is a suffix in far more words, despite its
    lower frequency (insan-ity, precoc-ity, etc.).
  • frequency( y ) gt frequency (ty) gt frequency
    (ity)
  • y is a suffix in some words (dirt-y, runn-y,
    etc.), but not in insan-ity, precoc-ity, etc.

19
Naive Minimum Description Length
  • Analyze the words of a corpus into stem suffix
    with the requirement that every stem and every
    suffix must be used in at least 2 distinct words.
  • Tally up the total number of letters in (a) each
    of the proposed stems, (b) each of the proposed
    suffixes, and (c) each of the unanalyzed words,
    and call that total the naive description
    length.

20
Naive Minimum Description Length
  • Corpus
  • jump, jumps, jumping
  • laugh, laughed, laughing
  • sing, sang, singing
  • the, dog, dogs
  • total 62 letters
  • Analysis
  • Stems jump laugh sing sang dog (20 letters)
  • Suffixes s ing ed (6 letters)
  • Unanalyzed the (3 letters)
  • total 29 letters.

Notice that the description length goes UP if we
analyze sing into sing
21
  • Frequencies matter, but only in the overarching
    context of a total morphological analysis of all
    of the words of the language.

22
Lets look at how the work is done, step by
step...
23
Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
24
Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
25
Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
26
Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
27
Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
28
Corpus
Is the modification an improvement? Ask MDL!
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
29
Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
30
Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
31
Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
32
Bootstrapping...initial hypothesis initial
morphology of the corpus
33
First a set of candidate suffixesfor the
language
  • Using some interesting statistics.

34
4. Weight the stickiness (3) by how often
the string shows up in the corpus
1. Observed frequency of a string (e.g., ing)
3. The computed stickiness of that string
2. Predicted frequency of the same string if
there were no morphemes in the language
35
  • Rank all word-final sequences of letters (of
    length 1-4 letters)
  • This gives us an excellent first guess of the
    suffixes of the language.
  • See Handout for English, French, Spanish, and
    Latin.

36
(No Transcript)
37
Given a candidate set of 100 suffixes...
  • It is not difficult to find the set of stems that
    gives us the largest number of analyses employing
    only those suffixes.
  • We use these to find the major signatures present
    in the corpus ...

38
Discovery of signatures
The first 8 stems in the largest signature in
a 500,000 word corpus of English.
Set of suffixes that appears with all of these
stems
39
Minimum Description Length
The real thing, this time Rissanen
1989. Evaluate a morphology by 1. How well the
morphology extracts generalizations present in
the data how well it describes the data. 2. How
concise the morphology is. The naïve MDL we
just looked at only covered the second point, and
only crudely.
40
Measure how well the morphology fits the data
  • 1. Compute the predicted inverse log frequency of
    each word in the corpus, and sum

This is a well-understood quantity in information
theory, called the optimal compressed length
of the corpus based on the probability
distribution defined by the morphology.
41
Conciseness
  • Sum all the letters, plus all the structure
    inherent in the description, using information
    theory.

42
structure
Number of letters
Signatures, which well get to shortly
43
Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
44
Results
45
(No Transcript)
46
French
47
Spanish
48
Latin
49
Future directions
  • Develop it to work with languages with greater
    complexity and
  • Use it as an aide in the task of learning syntax
    in the same unsupervised fashion.
Write a Comment
User Comments (0)
About PowerShow.com