Machine Learning for (Psycho-)Linguistics - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Machine Learning for (Psycho-)Linguistics

Description:

RBF-style gaussian voting function (Shepard, 1987) Linear voting function (Dudani, 1976) ... German Plural. Notoriously complex but routinely acquired (at age ... – PowerPoint PPT presentation

Number of Views:137

Avg rating:3.0/5.0

Slides: 49

Provided by: walterda

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning for (Psycho-)Linguistics

1
Machine Learning for (Psycho-)Linguistics

Walter Daelemans
daelem_at_uia.ua.ac.be
http//cnts.uia.ac.be
CNTS, University of Antwerp
ILK, Tilburg University
QITL-02

2
Outline

Machine Learning of Language
Induction of rules and classes
Learning by Analogy
Case Studies
Discovery of phonological categories and
morphological rules
A single-route model of morphological processing
Issues
Probabilities versus symbolic structure induction
Nativism versus empiricism
Exemplar analogy versus rules

3
Experience
BIAS
Learning Component
Search
Rj
Ri
Rk
Output
Input
Rl
Performance Component
4
Problems with Probabilities

Explanation
Also applies to neural networks
Event relevance
Especially in unsupervised learning (clustering)
Incorporation of linguistic knowledge
Smoothing zero-frequency events

5
(symbolic) machine learning

Rule induction (understandable induced theories)
Inductive Logic Programming (incorporating
linguistic knowledge)
Memory-based learning (similarity-based smoothing
of sparse data, feature weighting)

6
Common Fallacies

Rules nativism
(and connections empiricism)
Generalization abstraction
(and memory table-lookup)

7
Rule-Based ? Innate

Rules can be induced from primary linguistic data
as well
Applications in Linguistics
Evaluation and comparison of linguistic
hypotheses
Discovery of linguistic generalizations and
categories

8
Allomorphy in Dutch Diminutive

one of the more spectacular phenomena of modern
Dutch morphophonemics Trommelen (1983)
Base form of Noun tje (5 variants)
Linguistic theory (from Te Winkel 1862)
Rime last syllable, stress, morphological
structure,
Trommelen 1983
Local phenomenon, stress morphological
structure do not play a role
CELEX data (3900 nouns)
- b i - z _at_ m A nt ? je

9
Allomorphs
10
Decision Tree Learning

Given a data set, construct a decision tree that
reflects the structure of the domain
A decision tree is a tree where
non-leaf nodes represent features (tests)
branches leading out of a test represent possible
values for the feature
leaf nodes represent outcomes (classes)
Decision Tree can be translated into a set of
IF-THEN rules (with further optimization)
Value grouping

11
Decision Tree Construction

Given a set of examples T
If T contains one or more cases all belonging to
the same class C, then the decision tree for T is
a leaf node with category C.
If T contains different classes then
Choose a feature, and partition T into subsets
that have the same value for the feature chosen.
The decision tree consists of a node containing
the feature name, and a branch for each value
leading to a subset.
Apply the procedure recursively to subsets
created this way.

12
Induced rule set

Default class is -tje
IF coda last is /lm/ or /rm/ THEN -pje
IF nucleus last is bimoraic AND coda last is
/m/ THEN -pje
IF coda last is /N/ THEN
IF nucleus penultimate is empty or schwa THEN
-etje ELSE -kje
IF nucleus last is short and coda last is
nas or liq THEN -etje
IF coda last is obstruent THEN -je

13
Results

Problem is almost perfectly learnable (98.4)
More than last syllable is needed for a full
solution
Only rime of last syllable (not stress or onset)
is relevant
Induced Categories
Nasals, liquids, obstruents, short vowels,
bimoraic vowels (consists of vowels, diphtongs,
schwa)
Task-dependent categories? Category formation is
dependent on the task to be learned, not
absolute, not language-independent

14
Conclusions Rule Induction in Linguistics

Falsify existing linguistic theories
Evaluate role of linguistic information sources
(Re)discover interesting linguistic rules (
supervised learning)
(Re)discover interesting linguistic categories (
unsupervised learning)
Empiricist alternative for (mostly nativist)
rule-based systems

15
There is one small problem

Current methodology for comparative machine
learning experiments is not reliable (especially
with small data)
Different runs of the algorithm provide different
resulting rule sets
Algorithm can be tweaked to get high performance
with any information source combination
Algorithm is highly sensitive to training data,
feature selection, algorithm parameter settings,
Only to be used as a heuristic
As with your own rule induction module

16
Word Sense Disambiguation (do) Similar
experience, material, say, then,
keywords
Local Context
47.9
49.0
Default
59.5
60.8
Optimized parameters LC
61.0
Optimized parameters
60.8
17
Generalisation ? Abstraction
Rule Induction Connectionism Inductive Logic
Programming Statistics Handcrafting
abstraction
(Fill in your most hated linguist here)
generalisation
- generalisation
Memory-Based Learning
Table Lookup
- abstraction
18
MBL Use memory traces of experiences as a basis
for analogical reasoning, rather than using rules
or other abstractions extracted from experience
and replacing the experiences.
This rule of nearest neighbor has considerable
elementary intuitive appeal and probably
corresponds to practice in many situations. For
example, it is possible that much medical
diagnosis is influenced by the doctor's
recollection of the subsequent history of an
earlier patient whose symptoms resemble in some
way those of the current patient. (Fix and
Hodges, 1952, p.43)
19
-etje
Rule Induction
-kje
Coda last syl
Nucleus last syl
20
-etje
MBL
-kje
Coda last syl
?
Nucleus last syl
21
Memory-Based Learning

Basis k nearest neighbor algorithm
store all examples in memory
to classify a new instance X, look up the k
examples in memory with the smallest distance
D(X,Y) to X
let each nearest neighbor vote with its class
classify instance X with the class that has the
most votes in the nearest neighbor set
Choices
similarity metric
number of nearest neighbors (k)
voting weights

22
Metrics
ib1
ib1-ig
ib1-mvdm
23
Metrics (2)

Voting options
Equal weight for each nearest neighbor
Distance weighted voting
Inverse distance 1/D(X,Y) (Wettschereck, 1994)
RBF-style gaussian voting function (Shepard,
1987)
Linear voting function (Dudani, 1976)

(NB weighted NN distribution can be used as
conditional probability)
24
MBL Acquisition

Inflectional process is represented by a set of
exemplars in memory
Exemplars act as models
Learning is incremental storage of exemplars
Compression and Metrics
Exemplar consists of set of (mostly symbolic)
features

25
MBL Processing

New instances of a performance process are solved
through
Memory-lookup
Analogical (Similarity-Based) Reasoning
Similarity metric
Language (faculty) - independent
Adaptive (feature and exemplar weighting)

26
The properties of language processing tasks

Language processing tasks are mappings between
linguistic representation levels that are
context-sensitive (but mostly local!)
complex (sub/ir/regularity), pockets of
exceptions
Similar representations at one linguistic level
correspond to similar representations at the
other level
Several information sources interact in (often)
unpredictable ways at the same level
Data is sparse

27
fit the bias of MBL

Inference is based on Similarity-Based /
Analogical Reasoning
Adaptive data fusion / relevance assignment is
available through feature weighting
It is a non-parametric approach
Similarity-based smoothing is implicit
Regularities and subregularities / exceptions can
be modeled uniformly

28
German and Dutch plurals
29
Data Representation

Symbolic features
segmental information (syllable structure)
stress
gender
German Plural ( 25,000 from CELEX)
Vorlesung (lecture) l e - z U N F en
Classes e (e)n s er - U- Uer Ue
Dutch Plural ( 62,000 from CELEX)
ontruiming (evacuation) 0 - O nt 1 r L - 0 m I N
en
Classes (e)n s (-eren, -i, -a, )

30
Cognitive Architectures of Inflectional Morphology
Dual Route

Dual Route (Pinker, Clahsen, Marcus )
Rules for regular cases
(over)generalization
default behaviour
Associative memory for exceptions
irregularization / family effects
Single Route (RM, MacWhinney, Plunkett, Elman,
)
Frequency-based regularity

Suffix-class
Memory
Failure
Pattern
Rule
Associator
Input Features
31
German Plural

Notoriously complex but routinely acquired (at
age 5)
Evidence for Dual Route ?
-s suffix is default/regular (novel words,
surnames, acronyms, )
-s suffix is infrequent (least frequent of the
five most important suffixes)

32
(No Transcript)
33
The default status of -s

Similar item missing Fnöhk-s
Surname, product name Mann-s
Borrowings Kiosk-s
Acronyms BMW-s
Lexicalized phrases Vergissmeinnicht-s
Onomatopoeia, truncated roots, derived nouns, ...

34
(No Transcript)
35
Discussion

Three classes of plurals ((-en -)(-e -er))(s)
the former 4 suffixes seem regular, can be
accurately learned using information from
phonology and gender
-s is learned reasonably well but information is
lacking
Hypothesis more features are needed
(syntactic, semantic, meta-linguistic, ) to
enrich the lexical similarity space
No difference in accuracy and speed of learning
with and without Umlaut
Overall generalization accuracy very high 95
Schema-based learning (Köpcke).

,,,,i,r,M e
36
(No Transcript)
37
(No Transcript)
38
Acquisition DataSummary of previous studies

Existing nouns
(Park 78 Veit 86 Mills 86 Schamer-Wolles 88
Clahsen et al. 93 Sedlak et al. 98)
Children mainly overapply -e or -(e)n
-s plurals are learned late
Novel words
(Mugdan 77 MacWhinney 78 Phillis Bouma 80
Schöler Kany 89)
Children inflect novel words with -e or -(e)n
More irregular plural forms produced than
defaults

39
MBL simulation

model overapplies mainly -en and -e
-s is learned late and imperfectly
Mainly but not completely parallel to input
frequency (more -s overgeneralization than -er
generalization)

40
Bartke, Marcus, Clahsen (1995)

37 children age 3.6 to 6.6
pictures of imaginary things, presented as
neologisms
names or roots
rhymes of existing words or not
choice -en or -s
results
children are aware that unusual sounding words
require the default
children are aware that names require the default

41
MBL simulation

sort CELEX data according to rhyme
compare overgeneralization
to -en versus to -s
percentage of total number of errors
results
when new words dont rhyme more errors are made
overgeneralization to -en drops below the level
of overgeneralization to -s

42
Dutch Plural

Suffixes -en and -s are both defaults, and are in
complementary distribution
Selection of -en or -s governed by
phonological structure of the base noun (stressed
vs. unstressed last syllable)
morphological structure (suffix of the base noun)
loan word status
semantic feature person vs. thing
both are possible after /?/
(Baayen et al. 2001)

43
Feature Relevance
44
Accuracy on CELEX

Methodology
Leave-one-out
Results
MBL 94.9 accuracy
Prec Rec F?
-(e)n 95.8 97.2 96.4
-s 93.8 91.4 92.6
-i 82.0 77.2 79.5
without stress 94.9 accuracy
last syllable with stress 92.6 accuracy
last syllable without stress 92.4 accuracy
rhyme last syllable 89.6 accuracy

45
Accuracy on pseudo-words

Methodology
Train Celex (all) and Celex (1000 most frequent
types)
Test 8 10 pseudo-words (Baayen et al., 2001)
dreip - workel - bastus - bestroeting - kloertje
stape - stree - kadisme
Results accuracy number of decisions equal to
subject majority for each item
Subjects 87.5
MBL (all) 83.8
MBL (top 1000) 90.0

46
muidus, muidi nn modus, modi
Low frequency and loan word nearest neighbours
Celex bias
47
Conclusions Memory-Based Single Route

MBLP picks up the main schemata of Dutch and
German plural formation and their exceptions
without recourse to explicit rules or a dual
route architecture
MBLP trained on (part of) CELEX matches subject
behavior on pseudo words and acquisition data
Segmental information suffices to reliably
predict plural in Dutch and most plurals in
German, additional information needed for German
-s
Heterogeneity and density in lexical exemplar
space as source of behavior predictions

48
Overall Conclusions

Advantages of symbolic machine learning methods
over pure statistics
As a methodology for inducing interpretable
linguistic generalizations and categories
As a way of introducing an operationalisation of
analogy-based methods into (psycho)linguistics

Write a Comment

User Comments (0)