Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm

Description:

In traditional automated language processing, knowledge is involved in all cases ... 15(87) feline, chimpanzee, pigeon, quail, guinea-pig, chicken, grower, mammal, ... – PowerPoint PPT presentation

Number of Views:235

Avg rating:3.0/5.0

Slides: 30

Provided by: ich8

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm

1
Unsupervised Natural Language Processing using
Graph ModelsThe Structure Discovery Paradigm

Chris BiemannUniversity of Leipzig, Germany
Doctoral Consortium at HLT-NAACL 2007, Rochester,
NY, USA
April 22, 2007

2
Outline

Review of traditional approaches
Knowledge-intensive vs. knowledge-free
Degrees of Supervision
Computational Linguistics vs. statistical NLP
A new approach
The Structure Discovery Paradigm
Graph-based SD procedures
Graph models for language processing
Graph-based SD procedures
Results in task-based evaluation

3
Knowledge-Intensive vs. Knowledge-Free

In traditional automated language processing,
knowledge is involved in all cases where humans
manually tell machines
How to process language by explicit knowledge
How a task should be solved by implicit knowledge
Knowledge can be provided by the means of
Dictionaries, e.g. thesaurus, WordNet,
ontologies,
(grammar) rules
Annotation

4
Degrees of Supervision

Supervision is providing positive and negative
training examples to Machine Learning algorithms,
which use this as a basis for building a model
that reproduces the classification on unseen data
Degrees
Fully supervised (Classification) Learning is
only carried out on fully labeled training set
Semi-supervised Unlabeled examples are also used
for building a data model
Weakly-supervised (Bootstrapping) A small set of
labeled examples is grown and classifications are
used for re-training
Unsupervised (Clustering) No labeled examples
are provided

5
Computational Linguistics and Statistical NLP

CL
Implementing linguistic theories with computers
Rule-based approaches
Rules found by introspection, not data-driven
Explicit knowledge
Goal understanding language itself
Statistical NLP
Building systems that perform language processing
tasks
Machine Learning approaches
Models are built by training on annotated dataset
Implicit knowledge
Goal Build robust systems with high performance
There is a continuum rather than a sharp cutting
edge

6
Structure Discovery Paradigm

SD
Analyze raw data and identify regularities
Statistical methods, clustering
Knowledge-free, unsupervised
Structures as many as can be discovered
Language-independent, domain-independent,
encoding-independent
Goal Discover structure in language data and
mark it in the data

7
Example Discovered Structures
Increased interest rates lead to investments in
banks . ltsentence lang12, subj34.11gt ltchunk
idc25gt ltword POSp3 m0.0
ss14gtIncreas-edlt/wordgt ltMWU POSp1 ss33gt
ltword POSp1 m5.1 ss44gtinterestlt/wordgt
ltword POSp1 m2.12 ss106gtrate-slt/wordgt
lt/MWUgt lt/chunkgt ltchunk idc13gt ltMWU
POSp2gt ltword POSp2 m17.3
s74gtleadlt/wordgt ltword POSp117
m11.98gttolt/wordgt lt/MWUgt lt/chunkgt ltchunk
idc31gt ltword POSp1 m1.3
s33gtinvestment-slt/wordgt ltword POSp118
m11.36gtinlt/wordgt ltword POSp1 m1.12
s33gtbank-slt/wordgt lt/chunkgt ltwordgt POS298gt .
lt/wordgt lt/sentencegt

Annotation on various levels
Similar labels denote similar properties as found
by the SD algorithms
Similar structures in corpus are annotated in a
similar way

8
Consequences of Working in SD

Only input allowed is raw text data
Machines are told how to algorithmically discover
structure
Self-annotation process by marking regularities
in the data
Structure Discovery process is iterated

Text Data
Find regularities by analysis
SD algorithms
SD algorithm
SD algorithm
SD algorithm
Annotate data with regularities
9
Pros and Cons of Structure Discovery

Advantages
Cheap only raw data needed
Alleviation of acquisition bottleneck
Language and domain independent
No data-resource mismatch (all resources leak)
Disadvantages
No control over self-annotation labels
Congruence to linguistic concepts not guaranteed
Much computing time needed

10
Building Blocks in SD

Hierarchical levels of basic units in text data
Letters
Words
Sentences
Documents
These are assumed to be recognizable in the
remainder.
SD allows for
arbitrary numbers of intermediate levels
grouping of basic into complex units,
but these have to be found by SD procedures.

11
Similarity and Homogeneity

For determining which units share structure, a
similarity measure for units is needed. Two kinds
of features are possible
Internal features compare units based on the
lower level units they contain
Context features compare units based on other
units of same or other level that surround them
A clustering based on unit similarity yields sets
of units that are homogeneous w.r.t. structure
This is an abstraction process Units are
subsumed under the same label.

12
What is it good for? How do I know?

Many structural regularities can be thought of,
some are interesting, some are not.
Structures discovered by SD algorithms will not
necessarily match the concepts of linguists
Working in the SD paradigm means to over-generate
structure acquisition methods and to check,
whether these are helpful
Methods for telling helpful from useless SD
procedures
Look at my nice clusters-approach Examine data
by hand. While good in the initial phase of
testing, this is inconclusive choice of
clusters, coverage
Task-based evaluation Use the labels obtained as
features in a Machine Learning scenario and
measure the contribution of each label type.
Involves supervision, is indirect

13
Graph models for SD procedures

Motivation for graph representation
Graphs are an intuitive and natural way to encode
language units as nodes and their similarities as
edges - but also other representations are
possible
Graph clustering can efficiently perform
abstraction by grouping units into homogeneous
sets with Chinese Whispers
Some graphs on basic units
Word co-occurrence (neighbor/sentence),
significance, higher orders
Word context similarity based on local context
vectors
Sentence/document similarity on common words

14
Some graph-based SD procedures

Language Separation
Cluster sentence-based significant word
co-occurrence graph
Use word lists for language identification
Induced POS
Cluster local stop word context vector similarity
graph
Cluster second order neighbor word co-occurrence
graph
Train and apply trigram tagger
Word Sense Disambiguation
Cluster neighborhood of target word of
sentence-based significant co-occurrence graph
into sense clusters
Compare sense clusters with local context for
disambiguation
Semantic classes
Cluster similarity graph of words and induced POS
contexts
Use contexts for assigning semantic classes

15
Look at my nice languages! Cleaning CUCWeb

Latin
In expeditionibus tessellata et sectilia
pauimenta circumferebat.
Britanniam petiuit spe margaritarum earum
amplitudinem conferebat et interdum sua manu
exigebat ..
Scripting
_at_echo _at_cd (TLSFDIR)(CC) (RTLFLAGS)
(RTL_LWIPFLAGS) -c (TLSFSRC)
_at_echo _at_cd (TOOLSDIR)(CC) (RTLFLAGS)
(RTL_LWIPFLAGS) -c (TOOLSSRC) ..
Hungarian
A külügyminiszter a diplomáciai és konzuli
képviseletek címjegyzékét és konzuli
Köztestületek, jogi személyiséggel és helyi
jogalkotási jogkörrel.
Esperanto
Por vidi ghin kun internacia kodigho kaj kun
kelkaj bildoj kliku tie chi ) La Hispana..
Ne nur pro tio, ke ghi perdigis la vivon de
kelk-centmil hispanoj, sed ankau pro ghia efiko..
Human Genome
1 atgacgatga gtacaaacaa ctgcgagagc atgacctcgt
acttcaccaa ctcgtacatg 61 ggggcggaca tgcatcatgg
gcactacccg ggcaacgggg tcaccgacct ggacgcccag 121
cagatgcacc

16
Task-based unsuPOS evaluation

UnsuPOS tags are used as features, performance is
compared to no POS and supervised POS. Tagger was
induced in one-CPU-day from BNC
Kernel-based WSD better than noPOS, equal to
suPOS
POS-tagging better than noPOS
Named Entity Recognition no significant
differences
Chunking better than noPOS, worse than suPOS

17
Summary

Structure Discovery Paradigm contrasted to
traditional approaches
no manual annotation, no resources (cheaper)
language- and domain-independent
iteratively enriching structural information by
finding and annotating regularities
Graph-based SD procedures
Evaluation framework and results

18
Questions?

THANKS FOR YOUR ATTENTION!

19
Structure Discovery Machine I

From linguistics, we have the following
intuitions that can lead to SD algorithms that
capture their underlying structure
There are different languages
Words belong to word classes
Short sequences of words form multi word units
Words can be semantically decomposable
(compounds)
Words are subject to inflection
Morphological congruence between words
There are grammatical dependencies between words
and sequences of words
Words can have different semantic properties
Semantic congruence between words
A word can have several meanings

20
Structure Discovery Machine II

The following methods are SD algorithms
Language Identification as introduced
POS Induction as introduced
MWU detection by Collocation extraction
Unsupervised Compound Decomposition and
Paraphrasing (work in progress)
Unsupervised Morphology (MorphoChallenge) letter
successor varieties
Unsupervised Parsing Grammar Induction based on
POS and neighbor-based co-occurrences
Semantic classes Similarity in context patterns
of words and POS (work in progress)
WSIWSD Clustering Co-occurrencesDisambiguation
(work in progress)

21
Chinese Whispers Graph Clustering

Explanations
Nodes have a class and communicate it to their
adjacent nodes
A node adopts one of the the majority class in
its neighborhood
Nodes are processed in random order for some
iterations
Properties
Time-linear in number of edges very efficient
Randomized, non-deterministic
Parameter-free
Numbers of clusters found by algorithm
Small World graphs converge fast

Algorithm initialize forall vi in V
class(vi)i while changes forall v in V,
randomized order class(v)highest ranked
class in neighborhood of v
22
Language Seperation Evaluation

Cluster the co-occurrence graph of a multilingual
corpus
Use words of the same class in a language
identifier as lexicon
Almost perfect performance

23
unsuPOS Steps
... , sagte der Sprecher bei der Sitzung . ...
, rief der Vorsitzende in der Sitzung . ... ,
warf in die Tasche aus der Ecke .
17
C1 sagte, warf, rief C2 Sprecher, Vorsitzende,
Tasche C3 in C4 der, die
... , sagteC1 derC4 SprecherC2 bei derC4
Sitzung . ... , riefC1 derC4 VorsitzendeC2
inC3 derC4 Sitzung . ... , warfC1 inC3
dieC4 TascheC2 aus derC4 Ecke .
... , sagteC1 derC4 SprecherC2 beiC3 derC4
SitzungC2 . ... , riefC1 derC4 VorsitzendeC2
inC3 derC4 SitzungC2 . ... , warfC1 inC3
dieC4 TascheC2 ausC3 derC4 EckeC2 .
24
unsuPOS Ambiguity Example
25
unsuPOS Medline tagset

1 (13721) recombinogenic, chemoprophylaxis,
stereoscopic, MMP2, NIPPV, Lp, biosensor,
bradykinin, issue, S-100beta, iopromide,
expenditures, dwelling, emissions,
implementation, detoxification, amperometric,
appliance, rotation, diagonal,
2(1687) self-reporting, hematology, age-adjusted,
perioperative, gynaecology, antitrust,
instructional, beta-thalassemia, interrater,
postoperatively, verbal, up-to-date,
multicultural, nonsurgical, vowel, narcissistic,
offender, interrelated,
3(1383) proven, supplied, engineered,
distinguished, constrained, omitted, counted,
declared, reanalysed, coexpressed, wait,
4(957) mediates, relieves, longest, favor,
address, complicate, substituting, ensures,
advise, share, employ, separating, allowing,
5(1207) peritubular, maxillary, lumbar, abductor,
gray, rhabdoid, tympanic, malar, adrenal,
low-pressure, mediastinal,
6(653) trophoblasts, paws, perfusions, cerebrum,
pons, somites, supernatant, Kingdom,
extra-embryonic, Britain, endocardium,
7(1282) acyl-CoAs, conformations, isoenzymes,
STSs, autacoids, surfaces, crystallins,
sweeteners, TREs, biocides, pyrethroids,
8(1613) colds, apnea, aspergilloma, ACS,
breathlessness, perforations, hemangiomas,
lesions, psychoses, coinfection, terminals,
headache, hepatolithiasis, hypercholesterolemia,
leiomyosarcomas, hypercoagulability, xerostomia,
granulomata, pericarditis,
9(674) dysregulated, nearest, longest,
satisfying, unplanned, unrealistic, fair,
appreciable, separable, enigmatic, striking, i
10(509) differentiative, ARV, pleiotropic,
endothermic, tolerogenic, teratogenic, oxidizing,
intraovarian, anaesthetic, laxative,
13(177) ewe, nymphs, dams, fetuses, marmosets,
bats, triplets, camels, SHR, husband, siblings,
seedlings, ponies, foxes, neighbor, sisters,
mosquitoes, hamsters, hypertensives, neonates,
proband, anthers, brother, broilers, woman, eggs,
14(103) considers, comprises, secretes,
possesses, sees, undergoes, outlines, reviews,
span, uncovered, defines, shares, s
15(87) feline, chimpanzee, pigeon, quail,
guinea-pig, chicken, grower, mammal, toad,
simian, rat, human-derived, piglet, ovum,
16(589) dually, rarely, spectrally,
circumferentially, satisfactorily, dramatically,
chronically, therapeutically, beneficially,
already,
18(124) 1-min, two-week, 4-min, 8-week, 6-hour,
2-day, 3-minute, 20-year, 15-minute, 5-h, 24-h,
8-h, ten-year, overnight, 120-
21(12) July, January, May, February, December,
October, April, September, June, August, March,
November
23(13) acetic, retinoic, uric, oleic,
arachidonic, nucleic, sialic, linoleic, lactic,
glutamic, fatty, ascorbic, folic
25(28) route, angle, phase, rim, state, region,
arm, site, branch, dimension, configuration,
area, Clinic, zone, atom, isoform,
247(6) Plt0_001, Plt0_01, plt0_001, plt0_01, Plt_001,
Plt0_0001