Title: Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm
1Unsupervised Natural Language Processing using
Graph ModelsThe Structure Discovery Paradigm
- Chris BiemannUniversity of Leipzig, Germany
- Doctoral Consortium at HLT-NAACL 2007, Rochester,
NY, USA - April 22, 2007
2Outline
- Review of traditional approaches
- Knowledge-intensive vs. knowledge-free
- Degrees of Supervision
- Computational Linguistics vs. statistical NLP
- A new approach
- The Structure Discovery Paradigm
- Graph-based SD procedures
- Graph models for language processing
- Graph-based SD procedures
- Results in task-based evaluation
3Knowledge-Intensive vs. Knowledge-Free
- In traditional automated language processing,
knowledge is involved in all cases where humans
manually tell machines - How to process language by explicit knowledge
- How a task should be solved by implicit knowledge
- Knowledge can be provided by the means of
- Dictionaries, e.g. thesaurus, WordNet,
ontologies, - (grammar) rules
- Annotation
4Degrees of Supervision
- Supervision is providing positive and negative
training examples to Machine Learning algorithms,
which use this as a basis for building a model
that reproduces the classification on unseen data - Degrees
- Fully supervised (Classification) Learning is
only carried out on fully labeled training set - Semi-supervised Unlabeled examples are also used
for building a data model - Weakly-supervised (Bootstrapping) A small set of
labeled examples is grown and classifications are
used for re-training - Unsupervised (Clustering) No labeled examples
are provided
5Computational Linguistics and Statistical NLP
- CL
- Implementing linguistic theories with computers
- Rule-based approaches
- Rules found by introspection, not data-driven
- Explicit knowledge
- Goal understanding language itself
- Statistical NLP
- Building systems that perform language processing
tasks - Machine Learning approaches
- Models are built by training on annotated dataset
- Implicit knowledge
- Goal Build robust systems with high performance
- There is a continuum rather than a sharp cutting
edge
6Structure Discovery Paradigm
- SD
- Analyze raw data and identify regularities
- Statistical methods, clustering
- Knowledge-free, unsupervised
- Structures as many as can be discovered
- Language-independent, domain-independent,
encoding-independent - Goal Discover structure in language data and
mark it in the data
7Example Discovered Structures
Increased interest rates lead to investments in
banks . ltsentence lang12, subj34.11gt ltchunk
idc25gt ltword POSp3 m0.0
ss14gtIncreas-edlt/wordgt ltMWU POSp1 ss33gt
ltword POSp1 m5.1 ss44gtinterestlt/wordgt
ltword POSp1 m2.12 ss106gtrate-slt/wordgt
lt/MWUgt lt/chunkgt ltchunk idc13gt ltMWU
POSp2gt ltword POSp2 m17.3
s74gtleadlt/wordgt ltword POSp117
m11.98gttolt/wordgt lt/MWUgt lt/chunkgt ltchunk
idc31gt ltword POSp1 m1.3
s33gtinvestment-slt/wordgt ltword POSp118
m11.36gtinlt/wordgt ltword POSp1 m1.12
s33gtbank-slt/wordgt lt/chunkgt ltwordgt POS298gt .
lt/wordgt lt/sentencegt
- Annotation on various levels
- Similar labels denote similar properties as found
by the SD algorithms - Similar structures in corpus are annotated in a
similar way
8Consequences of Working in SD
- Only input allowed is raw text data
- Machines are told how to algorithmically discover
structure - Self-annotation process by marking regularities
in the data - Structure Discovery process is iterated
Text Data
Find regularities by analysis
SD algorithms
SD algorithm
SD algorithm
SD algorithm
Annotate data with regularities
9Pros and Cons of Structure Discovery
- Advantages
- Cheap only raw data needed
- Alleviation of acquisition bottleneck
- Language and domain independent
- No data-resource mismatch (all resources leak)
- Disadvantages
- No control over self-annotation labels
- Congruence to linguistic concepts not guaranteed
- Much computing time needed
10Building Blocks in SD
- Hierarchical levels of basic units in text data
- Letters
- Words
- Sentences
- Documents
- These are assumed to be recognizable in the
remainder. - SD allows for
- arbitrary numbers of intermediate levels
- grouping of basic into complex units,
- but these have to be found by SD procedures.
11Similarity and Homogeneity
- For determining which units share structure, a
similarity measure for units is needed. Two kinds
of features are possible - Internal features compare units based on the
lower level units they contain - Context features compare units based on other
units of same or other level that surround them - A clustering based on unit similarity yields sets
of units that are homogeneous w.r.t. structure - This is an abstraction process Units are
subsumed under the same label.
12What is it good for? How do I know?
- Many structural regularities can be thought of,
some are interesting, some are not. - Structures discovered by SD algorithms will not
necessarily match the concepts of linguists - Working in the SD paradigm means to over-generate
structure acquisition methods and to check,
whether these are helpful - Methods for telling helpful from useless SD
procedures - Look at my nice clusters-approach Examine data
by hand. While good in the initial phase of
testing, this is inconclusive choice of
clusters, coverage - Task-based evaluation Use the labels obtained as
features in a Machine Learning scenario and
measure the contribution of each label type.
Involves supervision, is indirect
13Graph models for SD procedures
- Motivation for graph representation
- Graphs are an intuitive and natural way to encode
language units as nodes and their similarities as
edges - but also other representations are
possible - Graph clustering can efficiently perform
abstraction by grouping units into homogeneous
sets with Chinese Whispers - Some graphs on basic units
- Word co-occurrence (neighbor/sentence),
significance, higher orders - Word context similarity based on local context
vectors - Sentence/document similarity on common words
14Some graph-based SD procedures
- Language Separation
- Cluster sentence-based significant word
co-occurrence graph - Use word lists for language identification
- Induced POS
- Cluster local stop word context vector similarity
graph - Cluster second order neighbor word co-occurrence
graph - Train and apply trigram tagger
- Word Sense Disambiguation
- Cluster neighborhood of target word of
sentence-based significant co-occurrence graph
into sense clusters - Compare sense clusters with local context for
disambiguation - Semantic classes
- Cluster similarity graph of words and induced POS
contexts - Use contexts for assigning semantic classes
15Look at my nice languages! Cleaning CUCWeb
- Latin
- In expeditionibus tessellata et sectilia
pauimenta circumferebat. - Britanniam petiuit spe margaritarum earum
amplitudinem conferebat et interdum sua manu
exigebat .. - Scripting
- _at_echo _at_cd (TLSFDIR)(CC) (RTLFLAGS)
(RTL_LWIPFLAGS) -c (TLSFSRC) - _at_echo _at_cd (TOOLSDIR)(CC) (RTLFLAGS)
(RTL_LWIPFLAGS) -c (TOOLSSRC) .. - Hungarian
- A külügyminiszter a diplomáciai és konzuli
képviseletek címjegyzékét és konzuli - Köztestületek, jogi személyiséggel és helyi
jogalkotási jogkörrel. - Esperanto
- Por vidi ghin kun internacia kodigho kaj kun
kelkaj bildoj kliku tie chi ) La Hispana.. - Ne nur pro tio, ke ghi perdigis la vivon de
kelk-centmil hispanoj, sed ankau pro ghia efiko.. - Human Genome
- 1 atgacgatga gtacaaacaa ctgcgagagc atgacctcgt
acttcaccaa ctcgtacatg 61 ggggcggaca tgcatcatgg
gcactacccg ggcaacgggg tcaccgacct ggacgcccag 121
cagatgcacc
16Task-based unsuPOS evaluation
- UnsuPOS tags are used as features, performance is
compared to no POS and supervised POS. Tagger was
induced in one-CPU-day from BNC - Kernel-based WSD better than noPOS, equal to
suPOS - POS-tagging better than noPOS
- Named Entity Recognition no significant
differences - Chunking better than noPOS, worse than suPOS
17Summary
- Structure Discovery Paradigm contrasted to
traditional approaches - no manual annotation, no resources (cheaper)
- language- and domain-independent
- iteratively enriching structural information by
finding and annotating regularities - Graph-based SD procedures
- Evaluation framework and results
18Questions?
- THANKS FOR YOUR ATTENTION!
19Structure Discovery Machine I
- From linguistics, we have the following
intuitions that can lead to SD algorithms that
capture their underlying structure - There are different languages
- Words belong to word classes
- Short sequences of words form multi word units
- Words can be semantically decomposable
(compounds) - Words are subject to inflection
- Morphological congruence between words
- There are grammatical dependencies between words
and sequences of words - Words can have different semantic properties
- Semantic congruence between words
- A word can have several meanings
20Structure Discovery Machine II
- The following methods are SD algorithms
- Language Identification as introduced
- POS Induction as introduced
- MWU detection by Collocation extraction
- Unsupervised Compound Decomposition and
Paraphrasing (work in progress) - Unsupervised Morphology (MorphoChallenge) letter
successor varieties - Unsupervised Parsing Grammar Induction based on
POS and neighbor-based co-occurrences - Semantic classes Similarity in context patterns
of words and POS (work in progress) - WSIWSD Clustering Co-occurrencesDisambiguation
(work in progress)
21Chinese Whispers Graph Clustering
- Explanations
- Nodes have a class and communicate it to their
adjacent nodes - A node adopts one of the the majority class in
its neighborhood - Nodes are processed in random order for some
iterations - Properties
- Time-linear in number of edges very efficient
- Randomized, non-deterministic
- Parameter-free
- Numbers of clusters found by algorithm
- Small World graphs converge fast
Algorithm initialize forall vi in V
class(vi)i while changes forall v in V,
randomized order class(v)highest ranked
class in neighborhood of v
22Language Seperation Evaluation
- Cluster the co-occurrence graph of a multilingual
corpus - Use words of the same class in a language
identifier as lexicon - Almost perfect performance
23unsuPOS Steps
... , sagte der Sprecher bei der Sitzung . ...
, rief der Vorsitzende in der Sitzung . ... ,
warf in die Tasche aus der Ecke .
17
C1 sagte, warf, rief C2 Sprecher, Vorsitzende,
Tasche C3 in C4 der, die
... , sagteC1 derC4 SprecherC2 bei derC4
Sitzung . ... , riefC1 derC4 VorsitzendeC2
inC3 derC4 Sitzung . ... , warfC1 inC3
dieC4 TascheC2 aus derC4 Ecke .
... , sagteC1 derC4 SprecherC2 beiC3 derC4
SitzungC2 . ... , riefC1 derC4 VorsitzendeC2
inC3 derC4 SitzungC2 . ... , warfC1 inC3
dieC4 TascheC2 ausC3 derC4 EckeC2 .
24unsuPOS Ambiguity Example
25unsuPOS Medline tagset
- 1 (13721) recombinogenic, chemoprophylaxis,
stereoscopic, MMP2, NIPPV, Lp, biosensor,
bradykinin, issue, S-100beta, iopromide,
expenditures, dwelling, emissions,
implementation, detoxification, amperometric,
appliance, rotation, diagonal, - 2(1687) self-reporting, hematology, age-adjusted,
perioperative, gynaecology, antitrust,
instructional, beta-thalassemia, interrater,
postoperatively, verbal, up-to-date,
multicultural, nonsurgical, vowel, narcissistic,
offender, interrelated, - 3(1383) proven, supplied, engineered,
distinguished, constrained, omitted, counted,
declared, reanalysed, coexpressed, wait, - 4(957) mediates, relieves, longest, favor,
address, complicate, substituting, ensures,
advise, share, employ, separating, allowing, - 5(1207) peritubular, maxillary, lumbar, abductor,
gray, rhabdoid, tympanic, malar, adrenal,
low-pressure, mediastinal, - 6(653) trophoblasts, paws, perfusions, cerebrum,
pons, somites, supernatant, Kingdom,
extra-embryonic, Britain, endocardium, - 7(1282) acyl-CoAs, conformations, isoenzymes,
STSs, autacoids, surfaces, crystallins,
sweeteners, TREs, biocides, pyrethroids, - 8(1613) colds, apnea, aspergilloma, ACS,
breathlessness, perforations, hemangiomas,
lesions, psychoses, coinfection, terminals,
headache, hepatolithiasis, hypercholesterolemia,
leiomyosarcomas, hypercoagulability, xerostomia,
granulomata, pericarditis, - 9(674) dysregulated, nearest, longest,
satisfying, unplanned, unrealistic, fair,
appreciable, separable, enigmatic, striking, i - 10(509) differentiative, ARV, pleiotropic,
endothermic, tolerogenic, teratogenic, oxidizing,
intraovarian, anaesthetic, laxative, - 13(177) ewe, nymphs, dams, fetuses, marmosets,
bats, triplets, camels, SHR, husband, siblings,
seedlings, ponies, foxes, neighbor, sisters,
mosquitoes, hamsters, hypertensives, neonates,
proband, anthers, brother, broilers, woman, eggs,
- 14(103) considers, comprises, secretes,
possesses, sees, undergoes, outlines, reviews,
span, uncovered, defines, shares, s - 15(87) feline, chimpanzee, pigeon, quail,
guinea-pig, chicken, grower, mammal, toad,
simian, rat, human-derived, piglet, ovum, - 16(589) dually, rarely, spectrally,
circumferentially, satisfactorily, dramatically,
chronically, therapeutically, beneficially,
already, - 18(124) 1-min, two-week, 4-min, 8-week, 6-hour,
2-day, 3-minute, 20-year, 15-minute, 5-h, 24-h,
8-h, ten-year, overnight, 120- - 21(12) July, January, May, February, December,
October, April, September, June, August, March,
November - 23(13) acetic, retinoic, uric, oleic,
arachidonic, nucleic, sialic, linoleic, lactic,
glutamic, fatty, ascorbic, folic - 25(28) route, angle, phase, rim, state, region,
arm, site, branch, dimension, configuration,
area, Clinic, zone, atom, isoform, - 247(6) Plt0_001, Plt0_01, plt0_001, plt0_01, Plt_001,
Plt0_0001
26unsuPOS POS-sorted neighbors
27unsuPOS-sorted co-occurrences
28WSI Hip Example I
hip
29WSI Hip Example II
hip