Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm

Description:

In traditional automated language processing, knowledge is involved in all cases ... 15(87) feline, chimpanzee, pigeon, quail, guinea-pig, chicken, grower, mammal, ... – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 30
Provided by: ich8
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Natural Language Processing using Graph Models The Structure Discovery Paradigm


1
Unsupervised Natural Language Processing using
Graph ModelsThe Structure Discovery Paradigm
  • Chris BiemannUniversity of Leipzig, Germany
  • Doctoral Consortium at HLT-NAACL 2007, Rochester,
    NY, USA
  • April 22, 2007

2
Outline
  • Review of traditional approaches
  • Knowledge-intensive vs. knowledge-free
  • Degrees of Supervision
  • Computational Linguistics vs. statistical NLP
  • A new approach
  • The Structure Discovery Paradigm
  • Graph-based SD procedures
  • Graph models for language processing
  • Graph-based SD procedures
  • Results in task-based evaluation

3
Knowledge-Intensive vs. Knowledge-Free
  • In traditional automated language processing,
    knowledge is involved in all cases where humans
    manually tell machines
  • How to process language by explicit knowledge
  • How a task should be solved by implicit knowledge
  • Knowledge can be provided by the means of
  • Dictionaries, e.g. thesaurus, WordNet,
    ontologies,
  • (grammar) rules
  • Annotation

4
Degrees of Supervision
  • Supervision is providing positive and negative
    training examples to Machine Learning algorithms,
    which use this as a basis for building a model
    that reproduces the classification on unseen data
  • Degrees
  • Fully supervised (Classification) Learning is
    only carried out on fully labeled training set
  • Semi-supervised Unlabeled examples are also used
    for building a data model
  • Weakly-supervised (Bootstrapping) A small set of
    labeled examples is grown and classifications are
    used for re-training
  • Unsupervised (Clustering) No labeled examples
    are provided

5
Computational Linguistics and Statistical NLP
  • CL
  • Implementing linguistic theories with computers
  • Rule-based approaches
  • Rules found by introspection, not data-driven
  • Explicit knowledge
  • Goal understanding language itself
  • Statistical NLP
  • Building systems that perform language processing
    tasks
  • Machine Learning approaches
  • Models are built by training on annotated dataset
  • Implicit knowledge
  • Goal Build robust systems with high performance
  • There is a continuum rather than a sharp cutting
    edge

6
Structure Discovery Paradigm
  • SD
  • Analyze raw data and identify regularities
  • Statistical methods, clustering
  • Knowledge-free, unsupervised
  • Structures as many as can be discovered
  • Language-independent, domain-independent,
    encoding-independent
  • Goal Discover structure in language data and
    mark it in the data

7
Example Discovered Structures
Increased interest rates lead to investments in
banks . ltsentence lang12, subj34.11gt ltchunk
idc25gt ltword POSp3 m0.0
ss14gtIncreas-edlt/wordgt ltMWU POSp1 ss33gt
ltword POSp1 m5.1 ss44gtinterestlt/wordgt
ltword POSp1 m2.12 ss106gtrate-slt/wordgt
lt/MWUgt lt/chunkgt ltchunk idc13gt ltMWU
POSp2gt ltword POSp2 m17.3
s74gtleadlt/wordgt ltword POSp117
m11.98gttolt/wordgt lt/MWUgt lt/chunkgt ltchunk
idc31gt ltword POSp1 m1.3
s33gtinvestment-slt/wordgt ltword POSp118
m11.36gtinlt/wordgt ltword POSp1 m1.12
s33gtbank-slt/wordgt lt/chunkgt ltwordgt POS298gt .
lt/wordgt lt/sentencegt
  • Annotation on various levels
  • Similar labels denote similar properties as found
    by the SD algorithms
  • Similar structures in corpus are annotated in a
    similar way

8
Consequences of Working in SD
  • Only input allowed is raw text data
  • Machines are told how to algorithmically discover
    structure
  • Self-annotation process by marking regularities
    in the data
  • Structure Discovery process is iterated

Text Data
Find regularities by analysis
SD algorithms
SD algorithm
SD algorithm
SD algorithm
Annotate data with regularities
9
Pros and Cons of Structure Discovery
  • Advantages
  • Cheap only raw data needed
  • Alleviation of acquisition bottleneck
  • Language and domain independent
  • No data-resource mismatch (all resources leak)
  • Disadvantages
  • No control over self-annotation labels
  • Congruence to linguistic concepts not guaranteed
  • Much computing time needed

10
Building Blocks in SD
  • Hierarchical levels of basic units in text data
  • Letters
  • Words
  • Sentences
  • Documents
  • These are assumed to be recognizable in the
    remainder.
  • SD allows for
  • arbitrary numbers of intermediate levels
  • grouping of basic into complex units,
  • but these have to be found by SD procedures.

11
Similarity and Homogeneity
  • For determining which units share structure, a
    similarity measure for units is needed. Two kinds
    of features are possible
  • Internal features compare units based on the
    lower level units they contain
  • Context features compare units based on other
    units of same or other level that surround them
  • A clustering based on unit similarity yields sets
    of units that are homogeneous w.r.t. structure
  • This is an abstraction process Units are
    subsumed under the same label.

12
What is it good for? How do I know?
  • Many structural regularities can be thought of,
    some are interesting, some are not.
  • Structures discovered by SD algorithms will not
    necessarily match the concepts of linguists
  • Working in the SD paradigm means to over-generate
    structure acquisition methods and to check,
    whether these are helpful
  • Methods for telling helpful from useless SD
    procedures
  • Look at my nice clusters-approach Examine data
    by hand. While good in the initial phase of
    testing, this is inconclusive choice of
    clusters, coverage
  • Task-based evaluation Use the labels obtained as
    features in a Machine Learning scenario and
    measure the contribution of each label type.
    Involves supervision, is indirect

13
Graph models for SD procedures
  • Motivation for graph representation
  • Graphs are an intuitive and natural way to encode
    language units as nodes and their similarities as
    edges - but also other representations are
    possible
  • Graph clustering can efficiently perform
    abstraction by grouping units into homogeneous
    sets with Chinese Whispers
  • Some graphs on basic units
  • Word co-occurrence (neighbor/sentence),
    significance, higher orders
  • Word context similarity based on local context
    vectors
  • Sentence/document similarity on common words

14
Some graph-based SD procedures
  • Language Separation
  • Cluster sentence-based significant word
    co-occurrence graph
  • Use word lists for language identification
  • Induced POS
  • Cluster local stop word context vector similarity
    graph
  • Cluster second order neighbor word co-occurrence
    graph
  • Train and apply trigram tagger
  • Word Sense Disambiguation
  • Cluster neighborhood of target word of
    sentence-based significant co-occurrence graph
    into sense clusters
  • Compare sense clusters with local context for
    disambiguation
  • Semantic classes
  • Cluster similarity graph of words and induced POS
    contexts
  • Use contexts for assigning semantic classes

15
Look at my nice languages! Cleaning CUCWeb
  • Latin
  • In expeditionibus tessellata et sectilia
    pauimenta circumferebat.
  • Britanniam petiuit spe margaritarum earum
    amplitudinem conferebat et interdum sua manu
    exigebat ..
  • Scripting
  • _at_echo _at_cd (TLSFDIR)(CC) (RTLFLAGS)
    (RTL_LWIPFLAGS) -c (TLSFSRC)
  • _at_echo _at_cd (TOOLSDIR)(CC) (RTLFLAGS)
    (RTL_LWIPFLAGS) -c (TOOLSSRC) ..
  • Hungarian
  • A külügyminiszter a diplomáciai és konzuli
    képviseletek címjegyzékét és konzuli
  • Köztestületek, jogi személyiséggel és helyi
    jogalkotási jogkörrel.
  • Esperanto
  • Por vidi ghin kun internacia kodigho kaj kun
    kelkaj bildoj kliku tie chi ) La Hispana..
  • Ne nur pro tio, ke ghi perdigis la vivon de
    kelk-centmil hispanoj, sed ankau pro ghia efiko..
  • Human Genome
  • 1 atgacgatga gtacaaacaa ctgcgagagc atgacctcgt
    acttcaccaa ctcgtacatg 61 ggggcggaca tgcatcatgg
    gcactacccg ggcaacgggg tcaccgacct ggacgcccag 121
    cagatgcacc

16
Task-based unsuPOS evaluation
  • UnsuPOS tags are used as features, performance is
    compared to no POS and supervised POS. Tagger was
    induced in one-CPU-day from BNC
  • Kernel-based WSD better than noPOS, equal to
    suPOS
  • POS-tagging better than noPOS
  • Named Entity Recognition no significant
    differences
  • Chunking better than noPOS, worse than suPOS

17
Summary
  • Structure Discovery Paradigm contrasted to
    traditional approaches
  • no manual annotation, no resources (cheaper)
  • language- and domain-independent
  • iteratively enriching structural information by
    finding and annotating regularities
  • Graph-based SD procedures
  • Evaluation framework and results

18
Questions?
  • THANKS FOR YOUR ATTENTION!

19
Structure Discovery Machine I
  • From linguistics, we have the following
    intuitions that can lead to SD algorithms that
    capture their underlying structure
  • There are different languages
  • Words belong to word classes
  • Short sequences of words form multi word units
  • Words can be semantically decomposable
    (compounds)
  • Words are subject to inflection
  • Morphological congruence between words
  • There are grammatical dependencies between words
    and sequences of words
  • Words can have different semantic properties
  • Semantic congruence between words
  • A word can have several meanings

20
Structure Discovery Machine II
  • The following methods are SD algorithms
  • Language Identification as introduced
  • POS Induction as introduced
  • MWU detection by Collocation extraction
  • Unsupervised Compound Decomposition and
    Paraphrasing (work in progress)
  • Unsupervised Morphology (MorphoChallenge) letter
    successor varieties
  • Unsupervised Parsing Grammar Induction based on
    POS and neighbor-based co-occurrences
  • Semantic classes Similarity in context patterns
    of words and POS (work in progress)
  • WSIWSD Clustering Co-occurrencesDisambiguation
    (work in progress)

21
Chinese Whispers Graph Clustering
  • Explanations
  • Nodes have a class and communicate it to their
    adjacent nodes
  • A node adopts one of the the majority class in
    its neighborhood
  • Nodes are processed in random order for some
    iterations
  • Properties
  • Time-linear in number of edges very efficient
  • Randomized, non-deterministic
  • Parameter-free
  • Numbers of clusters found by algorithm
  • Small World graphs converge fast

Algorithm initialize forall vi in V
class(vi)i while changes forall v in V,
randomized order class(v)highest ranked
class in neighborhood of v
22
Language Seperation Evaluation
  • Cluster the co-occurrence graph of a multilingual
    corpus
  • Use words of the same class in a language
    identifier as lexicon
  • Almost perfect performance

23
unsuPOS Steps
... , sagte der Sprecher bei der Sitzung . ...
, rief der Vorsitzende in der Sitzung . ... ,
warf in die Tasche aus der Ecke .
17
C1 sagte, warf, rief C2 Sprecher, Vorsitzende,
Tasche C3 in C4 der, die
... , sagteC1 derC4 SprecherC2 bei derC4
Sitzung . ... , riefC1 derC4 VorsitzendeC2
inC3 derC4 Sitzung . ... , warfC1 inC3
dieC4 TascheC2 aus derC4 Ecke .
... , sagteC1 derC4 SprecherC2 beiC3 derC4
SitzungC2 . ... , riefC1 derC4 VorsitzendeC2
inC3 derC4 SitzungC2 . ... , warfC1 inC3
dieC4 TascheC2 ausC3 derC4 EckeC2 .
24
unsuPOS Ambiguity Example
25
unsuPOS Medline tagset
  • 1 (13721) recombinogenic, chemoprophylaxis,
    stereoscopic, MMP2, NIPPV, Lp, biosensor,
    bradykinin, issue, S-100beta, iopromide,
    expenditures, dwelling, emissions,
    implementation, detoxification, amperometric,
    appliance, rotation, diagonal,
  • 2(1687) self-reporting, hematology, age-adjusted,
    perioperative, gynaecology, antitrust,
    instructional, beta-thalassemia, interrater,
    postoperatively, verbal, up-to-date,
    multicultural, nonsurgical, vowel, narcissistic,
    offender, interrelated,
  • 3(1383) proven, supplied, engineered,
    distinguished, constrained, omitted, counted,
    declared, reanalysed, coexpressed, wait,
  • 4(957) mediates, relieves, longest, favor,
    address, complicate, substituting, ensures,
    advise, share, employ, separating, allowing,
  • 5(1207) peritubular, maxillary, lumbar, abductor,
    gray, rhabdoid, tympanic, malar, adrenal,
    low-pressure, mediastinal,
  • 6(653) trophoblasts, paws, perfusions, cerebrum,
    pons, somites, supernatant, Kingdom,
    extra-embryonic, Britain, endocardium,
  • 7(1282) acyl-CoAs, conformations, isoenzymes,
    STSs, autacoids, surfaces, crystallins,
    sweeteners, TREs, biocides, pyrethroids,
  • 8(1613) colds, apnea, aspergilloma, ACS,
    breathlessness, perforations, hemangiomas,
    lesions, psychoses, coinfection, terminals,
    headache, hepatolithiasis, hypercholesterolemia,
    leiomyosarcomas, hypercoagulability, xerostomia,
    granulomata, pericarditis,
  • 9(674) dysregulated, nearest, longest,
    satisfying, unplanned, unrealistic, fair,
    appreciable, separable, enigmatic, striking, i
  • 10(509) differentiative, ARV, pleiotropic,
    endothermic, tolerogenic, teratogenic, oxidizing,
    intraovarian, anaesthetic, laxative,
  • 13(177) ewe, nymphs, dams, fetuses, marmosets,
    bats, triplets, camels, SHR, husband, siblings,
    seedlings, ponies, foxes, neighbor, sisters,
    mosquitoes, hamsters, hypertensives, neonates,
    proband, anthers, brother, broilers, woman, eggs,
  • 14(103) considers, comprises, secretes,
    possesses, sees, undergoes, outlines, reviews,
    span, uncovered, defines, shares, s
  • 15(87) feline, chimpanzee, pigeon, quail,
    guinea-pig, chicken, grower, mammal, toad,
    simian, rat, human-derived, piglet, ovum,
  • 16(589) dually, rarely, spectrally,
    circumferentially, satisfactorily, dramatically,
    chronically, therapeutically, beneficially,
    already,
  • 18(124) 1-min, two-week, 4-min, 8-week, 6-hour,
    2-day, 3-minute, 20-year, 15-minute, 5-h, 24-h,
    8-h, ten-year, overnight, 120-
  • 21(12) July, January, May, February, December,
    October, April, September, June, August, March,
    November
  • 23(13) acetic, retinoic, uric, oleic,
    arachidonic, nucleic, sialic, linoleic, lactic,
    glutamic, fatty, ascorbic, folic
  • 25(28) route, angle, phase, rim, state, region,
    arm, site, branch, dimension, configuration,
    area, Clinic, zone, atom, isoform,
  • 247(6) Plt0_001, Plt0_01, plt0_001, plt0_01, Plt_001,
    Plt0_0001

26
unsuPOS POS-sorted neighbors
27
unsuPOS-sorted co-occurrences
28
WSI Hip Example I
hip
29
WSI Hip Example II
hip
Write a Comment
User Comments (0)
About PowerShow.com