Finding homogenious word sets Towards a dissertation in NLP PowerPoint PPT Presentation

presentation player overlay
1 / 57
About This Presentation
Transcript and Presenter's Notes

Title: Finding homogenious word sets Towards a dissertation in NLP


1
Finding homogenious word setsTowards a
dissertation in NLP
  • Chris Biemann
  • NLP Department, University of Leipzig
  • biem_at_informatik.uni-leipzig.de
  • Universitetet i Oslo, 12/10/2005

2
Outline
  • Preliminaries Co-occurrences
  • Unsupervized methods- Language Seperation- POS
    tagging
  • Weakly Supervized Methods- gazetteer building
    for NER- semantic lexicon extension- extension
    of lexical-semantic word nets

3
Statistical Co-occurrences
  • occurrence of two or more words within a
    well-defined unit of information (sentence,
    nearest neighbors, document, window ...)
  • Significant Co-occurrences reflect relations
    between words
  • Significance Measure (log-likelihood)- k is the
    number of sentences containing a and b together-
    ab is (number of sentences with a)(number of
    sentences with b)- n is total number of
    sentences in corpus

4
Unsupervized methods
  • Unsupervized means no training data, there is
    nothing like a training set
  • This means the discovery and usage of any
    structure in language must be entirely
    algorithmical
  • Unsupervized means knowledge-free No prior
    knowledge allowed.
  • Famous unsupervized method clustering.
  • Advantages
  • language-independent
  • no need to build manual ressources (cheap)
  • Robust
  • Disadvantages
  • Labeling problem
  • Unaware of errors
  • Often not traceable
  • difficult to interpret / evaluate

5
Unsupervized Language Discrimination
  • Supervized Language Identification
  • needs training
  • Operates on letter n-grams or common words as
    features
  • Works almost error-free for texts from 500
    letters on
  • Drawbacks
  • Does not work for previously unknown languages
  • Danger of misclassifying instead of reporting
    unknown
  • Example http//odur.let.rug.nl/vannoord/TextCat/
    Demo
  • xx xxx x xxx classified as Nepali
  • öö ö öö ööö classified as Persian
  • Unsupervized Language Discrimination
  • Task Given a mixed-language corpus, split it
    into the different languages.

Biemann, C., Teresniak, S. (2005) Disentangling
from Babylonian Confusion - Unsupervized Language
Identification, Proceedings of CICLing-2005,
Computational Linguistics and Intelligent Text
Processing, Mexico City, Mexico and Springer LNCS
3406, pp. 762-773
6
Co-occurrence Graphs
  • The entirety of all significant co-occurrences is
    a co-occurrence graph G(V,E) withV Vertices
    WordsE Edges (v1, v2, s) with v1, v2 words,
    s significance value.
  • Co-occurrence graph is- weighted- undirected
  • Small-world-property

7
Chinese Whispers - Motivation
  • (small-world) graphs consist of regions with a
    high clustering coefficient and hubs that connect
    those regions
  • The nodes in cluster regions should be assigned
    the same label per region
  • Every node gets a label and whispers it to its
    neighbouring nodes. A node changes to a label if
    most of its neighbours whisper this label or it
    invents a new one
  • Under assumption of semantic closeness when being
    strongly connected there should emerge motivated
    clusters

8
Chinese Whispers Algorithm
  • Assign different labels to every node in the
    graph
  • For iteration i from 1 to total_iterations
  • mutation_rate 1/(i2)
  • For each word w in the graph
  • new_label of w highest ranked label in
    neighbourhood of w
  • with probability mutation_rate
    new_label of w new class label
  • labels new_labels
  • graph clustering algorithm
  • linear time in the number of nodes
  • random mutation can be omitted but showed better
    results for small graphs

9
Chinese Whispers on 7 Languages
10
Chinese Whispers on 7 languages
11
Assigning languages to sentences
  • Use word-based language identification tool
  • Largest clusters form word lists for different
    languages
  • A sentence is assigned a cluster label if - it
    contains at least 2 words from the cluster and -
    not more words from another cluster
  • Questions for Evaluation
  • up to what number of languages is that possible ?
  • How much can the corpus be biased ?

12
Evaluation
  • Mix of seven languages, equal number of
    sentences
  • Languages used Dutch, Estonian, English, French,
    German, Icelandic and Italian
  • At least 100 sentences per language are necessary
    for consistent clusters
  • Two languages with strong bias
  • At least 500 sentences out of 100000 needed to
    find the smaller language
  • Tested on English in Estonian, Dutch in German,
    French in Italian

13
Common mistakes
  • Unclassified - mostly enumerations of sport
    teams - very short sentences, e.g. headlines-
    legal act ciphers in estonian case, e.g.
    10.12.96 jõust.01.01.97 - RT I 1996 , 89 , 1590
  • Misclassified mixed-language-sentences,
    likeFrench Frönsku orðin "cinéma vérité"
    þýða "kvikmyndasannleikurEnglish
    Die Beatles mit "All you need is love".

14
Induction of POS Information
  • Given Unstructured monolingual text corpus
  • Goal Induction of POS Tags for many (all)
    words.Result is a list of words with the
    corresponding tag. Application on text (the
    actual POS tagging) is another task.
  • Motivation
  • POS information is a processing step in a variety
    of NLP applications such as parsing, IE, indexing
  • POS taggers need a considerable amount of
    hand-tagged training data which is expensive and
    only available for major languages
  • Even for major languages, POS taggers are suited
    for well-formed texts and do not cope well with
    domain-dependent issues as being found e.g. in
    eMail or spoken corpora

15
Literature Overview
  • Schütze 93, Schütze 95, Clark 00, Freitag 04
    show a similar architecture on high level, but
    differ in details.
  • Steps to achieve word classes
  • Calculation of global contexts using a window of
    1-2 words to left and right and the most frequent
    150-250 words as features
  • Clustering of these contexts gives word classes

16
Method Description
  • Contexts the most frequent N (100, 200, 300)
    words are used for 4 x N context vectors for the
    most frequent 10000 words in the corpus
  • Cosine similarity between all pairs of the 10000
    top words is calculated
  • Transformation to a graph Draw an edge with
    weight1/ (1-cos(x,y)) between x and y, if
    cos(x,y) is above some threshold
  • Chinese Whispers (CW) on graph results in word
    class clusters
  • Differences to prev. methods
  • CW Clustering does not need number of classes as
    input
  • No dimensionality reduction techniques as SVD
  • Explicit threshold for similarity

17
Toy Example (1)
  • Corpus fragments
  • ... _KOM_ sagte der Sprecher bei der Sitzung
    _ESENT_
  • ... _KOM_ rief der Vorsitzende in der Sitzung
    _ESENT_
  • ... _KOM_ warf in die Tasche aus der Ecke
    _ESENT_
  • Features der(1), die(2), bei(3), in(4),
    _ESENT_(5), _KOM_(6)

Position
-2
-1
1
2
18
Toy Example (2)
  • Here, CW cuts graph in 2 partitions nouns and
    verbs.

15
17
30
1000
15
15
12
15
17
17
17
17
30
19
Norwegian Labels
20
corpus size and features CP vs. coverage
21
Example time words in Norwegian
22
Cluster sizes and clusters per word class
  • When optimizing CP, words of the same word class
    tend to end up in several clusters, especially
    for open word classes
  • Open word classes are the most interesting word
    classes for further processing steps like IE,
    relation learning..
  • Cluster sizes are Zipf-distributed, there are
    always many small clusters
  • Hierarchical CW could be used to lower the number
    of clusters while staying in POS distinctions

23
Outlook Constructing a POS tagger
  • Using word clusters to initialize a POS tagger
  • Evaluation based on types instead of tokens
  • Open questions
  • Context window backoff model for unknown words
  • Leave out or take in unclustered high frequency
    words (as singletons) ?
  • Can the many classes per POS be unified using
    tagger behaviour?

24
Weakly Supervized Methods
  • Weakly supervized means
  • Very little training data and prior knowledge
  • Learning from labeled and unlabeled data
  • bootstrapping methods
  • Advantages
  • Very little input still cheap
  • No labeling problem
  • Easier to evaluate
  • Disadvantages
  • Subject to error propagation
  • Stopping criterion difficult to define

25
Bootstrapping of lexical items
  • For learning by bootstrapping, two things are
    needed A start set of some known items with
    classes and a rule set that states, how more
    information can be obtained using known items.
  • Generic bootstrapping algorithm 
  • Knowledge0
  • NewStart_set
  • While Newgt0
  • KnowledgeNew
  • New0
  • Newfind new items using Knowledge and Rule_set

known items
Phase of growth
items
Phase of exhaustion
new items
iteration
26
Benefits and Bothers of Bootstrapping
  • Pro
  • Only small start sets (seeds) are needed, those
    can be rapidly prepared
  • Process needs no further supervision (weakly
    supervized learning)
  • Cons
  • Danger of Error Propagation
  • When to stop is unclear

27
Patterns for word classes and their relations
  • Examples for word classes in text
  • Islands On the island of Cuba ...,
    carribbean island of Trinidad
  • Companies the ACME Ltd. Incorporated
  • Verbs of utterance she said ltsomethinggt
  • Person names John W. Smith, Ellen Meyer
  • Observation
  • Words belonging to the same class can be
    interchanged without hurting the relation
  • Sometimes no trigger words

28
Problem definition
  • Be Ri A1 ?... ? An n-ary relations over word
    sets A1..An.
  • Given
  • Some elements of sets A1..An
  • Large corpus
  • Needed
  • Sets A1..An
  • (a1..an) ? Ri
  • Necessary rules for classification

29
Pattern Matching Rules
  • Annotate Text with known items and flat features
    (tagging is nice, but Tagsets of 4 tags will do
    for English)" ... said Jonas Berger , who
    .. " ... LC UC LN PM LC ..
  • Use rules likeUC LN -gt FN FN UC -gt LNto
    classify "Jonas" as first name
  • Rules of this kind are weak hypotheses because
    they sometimes misclassify, e.g. in
  • As Berger turned over, ...
  • ... tickets at Agency Berger, Munich."
  • ? Rules alone are not sufficient.

30
Pendulum-Algorithm Bootstrapping with
verification
  • Initialize Knowledge, Rules, New_items
  • While New_itemsgt0
  • Last_new_itemsNew_items New_items0
  • for all Last_new_items i
  • fetch text containing i from corpus
  • find candidates in text by using Knowledge
    and Rules
  • verify candidate k
  • fetch text containing k
  • rate k on basis of text
  • New_itemscandidates with high ratings
    KnowledgeNew_items

Search step
Verification step
Quasthoff, U. Biemann, Chr. Wolff, Chr. Named
entity learning and verification EM in large
corpora. In Proceedings of CoNLL-2002 , The
Sixth Workshop on Computational Language
Learning, 31 August and 1 September 2002 in
association with Coling 2002 in Taipei,
Taiwan Biemann, Chr. Böhm, K. Quasthoff U.
Wolff, Chr. Automatic Discovery and Aggregation
of Compound Names for the use in Knowledge
Representations. Proc I-KNOW 03, International
Conference on Knowledge Management, Graz and
Journal of Universal Computer Science (JUCS),
Volume 9, Number 6, Pp. 530-541, Juni 2003
31
Explanations on the Pendulum
  • The same rules are used for both search and
    verification of candidates
  • Previously known and previously learned items are
    used for both search and verification of
    candidates
  • A word is tonly taken into knowledge, if it
    occurs
  • multiple times and
  • at high rate
  • in the corpus with its classification.

32
Example island names and island specifiers
33
Results German Person Names
  • Start Set and prior knowledge 9 first names,
    10 last names, 15 rules, 12 reg-exps for
    titles
  • Corpus Projekt Deutscher Wortschatz, 36 Mio.
    Sentences

Found 42000 items, of which74 LN Precgt99,
15 FN Precgt80 11 TIT Precgt99
34
Extending a semantic lexicon using
co-occurrences and HaGenLex
  • Size for nouns about 13 000.
  • 50 semantic classes for nouns are constructed
    from allowed combinations of
  • 16 semantic features (binary), e.g. HUMAN,
    ARTIFICIAL-
  • 17 ontologic sorts, e.g. concrete,
    abstract-situation...

WORD SEMANTIC CLASS Aggressivität nonment-dyn-abs
-situation Agonie nonment-stat-abs-situation Agra
rprodukt nat-discrete Ägypter human-object Ahn h
uman-object Ahndung nonment-dyn-abs-situation Ähn
lichkeit relation Airbag nonax-mov-art-discrete
Airbus mov-nonanimate-con-potag Airport art-con-
geogr Ajatollah human-object Akademiker human-ob
ject Akademisierung nonment-dyn-abs-situation ...
...
35
Underlying Assumptions
  • Harris 1968 Distributional Hypothesissemantic
    similarity is a function over global contexts of
    words. The more similar the contexts, the more
    similar the words
  • Projected on nouns and adjectives nouns of
    similar semantic classes are modified through
    similar adjectives

36
Neighbouring Co-occurrences and Profiles
  • Neighbouring co-occurrence a pair of words that
    occur next to each other more often than to be
    expected under assumption of statistical
    independence.
  • The neighbouring co-occurrence relation between
    adjectives as left neighbours and nouns as right
    neighbours approximates typical head-modifier
    structures
  • The set of adjectives that co-occur significantly
    often to the left of a noun is called ist
    adjective profile (analogous definition of noun
    profile for adjectives)
  • For experiments, I used the most recent German
    corpus of Projekt Deutscher Wortschatz, 500
    million tokens

37
Example neighbouring profiles
  • amount 160000 nouns, 23400 adjectives

38
Mechanism of Inheritance
Which class is assigned to N4 in the next step?
  • Algorithm
  • Initialize adjective and noun profiles
  • Initialize the start set
  • As long as new nouns get classified
  • calculate class probabilities for each
    adjective
  • for all yet unclassified nouns n
  • Multiply class probabilities per
    class of modifying adjectives
  • Assign the class with highest
    probabilities to n
  • Class probabilities per adjective
  • count number of classes
  • normalize on total number of class wrt.
    noun classes
  • normalize to 1

39
Experimental Data
  • 5133 nouns comply to minAdj5, that means
    maximal recall84.9
  • In all experiments, 10-fold-cross validation
    was used

40
Results Global Classification
  • Classification was carried out directly on 50
    semantic classes
  • Different measuring points correspond to
    parameters minAdj in 5,10,15,20, maxClass in
    2, 5, 50
  • Results too poor for lexicon extension

41
Combining Single Classifiers
  • Architecture binary classifiers for single
    features, then combinding the outcome.
    Parameter minAdj5, maxClass2

ANIMAL /-
ANIMATE /-
Selection compatible semantic classes that are
minimal w.r.t hierarchy and unambiguous.
ARTIF /-
AXIAL /-
result classorreject
... (16 features)
ab /-
abs /-
ad /-
as /-
... (17 sorts)
42
Results Single Semantic Features
  • for bias gt 0.05 good to excellent precision
  • total precision 93.8 (86.8 for feature )
  • total recall 70.7 (69.2 for feature )

43
Results Ontologic Sorts
  • for bias gt 0.10 good to excellent precision
  • total precision 94.1 (89.5 for sort )
  • total recall 73.6 (69.6 for sort )

44
Results Comb. Semantic Classes
  • no connection between amount of class and results
    visible
  • total precision 82.3
  • total recall 32.8
  • number of newly classified nouns 8500 (minAdj2
    13000)

45
Typical mistakes
  • Pflanze (plant) animal-object instead of
    plant-object
  • zart, fleischfressend, fressend, verändert,
    genmanipuliert, transgen, exotisch, selten,
    giftig, stinkend, wachsend...
  • Nachwuchs (offspring) human-object instead of
    animal-object
  • wissenschaftlich, qualifiziert, akademisch,
    eigen, talentiert, weiblich, hoffnungsvoll,
    geeignet, begabt, journalistisch...
  • Café (café) art-con-geogr instead of
    nonmov-art-discrete (cf. Restaurant)
  • Wiener, klein, türkisch, kurdisch, romanisch,
    cyber, philosophisch, besucht, traditionsreich,
    schnieke, gutbesucht, ...
  • Neger (negro) animal-object instead of
    human-object
  • weiß, dreckig, gefangen, faul, alt, schwarz,
    nackt, lieb, gut, brav
  • but
  • Skinhead (skinhead) human-object (ok)
  • 16,17,18,19,20,21,22,23,30ährig, gleichaltrig,
    zusammengeprügelt, rechtsradikal, brutal
  • In most cases the wrong class is semantically
    close. Evaluation metrics did not account for
    that.

Biemann, C., Osswald, R. (2005) Automatic
Extension of Feature-based Semantic Lexicons via
Contextual Attributes, Proceedings of 29th annual
meeting of Gfkl, Magdeburg 2005
46
Extending CoreNet Korean WordNet
  • CoreNet Characteristics
  • Rather large groups of words per concept as
    opposed to fine-grained WordNet structure
  • Same concept hierarchy is used for all word
    classes
  • Size of KAIST Korean corpus
  • 38 Million tokens,
  • 2.3 Million sentences,
  • 3.8 Million types

47
Pendulum-Algorithm on co-occurrences
  • LastLearnedStartSet
  • KnowledgeStartSet
  • NewLearned0
  • while (LastLearnedgt0)
  • for all i in LastLearned
  • CandidatesgetCooccurrences(i)
  • for all c in Candidates
  • VerifySetgetCooccurrences(c)
  • if VerifySet ? Knowledge gtthreshhold
  • NewLearnedc
  • Knowledgec
  • LastLearnedNewLearned
  • NewLearned0

Search step
Verification step
48
Sample step
  • Seed
  • Search with
    yields (amongst others)
  • Verifiy

49
Evaluation
  • Selection of concepts performed by a non-Korean
    speaker
  • Evaluation performed manually, only new words
    counted
  • Heuristics for avoiding result set infection-
    iteratively lower threshold for verification from
    8 downto 3 until the result set is too large-
    take lowest threshold for result set with
    reasonable size (not exceeding start set)
  • Typical run needed 3-7 iterations to converge

Biemann, C., Shin, S.-I., Choi, K.-S. (2004)
"Semiautomatic Externsion of CoreNet using a
Bootstrapping Mechanism on Corpus-based
Co-occurrences", Proceedings of the 20th
International Conference on Computational
Linguistics (COLING04) Genf, Switzerland
50
Results
  • Not enough for automatic extension, but a good
    source for candidates

51
Problems... ...and possible solutions
  • Coverage is low- increase corpus size for
    relevant domains- make use of other features,
    e.g. patterns
  • Precision is not satisfactionary- obtain
    multiple concepts simultaneously- meta-level
    bootstrapping- make use of other features, e.g.
    POS tags for word class information
  • This work gives a baseline of what is reachable
    without employing language-dependent features

52
From Text to Ontologies
Text
Text
Text
Text
Determine patterns and extract word pairs
assign semantic properties for words
sort by language
lang. 1
lang. 2
lang. n
...
typed relations and instances
assign word classes
text with POS labels
53
Questions?
  • THANK YOU!

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Abstract
  • Methods are introduced that find sets of words
    that have something in common in some way by
    corpus analysis. Having the objective of vastly
    automizing the task and putting the knowledge in
    algorithms instead of training sets, two kinds of
    methods can be distinguished completely
    unsupervized methods (clustering) and weakly
    supervized methods (bootstrapping).
  • Two unsupervized variants for standard
    preprocessing steps will be discussed, namely
    language identification and part-of-speech
    tagging. In both, a novel, efficient graph
    clustering algorithm is employed.
  • After a general introduction to bootstrapping,
    which needs only a minimal training set, three
    bootstrapping experiments will be described
    Gazetteer construction for Named Entity
    Recognition, extension of a semantic lexicon and
    expansion of a lexical-semantic word net.
  • Follow-ups on the latter two can give rise to
    automatic ontology creation and extension.
Write a Comment
User Comments (0)
About PowerShow.com