Finding homogenious word sets Towards a dissertation in NLP presentation

About This Presentation

Transcript and Presenter's Notes

Title: Finding homogenious word sets Towards a dissertation in NLP

1
Finding homogenious word setsTowards a
dissertation in NLP

Chris Biemann
NLP Department, University of Leipzig
biem_at_informatik.uni-leipzig.de
Universitetet i Oslo, 12/10/2005

2
Outline

Preliminaries Co-occurrences
Unsupervized methods- Language Seperation- POS
tagging
Weakly Supervized Methods- gazetteer building
for NER- semantic lexicon extension- extension
of lexical-semantic word nets

3
Statistical Co-occurrences

occurrence of two or more words within a
well-defined unit of information (sentence,
nearest neighbors, document, window ...)
Significant Co-occurrences reflect relations
between words
Significance Measure (log-likelihood)- k is the
number of sentences containing a and b together-
ab is (number of sentences with a)(number of
sentences with b)- n is total number of
sentences in corpus

4
Unsupervized methods

Unsupervized means no training data, there is
nothing like a training set
This means the discovery and usage of any
structure in language must be entirely
algorithmical
Unsupervized means knowledge-free No prior
knowledge allowed.
Famous unsupervized method clustering.
Advantages
language-independent
no need to build manual ressources (cheap)
Robust
Disadvantages
Labeling problem
Unaware of errors
Often not traceable
difficult to interpret / evaluate

5
Unsupervized Language Discrimination

Supervized Language Identification
needs training
Operates on letter n-grams or common words as
features
Works almost error-free for texts from 500
letters on
Drawbacks
Does not work for previously unknown languages
Danger of misclassifying instead of reporting
unknown
Example http//odur.let.rug.nl/vannoord/TextCat/
Demo
xx xxx x xxx classified as Nepali
öö ö öö ööö classified as Persian
Unsupervized Language Discrimination
Task Given a mixed-language corpus, split it
into the different languages.

Biemann, C., Teresniak, S. (2005) Disentangling
from Babylonian Confusion - Unsupervized Language
Identification, Proceedings of CICLing-2005,
Computational Linguistics and Intelligent Text
Processing, Mexico City, Mexico and Springer LNCS
3406, pp. 762-773
6
Co-occurrence Graphs

The entirety of all significant co-occurrences is
a co-occurrence graph G(V,E) withV Vertices
WordsE Edges (v1, v2, s) with v1, v2 words,
s significance value.
Co-occurrence graph is- weighted- undirected
Small-world-property

7
Chinese Whispers - Motivation

(small-world) graphs consist of regions with a
high clustering coefficient and hubs that connect
those regions
The nodes in cluster regions should be assigned
the same label per region
Every node gets a label and whispers it to its
neighbouring nodes. A node changes to a label if
most of its neighbours whisper this label or it
invents a new one
Under assumption of semantic closeness when being
strongly connected there should emerge motivated
clusters

8
Chinese Whispers Algorithm

Assign different labels to every node in the
graph
For iteration i from 1 to total_iterations
mutation_rate 1/(i2)
For each word w in the graph
new_label of w highest ranked label in
neighbourhood of w
with probability mutation_rate
new_label of w new class label
labels new_labels
graph clustering algorithm
linear time in the number of nodes
random mutation can be omitted but showed better
results for small graphs

9
Chinese Whispers on 7 Languages
10
Chinese Whispers on 7 languages
11
Assigning languages to sentences

Use word-based language identification tool
Largest clusters form word lists for different
languages
A sentence is assigned a cluster label if - it
contains at least 2 words from the cluster and -
not more words from another cluster
Questions for Evaluation
up to what number of languages is that possible ?
How much can the corpus be biased ?

12
Evaluation

Mix of seven languages, equal number of
sentences
Languages used Dutch, Estonian, English, French,
German, Icelandic and Italian
At least 100 sentences per language are necessary
for consistent clusters
Two languages with strong bias
At least 500 sentences out of 100000 needed to
find the smaller language
Tested on English in Estonian, Dutch in German,
French in Italian

13
Common mistakes

Unclassified - mostly enumerations of sport
teams - very short sentences, e.g. headlines-
legal act ciphers in estonian case, e.g.
10.12.96 jõust.01.01.97 - RT I 1996 , 89 , 1590
Misclassified mixed-language-sentences,
likeFrench Frönsku orðin "cinéma vérité"
þýða "kvikmyndasannleikurEnglish
Die Beatles mit "All you need is love".

14
Induction of POS Information

Given Unstructured monolingual text corpus
Goal Induction of POS Tags for many (all)
words.Result is a list of words with the
corresponding tag. Application on text (the
actual POS tagging) is another task.
Motivation
POS information is a processing step in a variety
of NLP applications such as parsing, IE, indexing
POS taggers need a considerable amount of
hand-tagged training data which is expensive and
only available for major languages
Even for major languages, POS taggers are suited
for well-formed texts and do not cope well with
domain-dependent issues as being found e.g. in
eMail or spoken corpora

15
Literature Overview

Schütze 93, Schütze 95, Clark 00, Freitag 04
show a similar architecture on high level, but
differ in details.
Steps to achieve word classes
Calculation of global contexts using a window of
1-2 words to left and right and the most frequent
150-250 words as features
Clustering of these contexts gives word classes

16
Method Description

Contexts the most frequent N (100, 200, 300)
words are used for 4 x N context vectors for the
most frequent 10000 words in the corpus
Cosine similarity between all pairs of the 10000
top words is calculated
Transformation to a graph Draw an edge with
weight1/ (1-cos(x,y)) between x and y, if
cos(x,y) is above some threshold
Chinese Whispers (CW) on graph results in word
class clusters
Differences to prev. methods
CW Clustering does not need number of classes as
input
No dimensionality reduction techniques as SVD
Explicit threshold for similarity

17
Toy Example (1)

Corpus fragments
... _KOM_ sagte der Sprecher bei der Sitzung
_ESENT_
... _KOM_ rief der Vorsitzende in der Sitzung
_ESENT_
... _KOM_ warf in die Tasche aus der Ecke
_ESENT_
Features der(1), die(2), bei(3), in(4),
_ESENT_(5), _KOM_(6)

Position
-2
-1
1
2
18
Toy Example (2)

Here, CW cuts graph in 2 partitions nouns and
verbs.

15
17
30
1000
15
15
12
15
17
17
17
17
30
19
Norwegian Labels
20
corpus size and features CP vs. coverage
21
Example time words in Norwegian
22
Cluster sizes and clusters per word class

When optimizing CP, words of the same word class
tend to end up in several clusters, especially
for open word classes
Open word classes are the most interesting word
classes for further processing steps like IE,
relation learning..
Cluster sizes are Zipf-distributed, there are
always many small clusters
Hierarchical CW could be used to lower the number
of clusters while staying in POS distinctions

23
Outlook Constructing a POS tagger

Using word clusters to initialize a POS tagger
Evaluation based on types instead of tokens
Open questions
Context window backoff model for unknown words
Leave out or take in unclustered high frequency
words (as singletons) ?
Can the many classes per POS be unified using
tagger behaviour?

24
Weakly Supervized Methods

Weakly supervized means
Very little training data and prior knowledge
Learning from labeled and unlabeled data
bootstrapping methods
Advantages
Very little input still cheap
No labeling problem
Easier to evaluate
Disadvantages
Subject to error propagation
Stopping criterion difficult to define

25
Bootstrapping of lexical items

For learning by bootstrapping, two things are
needed A start set of some known items with
classes and a rule set that states, how more
information can be obtained using known items.
Generic bootstrapping algorithm
Knowledge0
NewStart_set
While Newgt0
KnowledgeNew
New0
Newfind new items using Knowledge and Rule_set

known items
Phase of growth
items
Phase of exhaustion
new items
iteration
26
Benefits and Bothers of Bootstrapping

Pro
Only small start sets (seeds) are needed, those
can be rapidly prepared
Process needs no further supervision (weakly
supervized learning)
Cons
Danger of Error Propagation
When to stop is unclear

27
Patterns for word classes and their relations

Examples for word classes in text
Islands On the island of Cuba ...,
carribbean island of Trinidad
Companies the ACME Ltd. Incorporated
Verbs of utterance she said ltsomethinggt
Person names John W. Smith, Ellen Meyer
Observation
Words belonging to the same class can be
interchanged without hurting the relation
Sometimes no trigger words

28
Problem definition

Be Ri A1 ?... ? An n-ary relations over word
sets A1..An.
Given
Some elements of sets A1..An
Large corpus
Needed
Sets A1..An
(a1..an) ? Ri
Necessary rules for classification

29
Pattern Matching Rules

Annotate Text with known items and flat features
(tagging is nice, but Tagsets of 4 tags will do
for English)" ... said Jonas Berger , who
.. " ... LC UC LN PM LC ..
Use rules likeUC LN -gt FN FN UC -gt LNto
classify "Jonas" as first name
Rules of this kind are weak hypotheses because
they sometimes misclassify, e.g. in
As Berger turned over, ...
... tickets at Agency Berger, Munich."
? Rules alone are not sufficient.

30
Pendulum-Algorithm Bootstrapping with
verification

Initialize Knowledge, Rules, New_items
While New_itemsgt0
Last_new_itemsNew_items New_items0
for all Last_new_items i
fetch text containing i from corpus
find candidates in text by using Knowledge
and Rules
verify candidate k
fetch text containing k
rate k on basis of text
New_itemscandidates with high ratings
KnowledgeNew_items

Search step
Verification step
Quasthoff, U. Biemann, Chr. Wolff, Chr. Named
entity learning and verification EM in large
corpora. In Proceedings of CoNLL-2002 , The
Sixth Workshop on Computational Language
Learning, 31 August and 1 September 2002 in
association with Coling 2002 in Taipei,
Taiwan Biemann, Chr. Böhm, K. Quasthoff U.
Wolff, Chr. Automatic Discovery and Aggregation
of Compound Names for the use in Knowledge
Representations. Proc I-KNOW 03, International
Conference on Knowledge Management, Graz and
Journal of Universal Computer Science (JUCS),
Volume 9, Number 6, Pp. 530-541, Juni 2003
31
Explanations on the Pendulum

The same rules are used for both search and
verification of candidates
Previously known and previously learned items are
used for both search and verification of
candidates
A word is tonly taken into knowledge, if it
occurs
multiple times and
at high rate
in the corpus with its classification.

32
Example island names and island specifiers
33
Results German Person Names

Start Set and prior knowledge 9 first names,
10 last names, 15 rules, 12 reg-exps for
titles
Corpus Projekt Deutscher Wortschatz, 36 Mio.
Sentences

Found 42000 items, of which74 LN Precgt99,
15 FN Precgt80 11 TIT Precgt99
34
Extending a semantic lexicon using
co-occurrences and HaGenLex

Size for nouns about 13 000.
50 semantic classes for nouns are constructed
from allowed combinations of
16 semantic features (binary), e.g. HUMAN,
ARTIFICIAL-
17 ontologic sorts, e.g. concrete,
abstract-situation...

WORD SEMANTIC CLASS Aggressivität nonment-dyn-abs
-situation Agonie nonment-stat-abs-situation Agra
rprodukt nat-discrete Ägypter human-object Ahn h
uman-object Ahndung nonment-dyn-abs-situation Ähn
lichkeit relation Airbag nonax-mov-art-discrete
Airbus mov-nonanimate-con-potag Airport art-con-
geogr Ajatollah human-object Akademiker human-ob
ject Akademisierung nonment-dyn-abs-situation ...
...
35
Underlying Assumptions

Harris 1968 Distributional Hypothesissemantic
similarity is a function over global contexts of
words. The more similar the contexts, the more
similar the words
Projected on nouns and adjectives nouns of
similar semantic classes are modified through
similar adjectives

36
Neighbouring Co-occurrences and Profiles

Neighbouring co-occurrence a pair of words that
occur next to each other more often than to be
expected under assumption of statistical
independence.
The neighbouring co-occurrence relation between
adjectives as left neighbours and nouns as right
neighbours approximates typical head-modifier
structures
The set of adjectives that co-occur significantly
often to the left of a noun is called ist
adjective profile (analogous definition of noun
profile for adjectives)
For experiments, I used the most recent German
corpus of Projekt Deutscher Wortschatz, 500
million tokens

37
Example neighbouring profiles

amount 160000 nouns, 23400 adjectives

38
Mechanism of Inheritance
Which class is assigned to N4 in the next step?

Algorithm
Initialize adjective and noun profiles
Initialize the start set
As long as new nouns get classified
calculate class probabilities for each
adjective
for all yet unclassified nouns n
Multiply class probabilities per
class of modifying adjectives
Assign the class with highest
probabilities to n

Class probabilities per adjective
count number of classes
normalize on total number of class wrt.
noun classes
normalize to 1

39
Experimental Data

5133 nouns comply to minAdj5, that means
maximal recall84.9
In all experiments, 10-fold-cross validation
was used

40
Results Global Classification

Classification was carried out directly on 50
semantic classes
Different measuring points correspond to
parameters minAdj in 5,10,15,20, maxClass in
2, 5, 50
Results too poor for lexicon extension

41
Combining Single Classifiers

Architecture binary classifiers for single
features, then combinding the outcome.
Parameter minAdj5, maxClass2

ANIMAL /-
ANIMATE /-
Selection compatible semantic classes that are
minimal w.r.t hierarchy and unambiguous.
ARTIF /-
AXIAL /-
result classorreject
... (16 features)
ab /-
abs /-
ad /-
as /-
... (17 sorts)
42
Results Single Semantic Features

for bias gt 0.05 good to excellent precision
total precision 93.8 (86.8 for feature )
total recall 70.7 (69.2 for feature )

43
Results Ontologic Sorts

for bias gt 0.10 good to excellent precision
total precision 94.1 (89.5 for sort )
total recall 73.6 (69.6 for sort )

44
Results Comb. Semantic Classes

no connection between amount of class and results
visible
total precision 82.3
total recall 32.8
number of newly classified nouns 8500 (minAdj2
13000)

45
Typical mistakes

Pflanze (plant) animal-object instead of
plant-object
zart, fleischfressend, fressend, verändert,
genmanipuliert, transgen, exotisch, selten,
giftig, stinkend, wachsend...
Nachwuchs (offspring) human-object instead of
animal-object
wissenschaftlich, qualifiziert, akademisch,
eigen, talentiert, weiblich, hoffnungsvoll,
geeignet, begabt, journalistisch...
Café (café) art-con-geogr instead of
nonmov-art-discrete (cf. Restaurant)
Wiener, klein, türkisch, kurdisch, romanisch,
cyber, philosophisch, besucht, traditionsreich,
schnieke, gutbesucht, ...
Neger (negro) animal-object instead of
human-object
weiß, dreckig, gefangen, faul, alt, schwarz,
nackt, lieb, gut, brav
but
Skinhead (skinhead) human-object (ok)
16,17,18,19,20,21,22,23,30ährig, gleichaltrig,
zusammengeprügelt, rechtsradikal, brutal
In most cases the wrong class is semantically
close. Evaluation metrics did not account for
that.

Biemann, C., Osswald, R. (2005) Automatic
Extension of Feature-based Semantic Lexicons via
Contextual Attributes, Proceedings of 29th annual
meeting of Gfkl, Magdeburg 2005
46
Extending CoreNet Korean WordNet

CoreNet Characteristics
Rather large groups of words per concept as
opposed to fine-grained WordNet structure
Same concept hierarchy is used for all word
classes
Size of KAIST Korean corpus
38 Million tokens,
2.3 Million sentences,
3.8 Million types

47
Pendulum-Algorithm on co-occurrences

LastLearnedStartSet
KnowledgeStartSet
NewLearned0
while (LastLearnedgt0)
for all i in LastLearned
CandidatesgetCooccurrences(i)
for all c in Candidates
VerifySetgetCooccurrences(c)
if VerifySet ? Knowledge gtthreshhold
NewLearnedc
Knowledgec
LastLearnedNewLearned
NewLearned0

Search step
Verification step
48
Sample step

Seed
Search with
yields (amongst others)
Verifiy

49
Evaluation

Selection of concepts performed by a non-Korean
speaker
Evaluation performed manually, only new words
counted
Heuristics for avoiding result set infection-
iteratively lower threshold for verification from
8 downto 3 until the result set is too large-
take lowest threshold for result set with
reasonable size (not exceeding start set)
Typical run needed 3-7 iterations to converge

Biemann, C., Shin, S.-I., Choi, K.-S. (2004)
"Semiautomatic Externsion of CoreNet using a
Bootstrapping Mechanism on Corpus-based
Co-occurrences", Proceedings of the 20th
International Conference on Computational
Linguistics (COLING04) Genf, Switzerland
50
Results

Not enough for automatic extension, but a good
source for candidates

51
Problems... ...and possible solutions

Coverage is low- increase corpus size for
relevant domains- make use of other features,
e.g. patterns
Precision is not satisfactionary- obtain
multiple concepts simultaneously- meta-level
bootstrapping- make use of other features, e.g.
POS tags for word class information
This work gives a baseline of what is reachable
without employing language-dependent features

52
From Text to Ontologies
Text
Text
Text
Text
Determine patterns and extract word pairs
assign semantic properties for words
sort by language
lang. 1
lang. 2
lang. n
...
typed relations and instances
assign word classes
text with POS labels
53
Questions?

THANK YOU!

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
Abstract

Methods are introduced that find sets of words
that have something in common in some way by
corpus analysis. Having the objective of vastly
automizing the task and putting the knowledge in
algorithms instead of training sets, two kinds of
methods can be distinguished completely
unsupervized methods (clustering) and weakly
supervized methods (bootstrapping).
Two unsupervized variants for standard
preprocessing steps will be discussed, namely
language identification and part-of-speech
tagging. In both, a novel, efficient graph
clustering algorithm is employed.
After a general introduction to bootstrapping,
which needs only a minimal training set, three
bootstrapping experiments will be described
Gazetteer construction for Named Entity
Recognition, extension of a semantic lexicon and
expansion of a lexical-semantic word net.
Follow-ups on the latter two can give rise to
automatic ontology creation and extension.

Write a Comment

User Comments (0)

About PowerShow.com

Finding homogenious word sets Towards a dissertation in NLP PowerPoint PPT Presentation