SIMS 290-2: Applied Natural Language Processing

About This Presentation

Title:

SIMS 290-2: Applied Natural Language Processing

Description:

IS-INCOMPLETE-FILLER-OF joint space. IS-TOPO-INSIDE joint capsule ... IS-INCOMPLETE-FILLER-OF joint space. synovial membrane IS-BONAFIDE-BOUNDARY-OF joint space ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 46

Provided by: coursesIs8

Learn more at: https://courses.ischool.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing

1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst October 25, 2004
2
Next Few Classes

This week lexicons and ontologies
Today
WordNets structure, computing term similarity
Wed
Guest lecture Prof. Charles Fillmore on FrameNet
Next week Enron labeling in class
The entire assignment will be due on Nov 15
Following week Question-Answering

3
Text Categorization Assignment

Great job, you learned a lot!
Comparing to a baseline
Selecting features
Comparing relative usefulness of features
Training, testing, cross-validation
I learned a lot too! (from your results)
(Ill send you your feedback today)

4
Text Categorization Assignment

Features
Boosting weights of terms in subject line is
helpful.
Stemming does help in some circumstances (often
works well with SVM, for example), but not
always.
Counter-intuitively, stemming can increase the
number of features in our implementation, because
it increases how many terms pass the
minimum-document-occurrence cutoff.
An example of the porter stemmer not hiding
differences when it otherwise would converting
gaseous to "gase" and so not conflating "gas" for
fuel for motorcycles with "gaseous" for the
science group.

5
Text Categorization Assignment

Features
Terms with more than just the default
alphabetical terms are helpful, maybe because in
part getting the domain name information, but
also because of getting technical terms.
It's probably best to use the Weka feature
selector to tell you what kind of features are
performing well, but not to select those for use
exclusively.
I'm surprised that no one tried bigrams or
noun-noun compounds as features.

6
Text Categorization Assignment

Feature Weighting
Tf.idf Almost everyone who tried it found it was
raw term frequency (there were exceptions).
Binary feature weights with document count
minimum thresholds can be a good substitute.
An interesting variation on tf.idf is to do it in
a class-based manner.
weight terms higher that only occur in one class
vs. the others.
A couple of students tried this and got good
results on the diverse comparison, but less good
on the homogenous. This makes sense since the
measure would not help as much in distinguishing
similar newsgroups that share many terms.

7
Text Categorization Assignment

Classifiers
Naïve-Bayes Multinomial was a clear winner
SVM worked well most of the time, but not as well
as NBM
Naive Bayes seemed to be more robust to unseen
information the kernel estimator seems to
improve the default Naive Bayes settings.
VotedPerceptron worked very well, but only does
binary classification so people who found it did
very well on diverse did not transfer it to
homogenous.

8
Today

Lexicons, Semantic Nets and Ontologies
The Structure of WordNet
Computing Similarities
Automatic Acquisition of New Terms

9
Lexicons, Semantic Nets, and Ontologies

Lexicons are (typically) word lists augmented
with some subset of
Parts-of-speech
Different word senses
Synonyms
Semantic Nets
Include links to other terms
IS-A, Part-Of, etc.
Sometimes this term is used for what I call
ontologies
Ontologies
Represent concepts and relationships among
concepts
Language independent (in principle)
Sometimes include inference rules
Different from definition in philosophy
The science of what is, of the kinds and
structures of objects, properties, events,
processes and relations in every area of reality

10
One approach to linking ontologies and lexicons
Formal Domain Ontology
Cassandra Linguistic Ontology
MEDRA
11
Example Ontological Relation Types
HAS-PARTIAL-SPATIAL-OVERLAP
12
Example of applying an ontology joint anatomy

joint HAS-HOLE joint space
joint capsule IS-OUTER-LAYER-OF joint
meniscus
IS-INCOMPLETE-FILLER-OF joint space
IS-TOPO-INSIDE joint capsule
IS-NON-TANGENTIAL-MATERIAL-PART-OF joint
joint
IS-CONNECTOR-OF bone X
IS-CONNECTOR-OF bone Y
synovia
IS-INCOMPLETE-FILLER-OF joint space
synovial membrane IS-BONAFIDE-BOUNDARY-OF joint
space

This doesnt include the linguistic side
13
Linking Lexicons and Ontologies
Generalised Possession
Healthcare phenomenon
Human
Has- possessor
Has- possessed
IS-A
1
1
1
2
Having a healthcare phenomenon
IS-A
2
Is-possessor-of
Patient
Is-Risk- Factor-Of
IS-A
IS-A
4
Has-Healthcare-phenomenon
3
4
3
Patient at risk
Risk Factor
IS-A
IS-A
IS-A
Has-Healthcare-phenomenon
Is-Risk- Factor-Of
Patient at risk for osteoporosis
Risk factor for osteoporosis
Osteoporosis
14
Linking different lexicons
Snomed-RT Convulsion
MESH-2001 Seizures
ISA
IS-narrower-than
Snomed-RT Seizure
MESH-2001 Convulsions
Has-CCC
Has-CCC
Has-CCC
Has-CCC
LC Health crisis
IS-A
IS-A
LC Convulsion
LC Seizure
IS-A
IS-A
LC Epileptic convulsion
15
WordNet

A big lexicon with properties of a semantic net
Started as a language project by Dr George Miller
and Dr. Christiane Fellbaum at Princeton
First became available in 1990
Now on version 2.0

16
WordNet

Huge amounts of research (and products) use it

17
WordNet Relations

Original core relations
Synonymy
Polysemy
Metonymy
Hyponymy/Hyperonymy
Meronymy
Antonymy
New, useful additions for NLP
Glosses
Links between derivationally and semantically
related noun/verb pairs.
Domain/topical terms
Groups of similar verbs
Others on the way
Disambiguation of terms in glosses
Topical clustering.

18
Synonymy

Different ways of expressing related concepts
Examples
cat, feline, Siamese cat
Synonyms are almost never truly substitutable
Used in different contexts
Have different implications
This is a point of contention.

19
Polysemy

Most words have more than one sense
Homonym same word, different meaning
bank (river)
bank (financial)
Polysemy different senses of same word
That dog has floppy ears.
She has a good ear for jazz.
bank (financial) has several related senses
the building, the institution, the notion of
where money is stored

20
Metonymy

Use one aspect of something to stand for the
whole
The building stands for the institution of the
bank.
Newscast The White House released new figures
today.
Waitperson The ham sandwich spilled his drink.

21
Hyponymy/Hyperonymy

ISA relation
Related to Superordinate and Subordinate level
categories
hyponym(robin,bird)
hyponym(bird,animal)
hyponym(emu,bird)
A is a hypernym of B if B is a type of A
A is a hyponym of B if A is a type of B

22
Meronymy

Parts-of relation
part of(beak, bird)
part of(bark, tree)
Transitive conceptually but not lexically
The knob is a part of the door.
The door is a part of the house.
? The knob is a part of the house ?

23
Antonymy

Lexical opposites
antonym(large, small)
antonym(big, small)
antonym(big, little)
but not large, little
Many antonymous relations can be reliably
detected by looking for statistical correlations
in large text collections. (Justeson Katz 91)

24
Using WordNet in Python
from wordnet import from wntools import
25
Using WordNet in Python
from wordnet import from wntools import
26
More Readable Output
27
Using WordNet to Determine Similarity

The meet function in the python wordnet tool
finds the closest common parent to two terms

28
Similarity by Path Length

Count the edges (is-a links) between two concepts
and scale
Leacock and Chodorow, 1998
lch(c1,c2)
-log (length(c1,c2) / 2 max-depth
Wu and Palmer, 1994
wup(c1,c2)
2 depth(lcs(c1,c2)) /
depth (c1) depth (c2)

29
Problems with Path Length

The lengths of the paths are irregular across the
hierarchies
Words might not be in the same hierarchies that
should be
How to relate terms that are not in the same
hierarchies?
The tennis problem
Player
Racquet
Ball
Net
Are all in separate hierarchies
WordNet is working on developing such linkages

30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Similarity by Information Content

IC estimated from a corpus of text (Resnik, 1995)
IC(concept) -log(P(concept))
Specific Concept
High IC (pitchfork)
General Concept
Low IC (instrument)
To estimate it
Count occurrences of concept
Given a word, increment count of all concepts
associated with that word
increment bank as financial institution and also
as river shore.
Assume that senses occur uniformly lacking
evidence to the contrary (e.g., sense tagged
text)
Counts propagate up the hierarchy

35
Information Content as Similarity

Resnik, 1995
res(c1,c2) IC (lcs (c1,c2))
Jiang and Conrath, 1997
jcn(c1,c2)
1 / 2res(c1,c2) (IC (c1) IC(c2))
Lin, 1998
lin(c1,c2)
2res(c1,c2) / IC(c1) IC(c2)
All of these (and more!) are implemented in a
perl package
Called SenseRelate, Pedersen et al.
http//wn-similarity.sourceforge.net/

36
Rearranging WordNet

Try to fix the top-level hierarchies
Parse the glosses for more information
eXtended WordNet project
http//xwn.hlt.utdallas.edu/

37
Augmenting WordNet

Lexico-syntactic Patterns (Hearst 92, 97)

38
Augmenting WordNet

Lexico-syntactic Patterns (Hearst 92, 97)

39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
Acquisition using the Web

Towards Terascale Knowledge Acquisition, Pantel
and Lin 04
Use co-occurrence model and a huge collection
(the Web) to find similar terms
Input a cluster of related words
Feature vectors computed for each word
Catch ___
Compute mutual information between the word and
the context
Average the features for each class to create a
grammatical template for each class

43
Acquisition using the Web
Use this template to find new examples of this
class of terms (but it makes many errors)
44
Next Time

SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

SIMS 290-2: Applied Natural Language Processing

IS-INCOMPLETE-FILLER-OF joint space. IS-TOPO-INSIDE joint capsule ... IS-INCOMPLETE-FILLER-OF joint space. synovial membrane IS-BONAFIDE-BOUNDARY-OF joint space ... – PowerPoint PPT presentation