Title: SIMS 290-2: Applied Natural Language Processing
1SIMS 290-2 Applied Natural Language Processing
Marti Hearst October 25, 2004
2Next Few Classes
- This week lexicons and ontologies
- Today
- WordNets structure, computing term similarity
- Wed
- Guest lecture Prof. Charles Fillmore on FrameNet
- Next week Enron labeling in class
- The entire assignment will be due on Nov 15
- Following week Question-Answering
3Text Categorization Assignment
- Great job, you learned a lot!
- Comparing to a baseline
- Selecting features
- Comparing relative usefulness of features
- Training, testing, cross-validation
- I learned a lot too! (from your results)
- (Ill send you your feedback today)
4Text Categorization Assignment
- Features
- Boosting weights of terms in subject line is
helpful. - Stemming does help in some circumstances (often
works well with SVM, for example), but not
always. - Counter-intuitively, stemming can increase the
number of features in our implementation, because
it increases how many terms pass the
minimum-document-occurrence cutoff. - An example of the porter stemmer not hiding
differences when it otherwise would converting
gaseous to "gase" and so not conflating "gas" for
fuel for motorcycles with "gaseous" for the
science group.
5Text Categorization Assignment
- Features
- Terms with more than just the default
alphabetical terms are helpful, maybe because in
part getting the domain name information, but
also because of getting technical terms. - It's probably best to use the Weka feature
selector to tell you what kind of features are
performing well, but not to select those for use
exclusively. - I'm surprised that no one tried bigrams or
noun-noun compounds as features.
6Text Categorization Assignment
- Feature Weighting
- Tf.idf Almost everyone who tried it found it was
raw term frequency (there were exceptions). - Binary feature weights with document count
minimum thresholds can be a good substitute. - An interesting variation on tf.idf is to do it in
a class-based manner. - weight terms higher that only occur in one class
vs. the others. - A couple of students tried this and got good
results on the diverse comparison, but less good
on the homogenous. This makes sense since the
measure would not help as much in distinguishing
similar newsgroups that share many terms.
7Text Categorization Assignment
- Classifiers
- Naïve-Bayes Multinomial was a clear winner
- SVM worked well most of the time, but not as well
as NBM - Naive Bayes seemed to be more robust to unseen
information the kernel estimator seems to
improve the default Naive Bayes settings. - VotedPerceptron worked very well, but only does
binary classification so people who found it did
very well on diverse did not transfer it to
homogenous.
8Today
- Lexicons, Semantic Nets and Ontologies
- The Structure of WordNet
- Computing Similarities
- Automatic Acquisition of New Terms
9Lexicons, Semantic Nets, and Ontologies
- Lexicons are (typically) word lists augmented
with some subset of - Parts-of-speech
- Different word senses
- Synonyms
- Semantic Nets
- Include links to other terms
- IS-A, Part-Of, etc.
- Sometimes this term is used for what I call
ontologies - Ontologies
- Represent concepts and relationships among
concepts - Language independent (in principle)
- Sometimes include inference rules
- Different from definition in philosophy
- The science of what is, of the kinds and
structures of objects, properties, events,
processes and relations in every area of reality
10One approach to linking ontologies and lexicons
Formal Domain Ontology
Cassandra Linguistic Ontology
MEDRA
11Example Ontological Relation Types
HAS-PARTIAL-SPATIAL-OVERLAP
12Example of applying an ontology joint anatomy
- joint HAS-HOLE joint space
- joint capsule IS-OUTER-LAYER-OF joint
- meniscus
- IS-INCOMPLETE-FILLER-OF joint space
- IS-TOPO-INSIDE joint capsule
- IS-NON-TANGENTIAL-MATERIAL-PART-OF joint
- joint
- IS-CONNECTOR-OF bone X
- IS-CONNECTOR-OF bone Y
- synovia
- IS-INCOMPLETE-FILLER-OF joint space
- synovial membrane IS-BONAFIDE-BOUNDARY-OF joint
space
This doesnt include the linguistic side
13Linking Lexicons and Ontologies
Generalised Possession
Healthcare phenomenon
Human
Has- possessor
Has- possessed
IS-A
1
1
1
2
Having a healthcare phenomenon
IS-A
2
Is-possessor-of
Patient
Is-Risk- Factor-Of
IS-A
IS-A
4
Has-Healthcare-phenomenon
3
4
3
Patient at risk
Risk Factor
IS-A
IS-A
IS-A
Has-Healthcare-phenomenon
Is-Risk- Factor-Of
Patient at risk for osteoporosis
Risk factor for osteoporosis
Osteoporosis
14Linking different lexicons
Snomed-RT Convulsion
MESH-2001 Seizures
ISA
IS-narrower-than
Snomed-RT Seizure
MESH-2001 Convulsions
Has-CCC
Has-CCC
Has-CCC
Has-CCC
LC Health crisis
IS-A
IS-A
LC Convulsion
LC Seizure
IS-A
IS-A
LC Epileptic convulsion
15WordNet
- A big lexicon with properties of a semantic net
- Started as a language project by Dr George Miller
and Dr. Christiane Fellbaum at Princeton - First became available in 1990
- Now on version 2.0
16WordNet
- Huge amounts of research (and products) use it
17WordNet Relations
- Original core relations
- Synonymy
- Polysemy
- Metonymy
- Hyponymy/Hyperonymy
- Meronymy
- Antonymy
- New, useful additions for NLP
- Glosses
- Links between derivationally and semantically
related noun/verb pairs. - Domain/topical terms
- Groups of similar verbs
- Others on the way
- Disambiguation of terms in glosses
- Topical clustering.
18Synonymy
- Different ways of expressing related concepts
- Examples
- cat, feline, Siamese cat
- Synonyms are almost never truly substitutable
- Used in different contexts
- Have different implications
- This is a point of contention.
19Polysemy
- Most words have more than one sense
- Homonym same word, different meaning
- bank (river)
- bank (financial)
- Polysemy different senses of same word
- That dog has floppy ears.
- She has a good ear for jazz.
- bank (financial) has several related senses
- the building, the institution, the notion of
where money is stored
20Metonymy
- Use one aspect of something to stand for the
whole - The building stands for the institution of the
bank. - Newscast The White House released new figures
today. - Waitperson The ham sandwich spilled his drink.
21Hyponymy/Hyperonymy
- ISA relation
- Related to Superordinate and Subordinate level
categories - hyponym(robin,bird)
- hyponym(bird,animal)
- hyponym(emu,bird)
- A is a hypernym of B if B is a type of A
- A is a hyponym of B if A is a type of B
22Meronymy
- Parts-of relation
- part of(beak, bird)
- part of(bark, tree)
- Transitive conceptually but not lexically
- The knob is a part of the door.
- The door is a part of the house.
- ? The knob is a part of the house ?
23Antonymy
- Lexical opposites
- antonym(large, small)
- antonym(big, small)
- antonym(big, little)
- but not large, little
- Many antonymous relations can be reliably
detected by looking for statistical correlations
in large text collections. (Justeson Katz 91)
24Using WordNet in Python
from wordnet import from wntools import
25Using WordNet in Python
from wordnet import from wntools import
26More Readable Output
27Using WordNet to Determine Similarity
- The meet function in the python wordnet tool
finds the closest common parent to two terms
28Similarity by Path Length
- Count the edges (is-a links) between two concepts
and scale - Leacock and Chodorow, 1998
- lch(c1,c2)
- -log (length(c1,c2) / 2 max-depth
- Wu and Palmer, 1994
- wup(c1,c2)
- 2 depth(lcs(c1,c2)) /
- depth (c1) depth (c2)
-
29Problems with Path Length
- The lengths of the paths are irregular across the
hierarchies - Words might not be in the same hierarchies that
should be - How to relate terms that are not in the same
hierarchies? - The tennis problem
- Player
- Racquet
- Ball
- Net
- Are all in separate hierarchies
- WordNet is working on developing such linkages
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Similarity by Information Content
- IC estimated from a corpus of text (Resnik, 1995)
- IC(concept) -log(P(concept))
- Specific Concept
- High IC (pitchfork)
- General Concept
- Low IC (instrument)
- To estimate it
- Count occurrences of concept
- Given a word, increment count of all concepts
associated with that word - increment bank as financial institution and also
as river shore. - Assume that senses occur uniformly lacking
evidence to the contrary (e.g., sense tagged
text) - Counts propagate up the hierarchy
35Information Content as Similarity
- Resnik, 1995
- res(c1,c2) IC (lcs (c1,c2))
- Jiang and Conrath, 1997
- jcn(c1,c2)
- 1 / 2res(c1,c2) (IC (c1) IC(c2))
- Lin, 1998
- lin(c1,c2)
- 2res(c1,c2) / IC(c1) IC(c2)
- All of these (and more!) are implemented in a
perl package - Called SenseRelate, Pedersen et al.
- http//wn-similarity.sourceforge.net/
36Rearranging WordNet
- Try to fix the top-level hierarchies
- Parse the glosses for more information
- eXtended WordNet project
- http//xwn.hlt.utdallas.edu/
37Augmenting WordNet
- Lexico-syntactic Patterns (Hearst 92, 97)
38Augmenting WordNet
- Lexico-syntactic Patterns (Hearst 92, 97)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42Acquisition using the Web
- Towards Terascale Knowledge Acquisition, Pantel
and Lin 04 - Use co-occurrence model and a huge collection
(the Web) to find similar terms - Input a cluster of related words
- Feature vectors computed for each word
- Catch ___
- Compute mutual information between the word and
the context - Average the features for each class to create a
grammatical template for each class
43Acquisition using the Web
Use this template to find new examples of this
class of terms (but it makes many errors)
44Next Time
- FrameNet
- A background paper is on the class website
- (Not required to read it beforehand)
45Acquisition using the Web