Title: Castanet: Using WordNet to Build Facet Hierarchies
1CastanetUsing WordNet to Build Facet
Hierarchies
- Emilia Stoica and Marti HearstSchool of
Information, - Berkeley
2 Focus Search and Navigation of Large
Collections
Shopping Sites
Digital Libraries
E-Government Sites
Image Collections
3Problems with Site Search
- Study by Vividence in 2001 on 69 Sites
- 70 eCommerce
- 31 Service
- 21 Content
- 2 Community
- Poorly organized search results
- Frustration and wasted time
- Poor information architecture
- Confusion
- Dead ends
- "back and forthing"
- Forced to search
4 The Problem With Hierarchy
- Most things can be classified in more than one
way. - Most organizational systems do not handle this
well. - Example Animal Classification
otter penguin robin salmon wolf cobra bat
Skin Covering
Locomotion
Diet
5The Problem With Hierarchy
start
swim
fly
run
slither
fur
scales
feathers
fur
scales
feathers
fur
scales
feathers
fish
fish
fish
fish
fish
fish
fish
fish
fish
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
insects
insects
insects
insects
insects
insects
insects
insects
insects
salmon
bat
robin
wolf
6The Idea of Facets
- Facets are a way of labeling data
- A kind of Metadata (data about data)
- Can be thought of as properties of items
- Facets vs. Categories
- Items are placed INTO a category system
- Multiple facet labels are ASSIGNED TO items
7The Idea of Facets
- Hot and Sweet Chicken 1 pepper, 2 apricots,
1 pound chicken breast, 1 Tbsp gingerroot
Meat Chicken
8Using Facets
- Now there are multiple ways to get to each item
Preparation Method Fry Saute Boil
Bake Broil Freeze
Desserts Cakes Cookies Dairy
Ice Cream Sherbet Flan
Fruits Cherries Berries Blueberries
Strawberries Bananas Pineapple
Fruit gt Pineapple Dessert gt Cake Preparation gt
Bake
Dessert gt Dairy gt Sherbet Fruit gt Berries gt
Strawberries Preparation gt Freeze
9Castanet
- Semi-automatic algorithm for creating
hierarchical faceted metadata - Carves out a structure from the hypernym
(IS-A) relations within WordNet - Produces surprisingly good results for a wide
range of subjects - e.g., arts, medicine, recipes, math, news,
bibliographical records
10WordNet Challenges
- A word may have more than one sense
- - Fine granularity of word sense distinctions
- e.g., newspaper (1) - daily publication
on - folded sheets
- newspaper (3) - physical object
-
- - Ambiguity for the same sense
11WordNet Challenges (cont.)
- The hypernym path may be quite long (e.g., sense
3 of tuna has 14 nodes) - Sparse coverage of proper names and noun phrases
(not addressed)
12Algorithm Goals
- Build a set of facet hierarchies
- Balance depth and breadth
- Avoid skinny paths
- Dont go too deep or too broad
- Choose understandable labels
- Disambiguate words
- Currently a word can take on only one sense
13Our Approach
Documents
141. Select Terms
- Select well-distributed terms from the collection
- Eliminate stopwords
- Retain only those terms with a distribution
higher than a threshold - (default top 10)
Build core tree
Augm. core tree
Documents
Select terms
Comp. tree
Remove top level categ.
WordNet
152. Build Core Tree
- Build a backbone
- Create paths from unambiguous terms only
- Bias the structure towards appropriate senses of
words
- Get hypernym path if term
- - has only one sense, or
- - matches a pre-selected
- WordNet domain
- Adding a new term increases a count at each node
on its path by of docs with the term.
162. Build Core Tree (cont.)
- Merge hypernym paths to build a tree
173. Augment Core Tree
- Attach to Core tree the terms with more than one
sense - Favor the more common path over other alternatives
18Augment Core Tree (cont.)
19Optional Step Domains
- To disambiguate, use Domains
- Wordnet has 212 Domains
- medicine, mathematics, biology, chemistry,
linguistics, soccer, etc. - A better collection has been developed by Magnini
2000 - Assigns a domain to every noun synset
- Automatically scan the collection to see which
domains apply - The user selects which of the suggested domains
to use or may add own - Paths for terms that match the selected domains
are added to the core tree
20Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food gt concave
shape gt space
gt ingredient, fixings gt
depression gt angle
gt flavorer
Given domain food, choose
sense 3
214. Compress Tree
- Rule 1
- Eliminate a parent with fewer than k children
unless it is the root or its distribution is
larger than 0.1maxdist
dessert
frozen dessert
ice cream sundae
parfait
sherbet,sorbet
sundae
sherbet
224. Compress Tree (cont.)
- Rule 2
- Eliminate a child whose name appears within the
parents name
dessert
frozen dessert
sundae
parfait
sherbet
235. Divide into Facets
245. Divide into Facets(Remove top levels)
Rule 1 Manually eliminate the top t levels (t 4
for recipe collection).
Rule 2 For each resulting tree, test if it has
more than n children (n 2) If yes,
stop. If not, delete the root and test again.
25Example Recipes (3500 docs)
26Castanet Output (shown in Flamenco)
27Castanet Output
28Castanet Output
29Castanet Output
30Castanet Output
31(No Transcript)
32Castanet Evaluation
- This is a tool for information architects, so
people of this type did the evaluation - We compared output on
- Recipes
- Biomedical journal titles
- We compared to two state-of-the-art algorithms
- LDA (Blei et al. 04)
- Subsumption (Sanderson Croft 99)
33Subsumption Output
34Subsumption Output
35Subsumption Output
36Subsumption Output
37LDA Output
38LDA Output
39LDA Output
40Evaluation Method
- Information architects assessed the category
systems - For each of 2 systems output
- Examined and commented on top-level
- Examined and commented on two sub-levels
- Then comment on overall properties
- Meaningful?
- Systematic?
- Likely to use in your work?
41Evaluation (cont.)
- Sample questions for top level categories
- Would you add/remove/rename any category ? - - Did this category match your expectations ?
- Sample questions for a specific category
- - Would you add/move/remove any
sub-categories ? - - Would you promote any sub-category to top
level ? - General questions
- - Would you use Castanet ?
- - Would you use LDA ?
- - Would you use Subsumption ?
- - Would you use list of most frequent terms ?
-
42Evaluation Results
- Results on recipes collection for
Would you use this system in your work? - Yes in some cases or yes, definitely
- Castanet 29/34
- LDA 0/18
- Subsumption 6/16
- Baseline 25/34
- Average response to questions about quality
(4 strongly agree)
43Evaluation Results
- Average responses for top-level categories
- 4 no changes, 1 change many
- Average responses for 2 subcategories
44Needed Improvements
- Take spelling variations and morphological
variants into account - Use verbs and adjectives, not just nouns
- Normalize noun phrases
- Allow terms to have more than one sense
- Improve algorithm for assigning documents to
categories.
45Opportunities for Tagging
- New opportunity Tagging, folksonomies
- (flickr, de.lici.ous)
- People are creating facets in a decentralized
manner - They are assigning multiple facets to items
- This is done on a massive scale
- This leads naturally to meaningful associations
46Conclusions
- Flexible application of hierarchical faceted
metadata is a proven approach for navigating
large information collections. - Midway in complexity between simple hierarchies
and deep knowledge representation. - Currently in use on e-commerce sites spreading
to other domains - Systems are needed to help create faceted
metadata structures - Our WordNet-based algorithm, while not perfect,
seems like it will be a useful tool for
Information Architects.
47Conclusions
- Castanet builds a set of faceted hierarchies by
finding IS-A relations between terms using
WordNet. - The method has been tested on various domains
- medicine, recipes, math, news, arts,
bibliographical records - Usability study shows
- Castanet is preferred to other state-of-the art
solutions. - Information architects want to use the tool in
their work. -
48Learn More
- Funding
- This work supported in part by NSF (IIS-9984741)
- For more information
- Stoica, E., Hearst, M., and Richardson, M.,
Automating Creation of Hierarchical Faceted
Metadata Structures, NAACL/HLT 2007 - See http//flamenco.berkeley.edu
49Motivation
- Want to assign labels from multiple hierarchies
50 The Problem with
Hierarchy
- Inflexible
- Force the user to start with a particular
category - What if I dont know the animals diet, but the
interface makes me start with that category? - Wasteful
- Have to repeat combinations of categories
- Makes for extra clicking and extra coding
- Difficult to modify
- To add a new category type, must duplicate it
everywhere or change things everywhere