Title: Automating Creation of Hierarchical Faceted Metadata Structures
1Automating Creation of Hierarchical Faceted
Metadata Structures
- Emilia Stoica, Marti Hearst and Megan
Richardson -
- School of Information, Berkeley
- Dept. of Mathematical Sciences, NMSU
2Focus Browse Large Datasets
- Standard search interface - query box retrieved
results not suited for browsing and navigation - User interfaces need to group and organize the
results
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7(No Transcript)
8(No Transcript)
9(No Transcript)
10How do we Create Faceted Hierarchies?
- Goals
- Help an information architect to create the
hierarchy - Currently they do it all by hand!
- Balance depth and breadth
- Avoid skinny paths
- Dont go too deep or too broad
- Choose understandable labels
- Disambiguate between word senses
11Related Work
- Automated text categorization
- LOTS of work on this
- Assumes that a set of categories is already
created - Little if any work on building facet hierarchies
12Castanet
- Carves out a structure from the hypernym (IS-A)
relations within WordNet - Semi-automatic algorithm for creating
hierarchical faceted metadata - Produces surprisingly good results for a wide
range of subjects - e.g., recipes, medicine, math, news, fine arts
image description
13WordNet Challenges
- A word may have more than one sense
- - Fine granularity of word sense distinctions
- e.g., newspaper (1) - daily publication
on - folded sheets
- newspaper (3) - physical object
-
- - Ambiguity for the same sense
14WordNet Challenges (cont.)
- The hypernym path may be quite long (e.g., sense
3 of tuna has 14 nodes) - Sparse coverage of proper names and noun phrases
(not addressed)
15Our Approach
Documents
161. Select Terms
- Select well-distributed terms from the collection
- Eliminate stopwords
- Retain only those terms with a distribution
higher than a threshold - (default top 10)
Build core tree
Augm. core tree
Documents
Select terms
Comp. tree
Remove top level categ.
WordNet
172. Build Core Tree
- Build a backbone
- Create paths from unambiguous terms only
- Bias the structure towards appropriate senses of
words
- Get hypernym path if term
- - has only one sense, or
- - matches a pre-selected
- WordNet domain
- Adding a new term increases a count at each node
on its path by of docs with the term.
182. Build Core Tree (cont.)
- Merge hypernym paths to build a tree
193. Augment Core Tree
- Attach to Core tree the terms with more than one
sense - Favor the more common path over other alternatives
20Augment Core Tree (cont.)
Date (p1)
Date (p2)
entity
abstraction
substance,matter measure,
quantity
food, nutrient
fundamental quality
nutriment time period
food
calendar day (18) edible
fruit (78) date Sunday
berries date
?
?
21Optional Step Domains
- To disambiguate, use Domains
- Wordnet has 212 Domains
- medicine, mathematics, biology, chemistry,
linguistics, soccer, etc. - A better collection has been developed by Magnini
(2000) - Assigns a domain to every noun synset
- Automatically scan the collection to see which
domains apply - The user selects which of the suggested domains
to use or may add own - Paths for terms that match the selected domains
are added to the core tree
22Using Domains
dip glosses Sense 1 A depression in an
otherwise level surface Sense 2 The angle that a
magnet needle makes with horizon Sense 3 Tasty
mixture into which bite-size foods are dipped
dip hypernyms Sense 1
Sense 2 Sense 3
solid
shape, form food gt concave
shape gt space
gt ingredient, fixings gt
depression gt angle
gt flavorer
Given domain food, choose
sense 3
234. Compress Tree
- Rule 1
- Eliminate a parent with fewer than k children
unless it is the root or its distribution is
larger than 0.1maxdist
dessert
frozen dessert
ice cream sundae
parfait
sherbet,sorbet
sundae
sherbet
244. Compress Tree (cont.)
- Rule 2
- Eliminate a child whose name appears within the
parents name
dessert
frozen dessert
sundae
parfait
sherbet
255. Divide into Facets
265. Divide into Facets(Remove top levels)
Rule 1 Eliminate the top t levels (t 4 for
recipe collection).
Rule 2 For each resulting tree, test if it has
at least n children (n 2)
If yes, stop. If not, delete the root and repeat.
Manual cleaning remove facets that dont make
sense
27Example Recipes (13,500 docs)
28Castanet Output (shown in Flamenco)
29Castanet Output
30(No Transcript)
31Castanet Evaluation
- This is a tool for information architects (IA),
so people of this type did the evaluation - Each IA compared Castanet to other
state-of-the-art algorithms - LDA (Blei et al. 04)
- Subsumption (Sanderson Croft 99)
- Baseline most frequent terms in the collection
- Datasets
- 13,000 recipes from Southwestcooking.com
32Subsumption Output
33Subsumption Output
34LDA Output
35LDA Output
36Evaluation Method
C
C
16
18
L
S
- For each of 2 systems output
- Examined and commented on top-level
- Examined and commented on two sub-levels
- Then comment on overall properties
- Meaningful?
- Systematic?
- Likely to use in your work?
37Evaluation (cont.)
- Sample questions for top level categories
- Would you add/remove/rename any category ? - - Did this category match your expectations
? - Sample questions for a specific category
- - Would you add/move/remove any
sub-categories ? - - Would you promote any sub-category to top
level ? - General questions
- - Would you use Castanet ?
- - Would you use LDA ?
- - Would you use Subsumption ?
- - Would you use list of most frequent terms ?
-
38Evaluation Results
- Would you use this system in your work?
- yes definitely, yes, in some cases
- Castanet 85
- LDA 0
- Subsumption 37
- Baseline 74
- Average response to questions about quality
- (4 strongly agree, 3 agree
somewhat, 2 disagree somewhat, 1
strongly disagree)
39Evaluation Results
- Average responses for top-level categories
- (4 no changes, 3 one or two, 2 a few,
1 many) - Average responses for 2 subcategories
40Needed Improvements
- Take spelling variations and morphological
variants into account - Use verbs and adjectives, not just nouns
- Normalize noun phrases
- Allow terms to have more than one sense
- Improve algorithm for assigning documents to
categories.
41Conclusions
- Flexible application of hierarchical faceted
metadata is a proven approach for navigating
large information collections. - Midway in complexity between simple hierarchies
and deep knowledge representation. - Currently in use on e-commerce sites spreading
to other domains - Systems are needed to help create faceted
metadata structures - Our WordNet-based algorithm, while not perfect,
seems like it will be a useful tool for
Information Architects.
42Conclusions
- Castanet builds a set of faceted hierarchies by
finding IS-A relations between terms using
WordNet. - The method has been tested on various domains
- medicine, recipes, math, news, description of
images - Usability study shows
- Castanet is preferred to other state-of-the art
solutions. - Information architects want to use the tool in
their work. - Future work
- Apply to tags (flickr, delicious)
43Learn More
- Funding
- This work supported in part by NSF (IIS-9984741)
- For more information
- Stoica, E., Hearst, M., and Richardson, M.,
Automating Creation of Hierarchical Faceted
Metadata Structures, NAACL/HLT 2007 - See http//flamenco.berkeley.edu