Title: Automatic Discovery of Useful Facet Terms
1Automatic Discovery of Useful Facet Terms
- Wisam Dakka Columbia University
- Rishabh Dayal Columbia University
- Panagiotis G. Ipeirotis NYU
2Searching the NYT Archive for Book Research
3Motivation News Archive
- Accessing and searching is not an easy task
- Researchers and reporters spend a large amount of
time going through their long query results - News archives are huge and available for tens of
years - Many relevant results
- Results in the first page are not more relevant
than the results in the 5th or the 10th page (NYT
archive) - Search engines of news archive mainly follow the
paradigm - Search, skim through long results, modify, and
search again - Goal Multifaceted Interfaces (MI) over the news
archive of Newsblaster - Newsblaster archive
- About 6 years of news from 24 news sources
- Stories are clustered daily into hierarchies of
topics and events - Events are threaded over time, summarized, and
classified
4Motivation MI for Newsblaster Archive
- Our multifaceted interfaces work has some
limitations CIKM2005 - Supervised learning facets that could be
identified by our algorithm appear in the
training set - WordNet hypernyms
- WordNet has rather poor coverage of named
entities - Free text collections
- The quality of the hierarchies built on top of
news stories was low.
5Challenge Automatic Extraction of the Useful
Facets from News Archive
- Automatically discover, in an unsupervised
manner, a set of candidate facet terms from free
text - Automatically group together facet terms that
belong to the same facet - Build the appropriate browsing structure for each
facet
6Intuition Look for Facet Terms Elsewhere
- Pilot study - 100 stories from The NYTimes
- Common facets Location, Institutes, History,
People, Social Phenomenon, Markets, Nature, and
Event - Sub-facets Leaders under People, Corporations
under Markets - Clear phenomenon the terms for the useful facets
do not usually appear in the news stories - A journalist writing a story about Jacques Chirac
will not necessarily use the terms Political
Leader, Europe, or France. Such missing terms are
tremendously useful for identifying the
appropriate facets for the story - We will look for these terms elsewhere
- infrequent terms in the original collection, but
are frequent in expanded documents
7Context-Aware Expansion
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Wiki
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Murkowski made the announcement three days after
BP said it would shut down a Prudhoe Bay oil
field after a small leak was found. Energy
officials have said pipeline repairs are likely
to take months, curtailing Alaskan production
into next year
Yahoo Term Extractor
Name Entities
8Useful Facets Terms are Elsewhere
Original Collection
Context-aware Collection
Infrequent Terms
ti
9Term Frequency Analysis
- Frequency-based shifting
- ? Due to the Zipfian nature, we favor terms that
have already high frequencies (inverse problem) - Rank-shifting
10Summary Candidate Facet Terms
- For each document in the database, identify the
important terms that are useful to characterize
the contents of the document - For each term in the original database, query the
external resource and retrieve the terms that
appear in the results. Add the retrieved terms in
the original document, in order to create an
expanded, context-aware document - Analyze the frequency of the terms, in both the
original and the expanded database and identify
the candidate facet terms
11Indicative
12Research in Progress
- Cleaning and filtering
- Grouping similar facet terms under one facet
- Evaluation
- The resulted candidate terms
- The resulted hierarchies