Title: Semantically enhanced automatic keyphrase indexing
1Semantically enhanced automatic keyphrase
indexing
- Olena Medelyan, Supervisor Ian H. Witten
Agenda
- What is this about?
- How this all begun?
- What happened so far?
- Whats the outlook?
2What is this about?
3What is this about?
4What is this about?
5How this all begun?
- Kea born 1997, hit year 1999, stagnation 2002
- Master Thesis improve Kea with linguistic
methods ? Kea with a controlled vocabulary - PhD Thesis
- A better Kea that works with any vocabulary
- Controlled indexing without controlled vocabulary
- Prove that Kea is better than human indexers
- Started February 1st 2006
6Intermezzo
- Controlled vocabularies
- Support organisation of data and its access
- But difficult to create and maintain manually
Epidermis Of plants for the
epidermis of animals use SKIN BT Plant
tissues NT Plant cuticle NT Plant hairs NT
Root hairs RT Peel
(non-descriptor plant skin)
7What happened so far?
8What happened so far?
(seriously)
- Publishing 1 full paper, 2 short papers, 1
poster - Networking 2 Google visits
- 2 joined projects with other researchers from
Waikato Google - PhD goals
- A (better) Kea that works with any vocabulary
- Controlled indexing without controlled vocabulary
- Prove that Kea is better than human indexers
?
?
9Indexing with any vocabulary
- SKOS format RDF developed by W3C
- Over 10 thesauri in different domains and
languages - Easily integrated into Kea with Jena (Java RDF
API)
10Controlled indexing with no vocabulary
- For any document collection create a new
controlled vocabulary using Wikipedia -
- Problems word ambiguity, detecting link types
- Once implemented, Kea can indexing any documents,
e.g. blog articles
11A better Kea?
- Currently
- Indexing consistency among humans 38 or less
- Professional indexers vs. Kea 27
- Problems to work on
- Not enough semantic information
- Lexical Chains!
- Weak learning strategy
- Learn from several humans (weighted keyphrases)
- Multiple-instance learning
- New learning schems
12Lexical Chains
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
natural disaster
hazard
flashflood
strong currents
wash away
swollen rivers
13Lexical chains (contd)
1.1.4 Impact of Natural Disasters in the
Highlands of Central Vietnam The most damaging
hazard experienced in the highlands is
flashflood, as it occurs with little warning.
People, property and livestock may be washed
away. Crops planted on the hillsides are better
protected than staple crops in river valleys,
such as cassava, on which poorer farmers rely
between paddy harvests. High dependence on
subsistence farming renders highland populations
vulnerable to hunger during the flood season.
Floods from swollen rivers can cut off villages
for days or weeks, which could result in
food shortages. Floods with strong currents cause
permanent damage to fields, washing away the
topsoil. Floodwaters also deposit rock and gravel
onto fields. Heavy rains trigger landslides
that cut off roads and communication networks.
14Lexical chains (contd)
- Top 6 lexical chains (length member frequency)
Each chain reflects one of the main topic areas
in the document, judging by keyphrases assigned
by professionals (numbers in circles)
15Outlook for the next 2.5 years
- PhD goals
- A better Kea that works with any vocabulary
- Controlled indexing without controlled vocabulary
- Prove that Kea is better than human indexers
Visit Keas new homepage www.nzdl.org/Kea