Title: Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying
1Chinese Core Ontology Constructionfrom a
Bilingual Term Bank
- Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying
- Department of Computing
- The Hong Kong Polytechnic University
2Outline
- Introduction
- Related Works
- Algorithm Design COCA
- Performance Evaluation
- Conclusion
3Introduction
- What is a Core Ontology
- A mid-level ontology
- Bridges the gap between an upper ontology and a
domain ontology
4Concepts and Terminologies
- Upper Ontology
- A general ontology to ensure reusability across
different domains (e.g. Computer Program in
SUMO) - Domain Ontology
- An ontology conceptualize a specific domain
(e.g. Free Software in IT domain) - More application dependent, more extents of
concepts - Midlevel Ontology(Core Concept)
- Basic concepts of a domain
- More application independent, more intents of
concepts. - core ontology (e.g. Software)
- Frequently used, ability to form other concepts
- Core Terms
- Lexical units of core concepts
5Related Works
- Manually constructed ontologies
- SUMO
- Famous upper level ontology works based on
lexicon - CoreLex (Buitelaar, P., 1998)
- EuroWordnet (RodrÃguez, 1998 )
- Ontology harmonization Core ontology
- Towards a Core Ontology for Information
Integration (M. Doerr, 2003) - A most similar work
- Enriching Core Ontology with Domain Thesaurus
through Concept and Relation Classification
(Huang, 2007) - Use Concept and Relation Classification to Enrich
core ontology
6Our Previous Works
- Chinese terminology extraction
- Chinese core term extraction(Ji et al, 2007)
- Preliminary work on automatic construction of
core ontology construction using English-Chinese
Term Bank (MRCOCA, Ontolex 2007, Chen, 2007) - Bilingual lexicon
- Extended strings
- Frequency information in synset
- Weight from extended strings are integrated into
final weight by simple addition - Mapping to synset and SUMO can only achieve
accuracy of about 50
7Issues
- What kind of concept should be included?
- How to identify core concepts
- If through core terms, disambiguation
- What and how to identify relations?
- Making use of available resources
- Chinese NLP resource scares
- English NLP resources abundant
8Requirements of Core Ontology
- The concepts must be widely accepted and commonly
referenced - Corresponding core terms must be highly used and
productive - The concepts/terms can be mapped to upper
ontology. So the core ontology can inherit the
attributes provided by upper ontology
9Core Ontology Construction Algorithm(COCA) for
Chinese
- Extract Chinese core terms from a bilingual term
bank - Mapped core term Tc to English terms
- Mapping English terms to WordNet
- Mapping synset to a upper ontology concept in
SUMO
10COCA - Resources Used
- ITCTerm
- a domain specific core term list (Chen, 2007 )
- CETBank
- Chinese-English bilingual term bank
- 1,500 most productive core terms extracted can
serve as suffixes to form more than 50 of the
terms in CETBank) - WordNet
- SUMO
- Mappings between WordNet and SUMO
11The Framework of COCA
12COCA Statistical Translation Module
- Translation ambiguity
- Each Chinese core term TC ? ITCTerm has a set
of translations T_SetE , TE ?T_SetE - Objective
- to estimate the likelihood of every translation
using extended terms of TC - P(TE TC) for all TE ? T_SetE.
13COCA - Sense Disambiguation Module
- Mapping a given TC to the Synset S through its
translation set T_SetE (TC) - Mapping probability of a English term TE to take
a synset S using freq. info in WordNet - Mapping probability of TC to take a particular
synset S via an English translation TE
14COCA - Concept Selection Module
- Combining three features
- multi-path feature
- hypernyms feature
- part-of-speech feature
- Using Union Probability of Independent Events
15Feature 1 Multi-Paths to Synset
- Multiple paths is
- the path between Chinese core terms and synset
- via different English translations
The feature merges the probability of multiple
paths
16Feature 2 Hyponyms in domain
- Incorporate info on all the extended strings
-
Extended String uses the core term as headword
and is the hyponym of the core term
Length Ratio
Union Probability of Independent Events
17Feature 3 Part of Speech
- Probability of the POS tag pos(S)
- owned by a synset S
- given a core term Tc
- PoS Tag estimation Heuristics on Adj, Verb, and
noun based on position
18Integrate Features
- Using Union Probability of Independent Events
19Evaluation
- Algorithm Output
- A pair of lt Tc_i, Synseti gt for each Chinese core
term with the highest mapping weight - Evaluation Standard
- For each Tc_i, whether their mappings to Synset
are the best match with respect to this domain - Answer Preparation
- Answer is manually made by two experts in IT
domain respectively on the same set of data
20- Performance
- The evaluation conducted on the top N frequent
core terms - The algorithm COCA achieves 71 in accuracy (N is
28 in this paper) - Compared to the result of MRCOCA (Chen, 2007)
which achieved only 50 - Two examples of core term to syntset mapping
generated by the algorithm are given for ?? and
??.
21(No Transcript)
22Conclusion
- Evaluation of COCA repeated on an English-Chinese
bilingual Term bank with more than 130K entries
show that the algorithm is - 42 improved in accuracy compared to MRCOCA
(Our Previous Works) - The three features and the new algorithm based on
probability made the improvement
23- Term bank can help to quickly construct domain
core ontology by selecting the concept nodes and
relations used in domain - Bilingual term bank can further introduce the
second language realization of the core ontology
effectively and automatically
24Future Works
- Evaluation on three features
- how effective they are
- how much they contribute to the final performance
- Consideration of more features such as
abbreviation, synset of head word of core term
and etc. - Use of other resources
25 26Q