Title: Semantic Annotation and Search of Software Artefacts
1Semantic Annotation andSearch of Software
Artefacts
- Valentin Tablan
- Kalina Bontcheva
- Danica Damljanovic
2Some Terminology
- Ontology population given an ontology, populate
it with instances derived automatically from a
text. - Annotation (of text) associating labels to text
snippets from a larger document. - Can be linguistic, semantic, etc...
- Semantic annotation labels used in annotation
are associated with an ontology - Can also include ontology population, as a side
effect.
3Annotation
4Semantic Annotation
5Case Study Software Artefacts
- The GATE project
- Open source text mining infrastructure tools.
- Structured information
- XML configuration files
- Unstructured Information
- Software documentation (JavaDocs)?
- User guide
- Project website
- Publications
- Mailing lists.
6Case Study Software Artefacts
- The problem?
- Information overload 000's of pages.
- Parameters for ANNIE Tokeniser?
- Where to look for the answer?
- The solution?
- Use an ontology as a shared store.
- Populate ontology using semantic annotation.
- Provide search facilities backed by the ontology.
7The GATE Ontology
8Ontology Population Structured Data
9GATE Ontology - populated
10Ontology PopulationUnstructured Data
- Extract all lexicalisations of ontological
resources - Names, labels, string property values.
- Normalise the lexicalisations
- Extract morphological root.
- Break text at underscores.
- Segment CamelCaseNames.
- Build a gazetteer with all the lexicalisations.
- Use the gazetteer to recognise mentions in text.
11Ontology PopulationUnstructured Data
12Information Access Conceptual Retrieval
- Can make use of abstractions and generalisations
powered by ontology back-end. - Provides retrieval options not available to
full-text search, e.g. - Capitals of countries in Asia
- Query language very complex, somewhat similar to
SQL ? not really suitable for end users.
13Capitals of countries in Asia (simplified
SeRQL)?
- select c0, p1, c3, p4, i6
- from
- c0 rdftype puppCapitalgt,
- c3 p1 c0,
- c3 rdftype puppCountrygt,
- c3 p4 i6,
- i6 rdftype puppContinentgt
- where
- p1ltpupphasCapitalgt and
- p4ltptopsubRegionOfgt and
- i6ltkimContinent_T.2gt
14QuestIOQuestion-based Interface to Ontologies
- Natural Language interface for querying knowledge
bases. - Easy to use require no training.
- Domain independent.
- Works with short, agrammatical queries
(Google-like).
15QuestIO Initialisation
- Vocabulary built automatically from the KB (hence
domain independent). - Extract all possible textual descriptions from
the ontology. - Normalise for morphology, lack of tokenisation,
CamelCasing, etc. - Represent all lexicalisations into a GATE
gazetteer (long init time, fast run time).
16Query Construction
- Identify known concepts in the NL query
- Normalise the query for morphology, etc.
- Find concepts by matching lexicalisations from
the gazetteer.
Capital City
Country
Continent
Continent_T4
Asia
Capitals of countries located in Asia
17Query Construction (II)?
- Build a SerQL query by finding appropriate
properties to link the concepts found. - Build a list of candidate properties based on
ontology schema (using domain and range
constraints). - Rank the properties.
18Ranking Properties
- We combine three types of scores
- similarity score compare query fragments with
candidate property names using Levenshtein string
similarity metric. - specificity score is based on the subproperty
relation in the ontology definition.
19Ranking Properties (II)?
- distance score inferring an implicit specificity
of a property based on the level of the classes
that are used as its domain and range.
20Query Execution
- Execute queries in ranking order until results
are obtained.
21Evaluation coverage and correctness
- 36 questions extracted from GATE list
- 22 out of 36 questions were answerable (the
answer was in the knowledge base) - 12 correctly answered (54.5)?
- 6 with partially corrected answer (27.3)?
- system failed to create a SeRQL query or created
a wrong one for 4 questions (18.2)? - Total score
- 68 correctly answered
- 32 did not answer at all or did not answer
correctly
22Evaluation on scalability and portability
- Sizes of the knowledge bases created based on
- GATE ontology http//gate.ac.uk/ns/gate-ontology
- Travel ontology http//goodoldai.org.yu/ns/tgprot
on.owl - Ontologies have not been customised or changed
prior using with QuestIO!
23Evaluation on scalability and portability
24DEMO
- http//www.gate.ac.uk/questio-client-app/search.js
p - Geography ontology
- Continents, countries, cities (capitals only).
- Part of PROTON KIM KB
- Example questions
- Countries in Europe or North America
- Capitals in Asia
- Capitals of countries (located) in Africa
- ...