Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture - PowerPoint PPT Presentation

About This Presentation

Title:

Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture

Description:

Multilingual support to a proposed ... currently available Outline Semantic Web ... Logic Proof Trust Self desc. doc. Data Data ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 43

Provided by: Andrea629

Category:

more less

Transcript and Presenter's Notes

Title: Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture

1
Multilingual supportto a proposed Semantic Web
architecture

Andrea Ferrato
TOP-UIC MS Thesis, 2003/04
Advisor Laura Farinetti

2
Purpose of this work

Design and (partially) implement multilingual
support on a pre-existing Semantic Web platform
Provide an approach as generical as possible
Exploit features of the pre-existing architecture
Cope with the average chaotic structure of
resources currently available

3
Outline

Semantic Web
Multilinguality
The DOSE platform
Proposed solution
Given implementation
Experimental results
Conclusions

4
Semantic Web

The next evolutionary stage for WWW
Goal make network data usable by intelligent
agents
Deployable only on top of existing infrastructure
Two pressing tasks
Transform existing contents to include semantics
Setup ad hoc user agents to work on them

5
Transform existing contents

Basic data units resources
Every single information entity that can be
semantically isolated
Features to be given
Identification URI
Structure XML
Meaning RDF
Knowledge ontologies

6
Set up ad hoc user agents

Major players in Semantic Web deployment
Invoked by users, can proceed autonomously
Key facilities to be supported
Logic
Proof
Trust

7
Semantic Web layer cake view(Berners-Lee)
Trust
Rules
Logic
Data
Proof
Self desc. doc.
Digital signatures
Data
Ontology vocabulary
RDF RDFschema
8
Multilinguality

The extension to multiple languages of tasks
already performed in a monolingual context
Typical issues from cross-language mapping
Lexical gaps
Role of the context
Lack of pre-acquired knowledge

9
Multilinguality and Semantic Web

A problem of Text Retrieval in multiple languages
(NLP)
Start from popular approaches (Controlled
Vocabulary, Free text, etc.)
Two main requirements
Recognize language ID of resources
Map contents independently from language

10
Language ID retrieval

Two possible scenarios
Retrieve a given ID via resource parsing
Recreate the ID via resource analysis
When recollecting a given language attribute,
conform to existing language specification
standards

11
Language ID specification
Content-language
CSS-level declarations
12
Language-independent contents mapping

Investigate the form/meaning relationship
Ontology design is crucial
Three main requirements
Consistency (based on linguistic evidence)
Flexibility (meaningful for all languages)
Extendibility (easy addition of new languages)

13
Ontology models

Conceptual
founded upon general knowledge
Language-based
Built on a particular language
Interlingua
A combination of the above two
None is definitely superior for multilinguality

14
The DOSE platform

Distributed Open Semantic Elaboration platform
Key features
Modularity
Scalability
Semantic integration
Main functionalities offered
Annotation
Search

15
DOSE layered view
Service layer
Front-end layer
Back-end layer
16
DOSE distributed view
17
DOSE annotation
Substructure Extractor
Indexer
Fragment Retriever
Semantic Mapper
18
DOSE search
Semantic Mapper
Search Engine
Fragment Retriever
19
DOSE and multilinguality

Traditionally a new ontology for each different
language
DOSE the ontology language is totally
independent of the synset language
Use synsets to store lexical representations only
Let the ontology focus on knowledge modelization

20
Practical requirements for multilinguality

Indexing
Recognize language of resources to consequently
setup the system
Store language IDs with annotations
Search
Interpret user queries coming in natural
languages
Allow for cross-language search tasks

21
Extension to language

Proposed approach one ontology, many synsets
A concept is expressed by a different synset for
each supported language
Each synset contains multiple lexical
representations of a related concept in a single
language
Separate semantic and textual layers

22
Extension to language (contd)
(one concept, three synsets)
23
Advantages

Reduced implementation requirements
Ontology design
Resource occupation
Simplicity (in ontology management)
Flexibility
A new language just brings a new bag of synsets
Expansion of indexing word set

24
Language recognition

Proposed approach
Retrieve language IDs whenever present
Otherwise, recognize language(s)
Design constraints
To be activated in the annotation phase
Refined at the document substructure level
Has to deal with the average low authoring
quality of Web documents

25
Language recognition (contd)

Validate explicit request
Retrieve lang value
Guess via heuristics
Retrieve from ancestor
Accept default

26
Current implementation

A new English synset to couple with a disability
ontology (500 concepts)
A set of 20 bilingual documents (Italian,
English) on disability
A basic Language Detector XML-RPC module
implemented in Java
Testing scenarios
Parallel annotation
Language recognition

27
Implementation work

Language Detector module (Java, 1000 lines of
code)
Additions to pre-existing modules (Java, 1000
lines of code)
English synset (RDF, 3500 lines of code)
24 Mb of annotations produced
Simulation results analysis (A 600x40 .XLS for
ltBODYgt, a 925x250 .XLS for ltHxgt)

28
Multilingual DOSE in action
29
Parallel annotation

Two parallel documents have
The same structure elements with the same
contents
Two different languages of expression
Goal demonstrate that two sets of parallel
documents are (almost) simmetrically mapped to
the same concepts (parallel annotation)
Both sets indexed separately, with language
explicitly specified

30
Parallel annotation (contd)

Test methodology Vector Space Model
Document fragments described as vectors
Dimensions are ontology concepts
Components are weighted (tf/idf) occurrencies of
such concepts
The correlation between two fragments is
quantified as the cosine of the angle between
their vectors

31
Parallel annotation (contd)
IT/html/body/p3 XPart-time job
(2.5) YRetirement (0)
EN/html/body/p3 XPart-time job
(1.5) YRetirement (1.5)
32
Parallel annotation results at ltBODYgt level
33
Correlation results at ltBODYgt level
34
Correlation results at ltBODYgt level (alt)
35
Parallel annotation results at ltHxgt level
36
Parallel annotation notes

Parallel and nonparallel pairs can be grouped as
two different distributions
i.e. Gaussian distributions
Average values of the two distributions are
clearly separated, both for ltBODYgt and ltHxgt
levels
This proves that the indexing system is able to
annotate relevant document fragments
independently from language

37
Language recognition

Separate testing on the same document set
Italian and English documents are alternated in
batch processing
Avoid reuse of default settings for contiguous
documents of the same language
Two ways to retrieve ancestor language
Via Annotation Repository (acceptable)
Via a Language Stack (still inefficient)

38
Annotation Repository vs. Language Stack
All cyan, underlined words are to annotate
(included in the synsets)
ltBODY lang"en"gt ltH1 lang"it"gt Passatempilt/H1gt
ltH2 lang"en"gt Board Gameslt/H2gt ltPgtGomukult/Pgt ltPgt
Damalt/Pgt
Language Stack Dama is ignored (language en
inherited by ltH2gt)
Annotation Repository Dama is annotated
(language it inherited by ltH1gt, annotated)
39
Language recognition results(via Annotation
Repository)
40
Conclusions