Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture - PowerPoint PPT Presentation

About This Presentation
Title:

Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture

Description:

Multilingual support to a proposed ... currently available Outline Semantic Web ... Logic Proof Trust Self desc. doc. Data Data ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 43
Provided by: Andrea629
Category:

less

Transcript and Presenter's Notes

Title: Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture


1
Multilingual supportto a proposed Semantic Web
architecture
  • Andrea Ferrato
  • TOP-UIC MS Thesis, 2003/04
  • Advisor Laura Farinetti

2
Purpose of this work
  • Design and (partially) implement multilingual
    support on a pre-existing Semantic Web platform
  • Provide an approach as generical as possible
  • Exploit features of the pre-existing architecture
  • Cope with the average chaotic structure of
    resources currently available

3
Outline
  • Semantic Web
  • Multilinguality
  • The DOSE platform
  • Proposed solution
  • Given implementation
  • Experimental results
  • Conclusions

4
Semantic Web
  • The next evolutionary stage for WWW
  • Goal make network data usable by intelligent
    agents
  • Deployable only on top of existing infrastructure
  • Two pressing tasks
  • Transform existing contents to include semantics
  • Setup ad hoc user agents to work on them

5
Transform existing contents
  • Basic data units resources
  • Every single information entity that can be
    semantically isolated
  • Features to be given
  • Identification URI
  • Structure XML
  • Meaning RDF
  • Knowledge ontologies

6
Set up ad hoc user agents
  • Major players in Semantic Web deployment
  • Invoked by users, can proceed autonomously
  • Key facilities to be supported
  • Logic
  • Proof
  • Trust

7
Semantic Web layer cake view(Berners-Lee)
Trust
Rules
Logic
Data
Proof
Self desc. doc.
Digital signatures
Data
Ontology vocabulary
RDF RDFschema
8
Multilinguality
  • The extension to multiple languages of tasks
    already performed in a monolingual context
  • Typical issues from cross-language mapping
  • Lexical gaps
  • Role of the context
  • Lack of pre-acquired knowledge

9
Multilinguality and Semantic Web
  • A problem of Text Retrieval in multiple languages
    (NLP)
  • Start from popular approaches (Controlled
    Vocabulary, Free text, etc.)
  • Two main requirements
  • Recognize language ID of resources
  • Map contents independently from language

10
Language ID retrieval
  • Two possible scenarios
  • Retrieve a given ID via resource parsing
  • Recreate the ID via resource analysis
  • When recollecting a given language attribute,
    conform to existing language specification
    standards

11
Language ID specification
Content-language
CSS-level declarations
12
Language-independent contents mapping
  • Investigate the form/meaning relationship
  • Ontology design is crucial
  • Three main requirements
  • Consistency (based on linguistic evidence)
  • Flexibility (meaningful for all languages)
  • Extendibility (easy addition of new languages)

13
Ontology models
  • Conceptual
  • founded upon general knowledge
  • Language-based
  • Built on a particular language
  • Interlingua
  • A combination of the above two
  • None is definitely superior for multilinguality

14
The DOSE platform
  • Distributed Open Semantic Elaboration platform
  • Key features
  • Modularity
  • Scalability
  • Semantic integration
  • Main functionalities offered
  • Annotation
  • Search

15
DOSE layered view
Service layer
Front-end layer
Back-end layer
16
DOSE distributed view
17
DOSE annotation
Substructure Extractor
Indexer
Fragment Retriever
Semantic Mapper
18
DOSE search
Semantic Mapper
Search Engine
Fragment Retriever
19
DOSE and multilinguality
  • Traditionally a new ontology for each different
    language
  • DOSE the ontology language is totally
    independent of the synset language
  • Use synsets to store lexical representations only
  • Let the ontology focus on knowledge modelization

20
Practical requirements for multilinguality
  • Indexing
  • Recognize language of resources to consequently
    setup the system
  • Store language IDs with annotations
  • Search
  • Interpret user queries coming in natural
    languages
  • Allow for cross-language search tasks

21
Extension to language
  • Proposed approach one ontology, many synsets
  • A concept is expressed by a different synset for
    each supported language
  • Each synset contains multiple lexical
    representations of a related concept in a single
    language
  • Separate semantic and textual layers

22
Extension to language (contd)
(one concept, three synsets)
23
Advantages
  • Reduced implementation requirements
  • Ontology design
  • Resource occupation
  • Simplicity (in ontology management)
  • Flexibility
  • A new language just brings a new bag of synsets
  • Expansion of indexing word set

24
Language recognition
  • Proposed approach
  • Retrieve language IDs whenever present
  • Otherwise, recognize language(s)
  • Design constraints
  • To be activated in the annotation phase
  • Refined at the document substructure level
  • Has to deal with the average low authoring
    quality of Web documents

25
Language recognition (contd)
  1. Validate explicit request
  2. Retrieve lang value
  3. Guess via heuristics
  4. Retrieve from ancestor
  5. Accept default

26
Current implementation
  • A new English synset to couple with a disability
    ontology (500 concepts)
  • A set of 20 bilingual documents (Italian,
    English) on disability
  • A basic Language Detector XML-RPC module
    implemented in Java
  • Testing scenarios
  • Parallel annotation
  • Language recognition

27
Implementation work
  • Language Detector module (Java, 1000 lines of
    code)
  • Additions to pre-existing modules (Java, 1000
    lines of code)
  • English synset (RDF, 3500 lines of code)
  • 24 Mb of annotations produced
  • Simulation results analysis (A 600x40 .XLS for
    ltBODYgt, a 925x250 .XLS for ltHxgt)

28
Multilingual DOSE in action
29
Parallel annotation
  • Two parallel documents have
  • The same structure elements with the same
    contents
  • Two different languages of expression
  • Goal demonstrate that two sets of parallel
    documents are (almost) simmetrically mapped to
    the same concepts (parallel annotation)
  • Both sets indexed separately, with language
    explicitly specified

30
Parallel annotation (contd)
  • Test methodology Vector Space Model
  • Document fragments described as vectors
  • Dimensions are ontology concepts
  • Components are weighted (tf/idf) occurrencies of
    such concepts
  • The correlation between two fragments is
    quantified as the cosine of the angle between
    their vectors

31
Parallel annotation (contd)
IT/html/body/p3 XPart-time job
(2.5) YRetirement (0)
EN/html/body/p3 XPart-time job
(1.5) YRetirement (1.5)
32
Parallel annotation results at ltBODYgt level
33
Correlation results at ltBODYgt level
34
Correlation results at ltBODYgt level (alt)
35
Parallel annotation results at ltHxgt level
36
Parallel annotation notes
  • Parallel and nonparallel pairs can be grouped as
    two different distributions
  • i.e. Gaussian distributions
  • Average values of the two distributions are
    clearly separated, both for ltBODYgt and ltHxgt
    levels
  • This proves that the indexing system is able to
    annotate relevant document fragments
    independently from language

37
Language recognition
  • Separate testing on the same document set
  • Italian and English documents are alternated in
    batch processing
  • Avoid reuse of default settings for contiguous
    documents of the same language
  • Two ways to retrieve ancestor language
  • Via Annotation Repository (acceptable)
  • Via a Language Stack (still inefficient)

38
Annotation Repository vs. Language Stack
All cyan, underlined words are to annotate
(included in the synsets)
ltBODY lang"en"gt ltH1 lang"it"gt Passatempilt/H1gt
ltH2 lang"en"gt Board Gameslt/H2gt ltPgtGomukult/Pgt ltPgt
Damalt/Pgt
Language Stack Dama is ignored (language en
inherited by ltH2gt)
Annotation Repository Dama is annotated
(language it inherited by ltH1gt, annotated)
39
Language recognition results(via Annotation
Repository)
40
Conclusions
  • Typical issues discussed
  • Overall validity of the approach shown
  • Further work and improvements
  • Synset composition
  • Annotation testing with more languages
  • Optimize proposed language recognition
    techniques, add new ones

41
Thank you
  • Questions?

42
Language recognition (2)
Write a Comment
User Comments (0)
About PowerShow.com