Title: Multilingual%20support%20to%20a%20proposed%20Semantic%20Web%20architecture
1Multilingual supportto a proposed Semantic Web
architecture
- Andrea Ferrato
- TOP-UIC MS Thesis, 2003/04
- Advisor Laura Farinetti
2Purpose of this work
- Design and (partially) implement multilingual
support on a pre-existing Semantic Web platform - Provide an approach as generical as possible
- Exploit features of the pre-existing architecture
- Cope with the average chaotic structure of
resources currently available
3Outline
- Semantic Web
- Multilinguality
- The DOSE platform
- Proposed solution
- Given implementation
- Experimental results
- Conclusions
4Semantic Web
- The next evolutionary stage for WWW
- Goal make network data usable by intelligent
agents - Deployable only on top of existing infrastructure
- Two pressing tasks
- Transform existing contents to include semantics
- Setup ad hoc user agents to work on them
5Transform existing contents
- Basic data units resources
- Every single information entity that can be
semantically isolated - Features to be given
- Identification URI
- Structure XML
- Meaning RDF
- Knowledge ontologies
6Set up ad hoc user agents
- Major players in Semantic Web deployment
- Invoked by users, can proceed autonomously
- Key facilities to be supported
- Logic
- Proof
- Trust
7Semantic Web layer cake view(Berners-Lee)
Trust
Rules
Logic
Data
Proof
Self desc. doc.
Digital signatures
Data
Ontology vocabulary
RDF RDFschema
8Multilinguality
- The extension to multiple languages of tasks
already performed in a monolingual context - Typical issues from cross-language mapping
- Lexical gaps
- Role of the context
- Lack of pre-acquired knowledge
9Multilinguality and Semantic Web
- A problem of Text Retrieval in multiple languages
(NLP) - Start from popular approaches (Controlled
Vocabulary, Free text, etc.) - Two main requirements
- Recognize language ID of resources
- Map contents independently from language
10Language ID retrieval
- Two possible scenarios
- Retrieve a given ID via resource parsing
- Recreate the ID via resource analysis
- When recollecting a given language attribute,
conform to existing language specification
standards
11Language ID specification
Content-language
CSS-level declarations
12Language-independent contents mapping
- Investigate the form/meaning relationship
- Ontology design is crucial
- Three main requirements
- Consistency (based on linguistic evidence)
- Flexibility (meaningful for all languages)
- Extendibility (easy addition of new languages)
13Ontology models
- Conceptual
- founded upon general knowledge
- Language-based
- Built on a particular language
- Interlingua
- A combination of the above two
- None is definitely superior for multilinguality
14The DOSE platform
- Distributed Open Semantic Elaboration platform
- Key features
- Modularity
- Scalability
- Semantic integration
- Main functionalities offered
- Annotation
- Search
15DOSE layered view
Service layer
Front-end layer
Back-end layer
16DOSE distributed view
17DOSE annotation
Substructure Extractor
Indexer
Fragment Retriever
Semantic Mapper
18DOSE search
Semantic Mapper
Search Engine
Fragment Retriever
19DOSE and multilinguality
- Traditionally a new ontology for each different
language - DOSE the ontology language is totally
independent of the synset language - Use synsets to store lexical representations only
- Let the ontology focus on knowledge modelization
20Practical requirements for multilinguality
- Indexing
- Recognize language of resources to consequently
setup the system - Store language IDs with annotations
- Search
- Interpret user queries coming in natural
languages - Allow for cross-language search tasks
21Extension to language
- Proposed approach one ontology, many synsets
- A concept is expressed by a different synset for
each supported language - Each synset contains multiple lexical
representations of a related concept in a single
language - Separate semantic and textual layers
22Extension to language (contd)
(one concept, three synsets)
23Advantages
- Reduced implementation requirements
- Ontology design
- Resource occupation
- Simplicity (in ontology management)
- Flexibility
- A new language just brings a new bag of synsets
- Expansion of indexing word set
24Language recognition
- Proposed approach
- Retrieve language IDs whenever present
- Otherwise, recognize language(s)
- Design constraints
- To be activated in the annotation phase
- Refined at the document substructure level
- Has to deal with the average low authoring
quality of Web documents
25Language recognition (contd)
- Validate explicit request
- Retrieve lang value
- Guess via heuristics
- Retrieve from ancestor
- Accept default
26Current implementation
- A new English synset to couple with a disability
ontology (500 concepts) - A set of 20 bilingual documents (Italian,
English) on disability - A basic Language Detector XML-RPC module
implemented in Java - Testing scenarios
- Parallel annotation
- Language recognition
27Implementation work
- Language Detector module (Java, 1000 lines of
code) - Additions to pre-existing modules (Java, 1000
lines of code) - English synset (RDF, 3500 lines of code)
- 24 Mb of annotations produced
- Simulation results analysis (A 600x40 .XLS for
ltBODYgt, a 925x250 .XLS for ltHxgt)
28Multilingual DOSE in action
29Parallel annotation
- Two parallel documents have
- The same structure elements with the same
contents - Two different languages of expression
- Goal demonstrate that two sets of parallel
documents are (almost) simmetrically mapped to
the same concepts (parallel annotation) - Both sets indexed separately, with language
explicitly specified
30Parallel annotation (contd)
- Test methodology Vector Space Model
- Document fragments described as vectors
- Dimensions are ontology concepts
- Components are weighted (tf/idf) occurrencies of
such concepts - The correlation between two fragments is
quantified as the cosine of the angle between
their vectors
31Parallel annotation (contd)
IT/html/body/p3 XPart-time job
(2.5) YRetirement (0)
EN/html/body/p3 XPart-time job
(1.5) YRetirement (1.5)
32Parallel annotation results at ltBODYgt level
33Correlation results at ltBODYgt level
34Correlation results at ltBODYgt level (alt)
35Parallel annotation results at ltHxgt level
36Parallel annotation notes
- Parallel and nonparallel pairs can be grouped as
two different distributions - i.e. Gaussian distributions
- Average values of the two distributions are
clearly separated, both for ltBODYgt and ltHxgt
levels - This proves that the indexing system is able to
annotate relevant document fragments
independently from language
37Language recognition
- Separate testing on the same document set
- Italian and English documents are alternated in
batch processing - Avoid reuse of default settings for contiguous
documents of the same language - Two ways to retrieve ancestor language
- Via Annotation Repository (acceptable)
- Via a Language Stack (still inefficient)
38Annotation Repository vs. Language Stack
All cyan, underlined words are to annotate
(included in the synsets)
ltBODY lang"en"gt ltH1 lang"it"gt Passatempilt/H1gt
ltH2 lang"en"gt Board Gameslt/H2gt ltPgtGomukult/Pgt ltPgt
Damalt/Pgt
Language Stack Dama is ignored (language en
inherited by ltH2gt)
Annotation Repository Dama is annotated
(language it inherited by ltH1gt, annotated)
39Language recognition results(via Annotation
Repository)
40Conclusions
- Typical issues discussed
- Overall validity of the approach shown
- Further work and improvements
- Synset composition
- Annotation testing with more languages
- Optimize proposed language recognition
techniques, add new ones
41Thank you
42Language recognition (2)