Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc

Description:

– PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 47

Provided by: rona4

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc

1
Information Extraction and Integration from
Heterogeneous, Distributed, Autonomous
Information Sources A Federated Ontology-
Driven Query-Centric ApproachBy Jaime Castillo,
Adrian Silvescu, Doina Caragea, Jyotishman
Pathak, Vasant Honavar Aritificial
Intelligence Research Laboratory,
Department of Computer Science IOWA State
University, USA

Presented By
Ronak Shah
Graduate Student
University of Southern California.

2
Presentation Overview

Introduction
Data Integration Systems
Data Integration in INDUS
Implementation of Indus
Summary and Discussion

3
Introduction

Development of High throughput Data Acquisition
Advances in digital storage, computing and
Communication technologies
Opportunity in data-driven knowledge acquisition
and decision making

Challenges in use of increasing amounts of Data
from disparate sources

Issue
Data Repositories are large in size, dynamic and
physically distributed
Neither desirable nor feasible to gather all data
in centralized location for analysis
Solution
Algorithm to efficiently extract the relevant
information from disparate sources on demand

Issue
Data sources are autonomously owned and operated
Range of operations and precise mode of allowed
interactions can be diverse
Solution
Strategy for obtaining required information
within operational constraints

Issue
Data Sources are heterogenous in structure and
content
Each Data source uses its own ontology
Solution
Effective Integration of data from different
sources bridging the semantic and syntactic
mismatches among the data sources

Issue
No Single Universal Ontology
Solution
Methods for context-dependent dynamic information
extraction and integration from distributed data
based on user-specified ontologies for knowledge
acquisition and decision making

9
Data Integration Systems

Provide users with seamless and flexible access
to information from multiple autonomous,
distributed and heterogenous data sources
Unified Query Interface
Allow users to specify what information is needed
without mentioning how and from where to obtain
information

10
Data Integration Systems should provide
mechanisms for

Communication and interaction with each data
source
Specification of a query, expressed in terms of
user specified ontology across disparate sources
Mapping between user and data source specific
ontologies
Transformation of a query into plan
Integration and presentation of the results in
the vocabulary known to the user

11
Two Classes of Approach to Data Integration

Data Warehousing
Data Federation

12
Data Warehousing

Data is collected from disparate sources and
mapped to the common structure and stored in the
central location
Need to periodically update the warehouse in
order to ensure that the data is accurate
Not possible to analyze same data from different
perspective
Each user queries warehouse using common
vocabulary and a common query interface

13
Data Federation

Data directly gathered from the data sources
Results are up-to-date
Allows user to impose their own ontology on data
from disparate sources
Provides two sets of operations
get()
transform()
Operations capable of dealing with syntactic and
semantic mismatches between global ontology and
source specific ontology

14
Approaches to deal with semantic mismatches
between global and local ontologies

Source Centric Approach
Query Centric Approach

15
Source Centric Approach

Individual data source determines how the
concepts in local ontology are mapped to cpncepts
on global ontology
User has little control on the true meaning of
concepts in the global ontology
User is not responsible for specifying the
transformation between global concepts and the
local concepts

16
Query Centric Approach

Concepts in global ontology are defined in terms
of concepts in local ontology
Suited for application which requires users to
impose their own ontologies to flexibly interpret
and analyze data from different sources
User or administrator needs to specify precisely
how global concepts can be composed from local
concepts

17
INDUS

Intelligent Data Understanding System
Environment for data driven extraction and
integration from Heterogeneous, distributed and
autonomous information sources
Users able to flexibly interpret and analyze the
data from the various perspective
Motivated by the requirements of application such
as scientific discovery
Provides Federated, Query Centric approach to
data integration using user specified ontologies

18
Data Integration In INDUS

3-Layer Architecture
Physical Layer
Ontological Layer
User Interface Layer

19
Physical Layer

Allows system to communicate with the information
sources
Based on federated database architecture

20
Ontological Layer

Contains global ontologies specified by the users
and their mapping to local ontologies
Transforms queries expressed in terms of global
ontologies into execution plans

21
User Interface Layer

Unables user to interact with the system
Define ontologies
Post query and receive information
Hides the complexity from the user

22
Few INDUS related Terminologies

Concepts
Ground Concepts
Compound concepts
Global Ontology
Queries

23
Concepts

Equivalent to mathematical entity for a relation
Is a Subset of cartesian product of a list of
domains where each domain is finite
Stored as an instances in the relational
databases
Two types of concepts
Ground concept
- Ground concepts are those whose
instances can be retrieved from one or more
data sources using a set of predefined operations
Compound concept
- The definition of a compound concept X
specifies the set of operations that must be
applied over a set of instances of other
previously defined concepts in order to determine
the set of instances of X
- INDUS uses four operations to define new
compound concepts namely, selection, projection,
vertical integration and horizontal integration

24
Global Ontology

Global ontology consists of the set of concepts
that are used to describe entities and
relationships in the domain of discourse
It can be customized to suit the need of the user
or the group of users
Queries are expressed in terms of concept
Can be extended by defining new concepts
Hides the complexity of accessing and retrieving
the information from the data sources
Mapping the semantics of concepts in global and
user defined ontologies helps the resolve the
semantic mismatches
Its resides in the ontological layer

25
Queries

Query allows to access the instances of the
respective concept
Generally specified by selection and projection
The query is expressed in terms of an expression
tree before getting answered
Internal node represents operations and leaf node
represents ground concept
Query is executed after the plan (tree) is
created
INDUS provides instantiator that are able to
interact with the data sources and retrieve the
information from them

26
Implementation of INDUS

Implementation of the Data Integration component
of the INDUS
Five principal modules
Graphical User Interface
Common Global Ontological Area
Instantiator Library
Query Resolution
User Wrokspace

27
Implementation of Indus
28
Graphical User Interface

Allows the user to interact with the INDUS
Unables the user to describe ontologies,define
operational definition of ground
concepts,compound concepts and queries,register
the iterators and execute queries

29
Common Global Ontology Area

Manages the repositorywhere definition of
ontologies,ground concepts,compound
concepts,queries and iterator signatures are
stored

30
Instantiator Library

Contains Sets of function used to interact with
the individual data sources
Each instantiator based on iterator
Iterator interacts direclty with the data sources
Iterator is implemented as Java Class
Instantiator suppies parameter to control
behavior of iterator
Maps the instances returned by the iterator to
instances od corresponding ground concept
Functionality corresponds to that of wrapper

31
Query Resolution Module

Accepts query expressed in terms of concepts in
global ontology as input
Returns the answer to the query constructed from
the relevant data sources

32
User Workspace

Manage private workspace where users store
answers to posted queries
Partial Results are also stored
Set of Instances associated with each ground
concept present in the expression tree are stored
as populated relational tables

33
Advantages 0f INDUS Design

Modular design ensures that each module is
updated and alternative implementation easily
explored
Unables INDUS to use different network
architecture

34
Technologies used to develop INDUS

JSP for developing Graphical User Interface
Hosted in an Apache Tomcat 4.0 Web Server
Relational database for common global ontology
area
ODBC and JDBC protocol to share the ontology with
other applications
Iterators and the resolution algorithm are
implemented in Java
All compoents of Indus are platform Independent

35
Different Roles

Domain Scientist
Ontology Engineer
Administrator
Developer
Each user may place multiple role

36
Domain Scientist

Define ontologies, compund concepts and queries
Execute queries and manipulate the retrieved data
Should be familiar with the relevant domain, data
sources and thir capabilities
No programming knowledge required

37
Ontology Engineer

Programming new iterators
Define ground concepts associated with new ground
concepts
Define new modes of interaction with existing
data sources

38
Administrator

Install INDUS software
Set up and manage databases
Adding new users to the system

39
Developer

Add new compositional operations to INDUS
Modify the graphical user interface module and
query resolution module

40
Summary

Described the design and implementation of data
integration component of INDUS
INDUS implements federated query centric approach
for data integration
Information extraction ioperations to be executed
are dynamically determined on the basis of user
supplied ontology and query

41
Related Work

Early work on multi-database systems focused on
relational and object oriented databases
Recently, mediators and wrappers have been
developed to integrate information from disparate
sources

42
Information Manifold System

Developed at AT T Bell laboratories
Heterogenous data integration system offering
unified interface for retrieving information from
www and internal sources
Source centric approach
Allows definition of only one capability record
per data source
Only equality operator is supported

43
TSIMMIS

Stanford IBM Manager of multiple information
sources
Based on concepts of wrappers and mediators
Uses Object Exchange Model(OEM)
Query centric approach

44
TAMBIS

Transparent access to multiple Bioinformatics
Information System
Ontology centered system for evaluating query
Offers access to multiple heterogeneous
bioinformatics data sources
Three layer wrapper/mediator architecture
Uses description logic language GRAIL
Returns the answer for the query as an HTML file

45
Work in Progress

Development of the INDUS prototype into platform
to support exploratory data analysis and
knowledge acquisition from heteregenous,
distributed information sources
Extending the information extraction framework to
support extraction of sufficient statistics
Extending recently developed algorithms to work
with heteregenous, distributed information
sources
Performance Improvements using more sophisticated
query optimization methods
Exploration of the use of emerging frameworks for
data and meta data description, ontologies and
registry services, being developed as part of the
semantic web project
Exploration of methods for automatically learning
mapping between data sources
Extension of INDUS to support information
integration in peer to peer environments and
distributed sendor network