Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc

Description:

– PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 47
Provided by: rona4
Category:

less

Transcript and Presenter's Notes

Title: Information Extraction and Integration from Heterogeneous, Distributed, Autonomous Information Sourc


1
Information Extraction and Integration from
Heterogeneous, Distributed, Autonomous
Information Sources A Federated Ontology-
Driven Query-Centric ApproachBy Jaime Castillo,
Adrian Silvescu, Doina Caragea, Jyotishman
Pathak, Vasant Honavar Aritificial
Intelligence Research Laboratory,
Department of Computer Science IOWA State
University, USA
  • Presented By
  • Ronak Shah
  • Graduate Student
  • University of Southern California.

2
Presentation Overview
  • Introduction
  • Data Integration Systems
  • Data Integration in INDUS
  • Implementation of Indus
  • Summary and Discussion

3
Introduction
  • Development of High throughput Data Acquisition
  • Advances in digital storage, computing and
    Communication technologies
  • Opportunity in data-driven knowledge acquisition
    and decision making

4
  • Challenges in use of increasing amounts of Data
    from disparate sources

5
  • Issue
  • Data Repositories are large in size, dynamic and
    physically distributed
  • Neither desirable nor feasible to gather all data
    in centralized location for analysis
  • Solution
  • Algorithm to efficiently extract the relevant
    information from disparate sources on demand

6
  • Issue
  • Data sources are autonomously owned and operated
  • Range of operations and precise mode of allowed
    interactions can be diverse
  • Solution
  • Strategy for obtaining required information
    within operational constraints

7
  • Issue
  • Data Sources are heterogenous in structure and
    content
  • Each Data source uses its own ontology
  • Solution
  • Effective Integration of data from different
    sources bridging the semantic and syntactic
    mismatches among the data sources

8
  • Issue
  • No Single Universal Ontology
  • Solution
  • Methods for context-dependent dynamic information
    extraction and integration from distributed data
    based on user-specified ontologies for knowledge
    acquisition and decision making

9
Data Integration Systems
  • Provide users with seamless and flexible access
    to information from multiple autonomous,
    distributed and heterogenous data sources
  • Unified Query Interface
  • Allow users to specify what information is needed
    without mentioning how and from where to obtain
    information

10
Data Integration Systems should provide
mechanisms for
  • Communication and interaction with each data
    source
  • Specification of a query, expressed in terms of
    user specified ontology across disparate sources
  • Mapping between user and data source specific
    ontologies
  • Transformation of a query into plan
  • Integration and presentation of the results in
    the vocabulary known to the user

11
Two Classes of Approach to Data Integration
  • Data Warehousing
  • Data Federation

12
Data Warehousing
  • Data is collected from disparate sources and
    mapped to the common structure and stored in the
    central location
  • Need to periodically update the warehouse in
    order to ensure that the data is accurate
  • Not possible to analyze same data from different
    perspective
  • Each user queries warehouse using common
    vocabulary and a common query interface

13
Data Federation
  • Data directly gathered from the data sources
  • Results are up-to-date
  • Allows user to impose their own ontology on data
    from disparate sources
  • Provides two sets of operations
  • get()
  • transform()
  • Operations capable of dealing with syntactic and
    semantic mismatches between global ontology and
    source specific ontology

14
Approaches to deal with semantic mismatches
between global and local ontologies
  • Source Centric Approach
  • Query Centric Approach

15
Source Centric Approach
  • Individual data source determines how the
    concepts in local ontology are mapped to cpncepts
    on global ontology
  • User has little control on the true meaning of
    concepts in the global ontology
  • User is not responsible for specifying the
    transformation between global concepts and the
    local concepts

16
Query Centric Approach
  • Concepts in global ontology are defined in terms
    of concepts in local ontology
  • Suited for application which requires users to
    impose their own ontologies to flexibly interpret
    and analyze data from different sources
  • User or administrator needs to specify precisely
    how global concepts can be composed from local
    concepts

17
INDUS
  • Intelligent Data Understanding System
  • Environment for data driven extraction and
    integration from Heterogeneous, distributed and
    autonomous information sources
  • Users able to flexibly interpret and analyze the
    data from the various perspective
  • Motivated by the requirements of application such
    as scientific discovery
  • Provides Federated, Query Centric approach to
    data integration using user specified ontologies

18
Data Integration In INDUS
  • 3-Layer Architecture
  • Physical Layer
  • Ontological Layer
  • User Interface Layer

19
Physical Layer
  • Allows system to communicate with the information
    sources
  • Based on federated database architecture

20
Ontological Layer
  • Contains global ontologies specified by the users
    and their mapping to local ontologies
  • Transforms queries expressed in terms of global
    ontologies into execution plans

21
User Interface Layer
  • Unables user to interact with the system
  • Define ontologies
  • Post query and receive information
  • Hides the complexity from the user

22
Few INDUS related Terminologies
  • Concepts
  • Ground Concepts
  • Compound concepts
  • Global Ontology
  • Queries

23
Concepts
  • Equivalent to mathematical entity for a relation
  • Is a Subset of cartesian product of a list of
    domains where each domain is finite
  • Stored as an instances in the relational
    databases
  • Two types of concepts
  • Ground concept
  • - Ground concepts are those whose
    instances can be retrieved from one or more
    data sources using a set of predefined operations
  • Compound concept
  • - The definition of a compound concept X
    specifies the set of operations that must be
    applied over a set of instances of other
    previously defined concepts in order to determine
    the set of instances of X
  • - INDUS uses four operations to define new
    compound concepts namely, selection, projection,
    vertical integration and horizontal integration

24
Global Ontology
  • Global ontology consists of the set of concepts
    that are used to describe entities and
    relationships in the domain of discourse
  • It can be customized to suit the need of the user
    or the group of users
  • Queries are expressed in terms of concept
  • Can be extended by defining new concepts
  • Hides the complexity of accessing and retrieving
    the information from the data sources
  • Mapping the semantics of concepts in global and
    user defined ontologies helps the resolve the
    semantic mismatches
  • Its resides in the ontological layer

25
Queries
  • Query allows to access the instances of the
    respective concept
  • Generally specified by selection and projection
  • The query is expressed in terms of an expression
    tree before getting answered
  • Internal node represents operations and leaf node
    represents ground concept
  • Query is executed after the plan (tree) is
    created
  • INDUS provides instantiator that are able to
    interact with the data sources and retrieve the
    information from them

26
Implementation of INDUS
  • Implementation of the Data Integration component
    of the INDUS
  • Five principal modules
  • Graphical User Interface
  • Common Global Ontological Area
  • Instantiator Library
  • Query Resolution
  • User Wrokspace

27
Implementation of Indus
28
Graphical User Interface
  • Allows the user to interact with the INDUS
  • Unables the user to describe ontologies,define
    operational definition of ground
    concepts,compound concepts and queries,register
    the iterators and execute queries

29
Common Global Ontology Area
  • Manages the repositorywhere definition of
    ontologies,ground concepts,compound
    concepts,queries and iterator signatures are
    stored

30
Instantiator Library
  • Contains Sets of function used to interact with
    the individual data sources
  • Each instantiator based on iterator
  • Iterator interacts direclty with the data sources
  • Iterator is implemented as Java Class
  • Instantiator suppies parameter to control
    behavior of iterator
  • Maps the instances returned by the iterator to
    instances od corresponding ground concept
  • Functionality corresponds to that of wrapper

31
Query Resolution Module
  • Accepts query expressed in terms of concepts in
    global ontology as input
  • Returns the answer to the query constructed from
    the relevant data sources

32
User Workspace
  • Manage private workspace where users store
    answers to posted queries
  • Partial Results are also stored
  • Set of Instances associated with each ground
    concept present in the expression tree are stored
    as populated relational tables

33
Advantages 0f INDUS Design
  • Modular design ensures that each module is
    updated and alternative implementation easily
    explored
  • Unables INDUS to use different network
    architecture

34
Technologies used to develop INDUS
  • JSP for developing Graphical User Interface
  • Hosted in an Apache Tomcat 4.0 Web Server
  • Relational database for common global ontology
    area
  • ODBC and JDBC protocol to share the ontology with
    other applications
  • Iterators and the resolution algorithm are
    implemented in Java
  • All compoents of Indus are platform Independent

35
Different Roles
  • Domain Scientist
  • Ontology Engineer
  • Administrator
  • Developer
  • Each user may place multiple role

36
Domain Scientist
  • Define ontologies, compund concepts and queries
  • Execute queries and manipulate the retrieved data
  • Should be familiar with the relevant domain, data
    sources and thir capabilities
  • No programming knowledge required

37
Ontology Engineer
  • Programming new iterators
  • Define ground concepts associated with new ground
    concepts
  • Define new modes of interaction with existing
    data sources

38
Administrator
  • Install INDUS software
  • Set up and manage databases
  • Adding new users to the system

39
Developer
  • Add new compositional operations to INDUS
  • Modify the graphical user interface module and
    query resolution module

40
Summary
  • Described the design and implementation of data
    integration component of INDUS
  • INDUS implements federated query centric approach
    for data integration
  • Information extraction ioperations to be executed
    are dynamically determined on the basis of user
    supplied ontology and query

41
Related Work
  • Early work on multi-database systems focused on
    relational and object oriented databases
  • Recently, mediators and wrappers have been
    developed to integrate information from disparate
    sources

42
Information Manifold System
  • Developed at AT T Bell laboratories
  • Heterogenous data integration system offering
    unified interface for retrieving information from
    www and internal sources
  • Source centric approach
  • Allows definition of only one capability record
    per data source
  • Only equality operator is supported

43
TSIMMIS
  • Stanford IBM Manager of multiple information
    sources
  • Based on concepts of wrappers and mediators
  • Uses Object Exchange Model(OEM)
  • Query centric approach

44
TAMBIS
  • Transparent access to multiple Bioinformatics
    Information System
  • Ontology centered system for evaluating query
  • Offers access to multiple heterogeneous
    bioinformatics data sources
  • Three layer wrapper/mediator architecture
  • Uses description logic language GRAIL
  • Returns the answer for the query as an HTML file

45
Work in Progress
  • Development of the INDUS prototype into platform
    to support exploratory data analysis and
    knowledge acquisition from heteregenous,
    distributed information sources
  • Extending the information extraction framework to
    support extraction of sufficient statistics
  • Extending recently developed algorithms to work
    with heteregenous, distributed information
    sources
  • Performance Improvements using more sophisticated
    query optimization methods
  • Exploration of the use of emerging frameworks for
    data and meta data description, ontologies and
    registry services, being developed as part of the
    semantic web project
  • Exploration of methods for automatically learning
    mapping between data sources
  • Extension of INDUS to support information
    integration in peer to peer environments and
    distributed sendor network

46
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com