Naveen Ashish - PowerPoint PPT Presentation

About This Presentation

Title:

Naveen Ashish

Description:

Naveen Ashish Amit P. Sheth Department of Computer Science and Large Scale Distributed Information Systems Lab University of Georgia, Athens What is an Information ... – PowerPoint PPT presentation

Number of Views:194

Avg rating:3.0/5.0

Slides: 33

Provided by: MacR9

Category:

more less

Transcript and Presenter's Notes

Title: Naveen Ashish

1
Information Mediation Integrating Information
from Multiple Information Sources

Naveen Ashish
Amit P. Sheth
Department of Computer Science and
Large Scale Distributed Information Systems Lab
University of Georgia, Athens

2
What is an Information Agent/Mediator ?

A software system that provides integrated and
structured query access to multiple distributed
information sources
Sources may be databases of various kinds or Web
sources
Sources are autonomously created and
heterogeneous
Accessible via a network
Mediator provides the illusion of a single
information source

3
Information Agents aka Mediators
Example Restaurant and Theatre Info on the Web
Map Servers
Geocoders
Ariadne Mediator
Zagat
Health Ratings
Movies
4
Why the Interest in Building Such Systems ?
Oracle
MEDIATOR
Sybase
IBM DB2
Legacy System
Object-Oriented DB
5
Mediators on the Web
Wrapper
MEDIATOR
DB2
DB1
6
Organization of Remainder of Talk

Introduction
Information Agents, System Architecture
Research Issues
Information Modeling
Query Planning
Semi-automatic Wrapper Generation
Performance Optimization by Materialization
Resolving Inconsistencies
Industry Products for Data Extraction and
Integration
Start-up Ventures

7
Representative Systems (Research Projects)

SIMS/Ariadne University of Southern
California/ISI
TSIMMIS Stanford
Information Manifold ATT Research
Garlic IBM Almaden
Tukwila University of Washington
InfoSleuth MCC
DISCO University of Maryland/INRIA
HERMES University of Maryland
InfoMaster Stanford
InfoQuilt University of Georgia

8
Information Modeling

Multiple, heterogeneous, autonomously created
information sources
Users sees an integrated (global) view
Queries a mediated schema
A uniform model for all sources
Must be (at least) expressive enough to model the
most complex information source
Each source provides a set of relations or
classes
Translation (model) is done by wrapper at each
source
Integration
Global as view, Local as view

9
Global as View

For each relation (class) in mediated schema we
specify how to obtain its tuples from the sources

Name
Phonenumber
RESTAURANT
Name
DOH Ratings
GEOCODER
Rating
Address Lat Lon
ZAGAT
FODORS
Name
Name
FODORS
Phone
ZAGAT
Address
Reviews
Telephone
10
Heterogeneity Resolution

Sources may use different models
OO, Relational, Legacy, ..
May be Web sources
Wrapper exports contents in a uniform model
Structural and schematic differences
(name, address) (name, street, city, state, zip)
Semantic
(name, phonenumber) (name, telephone)

11
Global as View Models

KR based models (SIMS, Ariadne, .)
LOOM, CLASSIC
OO, based on ODMG (DISCO, Garlic )

interface Restaurant attribute string
name attribute string address attribute
string cuisine attribute string
review extent restaurant 0 of Restaurant
wrapper w0 repository r0 map ((zagts0restaurant0)
(namen) (addressa)(cuisinec))
12
Local as View

For every information source S describe it in
terms of relations in the mediated schema

v1(name,address,cuisine,rating) -
Restaurant(name,address, cuisine,rating) city
Santa Monica v2(name, foodrating) -
Restaurant(name,address,cuisine,rating) .
13
Query Planning and Optimization

Mediator must generate an information gathering
plan
Constraints on execution
Binding patterns ....
Optimization of query plans
Current areas of work
Optimization
Approximate answers (incomplete sources)
Query planning for other sources such as
simulations, computer programs etc.
Query execution engines

14
Query Plans and Plan Quality
Low-Quality Plan
High-Quality Plan
15
Accessing Sources via Wrappers
SELECT address, tel FROM Restaurant WHERE cuisine
chinese
Chinois, 2720 Main St, 310-777-9876 Peking Star,
1 Broad St, 213-999-7676 .....
16
Semi-Automatic Wrapper Generation

Need wrappers for several sites
Building wrappers by hand is tedious and time
consuming
Approaches to automating the process
Exploit format information (structure, HTML etc.
)
Template based approaches
Machine learning techniques
XML

ltnamegt Peking Star lt/namegt ltaddressgt 1 Broad
Street, Los Angeles lt/addressgt ltphonegt31-822-1511
lt/phonegt
17
Wrappers .... Work in Progress

Database wrappers
Variety of techniques for Web wrappers
Upmarking
To XML
Building Web-bases
Other Artificial Intelligence techniques
Natural Language Processing
IR
Classifiers

18
Performance Issue

Query processing time is typically very high
Despite the mediator generating efficient query
plans
Cost of fetching data and pages from remote
sources dominates
Have to typically fetch a large number of Web
pages
The Web sources are not designed for database
like query access
The Web sources can be slow
Further improve performance by materializing data
at the mediator side.

19
Store and Materialize Data Locally
Wrapped Web Source (SLOW)
MEDIATOR
Materialized Data (FAST)
20
Selective Materialization

Why not simply materialize all the data in all
the Web sources being integrated and have a
really fast mediator ??
Will not scale, amount of space needed may be too
much
Web sources can get updated
Cost of keeping data consistent can get
prohibitive
We are building a mediator, not a data warehouse
!
Approach then is to selectively materialize data
How do we automatically identify the portion of
data most useful to materialize ?

21
Selecting Data to Materialize
Distribution of User Queries (Identify frequently
accessed classes)
Structure of Sources (Prefetch data to speed up
expensive queries)
Classes of Data to Materialize
SELECTING CLASSES
Updates (Have to consider maintenance cost)
22
Inconsistency Resolution

Same object in different formats
United States and US
Red Lobster and The Red Lobster
John Smith, Smith, J. , J. Smith, Dr.
John Smith ...
Has appeared in other database and IR contexts
Solutions
Mapping tables
For finite domains (such as cities, countries,
companies )
Simply maintain an enumerated list of possible
formats for each object
(New York, N.Y., NYC, New York City, Big
Apple)

23
Mapping Functions

Mapping functions
When domain is not finite (person names)
Domain specific mapping transformations
Stemming common words (Inc., Corp., The etc.)
Matching full word and abbreviation
Match 2 formats with a score
Current work
Learning mapping functions from example matches
IR based approaches
Building metabases

24
Mediator Prototypes and Software

Software and tools from mediator research
projects
What may be available.
Mediator kernels (integration engines)
Data modeling tools, Description Logic systems
Wrapper and extractor toolkits and software
Plenty of papers !
Ariadne, USC/ISI, http//www.isi.edu/ariadne
TSIMMIS, Stanford, http//www-db.stanford.edu/tsim
mis/
MIX, UCSD, http//feast.ucsd.edu/Projects/MIX/
InfoSleuth, MCC, http//www.mcc.com/projects/infos
leuth/
DISCO, U Maryland, http//www.umiacs.umd.edu/labs/
CLIP/im.html
Garlic, IBM Almaden, http//www.almaden.ibm.com/cs
/garlic.html
Tukwila, U Washington, http//data.cs.washington.e
du/integration/tukwila/

25
Applications of Mediators

Heterogeneous and Distributed Database
Integration
Legacy systems integration
Web Sources Integration
Data Integration for E-commerce
Integrating product catalogs, multiple vendors
Data Warehousing
For populating data warehouses
Bioinformatics
Information Management Environments
Digital Libraries
Healthcare Information Systems

26
Industry Products (IBM DB2 DataJoiner)

IBM DB2 DataJoiner
http//www-4.ibm.com/software/data/datajoiner/
Enterprise data integration middleware
DataJoiner functionality now incorporated in IBM
DB2 UDB
http//www-4.ibm.com/software/data/db2/udb/about.h
tml
Native support for popular relational data
sources
DB2, Informix, SQL Server, Sybase, Teradata and
others
Supports non relational data sources
Support for Web data
Available on variety of platforms and OS

27
Start-up ventures Junglee Corp

Website www.amazon.com (Acquired)
Researcher Founders Rajaraman, Gupta,
Harinarayanan, Mathur
Products and Services
Tools for data extraction and integration
Building warehouse from multiple Web sources
Integrating apartment listings from multiple
sources
Integrating job postings from multiple online job
sources
Market focus Online shopping
Current Status Acquired by Amazon
Similar ventures Netbots Inc. (www.excite.com)
Acquired by Excite

28
Cohera

Website www.cohera.com
Researcher Founders Stonebraker, Hellerstein
Products and Services
Cohera E-Catalog System
Integrates product data from multiple sellers and
product catalogs
Set of software servers and tools for building
and running live e-catalogs
Market(s) Targetted E-Commerce
Customers E-Commerce communities - ThomasNet,
Trapezo, LiveListings, FoodService.Com
Current Status Founded October 1997, Privately
Held
Similar ventures Ensosys Markets Inc.
(www.enosysmarkets.com)
Mergent Inc. (www.mergent.com)

29
Nimble Technology

Website www.nimble.com
Researcher Founders Levy, Weld
Products and Services
Nimble Data Integration Suite
XML base integration approach
Current focus on multiple information sources
integration
Tools for data extraction and Data Integration
Engine
Market focus CRM, Business Intelligence, B2B,
Portals
Current Status Founded June 1999, Privately Held

30
WhizbangLabs !

Website www.whizbanglabs.com
Researcher Founders Quass, Geddes, Mitchell
Products and Services
Technology for building Webbases - databases
created by extracting data from Web pages
Topic specific
Topic specific crawler for retrieving pages
Tools for extracting data from Web pages,
cleaning data and loading into database
Market focus Content providing portals
Current Status Founded March 1999, Privately
held
Similar ventures Fetch Technologies
(www.fetch.com)

31
Bioinformatics A Data Integration Grand Challenge

Mapping of Human Genetic Code complete
New, revolutionary, computational approach to
drug discovery
Huge amounts of genetic, chemical and biological
data being generated at an exponential rate in
biotech/pharma RD
Complex structures, maps, sequence data etc.
Drug discovery scientists need integrated access
to this data
Look for patterns across data sources
Need to integrate data from multiple labs
Lab procedures (thus the data) keeps changing
Good amount of genomic data is free text
DiscoveryLink State of the art Life Sciences
data integration middleware from IBM
http//www-4.ibm.com/software/webservers/lifescien
ces/discovery.html

32
Conclusion

Information mediation
Issues in building such systems
Research projects
Industry products
Start-up ventures
Applicable to wide areas such as E-commerce,
database and legacy systems integration, Web
source extraction, content management, portals,
digital libraries, bioinformatics.

Write a Comment

User Comments (0)