AnHai Doan

About This Presentation

Title:

AnHai Doan

Description:

Dan Roth: AI, question answering, text/data integration. Jiawei Han: data mining ... many workshops in AI and DB communities: e.g., SIGMOD/VLDB-04. focused on ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 47

Provided by: zam34

Learn more at: https://pages.cs.wisc.edu

Category:

Tags: anhai | ai | doan

more less

Transcript and Presenter's Notes

Title: AnHai Doan

1
Evolving and Self-Managing Data Integration
Systems

AnHai Doan
Dept. of Computer Science
Univ. of Illinois at Urbana-Champaign
Spring 2004

2
Data Integration at UIUC

Two main players
Kevin Chang and AnHai Doan
10 students, 30001 cups of coffees, 3 SIGMOD-04
papers
Four supporting players
Chengxiang Zhai IR, bioinformatics, text/data
integration
Dan Roth AI, question answering, text/data
integration
Jiawei Han data mining
Marianne Winslett security/privacy issues in
data sharing
Many supporting departments and local
organizations
NCSA, Information Science, Genome Institute, Fire
Service Institute

3
Data Integration Challenge
Find houses with 3 bedrooms priced under 300K
New faculty member
homes.com
realestate.com
homeseekers.com
4
Architecture of Data Integration System
Find houses with 3 bedrooms priced under 300K
price, num-beds, location, agent-name
mediated schema
list-price, bdrms, address
source schema 2
source schema 3
source schema 1
wrapper
wrapper
wrapper
homes.com
realestate.com
homeseekers.com
Think comparison shopping systems on steroid ...
5
The Need for Data Integration is Ubiquitous!

In virtually all domains
data are distributed stored in heterogeneous
formats
WWW
hundreds/thousands of sources in bioinformatics,
real estate, book,etc.
Enterprises
avg. organization has 49 databases Ives-01
organizations frequently merge, exchange data
Government e.g., digital government initiatives
Military, cultural international exchange,
Semantic Web, information agents, etc.
Long-standing challenge in the database community
recent explosion of distributed data adds urgency

6
Current State of Affairs

Vibrant research industrial landscape
Research
dated back to the 70-80s, accelerated in the 90s
Stanford, UPenn, ATT Labs, Maryland,
UWashington, Wisconsin, IBM Almaden, ISI,
Arizona State U, Ireland, CMU, etc.
many workshops in AI and DB communities e.g.,
SIGMOD/VLDB-04
focused on
conceptual algorithmic aspects
building systems in specific domains (bio,
geo-spatial, rapid emergency response, virtual
organization, etc.)
Industry
more than 50 startups in 2001, new startups in
2004

Despite much RD activities, however
7
Current State of Affairs (cont.)

Most DI systems are still built maintained
manually
Manual deployment is extremely labor-intensive
...
construct mediated- source schemas,
find semantic mappings between schemas,
constantly monitor adjust to changes at
hundreds or thousands of data sources, ...
... and has become a key bottleneck
Emerging technologies
XML, Web services, Semantic Web, ...
will further fuel DI applications exacerbate
the problem

Slashing the astronomical cost of ownership

is now crucial!
8
The AIDA Project

Recently started at Univ of Illinois
AIDA Automatic Integration of Data
Goal evolving and self-managing data
integration systems
Easy to start
takes hours instead of weeks or months
perhaps with just a few sources
Learn to continuously improve
expand to cover new sources
add novel query capabilities, better query
performance
Adjust automatically to changes
detect and fix broken wrappers, semantic matches,
etc.
Require minimal efforts from system admin
some efforts at the start
far less as system has been learning more and
more

9
The AIDA Project (cont.)

In line with trends in broader computing
landscape
autonomic systems (IBM initiative)
recovery-oriented computing (Berkeley)
cognitive computer systems (DARPA)
from cycles to RASS (Stanford)
self-tuning databases (MSR, IBM Almaden, Oracle)
Key differences
applied to distributed data management systems
must attack difficult semantics/meta-data issues
heavy involvement of human
must handle large scale
Need techniques from multiple fields
databases, machine learning, AI, IR, data mining

10
Project Overview

Thrust 1 automate current labor-intensive tasks
schema matching
mediated schema construction
entity matching
Thrust 2 develop new capabilities
entity integration
Thrust 3 monitor adjust to changes
Thrust 4 reduce cost of system admin
by leveraging the mass of users
Thrust 5 design sources for interoperability

11
Schema Matching
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
12
Why Schema Matching is Difficult

Schema data never fully capture semantics!
not adequately documented
Must rely on clues in schema data
using names, structures, types, data values, etc.
Such clues can be unreliable
same names gt different entities area gt
location or square-feet
different names gt same entity area
address gt location
Intended semantics can be subjective
house-style house-description?
military apps require committees!
Cannot be fully automated, needs work from system
admin!

13
Current State of Affairs

Largely done by hand
labor intensive error prone
data integration at GTE LiClifton, 2000
40 databases, 27000 elements, estimated time 12
years
Need semi-automatic approaches to scale up!
Numerous prior current research projects
Databases SemInt (Northwestern), DELTA (MITRE),
IBM Almaden, Microsoft
Research, Wisconsin, Toronto,
UC-Irvine, BYU, George Mason, U of Leipzig, ...
AI Stanford, Karlsruhe University, NEC Japan,
ISI, ...
Many startups

14
Our Prior Ongoing Work 2000-date

Joint work with
Robin Dhamanka, Yoonkyong Lee, Wensheng Wu, Rob
McCann, Warren Shen, Alex Kramnik, Olu Sobulo,
Vanitha Varadarajan (Illinois), Pedro Domingos,
Alon Halevy (U Washington)
Learning 1-1 matches for relational XML schemas
LSD (Learning Source Description) system
WebDB-00, SIGMOD-01, Machine Learning
Journal-03
Learning 1-1 complex matches for ontologies
GLUE WWW-02, VLDB Journal-03, Ontology
Handbook-03
Learning 1-1 matches by mass collaboration
MOBS WebDB-03, IJCAI-03 Workshop
Learning complex matches for relational schemas
iMAP SIGMOD-04
Large-scale matching via clustering IceQ
SIGMOD-04
Corpus-based schema matching submitted
Further resources
brief survey talk at http//anhai.cs.uiuc.edu/home
/talks/isi-matching.ppt
"Learning to Match Structured Representations of
Data" book by Springer-Velag, to appear

15
Mediated Schema Construction

Joint work with
Wensheng Wu (UIUC), Clement Yu (UIC), Weiyi Meng
(SUNY Binghamton)
ICeQ project
given a set of source query interfaces
construct a mediated schema
Step 1 find matches among sourcequery
interfaces
use clustering SIGMOD-04
Step 2 use the found matches to construct
mediated schema (ongoing work)
Future work
given lot of text in the domain, construct a
mediated schema

16
Project Overview

Thrust 1 automate current labor-intensive tasks
schema matching
mediated schema construction
entity matching
Thrust 2 develop new capabilities
entity integration
Thrust 3 monitor adjust to changes
Thrust 4 reduce cost of system admin
by leveraging the mass of users
Thrust 5 design sources for interoperability

17
Entity Matching
(400K, Queen Ann Seattle, 206-616-1842, Mike
Brown) ...
PRICE LOCATION PHONE
NAME
(400K, Queen Ann Seattle, 206-616-1842, M.
Brown) (320K, S. W. Champaign, 217-727-1999,
Jane Smith) ...
PRICE LOCATION PHONE NAME
(250K, Decatur, 317-727-2459, P.
Robertson) (400K, Seattle, 616-1842, Mike
Brown) ...
18
Prior Work

Very active area of research
databases HernandezStolfo,SIGMOD-95,
Cohen,SIGMOD-98,
ElfekyVerykiosElmagarmid,ICDE-02, ...
AI CohenRichman,KDD-02,BilenkoMooney,02,
Dan Roth group, Tejada et. al.,
01,Tejada et. al. KDD-02, Michalowski et. al.
03, ...
Much progress
very effective techniques for many applications
covered a broad range of scenarios
Key commonality
assume entities from disparate sources have same
set of attributes
e.g., (price,location,phone,name) vs.
(price,location,phone,name)
match entities based on similarity of
corresponding attributes

19
Our PROM Approach

Key observation 1 Entities often have disjoint
attributes
source V1 (age, name)
source V2 (name, salary)
source S1 (location ,description,phone,name)
source S2 (description,phone,name,
price,sq-feet)
Key observation 2 Correlations among disjoint
attributes can be exploited to maximize matching
accuracy!
e.g., (9, Mike Brown) vs. (M. Brown,
200K)a 9-year-old is unlikely to make 200K!

20
A Profile-Based Solution

Consider again matching persons
source V1 (age, name)
source V2 (name, salary)
(9, Mike Brown) vs. (M. Brown, 200K)
Step 1 build a person profile
what does a typical person look like?
build from data user input
Step 2 match person names
Mike Brown vs. M. Brown gt 0.7
discard if confidence score is low, otherwise ...
Step 3 feed both tuples into profile
(9, Mike Brown, M. Brown, 200K) gt 0.3

21
Advantages of Profile-Based Solution

Can exploit disjoint attributes to improve
accuracy
Profiles capture task-independent knowledge
created from task data, domain experts, external
data
created once, used anywhere
an example of knowledge construction and reuse
Yields an extensible, modular architecture
plug and play with new profiles

Tuple t1
Tuple t2
Similarity Estimator
Training data Expert knowledge Domain
data Previous matching tasks
Hard profilers
Hard Profile Combiner
User specified constraints
Soft Profile Combiner
Soft profilers
Matching pairs
22
Profilers
Association Rule Profiler
Manual Profiler
Completeness Profiler

Manually encoded rules
Domain Expert Specified

Encodes interesting association rules having
high confidence
Employs Association Rule Mining Techniques

Eg debut-year ? b-year

Categorical rules based on complete data
Learn from external data that is complete in
some aspect

Eg Color US movies are produced only after 1917
Eg (birth-year lt 1900) implies (ODI-matches
0)
PROFILERS
encode information about domain concepts and can
be constructed in many ways
Instance Profiler
Histogram Profiler

Characteristics of a few frequent entities

All possible value combinations for a set of
attributes

Eg Profilers for 10 most productive director
Classifier
Eg (studio,movie-genre)

Learn from training data
Encodes high confidence rules relating
disjoint attributes

Learn from external data that is complete in
some aspect

External data

Eg Decision tree
23
Entity Matching Empirical Evaluation
Improve accuracy significantly across six
real-world domains
More profilers result in better performance
24
Entity Integration

Problem find all tuples related to a real world
entity.
given a seed paper
Chris C. Zhai, A. Kramnik, Hui Fang, Query
Processing, SIGMOD, 1998
find all papers by Chris C. Zhai from
DBLP-Lite

DBLP-Lite data source

Desired result papers (1)-(2)

25
Baseline Solutions Pairwise Matching

Seed paper Chris C. Zhai, A. Kramnik,
Hui Fang, Query Processing, SIGMOD, 1998
If match papers based only on author names

gt retrieve (1)-(6)
If consider also co-authors and confs gt retrieve
(1)-(2), (4)-(6)

26
Better Solution Apply Profilers to Pairwise
Matching

Seed paper Chris C. Zhai, A. Kramnik,
Hui Fang, Query Processing, SIGMOD, 1998
If match papers based only on author names

gt retrieve (1)-(6)
If consider also co-authors and confs gt retrieve
(1)-(2), (4)-(6)

Aggregate Property very active in both DB
and IR, with 3 SIGMOD/VLDB papers and 3 SIGIR
papers in 3 years
Doesn't fit profile of a typical researcher!
27
Even Better Solution Global Matching
seed paper
Chris C. Zhai, A. Kramnik, Hui Fang, Query
Processing, SIGMOD, 1998
C. Zhai, Search Optimization, SIGIR, 1999
(4)
28
Empirical Evaluation
Clustering improves performance over pair-wise
matching
Adding profilers improves performance over both
clustering and pair-wise matching.
29
More Information onEntity Matching and
Integration

Context-based entity matching and integration
Tech. Report UIUC-03-2004
Profile-based object matching for information
integration
A. Doan, Y. Lu, Y. Lee, and J. Han
IEEE Intelligent Systems, special issue on
information integration, 2003
Object matching for data integration a
profile-based approach
A. Doan, Y. Lu, Y. Lee, and J. Han
Proc. of the IJCAI-03 workshop on information
integration on the Web, 2003

30
Project Overview

Thrust 1 automate current labor-intensive tasks
schema matching
mediated schema construction
entity matching
Thrust 2 develop new capabilities
entity integration
Thrust 3 monitor adjust to changes
Thrust 4 reduce cost of system admin
by leveraging the mass of users
Thrust 5 design sources for interoperability

31
The Problem

Numerous automatic tools have been developed for
schema matching, wrapper construction, source
discovery, etc.
No matter how good these tools are, system admin
still needs to
verify predictions of tools
correct wrong ones
These tasks are still extremely labor intensive
even worse when considering system maintenance
System complexity overwhelms capacity of human
admin
Reduce the labor cost of system admin is
critical!
perhaps most important issue, to enable practical
systems!

32
Solution Shift Some Labor to Users

Take some task or part of some task
e.g., schema matching, wrapper construction,
source discovery
Convert it into a series of very simple questions
such that knowing the answers solving the task
Ask users to answer questions
such that each user has to do very little work
? Spread the task labor thinly over a mass of
users !

33
Example Mass Collaboration for Schema Matching

34
Mass Collaboration is not New

Successfully applied to
open source software construction
knowledge base construction
collaborative software bug detection
collaborative filtering
annotating online pictures CMU
Leverage both implicit and explicit feedback from
users
But has not been applied to data integration
settings
Can use both implicit and explicit feedback
focus here on explicit one

35
MOBS Project Mass Collaboration to Build DI
Systems

Joint work with
Rob McCann, Alex Kramnik, Warren Shen, Vanitha
Varadarajan, Olu Sobulo
If succeeds
can dramatically reduce cost time
launch numerous DI systems on Web enterprises
Key challenges
how to break a task into a series of questions
how to entice users to answer questions
how to combine user answers
(e.g., what to do with malicious users?)
Illustrate baseline solutions via schema matching

36
Example Book Domain
37
Build Partial Correct System
38
Solicit User Answers
0
1
3
2
39
Detect Remove Bad Users
40
Combine User Answers
41
Combine User Answers
42
MOBS Challenges Revisited

How to decompose a task into a series of
questions?
task dependent, currently works for source
discovery, 1-1 matching
if cant solve the whole task, ok for part of the
task (e.g., wrapper)
How to entice users to answer questions?
incentive models monopoly or better-service
applications use
helper applications
use volunteers
How to evaluate users and combine their answers?
use machine learning
build a dynamic Bayesian network model
solicit user answers to questions with known
answers
use these as training data to learn network
parameters
More detail in McCann et. al. Tech Report 04,
WebDB-03