Title: AnHai Doan
1Evolving and Self-Managing Data Integration
Systems
- AnHai Doan
- Dept. of Computer Science
- Univ. of Illinois at Urbana-Champaign
- Spring 2004
2Data Integration at UIUC
- Two main players
- Kevin Chang and AnHai Doan
- 10 students, 30001 cups of coffees, 3 SIGMOD-04
papers - Four supporting players
- Chengxiang Zhai IR, bioinformatics, text/data
integration - Dan Roth AI, question answering, text/data
integration - Jiawei Han data mining
- Marianne Winslett security/privacy issues in
data sharing - Many supporting departments and local
organizations - NCSA, Information Science, Genome Institute, Fire
Service Institute
3Data Integration Challenge
Find houses with 3 bedrooms priced under 300K
New faculty member
homes.com
realestate.com
homeseekers.com
4Architecture of Data Integration System
Find houses with 3 bedrooms priced under 300K
price, num-beds, location, agent-name
mediated schema
list-price, bdrms, address
source schema 2
source schema 3
source schema 1
wrapper
wrapper
wrapper
homes.com
realestate.com
homeseekers.com
Think comparison shopping systems on steroid ...
5The Need for Data Integration is Ubiquitous!
- In virtually all domains
- data are distributed stored in heterogeneous
formats - WWW
- hundreds/thousands of sources in bioinformatics,
real estate, book,etc. - Enterprises
- avg. organization has 49 databases Ives-01
- organizations frequently merge, exchange data
- Government e.g., digital government initiatives
- Military, cultural international exchange,
Semantic Web, information agents, etc. - Long-standing challenge in the database community
- recent explosion of distributed data adds urgency
6Current State of Affairs
- Vibrant research industrial landscape
- Research
- dated back to the 70-80s, accelerated in the 90s
- Stanford, UPenn, ATT Labs, Maryland,
UWashington, Wisconsin, IBM Almaden, ISI,
Arizona State U, Ireland, CMU, etc. - many workshops in AI and DB communities e.g.,
SIGMOD/VLDB-04 - focused on
- conceptual algorithmic aspects
- building systems in specific domains (bio,
geo-spatial, rapid emergency response, virtual
organization, etc.) - Industry
- more than 50 startups in 2001, new startups in
2004
Despite much RD activities, however
7Current State of Affairs (cont.)
- Most DI systems are still built maintained
manually - Manual deployment is extremely labor-intensive
... - construct mediated- source schemas,
- find semantic mappings between schemas,
- constantly monitor adjust to changes at
hundreds or thousands of data sources, ... - ... and has become a key bottleneck
- Emerging technologies
- XML, Web services, Semantic Web, ...
- will further fuel DI applications exacerbate
the problem
Slashing the astronomical cost of ownership
is now crucial!
8The AIDA Project
- Recently started at Univ of Illinois
- AIDA Automatic Integration of Data
- Goal evolving and self-managing data
integration systems - Easy to start
- takes hours instead of weeks or months
- perhaps with just a few sources
- Learn to continuously improve
- expand to cover new sources
- add novel query capabilities, better query
performance - Adjust automatically to changes
- detect and fix broken wrappers, semantic matches,
etc. - Require minimal efforts from system admin
- some efforts at the start
- far less as system has been learning more and
more
9The AIDA Project (cont.)
- In line with trends in broader computing
landscape - autonomic systems (IBM initiative)
- recovery-oriented computing (Berkeley)
- cognitive computer systems (DARPA)
- from cycles to RASS (Stanford)
- self-tuning databases (MSR, IBM Almaden, Oracle)
- Key differences
- applied to distributed data management systems
- must attack difficult semantics/meta-data issues
- heavy involvement of human
- must handle large scale
- Need techniques from multiple fields
- databases, machine learning, AI, IR, data mining
10Project Overview
- Thrust 1 automate current labor-intensive tasks
- schema matching
- mediated schema construction
- entity matching
- Thrust 2 develop new capabilities
- entity integration
- Thrust 3 monitor adjust to changes
- Thrust 4 reduce cost of system admin
- by leveraging the mass of users
- Thrust 5 design sources for interoperability
11Schema Matching
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
12Why Schema Matching is Difficult
- Schema data never fully capture semantics!
- not adequately documented
- Must rely on clues in schema data
- using names, structures, types, data values, etc.
- Such clues can be unreliable
- same names gt different entities area gt
location or square-feet - different names gt same entity area
address gt location - Intended semantics can be subjective
- house-style house-description?
- military apps require committees!
- Cannot be fully automated, needs work from system
admin!
13Current State of Affairs
- Largely done by hand
- labor intensive error prone
- data integration at GTE LiClifton, 2000
- 40 databases, 27000 elements, estimated time 12
years - Need semi-automatic approaches to scale up!
- Numerous prior current research projects
- Databases SemInt (Northwestern), DELTA (MITRE),
IBM Almaden, Microsoft
Research, Wisconsin, Toronto,
UC-Irvine, BYU, George Mason, U of Leipzig, ... - AI Stanford, Karlsruhe University, NEC Japan,
ISI, ... - Many startups
14Our Prior Ongoing Work 2000-date
- Joint work with
- Robin Dhamanka, Yoonkyong Lee, Wensheng Wu, Rob
McCann, Warren Shen, Alex Kramnik, Olu Sobulo,
Vanitha Varadarajan (Illinois), Pedro Domingos,
Alon Halevy (U Washington) - Learning 1-1 matches for relational XML schemas
- LSD (Learning Source Description) system
WebDB-00, SIGMOD-01, Machine Learning
Journal-03 - Learning 1-1 complex matches for ontologies
- GLUE WWW-02, VLDB Journal-03, Ontology
Handbook-03 - Learning 1-1 matches by mass collaboration
- MOBS WebDB-03, IJCAI-03 Workshop
- Learning complex matches for relational schemas
iMAP SIGMOD-04 - Large-scale matching via clustering IceQ
SIGMOD-04 - Corpus-based schema matching submitted
- Further resources
- brief survey talk at http//anhai.cs.uiuc.edu/home
/talks/isi-matching.ppt - "Learning to Match Structured Representations of
Data" book by Springer-Velag, to appear
15Mediated Schema Construction
- Joint work with
- Wensheng Wu (UIUC), Clement Yu (UIC), Weiyi Meng
(SUNY Binghamton) - ICeQ project
- given a set of source query interfaces
- construct a mediated schema
- Step 1 find matches among sourcequery
interfaces - use clustering SIGMOD-04
- Step 2 use the found matches to construct
mediated schema (ongoing work) - Future work
- given lot of text in the domain, construct a
mediated schema
16Project Overview
- Thrust 1 automate current labor-intensive tasks
- schema matching
- mediated schema construction
- entity matching
- Thrust 2 develop new capabilities
- entity integration
- Thrust 3 monitor adjust to changes
- Thrust 4 reduce cost of system admin
- by leveraging the mass of users
- Thrust 5 design sources for interoperability
17Entity Matching
(400K, Queen Ann Seattle, 206-616-1842, Mike
Brown) ...
PRICE LOCATION PHONE
NAME
(400K, Queen Ann Seattle, 206-616-1842, M.
Brown) (320K, S. W. Champaign, 217-727-1999,
Jane Smith) ...
PRICE LOCATION PHONE NAME
(250K, Decatur, 317-727-2459, P.
Robertson) (400K, Seattle, 616-1842, Mike
Brown) ...
18Prior Work
- Very active area of research
- databases HernandezStolfo,SIGMOD-95,
Cohen,SIGMOD-98,
ElfekyVerykiosElmagarmid,ICDE-02, ... - AI CohenRichman,KDD-02,BilenkoMooney,02,
Dan Roth group, Tejada et. al.,
01,Tejada et. al. KDD-02, Michalowski et. al.
03, ... - Much progress
- very effective techniques for many applications
- covered a broad range of scenarios
- Key commonality
- assume entities from disparate sources have same
set of attributes - e.g., (price,location,phone,name) vs.
(price,location,phone,name) - match entities based on similarity of
corresponding attributes
19Our PROM Approach
- Key observation 1 Entities often have disjoint
attributes - source V1 (age, name)
- source V2 (name, salary)
- source S1 (location ,description,phone,name)
- source S2 (description,phone,name,
price,sq-feet) - Key observation 2 Correlations among disjoint
attributes can be exploited to maximize matching
accuracy! - e.g., (9, Mike Brown) vs. (M. Brown,
200K)a 9-year-old is unlikely to make 200K!
20A Profile-Based Solution
- Consider again matching persons
- source V1 (age, name)
- source V2 (name, salary)
- (9, Mike Brown) vs. (M. Brown, 200K)
- Step 1 build a person profile
- what does a typical person look like?
- build from data user input
- Step 2 match person names
- Mike Brown vs. M. Brown gt 0.7
- discard if confidence score is low, otherwise ...
- Step 3 feed both tuples into profile
- (9, Mike Brown, M. Brown, 200K) gt 0.3
21Advantages of Profile-Based Solution
- Can exploit disjoint attributes to improve
accuracy - Profiles capture task-independent knowledge
- created from task data, domain experts, external
data - created once, used anywhere
- an example of knowledge construction and reuse
- Yields an extensible, modular architecture
- plug and play with new profiles
Tuple t1
Tuple t2
Similarity Estimator
Training data Expert knowledge Domain
data Previous matching tasks
Hard profilers
Hard Profile Combiner
User specified constraints
Soft Profile Combiner
Soft profilers
Matching pairs
22Profilers
Association Rule Profiler
Manual Profiler
Completeness Profiler
- Manually encoded rules
- Domain Expert Specified
- Encodes interesting association rules having
high confidence - Employs Association Rule Mining Techniques
Eg debut-year ? b-year
- Categorical rules based on complete data
- Learn from external data that is complete in
some aspect
Eg Color US movies are produced only after 1917
Eg (birth-year lt 1900) implies (ODI-matches
0)
PROFILERS
encode information about domain concepts and can
be constructed in many ways
Instance Profiler
Histogram Profiler
- Characteristics of a few frequent entities
- All possible value combinations for a set of
attributes
Eg Profilers for 10 most productive director
Classifier
Eg (studio,movie-genre)
- Learn from training data
- Encodes high confidence rules relating
disjoint attributes
- Learn from external data that is complete in
some aspect
Eg Decision tree
23Entity Matching Empirical Evaluation
Improve accuracy significantly across six
real-world domains
More profilers result in better performance
24Entity Integration
- Problem find all tuples related to a real world
entity. - given a seed paper
- Chris C. Zhai, A. Kramnik, Hui Fang, Query
Processing, SIGMOD, 1998 - find all papers by Chris C. Zhai from
DBLP-Lite
DBLP-Lite data source
- Desired result papers (1)-(2)
25Baseline Solutions Pairwise Matching
- Seed paper Chris C. Zhai, A. Kramnik,
Hui Fang, Query Processing, SIGMOD, 1998 - If match papers based only on author names
gt retrieve (1)-(6) - If consider also co-authors and confs gt retrieve
(1)-(2), (4)-(6)
26Better Solution Apply Profilers to Pairwise
Matching
- Seed paper Chris C. Zhai, A. Kramnik,
Hui Fang, Query Processing, SIGMOD, 1998 - If match papers based only on author names
gt retrieve (1)-(6) - If consider also co-authors and confs gt retrieve
(1)-(2), (4)-(6)
Aggregate Property very active in both DB
and IR, with 3 SIGMOD/VLDB papers and 3 SIGIR
papers in 3 years
Doesn't fit profile of a typical researcher!
27Even Better Solution Global Matching
seed paper
Chris C. Zhai, A. Kramnik, Hui Fang, Query
Processing, SIGMOD, 1998
C. Zhai, Search Optimization, SIGIR, 1999
(4)
28Empirical Evaluation
Clustering improves performance over pair-wise
matching
Adding profilers improves performance over both
clustering and pair-wise matching.
29More Information onEntity Matching and
Integration
- Context-based entity matching and integration
- Tech. Report UIUC-03-2004
- Profile-based object matching for information
integration - A. Doan, Y. Lu, Y. Lee, and J. Han
- IEEE Intelligent Systems, special issue on
information integration, 2003 - Object matching for data integration a
profile-based approach - A. Doan, Y. Lu, Y. Lee, and J. Han
- Proc. of the IJCAI-03 workshop on information
integration on the Web, 2003
30Project Overview
- Thrust 1 automate current labor-intensive tasks
- schema matching
- mediated schema construction
- entity matching
- Thrust 2 develop new capabilities
- entity integration
- Thrust 3 monitor adjust to changes
- Thrust 4 reduce cost of system admin
- by leveraging the mass of users
- Thrust 5 design sources for interoperability
31The Problem
- Numerous automatic tools have been developed for
- schema matching, wrapper construction, source
discovery, etc. - No matter how good these tools are, system admin
still needs to - verify predictions of tools
- correct wrong ones
- These tasks are still extremely labor intensive
- even worse when considering system maintenance
- System complexity overwhelms capacity of human
admin - Reduce the labor cost of system admin is
critical! - perhaps most important issue, to enable practical
systems!
32Solution Shift Some Labor to Users
- Take some task or part of some task
- e.g., schema matching, wrapper construction,
source discovery - Convert it into a series of very simple questions
- such that knowing the answers solving the task
- Ask users to answer questions
- such that each user has to do very little work
- ? Spread the task labor thinly over a mass of
users !
33Example Mass Collaboration for Schema Matching
34Mass Collaboration is not New
- Successfully applied to
- open source software construction
- knowledge base construction
- collaborative software bug detection
- collaborative filtering
- annotating online pictures CMU
- Leverage both implicit and explicit feedback from
users - But has not been applied to data integration
settings - Can use both implicit and explicit feedback
- focus here on explicit one
35MOBS Project Mass Collaboration to Build DI
Systems
- Joint work with
- Rob McCann, Alex Kramnik, Warren Shen, Vanitha
Varadarajan, Olu Sobulo - If succeeds
- can dramatically reduce cost time
- launch numerous DI systems on Web enterprises
- Key challenges
- how to break a task into a series of questions
- how to entice users to answer questions
- how to combine user answers
(e.g., what to do with malicious users?) - Illustrate baseline solutions via schema matching
36Example Book Domain
37Build Partial Correct System
38Solicit User Answers
0
1
3
2
39Detect Remove Bad Users
40Combine User Answers
41Combine User Answers
42MOBS Challenges Revisited
- How to decompose a task into a series of
questions? - task dependent, currently works for source
discovery, 1-1 matching - if cant solve the whole task, ok for part of the
task (e.g., wrapper) - How to entice users to answer questions?
- incentive models monopoly or better-service
applications use
helper applications
use volunteers - How to evaluate users and combine their answers?
- use machine learning
- build a dynamic Bayesian network model
- solicit user answers to questions with known
answers - use these as training data to learn network
parameters - More detail in McCann et. al. Tech Report 04,
WebDB-03
43MOBS Applicability
- Applied MOBS in many settings ...
- scale small-community intranet to high-traffic
website - users unpredictable novice users to cooperative
experts - ... and to several DI tasks
- Deep Web form recognition, query interface
matching - Surface Web hub discovery, data extraction,
mini-Citeseer
44Simulation and Real-World Results
45Project Overview
- Thrust 1 automate current labor-intensive tasks
- schema matching
- mediated schema construction
- entity matching
- Thrust 2 develop new capabilities
- entity integration
- Thrust 3 monitor adjust to changes
- Thrust 4 reduce cost of system admin
- by leveraging the mass of users
- Thrust 5 design sources for interoperability
46Summary
- The need for data integration is pervasive
- Manual data integration is a key bottleneck
- Our solution AIDA project on autonomic DI
systems - Discussed problems
- schema matching SIGMOD-04
- mediated schema construction SIGMOD-04
- entity matching integration Tech report 04
- mass collaboration Tech report 04
- Machine learning is the underlying technique
- Many implications beyond data integration context
- More information anhai on Google