AnHai Doan - PowerPoint PPT Presentation

About This Presentation
Title:

AnHai Doan

Description:

Dan Roth: AI, question answering, text/data integration. Jiawei Han: data mining ... many workshops in AI and DB communities: e.g., SIGMOD/VLDB-04. focused on ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 47
Provided by: zam34
Category:
Tags: anhai | ai | doan

less

Transcript and Presenter's Notes

Title: AnHai Doan


1
Evolving and Self-Managing Data Integration
Systems
  • AnHai Doan
  • Dept. of Computer Science
  • Univ. of Illinois at Urbana-Champaign
  • Spring 2004

2
Data Integration at UIUC
  • Two main players
  • Kevin Chang and AnHai Doan
  • 10 students, 30001 cups of coffees, 3 SIGMOD-04
    papers
  • Four supporting players
  • Chengxiang Zhai IR, bioinformatics, text/data
    integration
  • Dan Roth AI, question answering, text/data
    integration
  • Jiawei Han data mining
  • Marianne Winslett security/privacy issues in
    data sharing
  • Many supporting departments and local
    organizations
  • NCSA, Information Science, Genome Institute, Fire
    Service Institute

3
Data Integration Challenge
Find houses with 3 bedrooms priced under 300K
New faculty member
homes.com
realestate.com
homeseekers.com
4
Architecture of Data Integration System
Find houses with 3 bedrooms priced under 300K
price, num-beds, location, agent-name
mediated schema
list-price, bdrms, address
source schema 2
source schema 3
source schema 1
wrapper
wrapper
wrapper
homes.com
realestate.com
homeseekers.com
Think comparison shopping systems on steroid ...
5
The Need for Data Integration is Ubiquitous!
  • In virtually all domains
  • data are distributed stored in heterogeneous
    formats
  • WWW
  • hundreds/thousands of sources in bioinformatics,
    real estate, book,etc.
  • Enterprises
  • avg. organization has 49 databases Ives-01
  • organizations frequently merge, exchange data
  • Government e.g., digital government initiatives
  • Military, cultural international exchange,
    Semantic Web, information agents, etc.
  • Long-standing challenge in the database community
  • recent explosion of distributed data adds urgency

6
Current State of Affairs
  • Vibrant research industrial landscape
  • Research
  • dated back to the 70-80s, accelerated in the 90s
  • Stanford, UPenn, ATT Labs, Maryland,
    UWashington, Wisconsin, IBM Almaden, ISI,
    Arizona State U, Ireland, CMU, etc.
  • many workshops in AI and DB communities e.g.,
    SIGMOD/VLDB-04
  • focused on
  • conceptual algorithmic aspects
  • building systems in specific domains (bio,
    geo-spatial, rapid emergency response, virtual
    organization, etc.)
  • Industry
  • more than 50 startups in 2001, new startups in
    2004

Despite much RD activities, however
7
Current State of Affairs (cont.)
  • Most DI systems are still built maintained
    manually
  • Manual deployment is extremely labor-intensive
    ...
  • construct mediated- source schemas,
  • find semantic mappings between schemas,
  • constantly monitor adjust to changes at
    hundreds or thousands of data sources, ...
  • ... and has become a key bottleneck
  • Emerging technologies
  • XML, Web services, Semantic Web, ...
  • will further fuel DI applications exacerbate
    the problem

Slashing the astronomical cost of ownership

is now crucial!
8
The AIDA Project
  • Recently started at Univ of Illinois
  • AIDA Automatic Integration of Data
  • Goal evolving and self-managing data
    integration systems
  • Easy to start
  • takes hours instead of weeks or months
  • perhaps with just a few sources
  • Learn to continuously improve
  • expand to cover new sources
  • add novel query capabilities, better query
    performance
  • Adjust automatically to changes
  • detect and fix broken wrappers, semantic matches,
    etc.
  • Require minimal efforts from system admin
  • some efforts at the start
  • far less as system has been learning more and
    more

9
The AIDA Project (cont.)
  • In line with trends in broader computing
    landscape
  • autonomic systems (IBM initiative)
  • recovery-oriented computing (Berkeley)
  • cognitive computer systems (DARPA)
  • from cycles to RASS (Stanford)
  • self-tuning databases (MSR, IBM Almaden, Oracle)
  • Key differences
  • applied to distributed data management systems
  • must attack difficult semantics/meta-data issues
  • heavy involvement of human
  • must handle large scale
  • Need techniques from multiple fields
  • databases, machine learning, AI, IR, data mining

10
Project Overview
  • Thrust 1 automate current labor-intensive tasks
  • schema matching
  • mediated schema construction
  • entity matching
  • Thrust 2 develop new capabilities
  • entity integration
  • Thrust 3 monitor adjust to changes
  • Thrust 4 reduce cost of system admin
  • by leveraging the mass of users
  • Thrust 5 design sources for interoperability

11
Schema Matching
Mediated-schema
price agent-name address
1-1 match
complex match
homes.com
listed-price contact-name city
state
320K Jane Brown Seattle
WA 240K Mike Smith Miami
FL
12
Why Schema Matching is Difficult
  • Schema data never fully capture semantics!
  • not adequately documented
  • Must rely on clues in schema data
  • using names, structures, types, data values, etc.
  • Such clues can be unreliable
  • same names gt different entities area gt
    location or square-feet
  • different names gt same entity area
    address gt location
  • Intended semantics can be subjective
  • house-style house-description?
  • military apps require committees!
  • Cannot be fully automated, needs work from system
    admin!

13
Current State of Affairs
  • Largely done by hand
  • labor intensive error prone
  • data integration at GTE LiClifton, 2000
  • 40 databases, 27000 elements, estimated time 12
    years
  • Need semi-automatic approaches to scale up!
  • Numerous prior current research projects
  • Databases SemInt (Northwestern), DELTA (MITRE),
    IBM Almaden, Microsoft
    Research, Wisconsin, Toronto,
    UC-Irvine, BYU, George Mason, U of Leipzig, ...
  • AI Stanford, Karlsruhe University, NEC Japan,
    ISI, ...
  • Many startups

14
Our Prior Ongoing Work 2000-date
  • Joint work with
  • Robin Dhamanka, Yoonkyong Lee, Wensheng Wu, Rob
    McCann, Warren Shen, Alex Kramnik, Olu Sobulo,
    Vanitha Varadarajan (Illinois), Pedro Domingos,
    Alon Halevy (U Washington)
  • Learning 1-1 matches for relational XML schemas
  • LSD (Learning Source Description) system
    WebDB-00, SIGMOD-01, Machine Learning
    Journal-03
  • Learning 1-1 complex matches for ontologies
  • GLUE WWW-02, VLDB Journal-03, Ontology
    Handbook-03
  • Learning 1-1 matches by mass collaboration
  • MOBS WebDB-03, IJCAI-03 Workshop
  • Learning complex matches for relational schemas
    iMAP SIGMOD-04
  • Large-scale matching via clustering IceQ
    SIGMOD-04
  • Corpus-based schema matching submitted
  • Further resources
  • brief survey talk at http//anhai.cs.uiuc.edu/home
    /talks/isi-matching.ppt
  • "Learning to Match Structured Representations of
    Data" book by Springer-Velag, to appear

15
Mediated Schema Construction
  • Joint work with
  • Wensheng Wu (UIUC), Clement Yu (UIC), Weiyi Meng
    (SUNY Binghamton)
  • ICeQ project
  • given a set of source query interfaces
  • construct a mediated schema
  • Step 1 find matches among sourcequery
    interfaces
  • use clustering SIGMOD-04
  • Step 2 use the found matches to construct
    mediated schema (ongoing work)
  • Future work
  • given lot of text in the domain, construct a
    mediated schema

16
Project Overview
  • Thrust 1 automate current labor-intensive tasks
  • schema matching
  • mediated schema construction
  • entity matching
  • Thrust 2 develop new capabilities
  • entity integration
  • Thrust 3 monitor adjust to changes
  • Thrust 4 reduce cost of system admin
  • by leveraging the mass of users
  • Thrust 5 design sources for interoperability

17
Entity Matching
(400K, Queen Ann Seattle, 206-616-1842, Mike
Brown) ...
PRICE LOCATION PHONE
NAME
(400K, Queen Ann Seattle, 206-616-1842, M.
Brown) (320K, S. W. Champaign, 217-727-1999,
Jane Smith) ...
PRICE LOCATION PHONE NAME
(250K, Decatur, 317-727-2459, P.
Robertson) (400K, Seattle, 616-1842, Mike
Brown) ...
18
Prior Work
  • Very active area of research
  • databases HernandezStolfo,SIGMOD-95,
    Cohen,SIGMOD-98,
    ElfekyVerykiosElmagarmid,ICDE-02, ...
  • AI CohenRichman,KDD-02,BilenkoMooney,02,
    Dan Roth group, Tejada et. al.,
    01,Tejada et. al. KDD-02, Michalowski et. al.
    03, ...
  • Much progress
  • very effective techniques for many applications
  • covered a broad range of scenarios
  • Key commonality
  • assume entities from disparate sources have same
    set of attributes
  • e.g., (price,location,phone,name) vs.
    (price,location,phone,name)
  • match entities based on similarity of
    corresponding attributes

19
Our PROM Approach
  • Key observation 1 Entities often have disjoint
    attributes
  • source V1 (age, name)
  • source V2 (name, salary)
  • source S1 (location ,description,phone,name)
  • source S2 (description,phone,name,
    price,sq-feet)
  • Key observation 2 Correlations among disjoint
    attributes can be exploited to maximize matching
    accuracy!
  • e.g., (9, Mike Brown) vs. (M. Brown,
    200K)a 9-year-old is unlikely to make 200K!

20
A Profile-Based Solution
  • Consider again matching persons
  • source V1 (age, name)
  • source V2 (name, salary)
  • (9, Mike Brown) vs. (M. Brown, 200K)
  • Step 1 build a person profile
  • what does a typical person look like?
  • build from data user input
  • Step 2 match person names
  • Mike Brown vs. M. Brown gt 0.7
  • discard if confidence score is low, otherwise ...
  • Step 3 feed both tuples into profile
  • (9, Mike Brown, M. Brown, 200K) gt 0.3

21
Advantages of Profile-Based Solution
  • Can exploit disjoint attributes to improve
    accuracy
  • Profiles capture task-independent knowledge
  • created from task data, domain experts, external
    data
  • created once, used anywhere
  • an example of knowledge construction and reuse
  • Yields an extensible, modular architecture
  • plug and play with new profiles

Tuple t1
Tuple t2
Similarity Estimator
Training data Expert knowledge Domain
data Previous matching tasks
Hard profilers
Hard Profile Combiner
User specified constraints
Soft Profile Combiner
Soft profilers
Matching pairs
22
Profilers
Association Rule Profiler
Manual Profiler
Completeness Profiler
  • Manually encoded rules
  • Domain Expert Specified
  • Encodes interesting association rules having
    high confidence
  • Employs Association Rule Mining Techniques

Eg debut-year ? b-year
  • Categorical rules based on complete data
  • Learn from external data that is complete in
    some aspect

Eg Color US movies are produced only after 1917
Eg (birth-year lt 1900) implies (ODI-matches
0)
PROFILERS
encode information about domain concepts and can
be constructed in many ways
Instance Profiler
Histogram Profiler
  • Characteristics of a few frequent entities
  • All possible value combinations for a set of
    attributes

Eg Profilers for 10 most productive director
Classifier
Eg (studio,movie-genre)
  • Learn from training data
  • Encodes high confidence rules relating
    disjoint attributes
  • Learn from external data that is complete in
    some aspect
  • External data

Eg Decision tree
23
Entity Matching Empirical Evaluation
Improve accuracy significantly across six
real-world domains
More profilers result in better performance
24
Entity Integration
  • Problem find all tuples related to a real world
    entity.
  • given a seed paper
  • Chris C. Zhai, A. Kramnik, Hui Fang, Query
    Processing, SIGMOD, 1998
  • find all papers by Chris C. Zhai from
    DBLP-Lite

DBLP-Lite data source
  • Desired result papers (1)-(2)

25
Baseline Solutions Pairwise Matching
  • Seed paper Chris C. Zhai, A. Kramnik,
    Hui Fang, Query Processing, SIGMOD, 1998
  • If match papers based only on author names

    gt retrieve (1)-(6)
  • If consider also co-authors and confs gt retrieve
    (1)-(2), (4)-(6)

26
Better Solution Apply Profilers to Pairwise
Matching
  • Seed paper Chris C. Zhai, A. Kramnik,
    Hui Fang, Query Processing, SIGMOD, 1998
  • If match papers based only on author names

    gt retrieve (1)-(6)
  • If consider also co-authors and confs gt retrieve
    (1)-(2), (4)-(6)

Aggregate Property very active in both DB
and IR, with 3 SIGMOD/VLDB papers and 3 SIGIR
papers in 3 years
Doesn't fit profile of a typical researcher!
27
Even Better Solution Global Matching
seed paper
Chris C. Zhai, A. Kramnik, Hui Fang, Query
Processing, SIGMOD, 1998
C. Zhai, Search Optimization, SIGIR, 1999
(4)
28
Empirical Evaluation
Clustering improves performance over pair-wise
matching
Adding profilers improves performance over both
clustering and pair-wise matching.
29
More Information onEntity Matching and
Integration
  • Context-based entity matching and integration
  • Tech. Report UIUC-03-2004
  • Profile-based object matching for information
    integration
  • A. Doan, Y. Lu, Y. Lee, and J. Han
  • IEEE Intelligent Systems, special issue on
    information integration, 2003
  • Object matching for data integration a
    profile-based approach
  • A. Doan, Y. Lu, Y. Lee, and J. Han
  • Proc. of the IJCAI-03 workshop on information
    integration on the Web, 2003

30
Project Overview
  • Thrust 1 automate current labor-intensive tasks
  • schema matching
  • mediated schema construction
  • entity matching
  • Thrust 2 develop new capabilities
  • entity integration
  • Thrust 3 monitor adjust to changes
  • Thrust 4 reduce cost of system admin
  • by leveraging the mass of users
  • Thrust 5 design sources for interoperability

31
The Problem
  • Numerous automatic tools have been developed for
  • schema matching, wrapper construction, source
    discovery, etc.
  • No matter how good these tools are, system admin
    still needs to
  • verify predictions of tools
  • correct wrong ones
  • These tasks are still extremely labor intensive
  • even worse when considering system maintenance
  • System complexity overwhelms capacity of human
    admin
  • Reduce the labor cost of system admin is
    critical!
  • perhaps most important issue, to enable practical
    systems!

32
Solution Shift Some Labor to Users
  • Take some task or part of some task
  • e.g., schema matching, wrapper construction,
    source discovery
  • Convert it into a series of very simple questions
  • such that knowing the answers solving the task
  • Ask users to answer questions
  • such that each user has to do very little work
  • ? Spread the task labor thinly over a mass of
    users !

33
Example Mass Collaboration for Schema Matching

34
Mass Collaboration is not New
  • Successfully applied to
  • open source software construction
  • knowledge base construction
  • collaborative software bug detection
  • collaborative filtering
  • annotating online pictures CMU
  • Leverage both implicit and explicit feedback from
    users
  • But has not been applied to data integration
    settings
  • Can use both implicit and explicit feedback
  • focus here on explicit one

35
MOBS Project Mass Collaboration to Build DI
Systems
  • Joint work with
  • Rob McCann, Alex Kramnik, Warren Shen, Vanitha
    Varadarajan, Olu Sobulo
  • If succeeds
  • can dramatically reduce cost time
  • launch numerous DI systems on Web enterprises
  • Key challenges
  • how to break a task into a series of questions
  • how to entice users to answer questions
  • how to combine user answers
    (e.g., what to do with malicious users?)
  • Illustrate baseline solutions via schema matching

36
Example Book Domain
37
Build Partial Correct System
38
Solicit User Answers
0
1
3
2
39
Detect Remove Bad Users
40
Combine User Answers
41
Combine User Answers
42
MOBS Challenges Revisited
  • How to decompose a task into a series of
    questions?
  • task dependent, currently works for source
    discovery, 1-1 matching
  • if cant solve the whole task, ok for part of the
    task (e.g., wrapper)
  • How to entice users to answer questions?
  • incentive models monopoly or better-service
    applications use
    helper applications
    use volunteers
  • How to evaluate users and combine their answers?
  • use machine learning
  • build a dynamic Bayesian network model
  • solicit user answers to questions with known
    answers
  • use these as training data to learn network
    parameters
  • More detail in McCann et. al. Tech Report 04,
    WebDB-03

43
MOBS Applicability
  • Applied MOBS in many settings ...
  • scale small-community intranet to high-traffic
    website
  • users unpredictable novice users to cooperative
    experts
  • ... and to several DI tasks
  • Deep Web form recognition, query interface
    matching
  • Surface Web hub discovery, data extraction,
    mini-Citeseer

44
Simulation and Real-World Results
45
Project Overview
  • Thrust 1 automate current labor-intensive tasks
  • schema matching
  • mediated schema construction
  • entity matching
  • Thrust 2 develop new capabilities
  • entity integration
  • Thrust 3 monitor adjust to changes
  • Thrust 4 reduce cost of system admin
  • by leveraging the mass of users
  • Thrust 5 design sources for interoperability

46
Summary
  • The need for data integration is pervasive
  • Manual data integration is a key bottleneck
  • Our solution AIDA project on autonomic DI
    systems
  • Discussed problems
  • schema matching SIGMOD-04
  • mediated schema construction SIGMOD-04
  • entity matching integration Tech report 04
  • mass collaboration Tech report 04
  • Machine learning is the underlying technique
  • Many implications beyond data integration context
  • More information anhai on Google
Write a Comment
User Comments (0)
About PowerShow.com