Mayssam Sayyadian - PowerPoint PPT Presentation

About This Presentation
Title:

Mayssam Sayyadian

Description:

Data integration techniques help data mining techniques ... of common substrings (Clinton St. vs. Bill Clinton, George West vs. West Main St., etc. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 24
Provided by: zam34
Category:

less

Transcript and Presenter's Notes

Title: Mayssam Sayyadian


1
HeteroClass A Framework for Effective
Classification from Heterogeneous Databases
  • Mayssam Sayyadian
  • University of Illinois at Urbana-Champaign
  • Final Project Presentation for CS512 Principles
    and Algorithms for Data Mining
  • May 2006

2
Roadmap
  • Motivation
  • From multi-relational classification ? multi-DB
    classification
  • Motivating examples
  • Challenges
  • Heterogeneous data sources
  • Attribute instability
  • HeteroClass framework
  • Data integration techniques help data mining
    techniques
  • Diverse ensembles to address attribute
    instability
  • Empirical evaluation
  • Discussion, broader impact, and conclusion

3
Multi-relational Classification
  • Classification Old problem, but important
    problem !
  • Most real-world data are stored in relational
    databases.
  • To classify objects in one relation, other
    relations provide crucial information.
  • Cannot convert relational data into a single
    table without expert knowledge or loosing
    essential information.
  • Multi-relational classification Automatically
    classifying objects using multiple relations

4
Motivation
  • Necessity drives applications
  • From multi-relational scenarios to multi-database
    scenarios
  • Cross-database links play an important role but
    are difficult to capture automatically

5
Examples
  • In many real-world settings, data sources are
  • Diverse
  • Autonomous and heterogeneous
  • Scattered over Web, organizations, enterprises,
    digital libraries, smart homes, personal
    information systems (PIM), etc.
  • but inter-related
  • Examples
  • What sort of graduates of UIUC will be donating
    money to the school in future? (multiple
    databases in an organization)
  • What products are probable to be sold for the
    next 30 day? (multiple databases in an
    enterprise)
  • Which customers will be using both traveling and
    dining services of our company?
  • Which movie scenes are captured outdoor?
    (multimedia/spatial databases)
  • Etc.

6
Challenge 1 Heterogeneity of Data Sources
  • Heterogeneity of data sources
  • Data level heterogeneity e.g. Phil vs. Philippe,
    Jiawei Han vs. Han, J.
  • Heterogeneity of schemas e.g.
  • Personnel(Pno, Pname, Dept, Born) vs.
  • Empl(id, name, DeptNo, Salary, Birth),
    Dept(id, name)
  • Structure/format heterogeneity e.g.

ltSchema nameHR-Schemagt ltElementType
namePersonnelgt ltelement typeEmpNo/gt
ltelement typeEmpName/gt ltElementType
nameDeptgt ltelement typeid/gt
ltelement typename/gt lt/ElementTypegt
lt/ElementTypegt lt/Schemagt
CREATE TABLE Personnel( EmpNo int PRIMARY KEY,
EmpName varchar(50), DeptNo int REFERENCES
Dept, Salary dec(15,2) ) CREATE TABLE Dept(
DeptNo int PRIMARY KEY, DeptName
varchar(70) )
vs.
7
Challenge 2 Attribute Instability
  • In distributed databases, objects/data values are
  • Horizontally distributed
  • E.g. UIUC organization HR, OAR, Facilities,
    Business, Academics,
  • Vertically distributed
  • E.g. Comparison shopping Yahoo Shopping, Amazon,
    E-Bay,
  • Distributed with various distributions
  • E.g. Spatial/Multimedia databases DBs of
    different training size
  • Distributed with various characteristics
    (features)
  • Important when we need to run feature selection
    before learning the model

8
HeteroClass Framework Observations
  • Data integration before data mining is very
    costly and difficult (if not impossible)
  • Data warehousing before data mining is difficult
    and not efficient
  • Security concerns
  • Providers do not give full access to their data
  • Efficiency concerns (large databases)
  • Data co-existence paradigm is most natural
  • Data sources expose standalone classifiers
  • DataSupport Systems Platform (DSSPs)
  • Data mining embedded in DBMSs
  • It is possible (and useful) to build specialized
    classifiers
  • No need for full integration/warehousing partial
    integration is enough

9
HeteroClass Framework Architecture
  • Join Discovery Module
  • Alleviate heterogeneity
  • Output useful links between databases (and their
    relations)
  • Ensemble builder
  • Build diverse ensemble of classifiers
  • Meta-learner
  • Voting
  • Decision tree meta-learner

10
Approximate Join Discovery Step 1
  • Two step process
  • Find approximate join paths based on set
    resemblance
  • Field similarity by exact match/substring
    similarity
  • The resemblance of two sets A, B is ???,?? ?
    A?????????
  • After some algebra, we find that A???
    ???,???(AB)/(1 ???,??)
  • There are fast algorithms to compute ?
  • h(a) is a hash function.
  • m(A) min(h(a), a in A).
  • ObservationPrm(A) m(B) ???,??
  • The sets are the unique values in the fields
  • The resemblance of two fields is a good measure
    of their similarity

11
Approximate Join Discovery Step 2
  • For each set of joinable attributes in the two
    schemas
  • Estimate their semantic distance using a schema
    matching tool
  • Filter join paths, that are not semantically
    meaningful, i.e. their semantic distance is high
  • E.g.
  • (DB1.last-name, DB2.street-name)
  • Might be joinable because of common substrings
    (Clinton St. vs. Bill Clinton, George West vs.
    West Main St., etc.)
  • But this join is not semantically meaningful
  • This happens frequently in categorical attributes
    (e.g. with yes / no values.

12
Schema Matching
Given a schema
Build its graph model
CREATE TABLE Personnel( Pno int, Pname
string, Dept string)
columnType
column
table
Pno
5
2
int
column
type
Pname
column
3
1
column
name
Given two models
6
string
Dept
4
Personnel
SQL type
Run similarity flooding algorithm
For each two schema elements (a, b), output a
semantic similarity distance based on an
iterative graph matching algorithms
13
DI Approximate Join Discovery
  • Why it is useful?
  • When learning a classifier in multi-relational
    setting
  • The number of possible predicates (predicate
    space) ? grows exponentially with the number of
    links
  • Improves efficiency
  • We eliminate spurious links
  • Improves accuracy
  • Best effort DI
  • We need only a semantic distance
  • No need for exact matches (no need for human
    confirmation)
  • No need for schema mapping

Contribution Using schema matching and DI
techniques helps mining database structure and
join discovery
14
Ensemble of Classifiers Approach
  • Ensemble methods
  • Use a combination of models to increase global
    accuracy
  • Ensemble should be
  • Accurate
  • Build specialized (site-specific) classifiers
  • Build classifiers on homogeneous regions
  • ? Improve prediction capability
  • Diverse
  • Each learned hypothesis should be as different as
    possible
  • Each learned hypothesis should be consistent with
    training data
  • Ensure all classifiers do not make same errors
  • Exploit various attribute collection
    (joins/schemas/DBs)

15
HeteroClass Intuition
  • Main observation (theoretically proved)
  • To improve accuracy of an ensemble of
    classifiers
  • They should disagree on some inputs
  • Diversity/ambiguity measure of disagreement
  • Generalization error mean error diversity
    (variance) ( )
  • Increase ensemble diversity
  • Maintain average error of ensemble members
  • ? Decrease generalization error

16
HeteroClass Algorithm
  • 1- learn C0 BaseLearn (training data)
  • 2- initialize ensembleC0
  • 3- set ?? ensemble error
  • 4- while (error converges, or no more training
    data left, or enough ensembles created) do
  • 4.1- Pick two (or more) site-specific
    classifiers and add a number of joins (J1 Jn)
    that connect them
  • 4.2- training data training data data
    accessible through J1 Jn
  • 4.3- C BaseLearn (training data)
  • 4.4- C C C
  • 4.5- if new ensembles error is increased C C
    C

17
Evaluation Settings
  • Dataset 1 Inventory
  • Products, Stores, Inventory records 5 databases
  • Classification task
  • Product availability in store
  • Dataset 2 DBLife data sources
  • DBLP database (2 XML databases from Clio project
    _at_Toronto)
  • Authors, Papers, Publications, Conferences
  • DBLife-Researchers database
  • CS Researchers, their associations and meta-date
    (Department, Title, etc.)
  • DBWorld-Events
  • People, Events (service, talk, etc.)
  • Classification tasks
  • Research area for researchers

18
HeteroClass Accuracy
19
HeteroClass Sensitivity Analysis
20
Conclusion
  • Take-home messages
  • Data mining across multiple data sources is
  • Important and necessary
  • Heterogeneity is challenging
  • Best-effort DI helps data mining
  • Improve structure mining using schema matching
  • Ensemble's approach help in heterogeneous
    settings with data co-existence paradigm
  • Committee of local specialized classifiers can
    help in building a global classifier
  • HeteroClass framework is general
  • Applicable to other data mining tasks
  • Clustering build global clusters from local
    clusters
  • Etc.

21
Thank You!
22
Backup Similarity Flooding Algorithm
  • ? e(s,p,o) ? A edge e labeled p from s to o in
    A
  • Pairwise Connectivity Graph PCG(A, B)
  • Induced Propagation Graph add edges in opposite
    direction
  • Forward edge weights initialize with a q-gram
    string similarity matcher
  • Edge weights propagation coefficients (how the
    similarity propagates to neighbors)

23
Backup Similarity Flooding Algorithm
  • Input ? Graph ? Mapping ? Filtering
  • Mapping similarity flooding (SFJoin)
  • Initial similarity values taken from initial
    mapping e.g. a name matcher or a fully
    functional schema matching tool
  • In each iteration similarity of two elements
    affects the similarity of their respective
    neighbors
  • Iterate until similarity values are stable
    (fixpoint computation)
  • Filtering
  • Needed to choose best candidate match
  • Many heuristics
  • Difficult to implement fully automatically (needs
    human intervention)
  • In HeteroClass only need semantic distance ?? no
    need for filtering
Write a Comment
User Comments (0)
About PowerShow.com