Mayssam Sayyadian - PowerPoint PPT Presentation

About This Presentation

Title:

Mayssam Sayyadian

Description:

Data integration techniques help data mining techniques ... of common substrings (Clinton St. vs. Bill Clinton, George West vs. West Main St., etc. ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 24

Provided by: zam34

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mayssam Sayyadian

1
HeteroClass A Framework for Effective
Classification from Heterogeneous Databases

Mayssam Sayyadian
University of Illinois at Urbana-Champaign
Final Project Presentation for CS512 Principles
and Algorithms for Data Mining
May 2006

2
Roadmap

Motivation
From multi-relational classification ? multi-DB
classification
Motivating examples
Challenges
Heterogeneous data sources
Attribute instability
HeteroClass framework
Data integration techniques help data mining
techniques
Diverse ensembles to address attribute
instability
Empirical evaluation
Discussion, broader impact, and conclusion

3
Multi-relational Classification

Classification Old problem, but important
problem !
Most real-world data are stored in relational
databases.
To classify objects in one relation, other
relations provide crucial information.
Cannot convert relational data into a single
table without expert knowledge or loosing
essential information.
Multi-relational classification Automatically
classifying objects using multiple relations

4
Motivation

Necessity drives applications
From multi-relational scenarios to multi-database
scenarios
Cross-database links play an important role but
are difficult to capture automatically

5
Examples

In many real-world settings, data sources are
Diverse
Autonomous and heterogeneous
Scattered over Web, organizations, enterprises,
digital libraries, smart homes, personal
information systems (PIM), etc.
but inter-related
Examples
What sort of graduates of UIUC will be donating
money to the school in future? (multiple
databases in an organization)
What products are probable to be sold for the
next 30 day? (multiple databases in an
enterprise)
Which customers will be using both traveling and
dining services of our company?
Which movie scenes are captured outdoor?
(multimedia/spatial databases)
Etc.

6
Challenge 1 Heterogeneity of Data Sources

Heterogeneity of data sources
Data level heterogeneity e.g. Phil vs. Philippe,
Jiawei Han vs. Han, J.
Heterogeneity of schemas e.g.
Personnel(Pno, Pname, Dept, Born) vs.
Empl(id, name, DeptNo, Salary, Birth),
Dept(id, name)
Structure/format heterogeneity e.g.

ltSchema nameHR-Schemagt ltElementType
namePersonnelgt ltelement typeEmpNo/gt
ltelement typeEmpName/gt ltElementType
nameDeptgt ltelement typeid/gt
ltelement typename/gt lt/ElementTypegt
lt/ElementTypegt lt/Schemagt
CREATE TABLE Personnel( EmpNo int PRIMARY KEY,
EmpName varchar(50), DeptNo int REFERENCES
Dept, Salary dec(15,2) ) CREATE TABLE Dept(
DeptNo int PRIMARY KEY, DeptName
varchar(70) )
vs.
7
Challenge 2 Attribute Instability

In distributed databases, objects/data values are
Horizontally distributed
E.g. UIUC organization HR, OAR, Facilities,
Business, Academics,
Vertically distributed
E.g. Comparison shopping Yahoo Shopping, Amazon,
E-Bay,
Distributed with various distributions
E.g. Spatial/Multimedia databases DBs of
different training size
Distributed with various characteristics
(features)
Important when we need to run feature selection
before learning the model

8
HeteroClass Framework Observations

Data integration before data mining is very
costly and difficult (if not impossible)
Data warehousing before data mining is difficult
and not efficient
Security concerns
Providers do not give full access to their data
Efficiency concerns (large databases)
Data co-existence paradigm is most natural
Data sources expose standalone classifiers
DataSupport Systems Platform (DSSPs)
Data mining embedded in DBMSs
It is possible (and useful) to build specialized
classifiers
No need for full integration/warehousing partial
integration is enough

9
HeteroClass Framework Architecture

Join Discovery Module
Alleviate heterogeneity
Output useful links between databases (and their
relations)
Ensemble builder
Build diverse ensemble of classifiers
Meta-learner
Voting
Decision tree meta-learner

10
Approximate Join Discovery Step 1

Two step process
Find approximate join paths based on set
resemblance
Field similarity by exact match/substring
similarity
The resemblance of two sets A, B is ???,?? ?
A?????????
After some algebra, we find that A???
???,???(AB)/(1 ???,??)
There are fast algorithms to compute ?
h(a) is a hash function.
m(A) min(h(a), a in A).
ObservationPrm(A) m(B) ???,??
The sets are the unique values in the fields
The resemblance of two fields is a good measure
of their similarity

11
Approximate Join Discovery Step 2

For each set of joinable attributes in the two
schemas
Estimate their semantic distance using a schema
matching tool
Filter join paths, that are not semantically
meaningful, i.e. their semantic distance is high
E.g.
(DB1.last-name, DB2.street-name)
Might be joinable because of common substrings
(Clinton St. vs. Bill Clinton, George West vs.
West Main St., etc.)
But this join is not semantically meaningful
This happens frequently in categorical attributes
(e.g. with yes / no values.

12
Schema Matching
Given a schema
Build its graph model
CREATE TABLE Personnel( Pno int, Pname
string, Dept string)
columnType
column
table
Pno
5
2
int
column
type
Pname
column
3
1
column
name
Given two models
6
string
Dept
4
Personnel
SQL type
Run similarity flooding algorithm
For each two schema elements (a, b), output a
semantic similarity distance based on an
iterative graph matching algorithms
13
DI Approximate Join Discovery

Why it is useful?
When learning a classifier in multi-relational
setting
The number of possible predicates (predicate
space) ? grows exponentially with the number of
links
Improves efficiency
We eliminate spurious links
Improves accuracy
Best effort DI
We need only a semantic distance
No need for exact matches (no need for human
confirmation)
No need for schema mapping

Contribution Using schema matching and DI
techniques helps mining database structure and
join discovery
14
Ensemble of Classifiers Approach

Ensemble methods
Use a combination of models to increase global
accuracy
Ensemble should be
Accurate
Build specialized (site-specific) classifiers
Build classifiers on homogeneous regions
? Improve prediction capability
Diverse
Each learned hypothesis should be as different as
possible
Each learned hypothesis should be consistent with
training data
Ensure all classifiers do not make same errors
Exploit various attribute collection
(joins/schemas/DBs)

15
HeteroClass Intuition

Main observation (theoretically proved)
To improve accuracy of an ensemble of
classifiers
They should disagree on some inputs
Diversity/ambiguity measure of disagreement
Generalization error mean error diversity
(variance) ( )
Increase ensemble diversity
Maintain average error of ensemble members
? Decrease generalization error

16
HeteroClass Algorithm

1- learn C0 BaseLearn (training data)
2- initialize ensembleC0
3- set ?? ensemble error
4- while (error converges, or no more training
data left, or enough ensembles created) do
4.1- Pick two (or more) site-specific
classifiers and add a number of joins (J1 Jn)
that connect them
4.2- training data training data data
accessible through J1 Jn
4.3- C BaseLearn (training data)
4.4- C C C
4.5- if new ensembles error is increased C C
C

17
Evaluation Settings

Dataset 1 Inventory
Products, Stores, Inventory records 5 databases
Classification task
Product availability in store
Dataset 2 DBLife data sources
DBLP database (2 XML databases from Clio project
_at_Toronto)
Authors, Papers, Publications, Conferences
DBLife-Researchers database
CS Researchers, their associations and meta-date
(Department, Title, etc.)
DBWorld-Events
People, Events (service, talk, etc.)
Classification tasks
Research area for researchers

18
HeteroClass Accuracy
19
HeteroClass Sensitivity Analysis
20
Conclusion

Take-home messages
Data mining across multiple data sources is
Important and necessary
Heterogeneity is challenging
Best-effort DI helps data mining
Improve structure mining using schema matching
Ensemble's approach help in heterogeneous
settings with data co-existence paradigm
Committee of local specialized classifiers can
help in building a global classifier
HeteroClass framework is general
Applicable to other data mining tasks
Clustering build global clusters from local
clusters
Etc.

21
Thank You!
22
Backup Similarity Flooding Algorithm

? e(s,p,o) ? A edge e labeled p from s to o in
A
Pairwise Connectivity Graph PCG(A, B)
Induced Propagation Graph add edges in opposite
direction
Forward edge weights initialize with a q-gram
string similarity matcher
Edge weights propagation coefficients (how the
similarity propagates to neighbors)

23
Backup Similarity Flooding Algorithm

Input ? Graph ? Mapping ? Filtering
Mapping similarity flooding (SFJoin)
Initial similarity values taken from initial
mapping e.g. a name matcher or a fully
functional schema matching tool
In each iteration similarity of two elements
affects the similarity of their respective
neighbors
Iterate until similarity values are stable
(fixpoint computation)
Filtering
Needed to choose best candidate match
Many heuristics
Difficult to implement fully automatically (needs
human intervention)
In HeteroClass only need semantic distance ?? no
need for filtering