Title: Mayssam Sayyadian
1HeteroClass A Framework for Effective
Classification from Heterogeneous Databases
- Mayssam Sayyadian
- University of Illinois at Urbana-Champaign
- Final Project Presentation for CS512 Principles
and Algorithms for Data Mining - May 2006
2Roadmap
- Motivation
- From multi-relational classification ? multi-DB
classification - Motivating examples
- Challenges
- Heterogeneous data sources
- Attribute instability
- HeteroClass framework
- Data integration techniques help data mining
techniques - Diverse ensembles to address attribute
instability - Empirical evaluation
- Discussion, broader impact, and conclusion
3Multi-relational Classification
- Classification Old problem, but important
problem ! - Most real-world data are stored in relational
databases. - To classify objects in one relation, other
relations provide crucial information. - Cannot convert relational data into a single
table without expert knowledge or loosing
essential information. - Multi-relational classification Automatically
classifying objects using multiple relations
4Motivation
- Necessity drives applications
- From multi-relational scenarios to multi-database
scenarios - Cross-database links play an important role but
are difficult to capture automatically
5Examples
- In many real-world settings, data sources are
- Diverse
- Autonomous and heterogeneous
- Scattered over Web, organizations, enterprises,
digital libraries, smart homes, personal
information systems (PIM), etc. - but inter-related
- Examples
- What sort of graduates of UIUC will be donating
money to the school in future? (multiple
databases in an organization) - What products are probable to be sold for the
next 30 day? (multiple databases in an
enterprise) - Which customers will be using both traveling and
dining services of our company? - Which movie scenes are captured outdoor?
(multimedia/spatial databases) - Etc.
6Challenge 1 Heterogeneity of Data Sources
- Heterogeneity of data sources
- Data level heterogeneity e.g. Phil vs. Philippe,
Jiawei Han vs. Han, J. - Heterogeneity of schemas e.g.
- Personnel(Pno, Pname, Dept, Born) vs.
- Empl(id, name, DeptNo, Salary, Birth),
Dept(id, name) - Structure/format heterogeneity e.g.
ltSchema nameHR-Schemagt ltElementType
namePersonnelgt ltelement typeEmpNo/gt
ltelement typeEmpName/gt ltElementType
nameDeptgt ltelement typeid/gt
ltelement typename/gt lt/ElementTypegt
lt/ElementTypegt lt/Schemagt
CREATE TABLE Personnel( EmpNo int PRIMARY KEY,
EmpName varchar(50), DeptNo int REFERENCES
Dept, Salary dec(15,2) ) CREATE TABLE Dept(
DeptNo int PRIMARY KEY, DeptName
varchar(70) )
vs.
7Challenge 2 Attribute Instability
- In distributed databases, objects/data values are
- Horizontally distributed
- E.g. UIUC organization HR, OAR, Facilities,
Business, Academics, - Vertically distributed
- E.g. Comparison shopping Yahoo Shopping, Amazon,
E-Bay, - Distributed with various distributions
- E.g. Spatial/Multimedia databases DBs of
different training size - Distributed with various characteristics
(features) - Important when we need to run feature selection
before learning the model
8HeteroClass Framework Observations
- Data integration before data mining is very
costly and difficult (if not impossible) - Data warehousing before data mining is difficult
and not efficient - Security concerns
- Providers do not give full access to their data
- Efficiency concerns (large databases)
- Data co-existence paradigm is most natural
- Data sources expose standalone classifiers
- DataSupport Systems Platform (DSSPs)
- Data mining embedded in DBMSs
- It is possible (and useful) to build specialized
classifiers - No need for full integration/warehousing partial
integration is enough
9HeteroClass Framework Architecture
- Join Discovery Module
- Alleviate heterogeneity
- Output useful links between databases (and their
relations) - Ensemble builder
- Build diverse ensemble of classifiers
- Meta-learner
- Voting
- Decision tree meta-learner
10Approximate Join Discovery Step 1
- Two step process
- Find approximate join paths based on set
resemblance - Field similarity by exact match/substring
similarity - The resemblance of two sets A, B is ???,?? ?
A????????? - After some algebra, we find that A???
???,???(AB)/(1 ???,??) - There are fast algorithms to compute ?
- h(a) is a hash function.
- m(A) min(h(a), a in A).
- ObservationPrm(A) m(B) ???,??
- The sets are the unique values in the fields
- The resemblance of two fields is a good measure
of their similarity
11Approximate Join Discovery Step 2
- For each set of joinable attributes in the two
schemas - Estimate their semantic distance using a schema
matching tool - Filter join paths, that are not semantically
meaningful, i.e. their semantic distance is high - E.g.
- (DB1.last-name, DB2.street-name)
- Might be joinable because of common substrings
(Clinton St. vs. Bill Clinton, George West vs.
West Main St., etc.) - But this join is not semantically meaningful
- This happens frequently in categorical attributes
(e.g. with yes / no values.
12Schema Matching
Given a schema
Build its graph model
CREATE TABLE Personnel( Pno int, Pname
string, Dept string)
columnType
column
table
Pno
5
2
int
column
type
Pname
column
3
1
column
name
Given two models
6
string
Dept
4
Personnel
SQL type
Run similarity flooding algorithm
For each two schema elements (a, b), output a
semantic similarity distance based on an
iterative graph matching algorithms
13DI Approximate Join Discovery
- Why it is useful?
- When learning a classifier in multi-relational
setting - The number of possible predicates (predicate
space) ? grows exponentially with the number of
links - Improves efficiency
- We eliminate spurious links
- Improves accuracy
- Best effort DI
- We need only a semantic distance
- No need for exact matches (no need for human
confirmation) - No need for schema mapping
Contribution Using schema matching and DI
techniques helps mining database structure and
join discovery
14Ensemble of Classifiers Approach
- Ensemble methods
- Use a combination of models to increase global
accuracy - Ensemble should be
- Accurate
- Build specialized (site-specific) classifiers
- Build classifiers on homogeneous regions
- ? Improve prediction capability
- Diverse
- Each learned hypothesis should be as different as
possible - Each learned hypothesis should be consistent with
training data - Ensure all classifiers do not make same errors
- Exploit various attribute collection
(joins/schemas/DBs)
15HeteroClass Intuition
- Main observation (theoretically proved)
- To improve accuracy of an ensemble of
classifiers - They should disagree on some inputs
- Diversity/ambiguity measure of disagreement
- Generalization error mean error diversity
(variance) ( ) - Increase ensemble diversity
- Maintain average error of ensemble members
- ? Decrease generalization error
16HeteroClass Algorithm
- 1- learn C0 BaseLearn (training data)
- 2- initialize ensembleC0
- 3- set ?? ensemble error
- 4- while (error converges, or no more training
data left, or enough ensembles created) do - 4.1- Pick two (or more) site-specific
classifiers and add a number of joins (J1 Jn)
that connect them - 4.2- training data training data data
accessible through J1 Jn - 4.3- C BaseLearn (training data)
- 4.4- C C C
- 4.5- if new ensembles error is increased C C
C
17Evaluation Settings
- Dataset 1 Inventory
- Products, Stores, Inventory records 5 databases
- Classification task
- Product availability in store
- Dataset 2 DBLife data sources
- DBLP database (2 XML databases from Clio project
_at_Toronto) - Authors, Papers, Publications, Conferences
- DBLife-Researchers database
- CS Researchers, their associations and meta-date
(Department, Title, etc.) - DBWorld-Events
- People, Events (service, talk, etc.)
- Classification tasks
- Research area for researchers
18HeteroClass Accuracy
19HeteroClass Sensitivity Analysis
20Conclusion
- Take-home messages
- Data mining across multiple data sources is
- Important and necessary
- Heterogeneity is challenging
- Best-effort DI helps data mining
- Improve structure mining using schema matching
- Ensemble's approach help in heterogeneous
settings with data co-existence paradigm - Committee of local specialized classifiers can
help in building a global classifier - HeteroClass framework is general
- Applicable to other data mining tasks
- Clustering build global clusters from local
clusters - Etc.
21Thank You!
22Backup Similarity Flooding Algorithm
- ? e(s,p,o) ? A edge e labeled p from s to o in
A - Pairwise Connectivity Graph PCG(A, B)
- Induced Propagation Graph add edges in opposite
direction - Forward edge weights initialize with a q-gram
string similarity matcher - Edge weights propagation coefficients (how the
similarity propagates to neighbors)
23Backup Similarity Flooding Algorithm
- Input ? Graph ? Mapping ? Filtering
- Mapping similarity flooding (SFJoin)
- Initial similarity values taken from initial
mapping e.g. a name matcher or a fully
functional schema matching tool - In each iteration similarity of two elements
affects the similarity of their respective
neighbors - Iterate until similarity values are stable
(fixpoint computation) - Filtering
- Needed to choose best candidate match
- Many heuristics
- Difficult to implement fully automatically (needs
human intervention) - In HeteroClass only need semantic distance ?? no
need for filtering