Scalable Mining For Classification Rules in Relational Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Mining For Classification Rules in Relational Databases

Description:

Scalable Mining For Classification Rules in Relational Databases. ???? ... partition the records and grow the tree for one more level according to best splits; ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 22
Provided by: mathT
Category:

less

Transcript and Presenter's Notes

Title: Scalable Mining For Classification Rules in Relational Databases


1
Scalable Mining For Classification Rules in
Relational Databases
Min Wang Bala Iyer Jeffrey Scott Vitter
???? ?? ??? ???????
2
Abstract
  • Problem Increase in Size of Training Set
  • MIND (MINing in Database) Classifier
  • Can be Implemented easily over SQL
  • Other Classifiers Need O(N) space In Memory.
  • MIND Scales Well Over
  • I/O
  • of Processors

3
Over View
  • Introduction
  • Algorithm
  • Database Implementation
  • Performance
  • Experimental Results
  • Conclusions

4
Introduction - Classification Problem
DETAIL TABLE
CLASSIFYER
Age lt 30
yes
no
salary lt 62K
safe
yes
no
risky
safe
5
Introduction - Scalability In Classification
  • Importance Of Scalability
  • Use a Very Large Training Set Data is Not
    Memory Resident.
  • Number Of CPUs better usage of resources.

6
Introduction - Scalability In Classification
  • Properties of MIND
  • Scalable in memory
  • Scalable In CPU
  • Uses SQL
  • Easy to implement
  • Assumptions
  • Attribute Values Are Discrete
  • We focus on the growth stage(no pruning)

7
The Algorithm - DataStracture

DATA in DETAIL TABLE DETAIL(attr1,attr2,.,cla
ss,leaf_num) attri i attribute class
Class type leaf_num the number of leaf the
example belongs to(this data can be calculated
by the known tree)
8
The Algorithm - gini index
S - data Set C - number of Classes Pi - relative
frequency of class i in S gini index

9
The Algorithm
  • GrowTree(DETAIL TABLE)
  • Initialize tree T and put all records of DETAIL
    in root
  • while (some leaf in T is not a STOP node)
  • for each attribute i do
  • evaluate gini index for each non-STOP leaf at
    each split value with respect to attribute i
  • for each non-STOP leaf do
  • get the overall best split for it
  • partition the records and grow the tree for one
    more level according to best splits
  • mark all small or pure leaves as STOP nodes
  • return T

10
Database Implementation - Dimension table
  • For Each Attribute and each level of the tree
  • INSERT INTO DIMi
  • SELECT leaf_num,class,attri,count()
  • FROM DETAIL
  • WHERE leaf_num,ltgt STOP
  • GROUP BY leaf_num,class,attri
  • Size of Dimi leaves distinct values of
    attri classes

11
Database Implementation - Dimension table SQL
  • SELECT FROM DETAIL
  • INSERT INTO DIM1 leaf_num,class,attr1,count()
  • WHERE leaf_num,ltgt STOP
  • GROUP BY leaf_num,class,attr1
  • INSERT INTO DIM2 leaf_num,class,attr2,count()
  • WHERE leaf_num,ltgt STOP
  • GROUP BY leaf_num,class,attr2

12
Database Implementation - UP/DOWN - split
  • for each attribute we find all possible split
    places
  • INSERT INTO UP
  • SELECT d1.leaf_num, d1.attri,
  • d1.class,SUM(d2.count)
  • FROM(FULL OUTER JOIN DIMi d1, DIMi d2 ON
    d1.leaf_num d2.leaf_num AND
  • d2. attri lt d1. attri AND
  • d1.class d2.class
  • GROUP BY d1.leaf_num, d1. attri, d1.class

13
Database Implementation - Class View
  • create view for each class k and attribute i
  • CREATE VIEW Ck_UP(leaf_num,attri,count)
  • SELECT leaf_num,attri,count
  • FROM UP
  • WHERE class k

14
Database Implementation - GINI VALUE
  • create view for all gini values
  • CREATE VIEW GINI_VALUE(leaf_num,
  • attri,gini)AS
  • SELECT u1.leaf_num, u1.attri,Æ’gini
  • FROM C1_UP u1,..,Cc_UP uc,C1_DOWN d1...
  • ,Cc_DOWN dc
  • WHERE u1.attri .. uc. attri .. dc.
    attri
  • AND u1.leaf_num .. uc.leaf_num ..
    dc.leaf_num

15
Database Implementation - MIN GINI VALUE
  • create table for minimum gini values
  • for attribute i
  • INSERT INTO MIN_GINI
  • SELECT leaf_num,i,attri,gini
  • FROM GINI_VALUE a
  • WHERE a.gini
  • (SELECT MIN(gini)
  • FROM GINI_VALUE b
  • WHERE a.leaf_num b.leaf_num

16
Database Implementation - BEST SPLIT
  • create view over MIN_GINI for best split
  • CREATE VIEW BEST_SPLIT
  • (leaf_num,attr_name,attr_value)
  • SELECT leaf_num, attr_name,attr_value
  • FROM MIN_GINI a
  • WHERE a.gini
  • (SELECT MIN(gini)
  • FROM MIN_GINI b
  • WHERE a.leaf_num b.leaf_num

17
Database Implementation - Partitioning
  • Build new nodes by spliting old nodes according
    to BEST_SPLIT values
  • Set correct node to recoreds
  • Update leaf_node - is done by a function
  • No need to UPDATE data or DB

18
Performance
I/O cost of MIND
I/O cost of SPRINT
19
Experimental Results
Normalized time to finish building the tree
Normalized time to build the tree per example
20
Experimental Results
Normalized time to build the tree per of
processor
Time to build tree By Training Set Size
21
Conclusions
  • MIND works over DB
  • MIND works well because
  • MIND rephrases the classification to a DB problem
  • MIND avoid UPDATES the DETAIL table
  • Parallelism and Scaling Are achived by the use of
    RDBMS
  • MIND uses a user function to get the performance
    gain in the DIMi creation.
Write a Comment
User Comments (0)
About PowerShow.com