SLIQ and SPRINT for disk resident data - PowerPoint PPT Presentation

About This Presentation

Title:

SLIQ and SPRINT for disk resident data

Description:

SLIQ and SPRINT for disk resident data SLIQ SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 26

Provided by: AlexT160

Category:

more less

Transcript and Presenter's Notes

Title: SLIQ and SPRINT for disk resident data

1
SLIQ and SPRINTfor disk resident data
2
SLIQ

SLIQ is a decision tree classifier that can
handle both numerical and categorical attributes
Builds compact and accurate trees
Uses a pre-sorting technique in the tree growing
phase
Suitable for classification of large
disk-resident datasets.

3
Issues

There are two major, critical performance, issues
in the tree-growth phase
How to find split points
How to partition the data
The well-known decision tree classifiers
Grow trees depth-first
Repeatedly sort the data at every node
SLIQ
Replace this repeated sorting with one-time sort
Use new a data structure call class-list
Class-list must remain memory resident at all
time

4
Some Data
rid age salary marital car
1 30 60 single sports
2 25 20 single mini
3 40 80 married van
4 45 100 single luxury
5 60 150 married luxury
6 35 120 single sports
7 50 70 married van
8 55 90 single sports
9 65 30 married mini
10 70 200 single luxury
5
SLIQ - Attribute Lists
rid marital
1 single
2 single
3 married
4 single
5 married
6 single
7 married
8 single
9 married
10 single
rid age
1 30
2 25
3 40
4 45
5 60
6 35
7 50
8 55
9 65
10 70
rid salary
1 60
2 20
3 80
4 100
5 150
6 120
7 70
8 90
9 30
10 200
These are projections on (rid, attribute).
6
SLIQ - Sort Numeric, Group Categorical
rid marital
3 married
5 married
7 married
9 married
1 single
2 single
4 single
6 single
8 single
10 single
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
7
SLIQ - Class List
N1
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
8
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
N1
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L
R
age?25 ?
sports mini van luxury
L
R
age?30 ?
Evaluate each split, using GINI or Entropy.
...
9
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
N1
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
age?25
sports mini van luxury
L 1 1 0 0
R 2 1 2 3
age?30
Evaluate each split, using GINI or Entropy.
...
10
SLIQ - Histograms
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
N1
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
salary?20
sports mini van luxury
L 0 2 0 0
R 3 0 2 3
salary?30
Evaluate each split, using GINI or Entropy.
...
11
SLIQ - Histograms
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
rid marital
3 married
5 married
7 married
9 married
1 single
2 single
4 single
6 single
8 single
10 single
N1
Married
sports mini van luxury
Yes 0 1 2 1
No 3 1 0 2
Single
sports mini van luxury
Yes 3 1 0 2
No 0 1 2 1
Evaluate each split, using GINI or Entropy.
12
SLIQ - Perform best split and Update Class List
N1
salary?60
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
N2
N3
13
SLIQ - Perform best split and Update Class List
rid car LEAF
1 sports N2
2 mini N2
3 van N3
4 luxury N3
5 luxury N3
6 sports N3
7 van N3
8 sports N3
9 mini N2
10 luxury N3
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
14
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
rid car LEAF
1 sports N2
2 mini N2
3 van N3
4 luxury N3
5 luxury N3
6 sports N3
7 van N3
8 sports N3
9 mini N2
10 luxury N3
sports mini van luxury
L 0 0 0 0
R 1 1 1 0
N1
sports mini van luxury
L 0 0 0 0
R 2 0 2 3
N2
sports mini van luxury
L
R
N1
age?25 ?
sports mini van luxury
L
R
N2
Evaluate each split, using GINI or Entropy.
...
15
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
rid car LEAF
1 sports N2
2 mini N2
3 van N3
4 luxury N3
5 luxury N3
6 sports N3
7 van N3
8 sports N3
9 mini N2
10 luxury N3
sports mini van luxury
L 0 0 0 0
R 1 1 1 0
N1
sports mini van luxury
L 0 0 0 0
R 2 0 2 3
N2
sports mini van luxury
L 0 1 0 0
R 1 0 1 0
N1
age?25
sports mini van luxury
L 0 0 0 0
R 2 0 2 3
N2
Evaluate each split, using GINI or Entropy.
...
16
SLIQ - Pseudocode

Split evaluation
EvaluateSplits()
for each numeric attribute A do
for each value v in the attribute list do
find the corresponding entry in the class list,
and hence the corresponding class and the
leaf node Ni
update the class histogram in leaf Ni
compute splitting score for test (A v) for Ni
for each categorical attribute A do
for each leaf of the tree do
find subset of A with best split

17
SLIQ - Pseudocode

Updating the class list
UpdateLabels()
for each split leaf Ni do
Let A be the split attribute for Ni.
for each (rid,v) in the attribute list for A do
find the corresponding entry in the class list
e (using the rid)
if the leaf referenced by e is Ni then
find the new leaf Nj to which (rid,v)
belongs
(by applying the splitting test)
update the leaf pointer for e to Nj

18
SLIQ - bottleneck

Class-list must remain memory resident at all
time!
Although not a big problem with today's memories,
still there might be cases where this is a
bottleneck.
So, what can we do when the class-list doesn't
fit in main memory?
SPRINT is a solution...

19
SPRINT
rid age car
2 25 mini
1 30 sports
6 35 sports
3 40 van
4 45 luxury
7 50 van
8 55 sports
5 60 luxury
9 65 mini
10 70 luxury
rid salary car
2 20 mini
9 30 mini
1 60 sports
7 70 van
3 80 van
8 90 sports
4 100 luxury
6 120 sports
5 150 luxury
10 200 luxury
rid marital car
3 married van
5 married luxury
7 married van
9 married mini
1 single sports
2 single mini
4 single luxury
6 single sports
8 single sports
10 single luxury
The main data structures used in SPRINT
are Attribute lists and Class histograms
20
SPRINT - Histograms
rid age car
2 25 mini
1 30 sports
6 35 sports
3 40 van
4 45 luxury
7 50 van
8 55 sports
5 60 luxury
9 65 mini
10 70 luxury
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
age?25
sports mini van luxury
L 1 1 0 0
R 2 1 2 3
age?30
Evaluate each split, using GINI or Entropy.
...
21
SPRINT - Histograms
rid salary car
2 20 mini
9 30 mini
1 60 sports
7 70 van
3 80 van
8 90 sports
4 100 luxury
6 120 sports
5 150 luxury
10 200 luxury
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
salary?20
sports mini van luxury
L 0 2 0 0
R 3 0 2 3
salary?30
Evaluate each split, using GINI or Entropy.
...
22
SPRINT - Histograms
rid marital car
3 married van
5 married luxury
7 married van
9 married mini
1 single sports
2 single mini
4 single luxury
6 single sports
8 single sports
10 single luxury
Married
sports mini van luxury
Yes 0 1 2 1
No 3 1 0 2
Single
sports mini van luxury
Yes 3 1 0 2
No 0 1 2 1
Evaluate each split, using GINI or Entropy.
23
SPRINT - Performing Best Split

Once the best split point has been found for a
node, we execute the split by creating child
nodes.
Requires splitting the nodes lists for every
attribute into two.
Partitioning the attribute list of the winning
attribute (salary) is easy.
We scan the list, apply the split test, and move
the records to two new attribute lists - one for
each new child.

24
SPRINT - Performing Best Split

Unfortunately, for the remaining attribute lists
of the node (age and marital), we have no test
that we can apply to the attribute values to
decide how to divide the records.
Solution use the rids.
As we partition the list of the splitting
attribute (i.e. salary), we insert the rids of
each record into a probe structure (hash table),
noting to which child the record was moved.
Once we have collected all the rids, we scan the
lists of the remaining attributes and probe the
hash table with the rid of each record.
The retrieved information tells us with which
child to place the record.

25
SPRINT - Performing Best Split

If the hash-table is too large for the memory,
splitting is done in more than one step.
The attribute list for the splitting attribute is
partitioned up to the attribute record for which
the hash table will fit in memory
Portions of attribute lists of non-splitting
attributes are partitioned and the process is
repeated for the remainder of the attribute list
of the splitting attribute.

Write a Comment

User Comments (0)