SLIQ and SPRINT for disk resident data - PowerPoint PPT Presentation

About This Presentation
Title:

SLIQ and SPRINT for disk resident data

Description:

SLIQ and SPRINT for disk resident data SLIQ SLIQ is a decision tree classifier that can handle both numerical and categorical attributes Builds compact and accurate ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 26
Provided by: AlexT160
Category:
Tags: sliq | sprint | data | disk | resident | sprint

less

Transcript and Presenter's Notes

Title: SLIQ and SPRINT for disk resident data


1
SLIQ and SPRINTfor disk resident data
2
SLIQ
  • SLIQ is a decision tree classifier that can
    handle both numerical and categorical attributes
  • Builds compact and accurate trees
  • Uses a pre-sorting technique in the tree growing
    phase
  • Suitable for classification of large
    disk-resident datasets.

3
Issues
  • There are two major, critical performance, issues
    in the tree-growth phase
  • How to find split points
  • How to partition the data
  • The well-known decision tree classifiers
  • Grow trees depth-first
  • Repeatedly sort the data at every node
  • SLIQ
  • Replace this repeated sorting with one-time sort
  • Use new a data structure call class-list
  • Class-list must remain memory resident at all
    time

4
Some Data
rid age salary marital car
1 30 60 single sports
2 25 20 single mini
3 40 80 married van
4 45 100 single luxury
5 60 150 married luxury
6 35 120 single sports
7 50 70 married van
8 55 90 single sports
9 65 30 married mini
10 70 200 single luxury
5
SLIQ - Attribute Lists
rid marital
1 single
2 single
3 married
4 single
5 married
6 single
7 married
8 single
9 married
10 single
rid age
1 30
2 25
3 40
4 45
5 60
6 35
7 50
8 55
9 65
10 70
rid salary
1 60
2 20
3 80
4 100
5 150
6 120
7 70
8 90
9 30
10 200
These are projections on (rid, attribute).
6
SLIQ - Sort Numeric, Group Categorical
rid marital
3 married
5 married
7 married
9 married
1 single
2 single
4 single
6 single
8 single
10 single
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
7
SLIQ - Class List
N1
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
8
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
N1
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L
R
age?25 ?
sports mini van luxury
L
R
age?30 ?
Evaluate each split, using GINI or Entropy.
...
9
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
N1
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
age?25
sports mini van luxury
L 1 1 0 0
R 2 1 2 3
age?30
Evaluate each split, using GINI or Entropy.
...
10
SLIQ - Histograms
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
N1
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
salary?20
sports mini van luxury
L 0 2 0 0
R 3 0 2 3
salary?30
Evaluate each split, using GINI or Entropy.
...
11
SLIQ - Histograms
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
rid marital
3 married
5 married
7 married
9 married
1 single
2 single
4 single
6 single
8 single
10 single
N1
Married
sports mini van luxury
Yes 0 1 2 1
No 3 1 0 2
Single
sports mini van luxury
Yes 3 1 0 2
No 0 1 2 1
Evaluate each split, using GINI or Entropy.
12
SLIQ - Perform best split and Update Class List
N1
salary?60
rid car LEAF
1 sports N1
2 mini N1
3 van N1
4 luxury N1
5 luxury N1
6 sports N1
7 van N1
8 sports N1
9 mini N1
10 luxury N1
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
N2
N3
13
SLIQ - Perform best split and Update Class List
rid car LEAF
1 sports N2
2 mini N2
3 van N3
4 luxury N3
5 luxury N3
6 sports N3
7 van N3
8 sports N3
9 mini N2
10 luxury N3
rid salary
2 20
9 30
1 60
7 70
3 80
8 90
4 100
6 120
5 150
10 200
14
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
rid car LEAF
1 sports N2
2 mini N2
3 van N3
4 luxury N3
5 luxury N3
6 sports N3
7 van N3
8 sports N3
9 mini N2
10 luxury N3
sports mini van luxury
L 0 0 0 0
R 1 1 1 0
N1
sports mini van luxury
L 0 0 0 0
R 2 0 2 3
N2
sports mini van luxury
L
R
N1
age?25 ?
sports mini van luxury
L
R
N2
Evaluate each split, using GINI or Entropy.
...
15
SLIQ - Histograms
rid age
2 25
1 30
6 35
3 40
4 45
7 50
8 55
5 60
9 65
10 70
rid car LEAF
1 sports N2
2 mini N2
3 van N3
4 luxury N3
5 luxury N3
6 sports N3
7 van N3
8 sports N3
9 mini N2
10 luxury N3
sports mini van luxury
L 0 0 0 0
R 1 1 1 0
N1
sports mini van luxury
L 0 0 0 0
R 2 0 2 3
N2
sports mini van luxury
L 0 1 0 0
R 1 0 1 0
N1
age?25
sports mini van luxury
L 0 0 0 0
R 2 0 2 3
N2
Evaluate each split, using GINI or Entropy.
...
16
SLIQ - Pseudocode
  • Split evaluation
  • EvaluateSplits()
  • for each numeric attribute A do
  • for each value v in the attribute list do
  • find the corresponding entry in the class list,
    and hence the corresponding class and the
    leaf node Ni
  • update the class histogram in leaf Ni
  • compute splitting score for test (A v) for Ni
  • for each categorical attribute A do
  • for each leaf of the tree do
  • find subset of A with best split

17
SLIQ - Pseudocode
  • Updating the class list
  • UpdateLabels()
  • for each split leaf Ni do
  • Let A be the split attribute for Ni.
  • for each (rid,v) in the attribute list for A do
  • find the corresponding entry in the class list
    e (using the rid)
  • if the leaf referenced by e is Ni then
  • find the new leaf Nj to which (rid,v)
    belongs
  • (by applying the splitting test)
  • update the leaf pointer for e to Nj

18
SLIQ - bottleneck
  • Class-list must remain memory resident at all
    time!
  • Although not a big problem with today's memories,
    still there might be cases where this is a
    bottleneck.
  • So, what can we do when the class-list doesn't
    fit in main memory?
  • SPRINT is a solution...

19
SPRINT
rid age car
2 25 mini
1 30 sports
6 35 sports
3 40 van
4 45 luxury
7 50 van
8 55 sports
5 60 luxury
9 65 mini
10 70 luxury
rid salary car
2 20 mini
9 30 mini
1 60 sports
7 70 van
3 80 van
8 90 sports
4 100 luxury
6 120 sports
5 150 luxury
10 200 luxury
rid marital car
3 married van
5 married luxury
7 married van
9 married mini
1 single sports
2 single mini
4 single luxury
6 single sports
8 single sports
10 single luxury
The main data structures used in SPRINT
are Attribute lists and Class histograms
20
SPRINT - Histograms
rid age car
2 25 mini
1 30 sports
6 35 sports
3 40 van
4 45 luxury
7 50 van
8 55 sports
5 60 luxury
9 65 mini
10 70 luxury
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
age?25
sports mini van luxury
L 1 1 0 0
R 2 1 2 3
age?30
Evaluate each split, using GINI or Entropy.
...
21
SPRINT - Histograms
rid salary car
2 20 mini
9 30 mini
1 60 sports
7 70 van
3 80 van
8 90 sports
4 100 luxury
6 120 sports
5 150 luxury
10 200 luxury
sports mini van luxury
L 0 0 0 0
R 3 2 2 3
sports mini van luxury
L 0 1 0 0
R 3 1 2 3
salary?20
sports mini van luxury
L 0 2 0 0
R 3 0 2 3
salary?30
Evaluate each split, using GINI or Entropy.
...
22
SPRINT - Histograms
rid marital car
3 married van
5 married luxury
7 married van
9 married mini
1 single sports
2 single mini
4 single luxury
6 single sports
8 single sports
10 single luxury
Married
sports mini van luxury
Yes 0 1 2 1
No 3 1 0 2
Single
sports mini van luxury
Yes 3 1 0 2
No 0 1 2 1
Evaluate each split, using GINI or Entropy.
23
SPRINT - Performing Best Split
  • Once the best split point has been found for a
    node, we execute the split by creating child
    nodes.
  • Requires splitting the nodes lists for every
    attribute into two.
  • Partitioning the attribute list of the winning
    attribute (salary) is easy.
  • We scan the list, apply the split test, and move
    the records to two new attribute lists - one for
    each new child.

24
SPRINT - Performing Best Split
  • Unfortunately, for the remaining attribute lists
    of the node (age and marital), we have no test
    that we can apply to the attribute values to
    decide how to divide the records.
  • Solution use the rids.
  • As we partition the list of the splitting
    attribute (i.e. salary), we insert the rids of
    each record into a probe structure (hash table),
    noting to which child the record was moved.
  • Once we have collected all the rids, we scan the
    lists of the remaining attributes and probe the
    hash table with the rid of each record.
  • The retrieved information tells us with which
    child to place the record.

25
SPRINT - Performing Best Split
  • If the hash-table is too large for the memory,
    splitting is done in more than one step.
  • The attribute list for the splitting attribute is
    partitioned up to the attribute record for which
    the hash table will fit in memory
  • Portions of attribute lists of non-splitting
    attributes are partitioned and the process is
    repeated for the remainder of the attribute list
    of the splitting attribute.
Write a Comment
User Comments (0)
About PowerShow.com