SLIQ and SPRINT for disk resident data

About This Presentation

Title:

SLIQ and SPRINT for disk resident data

Description:

Replace this repeated sorting with one-time sort. Use new a data structure call class-list ... are projections on (rid, attribute). SLIQ - Sort Numeric, Group ... – PowerPoint PPT presentation

Number of Views:275

Avg rating:3.0/5.0

Slides: 26

Provided by: alext8

Category:

more less

Transcript and Presenter's Notes

Title: SLIQ and SPRINT for disk resident data

1
SLIQ and SPRINTfor disk resident data
2
SLIQ

SLIQ is a decision tree classifier that can
handle both numerical and categorical attributes
Builds compact and accurate trees
Uses a pre-sorting technique in the tree growing
phase
Suitable for classification of large
disk-resident datasets.

3
Issues

There are two major, critical performance, issues
in the tree-growth phase
How to find split points
How to partition the data
The well-known decision tree classifiers
Grow trees depth-first
Repeatedly sort the data at every node
SLIQ
Replace this repeated sorting with one-time sort
Use new a data structure call class-list
Class-list must remain memory resident at all
time

4
Some Data
5
SLIQ - Attribute Lists
These are projections on (rid, attribute).
6
SLIQ - Sort Numeric, Group Categorical
7
SLIQ - Class List
N1
8
SLIQ - Histograms
N1
age?25 ?
age?30 ?
Evaluate each split, using GINI or Entropy.
...
9
SLIQ - Histograms
N1
age?25
age?30
Evaluate each split, using GINI or Entropy.
...
10
SLIQ - Histograms
N1
salary?20
salary?30
Evaluate each split, using GINI or Entropy.
...
11
SLIQ - Histograms
N1
Married
Single
Evaluate each split, using GINI or Entropy.
12
SLIQ - Perform best split and Update Class List
N1
salary?60
N2
N3
13
SLIQ - Perform best split and Update Class List
14
SLIQ - Histograms
N1
N2
N1
age?25 ?
N2
Evaluate each split, using GINI or Entropy.
...
15
SLIQ - Histograms
N1
N2
N1
age?25
N2
Evaluate each split, using GINI or Entropy.
...
16
SLIQ - Pseudocode

Split evaluation
EvaluateSplits()
for each numeric attribute A do
for each value v in the attribute list do
find the corresponding entry in the class list,
and hence the corresponding class and the
leaf node Ni
update the class histogram in leaf Ni
compute splitting score for test (A v) for Ni
for each categorical attribute A do
for each leaf of the tree do
find subset of A with best split

17
SLIQ - Pseudocode

Updating the class list
UpdateLabels()
for each split leaf Ni do
Let A be the split attribute for Ni.
for each (rid,v) in the attribute list for A do
find the corresponding entry in the class list
e (using the rid)
if the leaf referenced by e is Ni then
find the new leaf Nj to which (rid,v)
belongs
(by applying the splitting test)
update the leaf pointer for e to Nj

18
SLIQ - bottleneck

Class-list must remain memory resident at all
time!
Although not a big problem with today's memories,
still there might be cases where this is a
bottleneck.
So, what can we do when the class-list doesn't
fit in main memory?
SPRINT is a solution...

19
SPRINT
The main data structures used in SPRINT
are Attribute lists and Class histograms
20
SPRINT - Histograms
age?25
age?30
Evaluate each split, using GINI or Entropy.
...
21
SPRINT - Histograms
salary?20
salary?30
Evaluate each split, using GINI or Entropy.
...
22
SPRINT - Histograms
Married
Single
Evaluate each split, using GINI or Entropy.
23
SPRINT - Performing Best Split

Once the best split point has been found for a
node, we execute the split by creating child
nodes.
Requires splitting the nodes lists for every
attribute into two.
Partitioning the attribute list of the winning
attribute (salary) is easy.
We scan the list, apply the split test, and move
the records to two new attribute lists - one for
each new child.

24
SPRINT - Performing Best Split

Unfortunately, for the remaining attribute lists
of the node (age and marital), we have no test
that we can apply to the attribute values to
decide how to divide the records.
Solution use the rids.
As we partition the list of the splitting
attribute (i.e. salary), we insert the rids of
each record into a probe structure (hash table),
noting to which child the record was moved.
Once we have collected all the rids, we scan the
lists of the remaining attributes and probe the
hash table with the rid of each record.
The retrieved information tells us with which
child to place the record.

25
SPRINT - Performing Best Split

If the hash-table is too large for the memory,
splitting is done in more than one step.
The attribute list for the splitting attribute is
partitioned up to the attribute record for which
the hash table will fit in memory
Portions of attribute lists of non-splitting
attributes are partitioned and the process is
repeated for the remainder of the attribute list
of the splitting attribute.

Write a Comment

User Comments (0)

About PowerShow.com

SLIQ and SPRINT for disk resident data - PowerPoint PPT Presentation

SLIQ and SPRINT for disk resident data

Replace this repeated sorting with one-time sort. Use new a data structure call class-list ... are projections on (rid, attribute). SLIQ - Sort Numeric, Group ... – PowerPoint PPT presentation