SLIQ and SPRINT for disk resident data - PowerPoint PPT Presentation

About This Presentation
Title:

SLIQ and SPRINT for disk resident data

Description:

Replace this repeated sorting with one-time sort. Use new a data structure call class-list ... are projections on (rid, attribute). SLIQ - Sort Numeric, Group ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 26
Provided by: alext8
Category:

less

Transcript and Presenter's Notes

Title: SLIQ and SPRINT for disk resident data


1
SLIQ and SPRINTfor disk resident data
2
SLIQ
  • SLIQ is a decision tree classifier that can
    handle both numerical and categorical attributes
  • Builds compact and accurate trees
  • Uses a pre-sorting technique in the tree growing
    phase
  • Suitable for classification of large
    disk-resident datasets.

3
Issues
  • There are two major, critical performance, issues
    in the tree-growth phase
  • How to find split points
  • How to partition the data
  • The well-known decision tree classifiers
  • Grow trees depth-first
  • Repeatedly sort the data at every node
  • SLIQ
  • Replace this repeated sorting with one-time sort
  • Use new a data structure call class-list
  • Class-list must remain memory resident at all
    time

4
Some Data
5
SLIQ - Attribute Lists
These are projections on (rid, attribute).
6
SLIQ - Sort Numeric, Group Categorical
7
SLIQ - Class List
N1
8
SLIQ - Histograms
N1
age?25 ?
age?30 ?
Evaluate each split, using GINI or Entropy.
...
9
SLIQ - Histograms
N1
age?25
age?30
Evaluate each split, using GINI or Entropy.
...
10
SLIQ - Histograms
N1
salary?20
salary?30
Evaluate each split, using GINI or Entropy.
...
11
SLIQ - Histograms
N1
Married
Single
Evaluate each split, using GINI or Entropy.
12
SLIQ - Perform best split and Update Class List
N1
salary?60
N2
N3
13
SLIQ - Perform best split and Update Class List
14
SLIQ - Histograms
N1
N2
N1
age?25 ?
N2
Evaluate each split, using GINI or Entropy.
...
15
SLIQ - Histograms
N1
N2
N1
age?25
N2
Evaluate each split, using GINI or Entropy.
...
16
SLIQ - Pseudocode
  • Split evaluation
  • EvaluateSplits()
  • for each numeric attribute A do
  • for each value v in the attribute list do
  • find the corresponding entry in the class list,
    and hence the corresponding class and the
    leaf node Ni
  • update the class histogram in leaf Ni
  • compute splitting score for test (A v) for Ni
  • for each categorical attribute A do
  • for each leaf of the tree do
  • find subset of A with best split

17
SLIQ - Pseudocode
  • Updating the class list
  • UpdateLabels()
  • for each split leaf Ni do
  • Let A be the split attribute for Ni.
  • for each (rid,v) in the attribute list for A do
  • find the corresponding entry in the class list
    e (using the rid)
  • if the leaf referenced by e is Ni then
  • find the new leaf Nj to which (rid,v)
    belongs
  • (by applying the splitting test)
  • update the leaf pointer for e to Nj

18
SLIQ - bottleneck
  • Class-list must remain memory resident at all
    time!
  • Although not a big problem with today's memories,
    still there might be cases where this is a
    bottleneck.
  • So, what can we do when the class-list doesn't
    fit in main memory?
  • SPRINT is a solution...

19
SPRINT
The main data structures used in SPRINT
are Attribute lists and Class histograms
20
SPRINT - Histograms
age?25
age?30
Evaluate each split, using GINI or Entropy.
...
21
SPRINT - Histograms
salary?20
salary?30
Evaluate each split, using GINI or Entropy.
...
22
SPRINT - Histograms
Married
Single
Evaluate each split, using GINI or Entropy.
23
SPRINT - Performing Best Split
  • Once the best split point has been found for a
    node, we execute the split by creating child
    nodes.
  • Requires splitting the nodes lists for every
    attribute into two.
  • Partitioning the attribute list of the winning
    attribute (salary) is easy.
  • We scan the list, apply the split test, and move
    the records to two new attribute lists - one for
    each new child.

24
SPRINT - Performing Best Split
  • Unfortunately, for the remaining attribute lists
    of the node (age and marital), we have no test
    that we can apply to the attribute values to
    decide how to divide the records.
  • Solution use the rids.
  • As we partition the list of the splitting
    attribute (i.e. salary), we insert the rids of
    each record into a probe structure (hash table),
    noting to which child the record was moved.
  • Once we have collected all the rids, we scan the
    lists of the remaining attributes and probe the
    hash table with the rid of each record.
  • The retrieved information tells us with which
    child to place the record.

25
SPRINT - Performing Best Split
  • If the hash-table is too large for the memory,
    splitting is done in more than one step.
  • The attribute list for the splitting attribute is
    partitioned up to the attribute record for which
    the hash table will fit in memory
  • Portions of attribute lists of non-splitting
    attributes are partitioned and the process is
    repeated for the remainder of the attribute list
    of the splitting attribute.
Write a Comment
User Comments (0)
About PowerShow.com