nonordfp: An FPgrowth variation without rebuilding the FPtree - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

nonordfp: An FPgrowth variation without rebuilding the FPtree

Description:

nonordfp: An FP-growth variation without rebuilding the FP-tree ... Nodes are placed in an array contiguously. no dynamic allocation. no pointers (just indices) ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 24
Provided by: rczb
Category:

less

Transcript and Presenter's Notes

Title: nonordfp: An FPgrowth variation without rebuilding the FPtree


1
nonordfp An FP-growth variation without
rebuilding the FP-tree
  • Balázs Rácz
  • Computer and Automation Research Institute of the
    Hungarian Academy of Sciences

2
Outline
  • Introduction FP-growth
  • New core data structure
  • Traversing strategies
  • Implementational aspects
  • Future work and thoughts

3
About the title
  • nonordfp An FP-growth variation without
    rebuilding the FP-tree
  • What varies
  • Core data structure
  • Memory layout
  • Simpler recursion the same trie is used for
    more/all levels of recursion
  • nonord item ordering does not change

4
FP-growth
  • Han, Pei, Yin, 2000
  • Pattern growth approach
  • Keep a representation of the (projected/conditiona
    l) database in memory
  • Data structure FP-tree (trie extra data)
  • Core recursion
  • Determine support of items
  • Calculate projections onto single items
  • Recurse on them

5
Projection
  • Conditional database for itemset X all
    transactions that contain X with the elements of
    X removed.

6
New core data structure
  • Trie
  • Original FP-tree node
  • item ID
  • children map, or
  • first child next sibling pointer
  • counter
  • parent pointer
  • New node counter, parent pointer
  • separately

7
New core data structure (2)
  • New node counter, parent pointer
  • Stored in separate arrays
  • projection replace counters, structure remains
  • Nodes are placed in an array contiguously
  • no dynamic allocation
  • no pointers (just indices)
  • traversal sequential memory read

8
New core data structure (3)
  • Hey, we dropped item identifiers!
  • Node array is filled with nodes of the same item
    being adjacent
  • Only starts of item intervals are stored
  • A sequential scan of this array is a
    bottom-to-top traversal of the trie
  • sequential memory read

9
Aggregation on the new data structure
  • Aggregation calculation of new counters
  • By default, recursion proceeds with the same
    structural information and new set of counters
  • Dense
  • Default aggregation mode, visits each node once
  • Problem deleted nodes of the previous iterations
  • Sparse
  • For first (few) recursions or skewed datasets
  • Ignores nodes with zero counter (deleted nodes)
  • Dynamically builds header chain
  • Typically projection follows

10
Aggregation on the new data structure
  • Single path optimization
  • If the FP-tree is only a single chain (path) of
    nodes, then easy fast recursion
  • Extension if at most one node of each item
  • Skewed datasets
  • Real-life market basket data
  • Very large trie (up to millions of nodes)
  • First level of projection dominates the time
    (gt80)
  • Specialized, simultaneous projection

11
Projection/compacting
  • Compact counters and parents arrays to remove
    space used by deleted nodes
  • both for sparse and dense aggregation
  • results in new structural data
  • spare memory (traversal) time
  • not a real projection no new trie is built, item
    order remains the same but can be done only
    using the parent pointers
  • Experimental results
  • on dense datasets projection costs and benefits
    are surprisingly well balanced
  • on sparse datasets projection is important on the
    first few levels (see simultaneous projection)

12
Implementation
  • Auxiliary and library routines take 90 of the
    time
  • Especially core recursion
  • input/output routines
  • memory management

13
Impl. Input/output
  • 100 million fprintf is ssslllooowww
  • Improvements
  • buffer output file
  • use specialized number conversion
  • utilize recursion

14
(No Transcript)
15
Impl. memory management
  • Allocating and freeing an array in each recursion
    step is too ssslllooowww
  • C librarys memory management functions are too
    general for the purpose
  • Solution reuse arrays
  • Note fill with zero the array before/after use
    only when necessary
  • Block oriented memory allocators
  • sometimes allocated memory is unused

16
(No Transcript)
17
Experimental results
  • On dense datasets the fast traversal routines
    take advantage
  • On sparse datasets the performance is still
    competitive
  • Test runs against some of the best competitors of
    FIMI03

18
(No Transcript)
19
(No Transcript)
20
Future work questions
  • Code is not yet mature
  • Use top-down approach
  • reuses counter arrays in-place
  • Simultaneous projection is not very efficient
  • how hard is it to do the first level of
    projections?
  • especially on sparse datasets

21
New direction of research
  • SIMD
  • stands for Single Instruction Multiple Data
  • vector processing ability of modern processors
  • Supported by all mainstream CPUs
  • In x86 since Pentium MMX (1997)
  • MMX 7 years old
  • APRIORI 10 years old
  • FP-growth 4 years old

22
New direction of research (contd)
  • Give an FIM algorithm/implementation that can be
    SIMD parallelized!
  • Use parallel CPU resources
  • Most important
  • eliminate conditional branches
  • fits superscalar processors
  • Revise FIMI rules?

23
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com