Online Mining of Frequent Query Trees over XML Data Streams PowerPoint PPT Presentation

presentation player overlay
1 / 18
About This Presentation
Transcript and Presenter's Notes

Title: Online Mining of Frequent Query Trees over XML Data Streams


1
Online Mining of Frequent Query Trees over XML
Data Streams
  • Hua-Fu Li, Man-Kwan Shan and Suh-Yin Lee
  • Department of Computer Science
  • National Chiao-Tung University
  • Hsinchu, Taiwan 300, R.O.C.
  • http//www.csie.nctu.edu.tw/hfli/
  • corresponding author

2
Outline
  • Introduction
  • Mining of Data Streams, Tree Mining
  • Problem Definition
  • Online Mining of Frequent Query Trees over XML
    Data Streams
  • The Proposed Algorithm
  • FQT-Stream (Frequent Query Trees of Streams)
  • Conclusions and Future Work

3
Mining of Data Streams Motivations
  • Many Applications generate data streams
  • Day to day business (credit card, ATM
    transactions, etc)
  • Hot Web services (XML data, record and click
    streams)
  • Telecommunication (call records)
  • Financial market (stock exchange)
  • Surveillance (sensor network, audio/video)
  • System management (network events)
  • Application characteristics
  • Massive volumes of data (several terabytes)
  • Records arrive at a rapid rate
  • Data distribution changes on the fly
  • What do we want to get from data streams ?
  • Real time query answering, Statistics, and
    Pattern discovery

4
Mining of Data Streams Computation Model
  • Requirements of Mining Data Streams
  • Single pass each record is examined at most once
  • Bounded storage Limited Memory for storing
    synopsis
  • Real-time Per record processing time (to
    maintain synopsis) must be low

5
Problem Definition of Frequent Query Tree Mining
(1/2)
  • XML Query Tree Stream (XQTS)
  • A sequence of query trees (QTs)
  • QT1, QT2, , QTN
  • N is tree id the latest incoming query tree
  • Support of a Query Tree QTi
  • sup(QTi) the number of QTs in XQTS containing
    QTi as a subtree

6
Problem Definition of Frequent Query Tree Mining
(2/2)
  • A QTi is a Frequent Query Tree (FQT)
  • if and only if sup(QTi) ? sN
  • s is a user-defined minimum support threshold in
    the range of 0, 1
  • Our Task
  • To mine the set of all frequent query trees
    (FQTs) by one scan of the XQTS
  • Using as smaller memory as possible

7
Proposed Algorithm FQT-Stream (Frequent Query
Trees of Streams)
  • FQT-Stream consists of 5 phases
  • 1. read a QT (Query Tree) from the buffer in the
    main memory
  • 2. transform the QT into a new NQTS (Normalized
    Query Tree Sequence) representation
  • 3. construct a in-memory summary data structure
    called FQT-forest (a forest of Frequent Query
    Trees) by projecting the NQTSs
  • 4. prune the infrequent query trees from
    FQT-forest
  • 5. find the set of all FQTs (Frequent Query
    Trees) from current FQT-forest
  • Since phase 1 is straightforward,
  • We focus on phases 2-5

8
Phase 2 of FQT-Stream NQTS Transformation
  • NQTS Transformation of QT
  • Using DFS on the QT
  • A sequence of triple (node-id, level, order)
  • level the level of the QT
  • order sequence order of the NQTS
  • For example (5-NQTS in Figure 1)

9
Phase 3 of FQT-Stream FQT-forest Construction
(1/4)
  • For each NQTS, 2 steps are performed to construct
    the FQT-forest
  • Step 1 enumerate each NQTS into a set of
    sub-sequences using Order-Break (OB) technique
  • OB is a level-wise method

10
Phase 3 of FQT-Stream Step 1 of FQT-forest
Construction (2/4)
  • For example, a 5-NQTS lt(A, 0, 1), (B, 1, 2),
    (D, 2, 3), (E, 2, 4), (C, 1, 5)gt
  • First, the 5-NQTS is broken into three 4-NQTSs
  • lt(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)gt
  • lt(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)gt
  • lt(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)gt
  • These sequences are 1-OB (One Order Break)
  • 1-OB sequences have one order break in the
    sequence order
  • The original 5-NQTS is called 0-OB

11
Phase 3 of FQT-Stream Step 1 of FQT-forest
Construction (3/4)
  • After delete the duplicates
  • Three 4-NQTSs ? Two 3-NQTSs with One Order Break
  • Two 3-NQTSs ? One 2-NQTS
  • lt(A, 0, 1), (E, 2, 4), (C, 1, 5)gt, lt(A, 0, 1),
    (B, 1, 2), (C, 1, 5)gt?lt(A, 0, 1), (C, 1, 5)gt
  • Finally, the set of 1-OB contains 8 NQTSs

12
Phase 3 of FQT-Stream Step 1 of FQT-forest
Construction (4/4)
  • Set of 2-OB is generated from the set of 1-OB
  • For example
  • 2-OB lt(A, 0, 1), (D, 2, 3), (C, 1, 5)gt is
    generated from 1-OB lt(A, 0, 1), (D, 2, 3), (E, 2,
    4), (C, 1, 5)gt
  • Repeat this process until no candidate k-OB
  • Property 1
  • The maximum size of order break is k-3, i.e.,
    (k-3)-OB, if the query tree has k nodes

13
Phase 3 of FQT-Stream Step 2 of FQT-forest
Construction (1/3)
  • The OBs (0-OB, 1-OB, 2-OB) are projected and
    inserted into a FQT-forest using Incremental
    Projection (IP) technique
  • A NQTS, ltX1X2Xigt, with i nodes is projected
    into i sub-NQTSs (also called node-suffix NQTSs)
  • ltXigt, ltXiXi-1gt, , ltX2gt, ltX1gt
  • We use one field node-id to represent the fields
    (node-id, level, order) for simplicity

14
Phase 3 of FQT-Stream Step 2 of FQT-forest
Construction (2/3)
  • Example of IP
  • 1-OB lt(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1,
    5)gt is projected into 4 node-suffix NQTSs as
    follows
  • lt(C, 1, 5)gt
  • lt(E, 2, 4), (C, 1, 5)gt
  • lt(D, 2, 3), (E, 2, 4), (C, 1, 5)gt
  • lt(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)gt
  • After projection, a tree structure checking is
    preformed
  • If the level of the first node in a node-suffix
    NQTS is not the smallest level
  • the node-suffix NQTS is deleted

15
Phase 3 of FQT-Stream Step 2 of FQT-forest
Construction (3/3)
  • After tree structure checking
  • The node-suffix NQTSs are inserted into
    FQT-forest
  • Update the corresponding nodes supports
  • FQT-forest consists of 2 parts
  • FN-list
  • A list of Frequent Nodes
  • Each node Xi in FN-list has a NQTS-tree
    (Xi.NQTS-tree)
  • NQTS-trees (trees of Normalized Query Tree
    Sequences)
  • A sequence (NQTS) is represented by a path
  • And its appearance frequent is maintained in the
    last of node of the path

16
Phase 4 of FQT-Stream Infrequent Information
Pruning
  • In order to guarantee the limited space
    requirement
  • Pruning Infrequent Information
  • Pruning steps
  • Check each node Xi in the FN-list of FQT-forest
  • If its sup(Xi) lt sN ? delete Xi and its NQTS-tree
  • Check other NQTS-trees to prune these infrequent
    nodes

17
Phase 4 of FQT-Stream Frequent Query Tree Mining
  • Assume that there are k frequent nodes, ltX1, X2,
    , Xkgt, in the FN-list
  • FQT-Stream traverses the Xi.NQTS-tree (?i, i 1,
    2, , k) to find the sequences with prefix Xi
    whose estimated support is greater than or equal
    to sN in a DFS manner
  • These frequent query trees are stored into a
    temporal list, called FQT-List

18
Conclusions and Future Work
  • We propose an efficient one-pass algorithm
    FQT-Stream (Frequent Query Trees of Streams)
  • To find the set of all frequent query trees over
    the entire history of online XML data streams
  • Future Work
  • Online Mining of Frequent Query Trees over
    Sliding Windows
Write a Comment
User Comments (0)
About PowerShow.com