Title: Online Mining of Frequent Query Trees over XML Data Streams
1Online Mining of Frequent Query Trees over XML
Data Streams
- Hua-Fu Li, Man-Kwan Shan and Suh-Yin Lee
- Department of Computer Science
- National Chiao-Tung University
- Hsinchu, Taiwan 300, R.O.C.
- http//www.csie.nctu.edu.tw/hfli/
- corresponding author
2Outline
- Introduction
- Mining of Data Streams, Tree Mining
- Problem Definition
- Online Mining of Frequent Query Trees over XML
Data Streams - The Proposed Algorithm
- FQT-Stream (Frequent Query Trees of Streams)
- Conclusions and Future Work
3Mining of Data Streams Motivations
- Many Applications generate data streams
- Day to day business (credit card, ATM
transactions, etc) - Hot Web services (XML data, record and click
streams) - Telecommunication (call records)
- Financial market (stock exchange)
- Surveillance (sensor network, audio/video)
- System management (network events)
- Application characteristics
- Massive volumes of data (several terabytes)
- Records arrive at a rapid rate
- Data distribution changes on the fly
- What do we want to get from data streams ?
- Real time query answering, Statistics, and
Pattern discovery
4Mining of Data Streams Computation Model
- Requirements of Mining Data Streams
- Single pass each record is examined at most once
- Bounded storage Limited Memory for storing
synopsis - Real-time Per record processing time (to
maintain synopsis) must be low
5Problem Definition of Frequent Query Tree Mining
(1/2)
- XML Query Tree Stream (XQTS)
- A sequence of query trees (QTs)
- QT1, QT2, , QTN
- N is tree id the latest incoming query tree
- Support of a Query Tree QTi
- sup(QTi) the number of QTs in XQTS containing
QTi as a subtree
6Problem Definition of Frequent Query Tree Mining
(2/2)
- A QTi is a Frequent Query Tree (FQT)
- if and only if sup(QTi) ? sN
- s is a user-defined minimum support threshold in
the range of 0, 1 - Our Task
- To mine the set of all frequent query trees
(FQTs) by one scan of the XQTS - Using as smaller memory as possible
7Proposed Algorithm FQT-Stream (Frequent Query
Trees of Streams)
- FQT-Stream consists of 5 phases
- 1. read a QT (Query Tree) from the buffer in the
main memory - 2. transform the QT into a new NQTS (Normalized
Query Tree Sequence) representation - 3. construct a in-memory summary data structure
called FQT-forest (a forest of Frequent Query
Trees) by projecting the NQTSs - 4. prune the infrequent query trees from
FQT-forest - 5. find the set of all FQTs (Frequent Query
Trees) from current FQT-forest - Since phase 1 is straightforward,
- We focus on phases 2-5
8Phase 2 of FQT-Stream NQTS Transformation
- NQTS Transformation of QT
- Using DFS on the QT
- A sequence of triple (node-id, level, order)
- level the level of the QT
- order sequence order of the NQTS
- For example (5-NQTS in Figure 1)
9Phase 3 of FQT-Stream FQT-forest Construction
(1/4)
- For each NQTS, 2 steps are performed to construct
the FQT-forest - Step 1 enumerate each NQTS into a set of
sub-sequences using Order-Break (OB) technique - OB is a level-wise method
10Phase 3 of FQT-Stream Step 1 of FQT-forest
Construction (2/4)
- For example, a 5-NQTS lt(A, 0, 1), (B, 1, 2),
(D, 2, 3), (E, 2, 4), (C, 1, 5)gt - First, the 5-NQTS is broken into three 4-NQTSs
- lt(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)gt
- lt(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)gt
- lt(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)gt
- These sequences are 1-OB (One Order Break)
- 1-OB sequences have one order break in the
sequence order - The original 5-NQTS is called 0-OB
11Phase 3 of FQT-Stream Step 1 of FQT-forest
Construction (3/4)
- After delete the duplicates
- Three 4-NQTSs ? Two 3-NQTSs with One Order Break
- Two 3-NQTSs ? One 2-NQTS
- lt(A, 0, 1), (E, 2, 4), (C, 1, 5)gt, lt(A, 0, 1),
(B, 1, 2), (C, 1, 5)gt?lt(A, 0, 1), (C, 1, 5)gt - Finally, the set of 1-OB contains 8 NQTSs
12Phase 3 of FQT-Stream Step 1 of FQT-forest
Construction (4/4)
- Set of 2-OB is generated from the set of 1-OB
- For example
- 2-OB lt(A, 0, 1), (D, 2, 3), (C, 1, 5)gt is
generated from 1-OB lt(A, 0, 1), (D, 2, 3), (E, 2,
4), (C, 1, 5)gt - Repeat this process until no candidate k-OB
- Property 1
- The maximum size of order break is k-3, i.e.,
(k-3)-OB, if the query tree has k nodes
13Phase 3 of FQT-Stream Step 2 of FQT-forest
Construction (1/3)
- The OBs (0-OB, 1-OB, 2-OB) are projected and
inserted into a FQT-forest using Incremental
Projection (IP) technique - A NQTS, ltX1X2Xigt, with i nodes is projected
into i sub-NQTSs (also called node-suffix NQTSs) - ltXigt, ltXiXi-1gt, , ltX2gt, ltX1gt
- We use one field node-id to represent the fields
(node-id, level, order) for simplicity
14Phase 3 of FQT-Stream Step 2 of FQT-forest
Construction (2/3)
- Example of IP
- 1-OB lt(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1,
5)gt is projected into 4 node-suffix NQTSs as
follows - lt(C, 1, 5)gt
- lt(E, 2, 4), (C, 1, 5)gt
- lt(D, 2, 3), (E, 2, 4), (C, 1, 5)gt
- lt(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)gt
- After projection, a tree structure checking is
preformed - If the level of the first node in a node-suffix
NQTS is not the smallest level - the node-suffix NQTS is deleted
15Phase 3 of FQT-Stream Step 2 of FQT-forest
Construction (3/3)
- After tree structure checking
- The node-suffix NQTSs are inserted into
FQT-forest - Update the corresponding nodes supports
- FQT-forest consists of 2 parts
- FN-list
- A list of Frequent Nodes
- Each node Xi in FN-list has a NQTS-tree
(Xi.NQTS-tree) - NQTS-trees (trees of Normalized Query Tree
Sequences) - A sequence (NQTS) is represented by a path
- And its appearance frequent is maintained in the
last of node of the path
16Phase 4 of FQT-Stream Infrequent Information
Pruning
- In order to guarantee the limited space
requirement - Pruning Infrequent Information
- Pruning steps
- Check each node Xi in the FN-list of FQT-forest
- If its sup(Xi) lt sN ? delete Xi and its NQTS-tree
- Check other NQTS-trees to prune these infrequent
nodes
17Phase 4 of FQT-Stream Frequent Query Tree Mining
- Assume that there are k frequent nodes, ltX1, X2,
, Xkgt, in the FN-list - FQT-Stream traverses the Xi.NQTS-tree (?i, i 1,
2, , k) to find the sequences with prefix Xi
whose estimated support is greater than or equal
to sN in a DFS manner - These frequent query trees are stored into a
temporal list, called FQT-List
18Conclusions and Future Work
- We propose an efficient one-pass algorithm
FQT-Stream (Frequent Query Trees of Streams) - To find the set of all frequent query trees over
the entire history of online XML data streams - Future Work
- Online Mining of Frequent Query Trees over
Sliding Windows