Title: Analytical Data Mining for Stream Data Analysis
1Department of Informatics, University of Minho
Braga, 22 de February de 2006
Analytical Data Mining for Stream Data Analysis
Ronnie Alves Orlando Belo ronnie,obelo_at_di.umin
ho.pt http//alfa.di.uminho.pt/ronnie
Department of Informatics University of
Minho PORTUGAL
2outlines
- motivation
- analytical data mining
- cube -gt lattice of cuboids
- main issues
- first efforts (on 2005)
- current work
- final discussion
- research agenda
3motivation
- emerging applications
- such as sensor networks, telecommunications, web,
power supply, stock exchange, have to handle
various data streams - data streams characteristics
- continuous, ordered, changing, fast, huge amount
4motivation
- most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing - analysts want to see changes, trends, unusual
patterns, at reasonable levels of details
5motivation
- stream data analysis
- query approximations
- data mining
- on-line analytical processing (cube-based)
- keywords
- multi-dimensional, trends, unusual patterns
6analytical data mining
- analytical data mining, combine ideas of
cube-based algorithms with data mining functions - we want to provide a set of analytical data
mining methods to reveal exceptional and trend
patterns over data streams - cubing while mining or mining while cubing
7cube -gt lattice of cuboids
all
0-D(apex) cuboid
time
item
location
supplier
1-D cuboids
time,location
item,location
location,supplier
2-D cuboids
time,supplier
item,supplier
time,location,supplier
3-D cuboids
item,location,supplier
time,item,supplier
4-D(base) cuboid
8main issues
- (issue 1) given such characteristics of stream
data, is it feasible to compute such data cube,
since its size is usually much bigger than the
original data set, and its construction may take
multiple database scans? curse of
dimensionality - Online Analytical Processing Stream Data Is It
Feasible? (DMKD02) - (issue 2) how to detect abnormal changes of
cuboids cells, since on-line mining of the
changes is one of the core issues is stream data
analysis?
9main issues
- compared to the history
- (issue 3) what are the distinct features of the
current status? - (issue 4) what are the relatively factors over
time? - on-line mining of changes(SIGMOD03)
10first efforts (on 2005)
- itemset mining is a core problem in many data
mining tasks - it can be used as a building block for more
complex data mining process and also for
computing data cubes
11first efforts (on 2005)
- pattern-growth SQL-extensions (one dimension)
(EPIA05) - inter-transactional rules (two dimensions
distance measures) (JISBD05) - cube-based mining method for multi-dimensional
associations rules (n-dimensions, incremental and
multi-level) (miuda project) - industrial projects on telecommunications and
retail (real testbed)
12current work
- iceberg data cubing, computing only cuboids cells
above minimum support threshold (curse of
dimensionality remains) - closed data cubing, computing cuboids cells
consisting of only closed cells (on dense
databases, cuboids will be too large) - maximal data cubing, computing cuboids cells
consisting of only maximal cells (pure maximal,
loose aggregates info)
13current work
- real data applications have dense and correlated
databases - can we develop an algorithm which captures
maximal correlated cuboids cells on dense
databases? - we propose m3c-cubing
14current work
- Let the measure be count, the iceberg be count
2 and the correlated value 3CV 0.85. Then c1
(a1,b1,c1, 3) and c2 (a1,,, 4) are
closed cells c1 is a maximal cell c3
(a1,b1,, 3) and c4 (,b1,c1, 3) are
covered by c1 but c4 has a correlated exception
(3CV1) c4 (a2,b2,c2,d4 1) does not satisfy
the iceberg constraint. Therefore, c1 and c4 are
maximal correlated cuboids cells
c1 (a1,b1,c1, 3)
c4 (,b1,c1, 3)
15current work
- we provide a interesting measure which disclose
true correlation (also dependence) relationship
among cuboids cells (inspired by all_confidence) - the computation of cuboids is guided by a m3cTree
(based on SetEnumemeration tree) - the m3cTree is traversed by using a pure
depth-first order
16current work
- several pruning strategies are proposed for
reducing the search space - data cube computing is optimized by performing
shared partitions and caching intermediate
aggregations
17final discussion
- cubing tradeoff between size complexity and
efficient computation - high performance data cube computing is critical
to analytical data mining
18final discussion
- the challenge could be how to share computation
and explore optimization - further studies must to deal with statistical
aspects, proper constraints, data mining and data
cubing functions, tilted time window frame
19research agenda
1st quarter
2nd quarter
3rd quarter
4th quarter
1
2
2005 2006 2007 2008
2
3
4
5
5
6
6
- activities
- area background
- cube-based mining
- exceptional patterns
- on-line changes
- analytical data mining
- thesis writing
past
future
20Department of Informatics, University of Minho
Braga, 22 de February de 2006
Analytical Data Mining for Stream Data Analysis
Ronnie Alves Orlando Belo ronnie,obelo_at_di.umin
ho.pt http//alfa.di.uminho.pt/ronnie
Department of Informatics University of
Minho PORTUGAL