Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001 - PowerPoint PPT Presentation

About This Presentation

Title:

Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001

Description:

... for Scientific Data Mining Research under ... to perform data mining in a distributed and ... Data Mining. and.... Problem Description: OAK RIDGE ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 25

Provided by: buddy3

Learn more at: https://sdm.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001

1
Multi-agent based High-Dimensional Cluster
Analysis SciDAC SDM-ISIC Kickoff Meeting July
10-11, 2001

Nagiza Samatova George Ostrouchov
Computer Science and Mathematics Division
Oak Ridge National Laboratory

2
Science driven Bottlenecks

Data management and data mining algorithmsnot
scalable to petabytes of scientific data
Retrieving data subsets from storage systems too
slow, especially for tertiary storage
Transferring large datasets between sites is
inefficient
Navigating between heterogeneous, distributed
data sources very user intensive
I/O techniques too low access rate

To improve the transfer of large datasets
Major Focus

To implement effective high-bandwidth transfers
(Randy Burris)

Approaches

To minimize the amount of data transferred

3
Minimizing the amount of scientific simulation
data transfer State of the Art

Data compression utilities (zip, compress, etc.)
large overheads
modest compression rates

Post-processing data analysis tools (like
PCMDI)
Scientists must wait for the simulation
completion
can use lots of CPU cycles on long-running
simulations
can use up to 50 more storage and require
unnecessary data transfer for data-intensive
simulations

Simulation monitoring tools
interference with simulations
lack of flexibility

4
Improvements through Multi-level data
minimization mechanisms

Simulation level
Data stream ? not simulation ? monitoring tools
for
Any-time feedback to decide whether to
terminate a simulation, restart with new
parameters, or continue
Filtering runs to decide whether to transfer to a
central archive, keep locally, or delete

Comparative analysis level
Application-specific search engines for
Simulation data comparison, esp. against archived
databases
Distributed simulation data query, search, and
retrieval

In-depth analysis level
Application-specific inference engines for
Inferring rules relating fragments in two or more
simulation outputs
New scientific discoveries

5
How we will address these needs

Our Approach Develop ASPECT (Adaptable
Simulation Product Exploration via Clustering
Toolkit) that includes
Dynamic first-look multivariate time series miner
(Level I)
Distributed time-series query, search, and
retrieval engine (Level II)
Time-series-based rules inference engine (Level
III)
Our Strategy
Leverage existing work
Expand our prior work
Integrate with other SDM tasks
Work closely with application scientists
Develop ASPECT in an iterative fashion

6
Our work will be leveraging

Distributed Scientific Data Mining Research
(Probe/MICS) SOA01a, SOA01b
Analysis of Large Scientific Datasets (LDRD/ORNL)
DFL96, DFL00, DFL00
Statistical Downscaling for Climate (LDRD/ORNL)
PDO00

7
Distributed Scientific Data Mining
Research(funded under Probe/MICS)

Motivation
Big picture
SDM-ETC related effort
Relevance to our task Levels II and III
Limitations w.r.t. to our task
Enabling Technology research not
application-specific

8
Motivation for Scientific Data Mining Research
under Probe

Existing data mining tools have limited
applicability to the emerging scientific data
sets that are

Massive (terabytes to petabytes)
Existing methods do not scale in terms of time,
storage, number of dimensions.
Need scalable data analysis algorithms.

Distributed (e.g., across computational grids,
multiple files, disks, tapes)
Existing methods work on a single, centralized
dataset. Data transfer is prohibitive (high
bandwidth, security/privacy concerns).
Need distributed data analysis algorithm.

Dynamic
Existing methods work with static datasets. Any
changes require complete re-computation.
Need dynamic (updating downdating) techniques.

High-dimensional
Usual assumptions about homogeneity or ergodicity
can not be made
Need segmented dimension reduction methods.

9
Our Approach Distributed agents and
peer-to-peer negotiation

Strategy
to perform data mining in a distributed and
recursive fashion
with reasonable data transfer overheads
Key idea
Generate local components using distributed
agents
Merge these components into a global system via
peer-to-peer agents collaboration and
negotiation
Requirements for Resulting System
Qualitative comparability
Computational complexity reduction
Scalability
Communication acceptability
Flexibility (in the choice of a local algorithm)
Visual representation sufficiency

10
Background Hierarchical Clustering
11
SDM-ETC Tie-in Distributed Hierarchical
Clustering
Problem Description

Given
A data set with N d-dimensional data items
distributed across multiples data sites
Task
Determine a hierarchical decomposition of this
dataset
Application of Clustering
Database Management
Multi-dimensional indexing
Data Mining
and.

12
RACHET Distributed Clustering Algorithm
Control flow of RACHET
13
Centroid Descriptive Statistics -summarized
cluster representation
Question How many statistical parameters are
sufficient to make clustering decisions (merging
or splitting clusters)?
14
Updating Descriptive Statistics
15
Euclidean Distance Approximation
16
Performance Analysis linear in time, space and
transmission
SltltN and kltltN
O(N)
17
Analysis of Large Scientific Datasets

Focus Univariate time series data
Applications ARM, EEG
Relevance to our task Level III
Limitations w.r.t. our task
No support of dynamic distributed time series
No support of multivariate time series

18
Local Models For Global Analysis and Comparison
of Data Series

Strategy
Segment series
Model the usual to find the unusual
Key ideas
Fit simple local models to segments
Use parameters for global analysis and monitoring
Resulting system
Detects specific events (targeted or unusual)
Provides a global description of one or several
data series
Provides data reduction to parameters of local
model

19
From Local Models to Annotated Time Series
Segment series (100 obs)
Fit simple local model ( c0, c1, c2, e?,
e2)
Select extreme (10)
Cluster extreme (4)
Map back to series
20
Statistical Downscaling for Climate

Focus Image time series
Application Climate
Relevance to our task Levels I and II
Limitation w.r.t. our task
Works as a post-processing tool

21
Climate Downscaling Contains Several
Post-Processing Tools
22
Trend and Periodic Components Provide a Concise
Description of Model Run
Filter periodic and trend components
Compute EOFs
Monitor model run
23
Summary of where efforts are needed