Title: Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001
1Multi-agent based High-Dimensional Cluster
Analysis SciDAC SDM-ISIC Kickoff Meeting July
10-11, 2001
- Nagiza Samatova George Ostrouchov
- Computer Science and Mathematics Division
- Oak Ridge National Laboratory
2Science driven Bottlenecks
- Data management and data mining algorithmsnot
scalable to petabytes of scientific data - Retrieving data subsets from storage systems too
slow, especially for tertiary storage - Transferring large datasets between sites is
inefficient - Navigating between heterogeneous, distributed
data sources very user intensive - I/O techniques too low access rate
To improve the transfer of large datasets
Major Focus
- To implement effective high-bandwidth transfers
(Randy Burris)
Approaches
- To minimize the amount of data transferred
3Minimizing the amount of scientific simulation
data transfer State of the Art
- Data compression utilities (zip, compress, etc.)
- large overheads
- modest compression rates
- Post-processing data analysis tools (like
PCMDI) - Scientists must wait for the simulation
completion - can use lots of CPU cycles on long-running
simulations - can use up to 50 more storage and require
unnecessary data transfer for data-intensive
simulations
- Simulation monitoring tools
- interference with simulations
- lack of flexibility
4Improvements through Multi-level data
minimization mechanisms
- Simulation level
- Data stream ? not simulation ? monitoring tools
for - Any-time feedback to decide whether to
terminate a simulation, restart with new
parameters, or continue - Filtering runs to decide whether to transfer to a
central archive, keep locally, or delete
- Comparative analysis level
- Application-specific search engines for
- Simulation data comparison, esp. against archived
databases - Distributed simulation data query, search, and
retrieval
- In-depth analysis level
- Application-specific inference engines for
- Inferring rules relating fragments in two or more
simulation outputs - New scientific discoveries
5How we will address these needs
- Our Approach Develop ASPECT (Adaptable
Simulation Product Exploration via Clustering
Toolkit) that includes - Dynamic first-look multivariate time series miner
(Level I) - Distributed time-series query, search, and
retrieval engine (Level II) - Time-series-based rules inference engine (Level
III) - Our Strategy
- Leverage existing work
- Expand our prior work
- Integrate with other SDM tasks
- Work closely with application scientists
- Develop ASPECT in an iterative fashion
6Our work will be leveraging
- Distributed Scientific Data Mining Research
(Probe/MICS) SOA01a, SOA01b - Analysis of Large Scientific Datasets (LDRD/ORNL)
DFL96, DFL00, DFL00 - Statistical Downscaling for Climate (LDRD/ORNL)
PDO00
7Distributed Scientific Data Mining
Research(funded under Probe/MICS)
- Motivation
- Big picture
- SDM-ETC related effort
- Relevance to our task Levels II and III
- Limitations w.r.t. to our task
- Enabling Technology research not
application-specific
8Motivation for Scientific Data Mining Research
under Probe
- Existing data mining tools have limited
applicability to the emerging scientific data
sets that are
- Massive (terabytes to petabytes)
- Existing methods do not scale in terms of time,
storage, number of dimensions. - Need scalable data analysis algorithms.
- Distributed (e.g., across computational grids,
multiple files, disks, tapes) - Existing methods work on a single, centralized
dataset. Data transfer is prohibitive (high
bandwidth, security/privacy concerns). - Need distributed data analysis algorithm.
- Dynamic
- Existing methods work with static datasets. Any
changes require complete re-computation. - Need dynamic (updating downdating) techniques.
- High-dimensional
- Usual assumptions about homogeneity or ergodicity
can not be made - Need segmented dimension reduction methods.
9Our Approach Distributed agents and
peer-to-peer negotiation
- Strategy
- to perform data mining in a distributed and
recursive fashion - with reasonable data transfer overheads
- Key idea
- Generate local components using distributed
agents - Merge these components into a global system via
peer-to-peer agents collaboration and
negotiation - Requirements for Resulting System
- Qualitative comparability
- Computational complexity reduction
- Scalability
- Communication acceptability
- Flexibility (in the choice of a local algorithm)
- Visual representation sufficiency
10Background Hierarchical Clustering
11SDM-ETC Tie-in Distributed Hierarchical
Clustering
Problem Description
- Given
- A data set with N d-dimensional data items
distributed across multiples data sites - Task
- Determine a hierarchical decomposition of this
dataset - Application of Clustering
- Database Management
- Multi-dimensional indexing
- Data Mining
- and.
12RACHET Distributed Clustering Algorithm
Control flow of RACHET
13Centroid Descriptive Statistics -summarized
cluster representation
Question How many statistical parameters are
sufficient to make clustering decisions (merging
or splitting clusters)?
14Updating Descriptive Statistics
15Euclidean Distance Approximation
16Performance Analysis linear in time, space and
transmission
SltltN and kltltN
O(N)
17Analysis of Large Scientific Datasets
- Focus Univariate time series data
- Applications ARM, EEG
- Relevance to our task Level III
- Limitations w.r.t. our task
- No support of dynamic distributed time series
- No support of multivariate time series
18Local Models For Global Analysis and Comparison
of Data Series
- Strategy
- Segment series
- Model the usual to find the unusual
- Key ideas
- Fit simple local models to segments
- Use parameters for global analysis and monitoring
- Resulting system
- Detects specific events (targeted or unusual)
- Provides a global description of one or several
data series - Provides data reduction to parameters of local
model
19From Local Models to Annotated Time Series
Segment series (100 obs)
Fit simple local model ( c0, c1, c2, e?,
e2)
Select extreme (10)
Cluster extreme (4)
Map back to series
20Statistical Downscaling for Climate
- Focus Image time series
- Application Climate
- Relevance to our task Levels I and II
- Limitation w.r.t. our task
- Works as a post-processing tool
21Climate Downscaling Contains Several
Post-Processing Tools
22Trend and Periodic Components Provide a Concise
Description of Model Run
Filter periodic and trend components
Compute EOFs
Monitor model run
23Summary of where efforts are needed
- Research
- Multivariate time series datasets
- Dynamic versions of time series processing
analysis tools - Application-specific distributed dynamic
clustering - Application-specific rules inference algorithms
- Implementations
- ASPECTs framework
- Simulation data monitoring engine
- with pluggable user-driven data analysis modules
- with any-time, real-time not post-processing
- with no or very little interference with
simulation - Simulation data query, search, retrieval engine
- Simulation data rules inference engine
- A lot of integration work
24Integration with other SDM-ETC tasks