Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001 - PowerPoint PPT Presentation

About This Presentation
Title:

Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001

Description:

... for Scientific Data Mining Research under ... to perform data mining in a distributed and ... Data Mining. and.... Problem Description: OAK RIDGE ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 25
Provided by: buddy3
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Multiagent based HighDimensional Cluster Analysis SciDAC SDMISIC Kickoff Meeting July 1011, 2001


1
Multi-agent based High-Dimensional Cluster
Analysis SciDAC SDM-ISIC Kickoff Meeting July
10-11, 2001
  • Nagiza Samatova George Ostrouchov
  • Computer Science and Mathematics Division
  • Oak Ridge National Laboratory

2
Science driven Bottlenecks
  • Data management and data mining algorithmsnot
    scalable to petabytes of scientific data
  • Retrieving data subsets from storage systems too
    slow, especially for tertiary storage
  • Transferring large datasets between sites is
    inefficient
  • Navigating between heterogeneous, distributed
    data sources very user intensive
  • I/O techniques too low access rate

To improve the transfer of large datasets
Major Focus
  • To implement effective high-bandwidth transfers
    (Randy Burris)

Approaches
  • To minimize the amount of data transferred

3
Minimizing the amount of scientific simulation
data transfer State of the Art
  • Data compression utilities (zip, compress, etc.)
  • large overheads
  • modest compression rates
  • Post-processing data analysis tools (like
    PCMDI)
  • Scientists must wait for the simulation
    completion
  • can use lots of CPU cycles on long-running
    simulations
  • can use up to 50 more storage and require
    unnecessary data transfer for data-intensive
    simulations
  • Simulation monitoring tools
  • interference with simulations
  • lack of flexibility

4
Improvements through Multi-level data
minimization mechanisms
  • Simulation level
  • Data stream ? not simulation ? monitoring tools
    for
  • Any-time feedback to decide whether to
    terminate a simulation, restart with new
    parameters, or continue
  • Filtering runs to decide whether to transfer to a
    central archive, keep locally, or delete
  • Comparative analysis level
  • Application-specific search engines for
  • Simulation data comparison, esp. against archived
    databases
  • Distributed simulation data query, search, and
    retrieval
  • In-depth analysis level
  • Application-specific inference engines for
  • Inferring rules relating fragments in two or more
    simulation outputs
  • New scientific discoveries

5
How we will address these needs
  • Our Approach Develop ASPECT (Adaptable
    Simulation Product Exploration via Clustering
    Toolkit) that includes
  • Dynamic first-look multivariate time series miner
    (Level I)
  • Distributed time-series query, search, and
    retrieval engine (Level II)
  • Time-series-based rules inference engine (Level
    III)
  • Our Strategy
  • Leverage existing work
  • Expand our prior work
  • Integrate with other SDM tasks
  • Work closely with application scientists
  • Develop ASPECT in an iterative fashion

6
Our work will be leveraging
  • Distributed Scientific Data Mining Research
    (Probe/MICS) SOA01a, SOA01b
  • Analysis of Large Scientific Datasets (LDRD/ORNL)
    DFL96, DFL00, DFL00
  • Statistical Downscaling for Climate (LDRD/ORNL)
    PDO00

7
Distributed Scientific Data Mining
Research(funded under Probe/MICS)
  • Motivation
  • Big picture
  • SDM-ETC related effort
  • Relevance to our task Levels II and III
  • Limitations w.r.t. to our task
  • Enabling Technology research not
    application-specific

8
Motivation for Scientific Data Mining Research
under Probe
  • Existing data mining tools have limited
    applicability to the emerging scientific data
    sets that are
  • Massive (terabytes to petabytes)
  • Existing methods do not scale in terms of time,
    storage, number of dimensions.
  • Need scalable data analysis algorithms.
  • Distributed (e.g., across computational grids,
    multiple files, disks, tapes)
  • Existing methods work on a single, centralized
    dataset. Data transfer is prohibitive (high
    bandwidth, security/privacy concerns).
  • Need distributed data analysis algorithm.
  • Dynamic
  • Existing methods work with static datasets. Any
    changes require complete re-computation.
  • Need dynamic (updating downdating) techniques.
  • High-dimensional
  • Usual assumptions about homogeneity or ergodicity
    can not be made
  • Need segmented dimension reduction methods.

9
Our Approach Distributed agents and
peer-to-peer negotiation
  • Strategy
  • to perform data mining in a distributed and
    recursive fashion
  • with reasonable data transfer overheads
  • Key idea
  • Generate local components using distributed
    agents
  • Merge these components into a global system via
    peer-to-peer agents collaboration and
    negotiation
  • Requirements for Resulting System
  • Qualitative comparability
  • Computational complexity reduction
  • Scalability
  • Communication acceptability
  • Flexibility (in the choice of a local algorithm)
  • Visual representation sufficiency

10
Background Hierarchical Clustering
11
SDM-ETC Tie-in Distributed Hierarchical
Clustering
Problem Description
  • Given
  • A data set with N d-dimensional data items
    distributed across multiples data sites
  • Task
  • Determine a hierarchical decomposition of this
    dataset
  • Application of Clustering
  • Database Management
  • Multi-dimensional indexing
  • Data Mining
  • and.

12
RACHET Distributed Clustering Algorithm
Control flow of RACHET
13
Centroid Descriptive Statistics -summarized
cluster representation
Question How many statistical parameters are
sufficient to make clustering decisions (merging
or splitting clusters)?
14
Updating Descriptive Statistics
15
Euclidean Distance Approximation
16
Performance Analysis linear in time, space and
transmission
SltltN and kltltN
O(N)
17
Analysis of Large Scientific Datasets
  • Focus Univariate time series data
  • Applications ARM, EEG
  • Relevance to our task Level III
  • Limitations w.r.t. our task
  • No support of dynamic distributed time series
  • No support of multivariate time series

18
Local Models For Global Analysis and Comparison
of Data Series
  • Strategy
  • Segment series
  • Model the usual to find the unusual
  • Key ideas
  • Fit simple local models to segments
  • Use parameters for global analysis and monitoring
  • Resulting system
  • Detects specific events (targeted or unusual)
  • Provides a global description of one or several
    data series
  • Provides data reduction to parameters of local
    model

19
From Local Models to Annotated Time Series
Segment series (100 obs)
Fit simple local model ( c0, c1, c2, e?,
e2)
Select extreme (10)
Cluster extreme (4)
Map back to series
20
Statistical Downscaling for Climate
  • Focus Image time series
  • Application Climate
  • Relevance to our task Levels I and II
  • Limitation w.r.t. our task
  • Works as a post-processing tool

21
Climate Downscaling Contains Several
Post-Processing Tools
22
Trend and Periodic Components Provide a Concise
Description of Model Run
Filter periodic and trend components
Compute EOFs
Monitor model run
23
Summary of where efforts are needed
  • Research
  • Multivariate time series datasets
  • Dynamic versions of time series processing
    analysis tools
  • Application-specific distributed dynamic
    clustering
  • Application-specific rules inference algorithms
  • Implementations
  • ASPECTs framework
  • Simulation data monitoring engine
  • with pluggable user-driven data analysis modules
  • with any-time, real-time not post-processing
  • with no or very little interference with
    simulation
  • Simulation data query, search, retrieval engine
  • Simulation data rules inference engine
  • A lot of integration work

24
Integration with other SDM-ETC tasks
Write a Comment
User Comments (0)
About PowerShow.com