NVOSS08 - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

NVOSS08

Description:

P2P Data Mining. Kirk D. Borne. George Mason University. kborne_at_gmu.edu , http://classweb.gmu.edu/kborne ... Type 1 requires sophisticated algorithms that ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 15
Provided by: kbo83
Category:

less

Transcript and Presenter's Notes

Title: NVOSS08


1
THE US NATIONAL VIRTUAL OBSERVATORY
P2P Data Mining
Kirk D. Borne George Mason University kborne_at_gmu.e
du , http//classweb.gmu.edu/kborne/ with H.
Kargupta, S. Arora, K. Bhaduri, K. Das, Tushar,
W. Griffin (UMBC), and C. Giannella (Loyola)
2
Topics
  • Distributed vs. P2P Data Mining
  • Science Use Cases
  • P2P Data Mining Project Plans
  • Current Design Status
  • IVOA GWS Standards

3
Distributed Data Mining (DDM)
  • DDM comes in 2 types
  • Distributed Mining of Data
  • Mining of Distributed Data
  • Type 1 requires sophisticated algorithms that
    operate with data in situ
  • Type 2 takes many forms, with data being
    centralized (in whole or in partitions) or data
    remaining in place at distributed sites
  • References http//www.cs.umbc.edu/hillol/DDMBIB
    /
  • C. Giannella, H. Dutta, K. Borne, R. Wolff, H.
    Kargupta. (2006). Distributed Data Mining for
    Astronomy Catalogs. Proceedings of 9th Workshop
    on Mining Scientific and Engineering Datasets, as
    part of the SIAM International Conference on Data
    Mining (SDM), 2006. http//www.cs.umbc.edu/hil
    lol/PUBS/Papers/Astro.pdf
  • H. Dutta, C. Giannella, K. Borne and H. Kargupta.
    (2007). Distributed Top-K Outlier Detection from
    Astronomy Catalogs using the DEMAC System.
    Proceedings of the SIAM International Conference
    on Data Mining, Minneapolis, USA, April 2007.
    http//www.cs.umbc.edu/hillol/PUBS/Papers/sdm07.p
    df

4
P2P Data Mining
  • P2P Data Mining represents one possible
    implementation of DDM
  • P2P has two types
  • Task-parallel the compute processes are
    distributed across the nodes
  • Data-parallel the data are distributed across
    the nodes
  • References http//www.cs.umbc.edu/hillol/DDMBIB/
    ddmbib_html/DistSys.html
  • S. Banyopadhyay, C. Giannella, U. Maulik,
    H. Kargupta, S. Datta, and K. Liu. Clustering
    distributed data streams in peer-to-peer
    environments. Information Science,
    176(14)1952-1985, 2006. http//www.cs.umbc.edu/
    hillol/PUBS/p2pDM.pdf
  • K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta.
    (2008). Distributed Decision Tree Induction in
    Peer-to-Peer Systems. Statistical Analysis and
    Data Mining. Volume 1, Issue 2, pp. 85-103.
    http//www.cs.umbc.edu/hillol/PUBS/Papers/sam08_
    dtree_bhaduri.pdf
  • S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H.
    Kargupta. (2006). Distributed Data Mining in
    Peer-to-Peer Networks. (Invited submission to the
    IEEE Internet Computing special issue on
    Distributed Data Mining), Volume 10, Number 4,
    pp. 18--26. http//www.cs.umbc.edu/hillol/PUBS
    /P2PDM.pdf

5
Why distributed data mining?
Because
  • Many great astronomical
  • discoveries have come
  • from inter-comparisons
  • of various wavelengths
  • Quasars
  • Gamma-ray bursts
  • Ultraluminous IR galaxies
  • X-ray black-hole binaries
  • Radio galaxies
  • . . .

Just Checking
6
Some Fundamental Astronomy problems most of
these require VO-accessible distributed data
  • Some key astronomy problems that can be addressed
    with distributed data
  • Cross-Match objects from different catalogues
  • The distance problem (e.g., Photometric Redshift
    estimators)
  • Star-Galaxy Separation
  • Cosmic-Ray Detection in images
  • Supernova Detection and Classification
  • Morphological Classification (galaxies, AGN,
    gravitational lenses, ...)
  • Class and Subclass Discovery (brown dwarfs,
    methane dwarfs, ...)
  • Dimension Reduction Correlation Discovery
  • Learning Rules for improved classifiers
  • Classification of massive data streams
  • Real-time Classification of Astronomical Events
  • Clustering of massive data collections
  • Novelty, Anomaly, Outlier Detection in massive
    databases

7
Sample Astronomy Data Mining Applications most
of these require VO-accessible distributed data
  • Neural Network for Pixel Classification Event
    Detection and Prediction (e.g., Supernova or
    Cosmic-ray hit?)
  • Bayesian Network for Object Classification (star
    or galaxy?)
  • PCA for finding Fundamental Planes of Galaxy
    Parameters
  • PCA (weakest component) for Outlier Detection
    anomalies, novel discoveries, new objects
  • Link Analysis (Association Mining) for Causal
    Event Detection (e.g., linking optical transients
    with gamma-ray events)
  • Clustering analysis Spatial, Temporal, or any
    scientific database parameters
  • Markov models Temporal mining, classification,
    and prediction from time series data

8
Class Discovery feature separation and
discrimination of classes across multiple
databases
  • Reference http//www.cs.princeton.edu/courses/ar
    chive/spr04/cos598B/bib/BrunnerDPS.pdf
  • The separation of classes improves when
    attributes from disparate databases are chosen to
    be projected, as in the following star-galaxy
    discrimination test

Good
Not good
9
Novelty Discovery (Outlier Detection) improved
discovery of rare objects across multiple
databases
10
Correlation Discovery Fundamental Plane for
156,000 cross-matched Sloan2MASS Elliptical
Galaxies plot shows variance captured by first
2 Principal Components as a function of local
galaxy density.
Reference Borne, Dutta, Giannella, Kargupta,
Griffin 2008
  • Slide Content
  • Slide content
  • Slide content
  • Slide content

of variance captured by PC1PC2
low (Local Galaxy Density)
high
11
Our Project Plans
  • NASA-funded (AISR) project to implement a P2P
    distributed data mining system
  • Provide a small number of useful data mining
    algorithms (one-to-one mapping with science use
    cases)
  • Clustering Class Discovery Characterization
  • Outlier detection Novelty Discovery
  • PCA Correlation Discovery
  • Select problems and algorithms that are
    decomposable task-parallel and/or data-parallel
  • Implement system within VO framework

12
(No Transcript)
13
IVOA GWS Standards
  • GWS standards enable access to distributed data
    and distributed compute resources
  • Nodes in P2P system individually request
    distributed data partitions
  • Workflow is distributed across the P2P compute
    nodes
  • P2P activities are stateful asynchronous
  • Relevant GWS activities Security, VOSpace,
    Asynchronous services, Single Sign-on, Universal
    Worker Service (UWS), Logging

14
GWS functions required by P2P Data Mining
Environment
  • Acquiring managing nodes and workspaces
    (VOSpace)
  • Single sign-on to nodes (SSO)
  • Distributing work and metadata to nodes (GRID)
  • Cone-search and other data requests submitted
    from compute nodes to data repositories
  • RESTful services?
  • Secure stateful asynchronous computations (UWS)
  • Communicate results between nodes, as required by
    some DDM algorithms
  • Recording and sharing results, and demonstrating
    interoperable multi-database VO science (Logging)
Write a Comment
User Comments (0)
About PowerShow.com