NVOSS08 - PowerPoint PPT Presentation

1 / 14

About This Presentation

Title:

NVOSS08

Description:

P2P Data Mining. Kirk D. Borne. George Mason University. kborne_at_gmu.edu , http://classweb.gmu.edu/kborne ... Type 1 requires sophisticated algorithms that ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 15

Provided by: kbo83

Category:

more less

Transcript and Presenter's Notes

Title: NVOSS08

1
THE US NATIONAL VIRTUAL OBSERVATORY
P2P Data Mining
Kirk D. Borne George Mason University kborne_at_gmu.e
du , http//classweb.gmu.edu/kborne/ with H.
Kargupta, S. Arora, K. Bhaduri, K. Das, Tushar,
W. Griffin (UMBC), and C. Giannella (Loyola)
2
Topics

Distributed vs. P2P Data Mining
Science Use Cases
P2P Data Mining Project Plans
Current Design Status
IVOA GWS Standards

3
Distributed Data Mining (DDM)

DDM comes in 2 types
Distributed Mining of Data
Mining of Distributed Data
Type 1 requires sophisticated algorithms that
operate with data in situ
Type 2 takes many forms, with data being
centralized (in whole or in partitions) or data
remaining in place at distributed sites
References http//www.cs.umbc.edu/hillol/DDMBIB
/
C. Giannella, H. Dutta, K. Borne, R. Wolff, H.
Kargupta. (2006). Distributed Data Mining for
Astronomy Catalogs. Proceedings of 9th Workshop
on Mining Scientific and Engineering Datasets, as
part of the SIAM International Conference on Data
Mining (SDM), 2006. http//www.cs.umbc.edu/hil
lol/PUBS/Papers/Astro.pdf
H. Dutta, C. Giannella, K. Borne and H. Kargupta.
(2007). Distributed Top-K Outlier Detection from
Astronomy Catalogs using the DEMAC System.
Proceedings of the SIAM International Conference
on Data Mining, Minneapolis, USA, April 2007.
http//www.cs.umbc.edu/hillol/PUBS/Papers/sdm07.p
df

4
P2P Data Mining

P2P Data Mining represents one possible
implementation of DDM
P2P has two types
Task-parallel the compute processes are
distributed across the nodes
Data-parallel the data are distributed across
the nodes
References http//www.cs.umbc.edu/hillol/DDMBIB/
ddmbib_html/DistSys.html
S. Banyopadhyay, C. Giannella, U. Maulik,
H. Kargupta, S. Datta, and K. Liu. Clustering
distributed data streams in peer-to-peer
environments. Information Science,
176(14)1952-1985, 2006. http//www.cs.umbc.edu/
hillol/PUBS/p2pDM.pdf
K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta.
(2008). Distributed Decision Tree Induction in
Peer-to-Peer Systems. Statistical Analysis and
Data Mining. Volume 1, Issue 2, pp. 85-103.
http//www.cs.umbc.edu/hillol/PUBS/Papers/sam08_
dtree_bhaduri.pdf
S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H.
Kargupta. (2006). Distributed Data Mining in
Peer-to-Peer Networks. (Invited submission to the
IEEE Internet Computing special issue on
Distributed Data Mining), Volume 10, Number 4,
pp. 18--26. http//www.cs.umbc.edu/hillol/PUBS
/P2PDM.pdf

5
Why distributed data mining?
Because

Many great astronomical
discoveries have come
from inter-comparisons
of various wavelengths
Quasars
Gamma-ray bursts
Ultraluminous IR galaxies
X-ray black-hole binaries
Radio galaxies
. . .

Just Checking
6
Some Fundamental Astronomy problems most of
these require VO-accessible distributed data

Some key astronomy problems that can be addressed
with distributed data
Cross-Match objects from different catalogues
The distance problem (e.g., Photometric Redshift
estimators)
Star-Galaxy Separation
Cosmic-Ray Detection in images
Supernova Detection and Classification
Morphological Classification (galaxies, AGN,
gravitational lenses, ...)
Class and Subclass Discovery (brown dwarfs,
methane dwarfs, ...)
Dimension Reduction Correlation Discovery
Learning Rules for improved classifiers
Classification of massive data streams
Real-time Classification of Astronomical Events
Clustering of massive data collections
Novelty, Anomaly, Outlier Detection in massive
databases

7
Sample Astronomy Data Mining Applications most
of these require VO-accessible distributed data

Neural Network for Pixel Classification Event
Detection and Prediction (e.g., Supernova or
Cosmic-ray hit?)
Bayesian Network for Object Classification (star
or galaxy?)
PCA for finding Fundamental Planes of Galaxy
Parameters
PCA (weakest component) for Outlier Detection
anomalies, novel discoveries, new objects
Link Analysis (Association Mining) for Causal
Event Detection (e.g., linking optical transients
with gamma-ray events)
Clustering analysis Spatial, Temporal, or any
scientific database parameters
Markov models Temporal mining, classification,
and prediction from time series data

8
Class Discovery feature separation and
discrimination of classes across multiple
databases

Reference http//www.cs.princeton.edu/courses/ar
chive/spr04/cos598B/bib/BrunnerDPS.pdf

The separation of classes improves when
attributes from disparate databases are chosen to
be projected, as in the following star-galaxy
discrimination test

Good
Not good
9
Novelty Discovery (Outlier Detection) improved
discovery of rare objects across multiple
databases
10
Correlation Discovery Fundamental Plane for
156,000 cross-matched Sloan2MASS Elliptical
Galaxies plot shows variance captured by first
2 Principal Components as a function of local
galaxy density.
Reference Borne, Dutta, Giannella, Kargupta,
Griffin 2008

Slide Content
Slide content
Slide content
Slide content

of variance captured by PC1PC2
low (Local Galaxy Density)
high
11
Our Project Plans

NASA-funded (AISR) project to implement a P2P
distributed data mining system
Provide a small number of useful data mining
algorithms (one-to-one mapping with science use
cases)
Clustering Class Discovery Characterization
Outlier detection Novelty Discovery
PCA Correlation Discovery
Select problems and algorithms that are
decomposable task-parallel and/or data-parallel
Implement system within VO framework

12
(No Transcript)
13
IVOA GWS Standards

GWS standards enable access to distributed data
and distributed compute resources
Nodes in P2P system individually request
distributed data partitions
Workflow is distributed across the P2P compute
nodes
P2P activities are stateful asynchronous
Relevant GWS activities Security, VOSpace,
Asynchronous services, Single Sign-on, Universal
Worker Service (UWS), Logging

14
GWS functions required by P2P Data Mining
Environment

Acquiring managing nodes and workspaces
(VOSpace)
Single sign-on to nodes (SSO)
Distributing work and metadata to nodes (GRID)
Cone-search and other data requests submitted
from compute nodes to data repositories
RESTful services?
Secure stateful asynchronous computations (UWS)
Communicate results between nodes, as required by
some DDM algorithms
Recording and sharing results, and demonstrating
interoperable multi-database VO science (Logging)