Title: NVOSS08
1THE US NATIONAL VIRTUAL OBSERVATORY
P2P Data Mining
Kirk D. Borne George Mason University kborne_at_gmu.e
du , http//classweb.gmu.edu/kborne/ with H.
Kargupta, S. Arora, K. Bhaduri, K. Das, Tushar,
W. Griffin (UMBC), and C. Giannella (Loyola)
2Topics
- Distributed vs. P2P Data Mining
- Science Use Cases
- P2P Data Mining Project Plans
- Current Design Status
- IVOA GWS Standards
3Distributed Data Mining (DDM)
- DDM comes in 2 types
- Distributed Mining of Data
- Mining of Distributed Data
- Type 1 requires sophisticated algorithms that
operate with data in situ - Type 2 takes many forms, with data being
centralized (in whole or in partitions) or data
remaining in place at distributed sites - References http//www.cs.umbc.edu/hillol/DDMBIB
/ - C. Giannella, H. Dutta, K. Borne, R. Wolff, H.
Kargupta. (2006). Distributed Data Mining for
Astronomy Catalogs. Proceedings of 9th Workshop
on Mining Scientific and Engineering Datasets, as
part of the SIAM International Conference on Data
Mining (SDM), 2006. http//www.cs.umbc.edu/hil
lol/PUBS/Papers/Astro.pdf - H. Dutta, C. Giannella, K. Borne and H. Kargupta.
(2007). Distributed Top-K Outlier Detection from
Astronomy Catalogs using the DEMAC System.
Proceedings of the SIAM International Conference
on Data Mining, Minneapolis, USA, April 2007.
http//www.cs.umbc.edu/hillol/PUBS/Papers/sdm07.p
df
4P2P Data Mining
- P2P Data Mining represents one possible
implementation of DDM - P2P has two types
- Task-parallel the compute processes are
distributed across the nodes - Data-parallel the data are distributed across
the nodes - References http//www.cs.umbc.edu/hillol/DDMBIB/
ddmbib_html/DistSys.html - S. Banyopadhyay, C. Giannella, U. Maulik,
H. Kargupta, S. Datta, and K. Liu. Clustering
distributed data streams in peer-to-peer
environments. Information Science,
176(14)1952-1985, 2006. http//www.cs.umbc.edu/
hillol/PUBS/p2pDM.pdf - K. Bhaduri, R. Wolff, C. Giannella, H. Kargupta.
(2008). Distributed Decision Tree Induction in
Peer-to-Peer Systems. Statistical Analysis and
Data Mining. Volume 1, Issue 2, pp. 85-103.
http//www.cs.umbc.edu/hillol/PUBS/Papers/sam08_
dtree_bhaduri.pdf - S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H.
Kargupta. (2006). Distributed Data Mining in
Peer-to-Peer Networks. (Invited submission to the
IEEE Internet Computing special issue on
Distributed Data Mining), Volume 10, Number 4,
pp. 18--26. http//www.cs.umbc.edu/hillol/PUBS
/P2PDM.pdf
5Why distributed data mining?
Because
- Many great astronomical
- discoveries have come
- from inter-comparisons
- of various wavelengths
- Quasars
- Gamma-ray bursts
- Ultraluminous IR galaxies
- X-ray black-hole binaries
- Radio galaxies
- . . .
Just Checking
6Some Fundamental Astronomy problems most of
these require VO-accessible distributed data
- Some key astronomy problems that can be addressed
with distributed data - Cross-Match objects from different catalogues
- The distance problem (e.g., Photometric Redshift
estimators) - Star-Galaxy Separation
- Cosmic-Ray Detection in images
- Supernova Detection and Classification
- Morphological Classification (galaxies, AGN,
gravitational lenses, ...) - Class and Subclass Discovery (brown dwarfs,
methane dwarfs, ...) - Dimension Reduction Correlation Discovery
- Learning Rules for improved classifiers
- Classification of massive data streams
- Real-time Classification of Astronomical Events
- Clustering of massive data collections
- Novelty, Anomaly, Outlier Detection in massive
databases
7Sample Astronomy Data Mining Applications most
of these require VO-accessible distributed data
- Neural Network for Pixel Classification Event
Detection and Prediction (e.g., Supernova or
Cosmic-ray hit?) - Bayesian Network for Object Classification (star
or galaxy?) - PCA for finding Fundamental Planes of Galaxy
Parameters - PCA (weakest component) for Outlier Detection
anomalies, novel discoveries, new objects - Link Analysis (Association Mining) for Causal
Event Detection (e.g., linking optical transients
with gamma-ray events) - Clustering analysis Spatial, Temporal, or any
scientific database parameters - Markov models Temporal mining, classification,
and prediction from time series data
8Class Discovery feature separation and
discrimination of classes across multiple
databases
- Reference http//www.cs.princeton.edu/courses/ar
chive/spr04/cos598B/bib/BrunnerDPS.pdf
- The separation of classes improves when
attributes from disparate databases are chosen to
be projected, as in the following star-galaxy
discrimination test
Good
Not good
9Novelty Discovery (Outlier Detection) improved
discovery of rare objects across multiple
databases
10Correlation Discovery Fundamental Plane for
156,000 cross-matched Sloan2MASS Elliptical
Galaxies plot shows variance captured by first
2 Principal Components as a function of local
galaxy density.
Reference Borne, Dutta, Giannella, Kargupta,
Griffin 2008
- Slide Content
- Slide content
- Slide content
- Slide content
of variance captured by PC1PC2
low (Local Galaxy Density)
high
11Our Project Plans
- NASA-funded (AISR) project to implement a P2P
distributed data mining system - Provide a small number of useful data mining
algorithms (one-to-one mapping with science use
cases) - Clustering Class Discovery Characterization
- Outlier detection Novelty Discovery
- PCA Correlation Discovery
- Select problems and algorithms that are
decomposable task-parallel and/or data-parallel - Implement system within VO framework
12(No Transcript)
13IVOA GWS Standards
- GWS standards enable access to distributed data
and distributed compute resources - Nodes in P2P system individually request
distributed data partitions - Workflow is distributed across the P2P compute
nodes - P2P activities are stateful asynchronous
- Relevant GWS activities Security, VOSpace,
Asynchronous services, Single Sign-on, Universal
Worker Service (UWS), Logging
14GWS functions required by P2P Data Mining
Environment
- Acquiring managing nodes and workspaces
(VOSpace) - Single sign-on to nodes (SSO)
- Distributing work and metadata to nodes (GRID)
- Cone-search and other data requests submitted
from compute nodes to data repositories - RESTful services?
- Secure stateful asynchronous computations (UWS)
- Communicate results between nodes, as required by
some DDM algorithms - Recording and sharing results, and demonstrating
interoperable multi-database VO science (Logging)