Title: GridMiner a Framework for Data Integration
1GridMiner a Framework for Data Integration
Knowledge Discovery on Computational Grids
- Peter Brezany
- Institute of Scientific Computing
- University of Vienna, Austria
- brezany_at_par.univie.ac.at
April 8, 2005, Vienna
2Outline
- Motivation
- Scientific and Application Drivers
- GridMiner Project in Vienna
- Architecture
- Workflow Management
- Data Access and Integration
- On-Line Analytical Processing Data Mining
- Current Prototype
- Demo
- Future Work
- Conclusions
3Motivation
Business
Medicine
Scientific experiments
Data and data exploration
cloud
Simulations
Earth observations
4Stages of a Data Exploration Project
Time to
Importance complete to
success (percent of total) (percent of
total)
Based on Data Preparation for Data Mining, by
Dorian Pyle, Morgan Kaufmann
- Exploring the problem 10 15
- Exploring the solution 9 20 14 80
- Implementation specification 1 51
- Knowledge discovery
- a. Data preparation 60 15
-
- b. Data surveying 15 3
- c. Data modeling 5 2
80
20
5The Knowledge Discovery Process
Knowledge
OLAP Queries
OLAP
Online Analytical Mining
Evaluation and Presentation
Data Mining
Selection and Transformation
Data Warehouse
Cleaning and Integration
6Application and Scientific Drivers
7Data Mining Accuracy vs. Data Size
100
accuracy
sampled data size
available data size
8Project EcoGRID (Sketch)
Distributed Data
Distributed Applications
Distributed Data Mining
Reporting
Bio- diversity
Waste
Popular Presen- tation
Statistic
Air
Soil
Flow Analysis
Prediction Models
Emmisions
Water
Geo- Statistic
Forests
Common Ontology
Author Kathi Schleidt
9Management of TBI patients
- Traumatic brain injuries (TBIs) typically result
from accidents in which head strikes an object. - The treatment of TBI patients is very resource
intensive. - The trajectory of the TBI patients management
- Trauma event
- First aid
- Transportation to hospital
- Acute hospital care
- Home care
- All the above phases are associated with data
collection into databases now managed by
individual hospitals.
Usage of mobile communication devices
10The GridMiner Project in Vienna
- GridMiner A knowledge discovery Grid
infrastructure (http//www.gridminer.org/) - OGSA-based architecture
- Workflow management
- Grid-aware data preprocessing
- and data mining services
- Data mediation service
- OLAP service
- GUI
- Current Implementation on top of Globus Toolkit
3.2
11GridMiner (Goal) Architecture
SMD Support for Mobile Devices
GridMiner Mobility
GridMiner Workflow
GM DSCE Dynamic Service Control
GridMiner Core
GMPPS Preprocessing
GMDMS Data Mining
GMPRS Presentation
GMDT Transformation
GMOMS OLAM
GridMiner Base
GMMS Mediation
GMIS Information
GMRB Resource Broker
GMCMS OLAP / Cubes
Grid Core
Grid Core Services
Security
File and Database Access Service
Replica Management
Basic Grid Services
Fabric
Grid Resources
Data Source
12Collaboration of GM-Services
Simple Scenario
GMPPS Preprocessing
GMDMS Data Mining
GMDIS Integration
GMPRS Presentation
Intermediate Result 1
Intermediate Result 2 (e.g. flat table)
Intermediate Result 3 (e.g. PMML)
Final Result
Data Sources
13Collaboration (2)
Complex Scenarios
GMPPS
GMDIS
GMPPS
GMDMS
GMPRS
GMPPS
GMDMS
GMPRS
GMPPS
GMPPS
GMDMS
GMPRS
GMDMS
GMPRS
GMPPS
GMDIS
GMCMS
GMOMS
GMPRS
GMPPS
GMPPS
14Workflow Models
Static Workflows
Dynamic Workflows
15Dynamic Workflows
- Dynamic Service Control Language (DSCL)
- based on XML
- easy to use
- Dynamic Service Control Engine (DSCE)
- processes the workflow according to DSCL
DSCL
DSCE
Service A
Service B
Service D
Service C
16DSCL Control Flow
Automatic conversion
Users view
dscl
variables
composition
sequence
createService activityIDact1
parallel
invoke activityIDact2.1
invoke activityIDact2.2
sequence
17Graphical User Interface End-User Level
18(No Transcript)
19(No Transcript)
20 21(No Transcript)
22Grid Data Mediation Service Example Scenario
- Heterogeneities
- Name in A is Alexander Wöhrer
- Name in C has to be combined
- Distribution
- 3 data sources
23Grid Data Mediation Service - Architecture
24OLAP (On-Line Analytical Processing)
Research Objectives High-Performance
Grid OLAP Services
25Requirements
- Operation on large data sets
- Centralized OLAP Service (parallel computing
power can be included) - Distributed OLAP service
- Federation of autonomous distributed OLAP services
26Development Strategy
OE
Network
OE
OE
OLAP Engine
OE
27Development Strategy (2)
- Precondition No open-source OLAP system
available - Decision development (in Java) from scratch
- Advantage motivation for research activities
addressing all facets - Disadvantage a possible long implementation
curve - First step centralized sequential Grid OLAP
service
28Towards Centralized Service
OLAP
Workflow Engine
DSCL, OMML
OMML
XML
GUI
Mediator
PMML
PMML
RD
XMLD
CSV
Data Mining Engine
29Distributed OLAP Aggregation of Compute and
Storage Resources
Tuple Stream
30OLAP Caching
31Federated OLAPMotivating Example
- Effective management of a network requires
collecting, correlating, and analyzing a variety
of network trace data. - Analysis of flow data collecting at each router
and stored in a local data warehouse adjacent
to the router is a challenging application. - All flow information is conceptually part of a
single relation with the following schema - Flow ( RouterId, SourceIP, SourcePort,
SourceMask, SourceAS, DestIP, DestPort,
DestMask, DestAS, StartTime, EndTime, NumPackets,
NumBytes)
32OLAP Federation
33GridMiner Current Architecture
User environment
Web
Grid
34Towards an Open Service System
35Implementation/Technology
- Globus 3.2
- OGSA/DAI version 5
- GUI Workflow constructions/Results
visualization (JGraph, Java web Start, Java
server pages) - Service Configurators (Java server pages)
- Workflow management DSCE Client (OGSA)
- Knowledge base Configurations (XML,OWL)
- Data mediation service (OGSA/DAI)
36GridMiner People and Areas
Data Preprocessing Michaela Pfeifer
Former Members Jürgen Hofer (until 07/2003
Early GT3-based Prototype DIGIDT Case Study)
37DIALOGUE ProjectData Integration Applications
Linking Organizations to Gain Understanding and
Experience
- University of Edinburgh (Project Leader)
- Malcolm Atkinson
- Cutter (Ohio State University, Columbus)
- Joel Saltz
- Bioinformatics (Indiana University, Bloomington)
- Beth Plale
- San Diego Supercomputing Center
- Chaitan Baru
- GridMiner
- Peter Brezany
- Kick-off Workshop August 2005, Univ. of Ohio
38Within Austrian Grid Adaptive Semantic Data
Integration
39Project Schedules
- GridMiner (2003-2005)
- a follow-up project proposal in preparation phase
- Austrian Grid (Workpackage 4b) (2005-2006)
- DIALOGUE (2005-2006)