D2K Tutorial - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

D2K Tutorial

Description:

Three Primary Paradigms ... Drag and Drop Modules from Modules Pane of Resource Panel to the Workspace ... Click on the 'Itinerary' Pane in the Resource Panel. ... – PowerPoint PPT presentation

Number of Views:1734

Avg rating:3.0/5.0

Slides: 33

Provided by: lorett5

Category:

more less

Transcript and Presenter's Notes

Title: D2K Tutorial

1
D2K Tutorial

D2K Overview

2
Outline

Motivation for D2K
Overview of D2K Functionality
Example Discovery Rule Association
References
http//alg.ncsa.uiuc.edu/do/tools/d2k/documentatio
n
http//alg.ncsa.uiuc.edu/do/tools/d2k/tutorials
http//alg.ncsa.uiuc.edu/tools/docs/d2k/manual/ind
ex.html
http//alg.ncsa.uiuc.edu/tools/docs/d2k/principles
/index.html
http//alg.ncsa.uiuc.edu/tools/docs/d2k/faq/faq.ht
ml

3
Motivation for D2K

Create a framework that could be used and
extended for our needs in knowledge discovery
(data mining).
D2K is a flexible data mining system that
integrates effective analytical data mining
methods for prediction, discovery, and anomaly
detection with data management and information
visualization.
Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.
The understandable patterns are used to
Make predictions about or classifications of new
data
Explain existing data
Summarize the contents of a large database to
support decision making
Create graphical data visualization to aid humans
in discovering complex patterns

4
Knowledge Discovery Process
5
Required Effort for each KDD Step

Arrows indicate the direction we want the effort
to go.

6
Three Primary Paradigms

Predictive Modeling supervised learning
approach where classification or prediction of
one of the attributes is desired.
Classification is the prediction of predefined
classes
e.g. Naive Bayesian, Decision Trees, and Neural
Networks
Regression is the prediction of continuous data
e.g. Neural Networks, and Decision (Regression)
Trees
Discovery unsupervised learning approach for
exploratory data analysis.
e.g. Association Rules, Link Analysis,
Clustering, and Self Organizing Maps
Deviation Detection identifying outliers in the
data.
e.g. Visualization

7
Advantages of a Framework for Analytics

Provides scalable environment from the Desktop to
Web Services
Employs a visual programming system for data/work
flow paradigm
Provides capability to build custom applications
Provides capability to access data management
tools
Contains data mining algorithms for prediction
and discovery
Provides data transformations for standard
operations
Integrated environment for models and
visualization
Supports an extensible interface for creating
ones own algorithms
Provides access to distributed computing
capabilities
Employs multi-layered learning strategies

8
D2K and Its Many Components

D2K Infrastructure
D2K API, data flow environment, distributed
computing framework and runtime system
D2K Modules
Computational units written in Java that follow
the D2K API
D2K Itineraries
Modules that are connected to form an application
D2K Toolkit
User interface for specification of itineraries
and execution that provides the rapid application
development environment
D2K-Driven Applications
Applications that use D2K modules with a custom
user interface
D2K Streamline (SL)
Task driven system that uses D2K modules
D2K Web Services
Enables web deployment

9
D2K Streamline (D2K SL)

Provides step by step interface to guide user in
data analysis
Supports return to earlier steps to run different
parameters
Uses the D2K infrastructure transparently
Uses same D2K modules
Provides way to capture different experiments
Define templates that can be reused in different
experiments

10
D2K Web Service Architecture

Any web enabled client can connect to and use the
D2K Web Service by sending SOAP messages over
HTTP.
Itineraries and modules are stored on the web
service machine and loaded over the network by
the D2K Servers.
Job results are also stored in the web service
tier.
Results are returned to clients upon request.
A relational database is used by the web service
to lookup accounts, itineraries, servers, and
jobs.
Remote D2K Servers handle itinerary processing.
If possible, modules should load any data from
remote locations.

11
D2K Basic
D2K Overview

Set of D2K Modules to perform data mining
techniques
Prediction
Decision Trees
C4.5 Decision Tree, Continuous Decision Tree, SQL
Rain Forest Decision Tree
Naïve Bayesian Classification and SQL Naïve
Bayesian Classification
Neural Networks
Discovery
Rule Association
Apriori, FP Growth, Htree
Clustering
Hierarchical Agglomerative, Kmeans, Coverage,
etc.
Includes visualizations for many of the modeling
approaches
Includes a set of data transformations
Attribute selection, binning, filtering,
attribute construction
Includes optimization strategy for searching
parameter space

12
D2K Features
D2K Overview

Extension of existing API and Batch Interface for
Execution
Provides the capability to programmatically
connect modules and set properties.
Enables D2K-driven applications to be developed.
Provides ability to pause and restart an
itinerary.
Enhanced Distributed Computing
Allows modules that are re-entrant to be executed
remotely.
Includes interface for specifying the runtime
layout of a distributed itinerary.
Processor Status Overlay
Shows utilization of distributed computing
resources.
Distributed Checkpointing
Resource Manager
Provides a mechanism for treating selected data
structures as if they were stored in global
memory.
Provides memory space that is accessible from
multiple modules running locally as well as
remotely.
Easily publish itineraries in a D2K Web Service

13
D2K ToolKit
D2K Overview

Workspace
Resource Panel
Modules
Models
Itineraries
Visualizations
Generated Visualizations
Generated Models
Component Information
Toolbar
Console

14
D2K Modules
D2K Overview

Input Module Loads data from the outside world.
Flat files, database, etc.
Data Prep Module Performs functions to select,
clean, or transform the data
Binning, Normalizing, Feature Selection, etc.
Compute Module Performs main algorithmic
computations.
Naïve Bayesian, Decision Tree, Apriori, FP
Growth, etc.
User Input Module Requires interaction with the
user.
Data Selection, Input and Output selection, etc.
Output Module Saves data to the outside world.
Flat files, databases, etc.
Visualization Module Provides visual feedback to
the user.
Naïve Bayesian, Rule Association, Decision Tree,
Parallel Coordinates, 2D Scatterplot, 3D Surface
Plot

15
D2K Module Icon Description
D2K Overview

Module Progress Bar
Appears during execution to show the percentage
of time that this module executed over the entire
execution time. It is green when the module is
executing and red when not.
Input Port
Rectangular shapes on the left side of the module
represent the inputs for the module. They are
colored according to the data type that they
represent
Properties Symbol
If a P is shown in the lower left corner of the
module, then the module has properties that can
be set before execution.

Output Port Rectangular shapes on the right side
of the module represent the outputs for the
module. They are colored according to the data
type that they represent.
16
D2K Itineraries
D2K Overview

Itineraries are partial or complete applications
composed of connected modules.
D2K Core Itineraries include
Prediction
Discovery
Data Selection
Transformation
Visualization

17
Workspace
D2K Overview

The Workspace is the area where applications are
formed.
Modules are placed, connected, and properties
set.
Itineraries are saved and executed.

18
Resource Panel
D2K Overview

The area to the left of the Workspace that
contains the components necessary for data
analysis.
Modules
Models
Itineraries
Visualizations

19
Session Panes
D2K Overview

Component Information
Shows detailed information about components of
D2K
Shows module information, inputs, outputs, and
property descriptions
Shows itinerary annotation.
Generated Visualization
Shows visualizations generated during this
session
Provides ability to save these visualizations for
later use.
Generated Models
Shows models generated during this session
Provides ability to save these visualizations for
later use.

20
D2K Setup
D2K Overview

Preferences
Written to a file called .d2kV4.props
Set up automatically the first time D2K is
installed
Changed via Edit menu Preferences
Some changes do require restart of D2K.
Check the User Manual for more details (available
online).

21
Using the Toolkit
D2K Overview

Build an itinerary for loading data and viewing
it in a TableViewer
Drag and Drop Modules from Modules Pane of
Resource Panel to the Workspace as shown
Expand directory ncsa/io/file/input
Drag and Drop Input1Filename to Workspace
Drag and Drop CreateDelimitedParser to Workspace
Drag and Drop ParseFileToTable to Workspace
Expand directory ncsa/vis
Drag and Drop TableViewer to Workspace

22
Using the Toolkit (contd)
D2K Overview

Connect the modules like shown
Drag from the output port of one module to the
input port of the next module.
Check the properties of modules by double
clicking on the module.
Input File Name
Choose data/UCI/iris.csv
Create Delimited File Parser
Defaults work
Parse File To Table
Defaults work
Click Run to execute.

23
Variation Using a Nested Itinerary
D2K Overview

An itinerary can be used as a module nested
itinerary.
Properties can be set by holding Control and
double clicking on the nested itinerary.
Then connecting the inputs and output ports of
the nested itinerary as one would any other
module.

DISCOVERY
RULE ASSOCIATION
Using FP Growth

25
Overview
Discovery Rule Association

Unsupervised learning problem.
Find all rules that correlate the presence of one
set of items X with another item Y.
Example When a customer buys bread and butter,
they buy milk 85 of the time.
Support is the percentage of the records that
contain both X and Y.
A rule must have some minimum user-specified
support to show its impact.
Confidence is the percentage of records that
contain X and Y out of the number of records that
contain X.
A rule must have some minimum user-specified
confidence to show its value.

26
Strengths and Weaknesses
Discovery Rule Association

Strengths
It produces easy to understand results.
It supports undirected data mining.
It works on variable length data.
Rules are relatively easy to compute.
Weaknesses
It produces many rules.
For large numbers of attribute-value
combinations, considerable cpu and memory
resources are consumed.

27
Opening the Itinerary
Discovery Rule Association