Title: D2K Tutorial
1D2K Tutorial
2Outline
- Motivation for D2K
- Overview of D2K Functionality
- Example Discovery Rule Association
- References
- http//alg.ncsa.uiuc.edu/do/tools/d2k/documentatio
n - http//alg.ncsa.uiuc.edu/do/tools/d2k/tutorials
- http//alg.ncsa.uiuc.edu/tools/docs/d2k/manual/ind
ex.html - http//alg.ncsa.uiuc.edu/tools/docs/d2k/principles
/index.html - http//alg.ncsa.uiuc.edu/tools/docs/d2k/faq/faq.ht
ml
3Motivation for D2K
- Create a framework that could be used and
extended for our needs in knowledge discovery
(data mining). - D2K is a flexible data mining system that
integrates effective analytical data mining
methods for prediction, discovery, and anomaly
detection with data management and information
visualization. - Knowledge Discovery in Databases is the
non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data. - The understandable patterns are used to
- Make predictions about or classifications of new
data - Explain existing data
- Summarize the contents of a large database to
support decision making - Create graphical data visualization to aid humans
in discovering complex patterns
4Knowledge Discovery Process
5Required Effort for each KDD Step
- Arrows indicate the direction we want the effort
to go.
6Three Primary Paradigms
- Predictive Modeling supervised learning
approach where classification or prediction of
one of the attributes is desired. - Classification is the prediction of predefined
classes - e.g. Naive Bayesian, Decision Trees, and Neural
Networks - Regression is the prediction of continuous data
- e.g. Neural Networks, and Decision (Regression)
Trees - Discovery unsupervised learning approach for
exploratory data analysis. - e.g. Association Rules, Link Analysis,
Clustering, and Self Organizing Maps - Deviation Detection identifying outliers in the
data. - e.g. Visualization
7Advantages of a Framework for Analytics
- Provides scalable environment from the Desktop to
Web Services - Employs a visual programming system for data/work
flow paradigm - Provides capability to build custom applications
- Provides capability to access data management
tools - Contains data mining algorithms for prediction
and discovery - Provides data transformations for standard
operations - Integrated environment for models and
visualization - Supports an extensible interface for creating
ones own algorithms - Provides access to distributed computing
capabilities - Employs multi-layered learning strategies
8D2K and Its Many Components
- D2K Infrastructure
- D2K API, data flow environment, distributed
computing framework and runtime system - D2K Modules
- Computational units written in Java that follow
the D2K API - D2K Itineraries
- Modules that are connected to form an application
- D2K Toolkit
- User interface for specification of itineraries
and execution that provides the rapid application
development environment - D2K-Driven Applications
- Applications that use D2K modules with a custom
user interface - D2K Streamline (SL)
- Task driven system that uses D2K modules
- D2K Web Services
- Enables web deployment
9D2K Streamline (D2K SL)
- Provides step by step interface to guide user in
data analysis - Supports return to earlier steps to run different
parameters - Uses the D2K infrastructure transparently
- Uses same D2K modules
- Provides way to capture different experiments
- Define templates that can be reused in different
experiments
10D2K Web Service Architecture
- Any web enabled client can connect to and use the
D2K Web Service by sending SOAP messages over
HTTP. - Itineraries and modules are stored on the web
service machine and loaded over the network by
the D2K Servers. - Job results are also stored in the web service
tier. - Results are returned to clients upon request.
- A relational database is used by the web service
to lookup accounts, itineraries, servers, and
jobs. - Remote D2K Servers handle itinerary processing.
If possible, modules should load any data from
remote locations.
11D2K Basic
D2K Overview
- Set of D2K Modules to perform data mining
techniques - Prediction
- Decision Trees
- C4.5 Decision Tree, Continuous Decision Tree, SQL
Rain Forest Decision Tree - Naïve Bayesian Classification and SQL Naïve
Bayesian Classification - Neural Networks
- Discovery
- Rule Association
- Apriori, FP Growth, Htree
- Clustering
- Hierarchical Agglomerative, Kmeans, Coverage,
etc. - Includes visualizations for many of the modeling
approaches - Includes a set of data transformations
- Attribute selection, binning, filtering,
attribute construction - Includes optimization strategy for searching
parameter space
12D2K Features
D2K Overview
- Extension of existing API and Batch Interface for
Execution - Provides the capability to programmatically
connect modules and set properties. - Enables D2K-driven applications to be developed.
- Provides ability to pause and restart an
itinerary. - Enhanced Distributed Computing
- Allows modules that are re-entrant to be executed
remotely. - Includes interface for specifying the runtime
layout of a distributed itinerary. - Processor Status Overlay
- Shows utilization of distributed computing
resources. - Distributed Checkpointing
- Resource Manager
- Provides a mechanism for treating selected data
structures as if they were stored in global
memory. - Provides memory space that is accessible from
multiple modules running locally as well as
remotely. - Easily publish itineraries in a D2K Web Service
13D2K ToolKit
D2K Overview
- Workspace
- Resource Panel
- Modules
- Models
- Itineraries
- Visualizations
- Generated Visualizations
- Generated Models
- Component Information
- Toolbar
- Console
14D2K Modules
D2K Overview
- Input Module Loads data from the outside world.
- Flat files, database, etc.
- Data Prep Module Performs functions to select,
clean, or transform the data - Binning, Normalizing, Feature Selection, etc.
- Compute Module Performs main algorithmic
computations. - Naïve Bayesian, Decision Tree, Apriori, FP
Growth, etc. - User Input Module Requires interaction with the
user. - Data Selection, Input and Output selection, etc.
- Output Module Saves data to the outside world.
- Flat files, databases, etc.
- Visualization Module Provides visual feedback to
the user. - Naïve Bayesian, Rule Association, Decision Tree,
Parallel Coordinates, 2D Scatterplot, 3D Surface
Plot
15D2K Module Icon Description
D2K Overview
- Module Progress Bar
- Appears during execution to show the percentage
of time that this module executed over the entire
execution time. It is green when the module is
executing and red when not. - Input Port
- Rectangular shapes on the left side of the module
represent the inputs for the module. They are
colored according to the data type that they
represent - Properties Symbol
- If a P is shown in the lower left corner of the
module, then the module has properties that can
be set before execution.
Output Port Rectangular shapes on the right side
of the module represent the outputs for the
module. They are colored according to the data
type that they represent.
16D2K Itineraries
D2K Overview
- Itineraries are partial or complete applications
composed of connected modules. - D2K Core Itineraries include
- Prediction
- Discovery
- Data Selection
- Transformation
- Visualization
17Workspace
D2K Overview
- The Workspace is the area where applications are
formed. - Modules are placed, connected, and properties
set. - Itineraries are saved and executed.
18Resource Panel
D2K Overview
- The area to the left of the Workspace that
contains the components necessary for data
analysis. - Modules
- Models
- Itineraries
- Visualizations
19Session Panes
D2K Overview
- Component Information
- Shows detailed information about components of
D2K - Shows module information, inputs, outputs, and
property descriptions - Shows itinerary annotation.
- Generated Visualization
- Shows visualizations generated during this
session - Provides ability to save these visualizations for
later use. - Generated Models
- Shows models generated during this session
- Provides ability to save these visualizations for
later use.
20D2K Setup
D2K Overview
- Preferences
- Written to a file called .d2kV4.props
- Set up automatically the first time D2K is
installed - Changed via Edit menu Preferences
- Some changes do require restart of D2K.
- Check the User Manual for more details (available
online).
21Using the Toolkit
D2K Overview
- Build an itinerary for loading data and viewing
it in a TableViewer - Drag and Drop Modules from Modules Pane of
Resource Panel to the Workspace as shown - Expand directory ncsa/io/file/input
- Drag and Drop Input1Filename to Workspace
- Drag and Drop CreateDelimitedParser to Workspace
- Drag and Drop ParseFileToTable to Workspace
- Expand directory ncsa/vis
- Drag and Drop TableViewer to Workspace
22Using the Toolkit (contd)
D2K Overview
- Connect the modules like shown
- Drag from the output port of one module to the
input port of the next module. - Check the properties of modules by double
clicking on the module. - Input File Name
- Choose data/UCI/iris.csv
- Create Delimited File Parser
- Defaults work
- Parse File To Table
- Defaults work
- Click Run to execute.
23Variation Using a Nested Itinerary
D2K Overview
- An itinerary can be used as a module nested
itinerary. - Properties can be set by holding Control and
double clicking on the nested itinerary. - Then connecting the inputs and output ports of
the nested itinerary as one would any other
module.
24- DISCOVERY
- RULE ASSOCIATION
- Using FP Growth
25Overview
Discovery Rule Association
- Unsupervised learning problem.
- Find all rules that correlate the presence of one
set of items X with another item Y. - Example When a customer buys bread and butter,
they buy milk 85 of the time. - Support is the percentage of the records that
contain both X and Y. - A rule must have some minimum user-specified
support to show its impact. - Confidence is the percentage of records that
contain X and Y out of the number of records that
contain X. - A rule must have some minimum user-specified
confidence to show its value.
26Strengths and Weaknesses
Discovery Rule Association
- Strengths
- It produces easy to understand results.
- It supports undirected data mining.
- It works on variable length data.
- Rules are relatively easy to compute.
- Weaknesses
- It produces many rules.
- For large numbers of attribute-value
combinations, considerable cpu and memory
resources are consumed.
27Opening the Itinerary
Discovery Rule Association
- Click on the Itinerary Pane in the Resource
Panel. - Expand the Discovery directory with a single
click. - Expand the RuleAssociation directory with a
single click. - Double click on fp-growth-binning to load the
itinerary into your Workspace.
28Accessing the Data
Discovery Rule Association
- Identify the data to use
- Form rules
- View rules in report or visualization
29Rule Association Visualization
Discovery Rule Association
- Read rules down the column.
- Example - the first rule is
- If petal-length Binned2. and petal-width
Binned0.7 then flower-typeIris-setosa - Support 25
- Confidence 100
- Use brushing to find out support and confidence.
- Click on the Confidence label to sort by
confidence. - Click on the Support label to sort by support.
- Additional functionality for searching/sorting is
planned.
30D2K Release
- Review Modules
- Review of Itineraries
31The ALG Team
- Staff
- Bernie Acs
- Loretta Auvil
- David Clutter
- Vered Goren
- Eugene Grois
- Luigi Marini
- Robert McGrath
- Chris Navarro
- Greg Pape
- Barry Sanders
- Andrew Shirk
- David Tcheng
- Michael Welge
- Students
- Chen Chen
- Hong Cheng
- Yaniv Eytani
- Fang Guo
- Govind Kabra
- Chao Liu
- Haitao Mo
- Xuanhui Wang
- Qian Yang
- Feida Zhu
32References
- http//alg.ncsa.uiuc.edu/do/tools/d2k/documentatio
n - http//alg.ncsa.uiuc.edu/do/tools/d2k/tutorials
- http//alg.ncsa.uiuc.edu/tools/docs/d2k/manual/ind
ex.html - http//alg.ncsa.uiuc.edu/tools/docs/d2k/principles
/index.html - http//alg.ncsa.uiuc.edu/tools/docs/d2k/faq/faq.ht
ml