D2K Tutorial - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

D2K Tutorial

Description:

Three Primary Paradigms ... Drag and Drop Modules from Modules Pane of Resource Panel to the Workspace ... Click on the 'Itinerary' Pane in the Resource Panel. ... – PowerPoint PPT presentation

Number of Views:1734
Avg rating:3.0/5.0
Slides: 33
Provided by: lorett5
Category:

less

Transcript and Presenter's Notes

Title: D2K Tutorial


1
D2K Tutorial
  • D2K Overview

2
Outline
  • Motivation for D2K
  • Overview of D2K Functionality
  • Example Discovery Rule Association
  • References
  • http//alg.ncsa.uiuc.edu/do/tools/d2k/documentatio
    n
  • http//alg.ncsa.uiuc.edu/do/tools/d2k/tutorials
  • http//alg.ncsa.uiuc.edu/tools/docs/d2k/manual/ind
    ex.html
  • http//alg.ncsa.uiuc.edu/tools/docs/d2k/principles
    /index.html
  • http//alg.ncsa.uiuc.edu/tools/docs/d2k/faq/faq.ht
    ml

3
Motivation for D2K
  • Create a framework that could be used and
    extended for our needs in knowledge discovery
    (data mining).
  • D2K is a flexible data mining system that
    integrates effective analytical data mining
    methods for prediction, discovery, and anomaly
    detection with data management and information
    visualization.
  • Knowledge Discovery in Databases is the
    non-trivial process of identifying valid, novel,
    potentially useful, and ultimately understandable
    patterns in data.
  • The understandable patterns are used to
  • Make predictions about or classifications of new
    data
  • Explain existing data
  • Summarize the contents of a large database to
    support decision making
  • Create graphical data visualization to aid humans
    in discovering complex patterns

4
Knowledge Discovery Process
5
Required Effort for each KDD Step
  • Arrows indicate the direction we want the effort
    to go.

6
Three Primary Paradigms
  • Predictive Modeling supervised learning
    approach where classification or prediction of
    one of the attributes is desired.
  • Classification is the prediction of predefined
    classes
  • e.g. Naive Bayesian, Decision Trees, and Neural
    Networks
  • Regression is the prediction of continuous data
  • e.g. Neural Networks, and Decision (Regression)
    Trees
  • Discovery unsupervised learning approach for
    exploratory data analysis.
  • e.g. Association Rules, Link Analysis,
    Clustering, and Self Organizing Maps
  • Deviation Detection identifying outliers in the
    data.
  • e.g. Visualization

7
Advantages of a Framework for Analytics
  • Provides scalable environment from the Desktop to
    Web Services
  • Employs a visual programming system for data/work
    flow paradigm
  • Provides capability to build custom applications
  • Provides capability to access data management
    tools
  • Contains data mining algorithms for prediction
    and discovery
  • Provides data transformations for standard
    operations
  • Integrated environment for models and
    visualization
  • Supports an extensible interface for creating
    ones own algorithms
  • Provides access to distributed computing
    capabilities
  • Employs multi-layered learning strategies

8
D2K and Its Many Components
  • D2K Infrastructure
  • D2K API, data flow environment, distributed
    computing framework and runtime system
  • D2K Modules
  • Computational units written in Java that follow
    the D2K API
  • D2K Itineraries
  • Modules that are connected to form an application
  • D2K Toolkit
  • User interface for specification of itineraries
    and execution that provides the rapid application
    development environment
  • D2K-Driven Applications
  • Applications that use D2K modules with a custom
    user interface
  • D2K Streamline (SL)
  • Task driven system that uses D2K modules
  • D2K Web Services
  • Enables web deployment

9
D2K Streamline (D2K SL)
  • Provides step by step interface to guide user in
    data analysis
  • Supports return to earlier steps to run different
    parameters
  • Uses the D2K infrastructure transparently
  • Uses same D2K modules
  • Provides way to capture different experiments
  • Define templates that can be reused in different
    experiments

10
D2K Web Service Architecture
  • Any web enabled client can connect to and use the
    D2K Web Service by sending SOAP messages over
    HTTP.
  • Itineraries and modules are stored on the web
    service machine and loaded over the network by
    the D2K Servers.
  • Job results are also stored in the web service
    tier.
  • Results are returned to clients upon request.
  • A relational database is used by the web service
    to lookup accounts, itineraries, servers, and
    jobs.
  • Remote D2K Servers handle itinerary processing.
    If possible, modules should load any data from
    remote locations.

11
D2K Basic
D2K Overview
  • Set of D2K Modules to perform data mining
    techniques
  • Prediction
  • Decision Trees
  • C4.5 Decision Tree, Continuous Decision Tree, SQL
    Rain Forest Decision Tree
  • Naïve Bayesian Classification and SQL Naïve
    Bayesian Classification
  • Neural Networks
  • Discovery
  • Rule Association
  • Apriori, FP Growth, Htree
  • Clustering
  • Hierarchical Agglomerative, Kmeans, Coverage,
    etc.
  • Includes visualizations for many of the modeling
    approaches
  • Includes a set of data transformations
  • Attribute selection, binning, filtering,
    attribute construction
  • Includes optimization strategy for searching
    parameter space

12
D2K Features
D2K Overview
  • Extension of existing API and Batch Interface for
    Execution
  • Provides the capability to programmatically
    connect modules and set properties.
  • Enables D2K-driven applications to be developed.
  • Provides ability to pause and restart an
    itinerary.
  • Enhanced Distributed Computing
  • Allows modules that are re-entrant to be executed
    remotely.
  • Includes interface for specifying the runtime
    layout of a distributed itinerary.
  • Processor Status Overlay
  • Shows utilization of distributed computing
    resources.
  • Distributed Checkpointing
  • Resource Manager
  • Provides a mechanism for treating selected data
    structures as if they were stored in global
    memory.
  • Provides memory space that is accessible from
    multiple modules running locally as well as
    remotely.
  • Easily publish itineraries in a D2K Web Service

13
D2K ToolKit
D2K Overview
  • Workspace
  • Resource Panel
  • Modules
  • Models
  • Itineraries
  • Visualizations
  • Generated Visualizations
  • Generated Models
  • Component Information
  • Toolbar
  • Console

14
D2K Modules
D2K Overview
  • Input Module Loads data from the outside world.
  • Flat files, database, etc.
  • Data Prep Module Performs functions to select,
    clean, or transform the data
  • Binning, Normalizing, Feature Selection, etc.
  • Compute Module Performs main algorithmic
    computations.
  • Naïve Bayesian, Decision Tree, Apriori, FP
    Growth, etc.
  • User Input Module Requires interaction with the
    user.
  • Data Selection, Input and Output selection, etc.
  • Output Module Saves data to the outside world.
  • Flat files, databases, etc.
  • Visualization Module Provides visual feedback to
    the user.
  • Naïve Bayesian, Rule Association, Decision Tree,
    Parallel Coordinates, 2D Scatterplot, 3D Surface
    Plot

15
D2K Module Icon Description
D2K Overview
  • Module Progress Bar
  • Appears during execution to show the percentage
    of time that this module executed over the entire
    execution time. It is green when the module is
    executing and red when not.
  • Input Port
  • Rectangular shapes on the left side of the module
    represent the inputs for the module. They are
    colored according to the data type that they
    represent
  • Properties Symbol
  • If a P is shown in the lower left corner of the
    module, then the module has properties that can
    be set before execution.

Output Port Rectangular shapes on the right side
of the module represent the outputs for the
module. They are colored according to the data
type that they represent.
16
D2K Itineraries
D2K Overview
  • Itineraries are partial or complete applications
    composed of connected modules.
  • D2K Core Itineraries include
  • Prediction
  • Discovery
  • Data Selection
  • Transformation
  • Visualization

17
Workspace
D2K Overview
  • The Workspace is the area where applications are
    formed.
  • Modules are placed, connected, and properties
    set.
  • Itineraries are saved and executed.

18
Resource Panel
D2K Overview
  • The area to the left of the Workspace that
    contains the components necessary for data
    analysis.
  • Modules
  • Models
  • Itineraries
  • Visualizations

19
Session Panes
D2K Overview
  • Component Information
  • Shows detailed information about components of
    D2K
  • Shows module information, inputs, outputs, and
    property descriptions
  • Shows itinerary annotation.
  • Generated Visualization
  • Shows visualizations generated during this
    session
  • Provides ability to save these visualizations for
    later use.
  • Generated Models
  • Shows models generated during this session
  • Provides ability to save these visualizations for
    later use.

20
D2K Setup
D2K Overview
  • Preferences
  • Written to a file called .d2kV4.props
  • Set up automatically the first time D2K is
    installed
  • Changed via Edit menu Preferences
  • Some changes do require restart of D2K.
  • Check the User Manual for more details (available
    online).

21
Using the Toolkit
D2K Overview
  • Build an itinerary for loading data and viewing
    it in a TableViewer
  • Drag and Drop Modules from Modules Pane of
    Resource Panel to the Workspace as shown
  • Expand directory ncsa/io/file/input
  • Drag and Drop Input1Filename to Workspace
  • Drag and Drop CreateDelimitedParser to Workspace
  • Drag and Drop ParseFileToTable to Workspace
  • Expand directory ncsa/vis
  • Drag and Drop TableViewer to Workspace

22
Using the Toolkit (contd)
D2K Overview
  • Connect the modules like shown
  • Drag from the output port of one module to the
    input port of the next module.
  • Check the properties of modules by double
    clicking on the module.
  • Input File Name
  • Choose data/UCI/iris.csv
  • Create Delimited File Parser
  • Defaults work
  • Parse File To Table
  • Defaults work
  • Click Run to execute.

23
Variation Using a Nested Itinerary
D2K Overview
  • An itinerary can be used as a module nested
    itinerary.
  • Properties can be set by holding Control and
    double clicking on the nested itinerary.
  • Then connecting the inputs and output ports of
    the nested itinerary as one would any other
    module.

24
  • DISCOVERY
  • RULE ASSOCIATION
  • Using FP Growth

25
Overview
Discovery Rule Association
  • Unsupervised learning problem.
  • Find all rules that correlate the presence of one
    set of items X with another item Y.
  • Example When a customer buys bread and butter,
    they buy milk 85 of the time.
  • Support is the percentage of the records that
    contain both X and Y.
  • A rule must have some minimum user-specified
    support to show its impact.
  • Confidence is the percentage of records that
    contain X and Y out of the number of records that
    contain X.
  • A rule must have some minimum user-specified
    confidence to show its value.

26
Strengths and Weaknesses
Discovery Rule Association
  • Strengths
  • It produces easy to understand results.
  • It supports undirected data mining.
  • It works on variable length data.
  • Rules are relatively easy to compute.
  • Weaknesses
  • It produces many rules.
  • For large numbers of attribute-value
    combinations, considerable cpu and memory
    resources are consumed.

27
Opening the Itinerary
Discovery Rule Association
  • Click on the Itinerary Pane in the Resource
    Panel.
  • Expand the Discovery directory with a single
    click.
  • Expand the RuleAssociation directory with a
    single click.
  • Double click on fp-growth-binning to load the
    itinerary into your Workspace.

28
Accessing the Data
Discovery Rule Association
  • Identify the data to use
  • Form rules
  • View rules in report or visualization

29
Rule Association Visualization
Discovery Rule Association
  • Read rules down the column.
  • Example - the first rule is
  • If petal-length Binned2. and petal-width
    Binned0.7 then flower-typeIris-setosa
  • Support 25
  • Confidence 100
  • Use brushing to find out support and confidence.
  • Click on the Confidence label to sort by
    confidence.
  • Click on the Support label to sort by support.
  • Additional functionality for searching/sorting is
    planned.

30
D2K Release
  • Review Modules
  • Review of Itineraries

31
The ALG Team
  • Staff
  • Bernie Acs
  • Loretta Auvil
  • David Clutter
  • Vered Goren
  • Eugene Grois
  • Luigi Marini
  • Robert McGrath
  • Chris Navarro
  • Greg Pape
  • Barry Sanders
  • Andrew Shirk
  • David Tcheng
  • Michael Welge
  • Students
  • Chen Chen
  • Hong Cheng
  • Yaniv Eytani
  • Fang Guo
  • Govind Kabra
  • Chao Liu
  • Haitao Mo
  • Xuanhui Wang
  • Qian Yang
  • Feida Zhu

32
References
  • http//alg.ncsa.uiuc.edu/do/tools/d2k/documentatio
    n
  • http//alg.ncsa.uiuc.edu/do/tools/d2k/tutorials
  • http//alg.ncsa.uiuc.edu/tools/docs/d2k/manual/ind
    ex.html
  • http//alg.ncsa.uiuc.edu/tools/docs/d2k/principles
    /index.html
  • http//alg.ncsa.uiuc.edu/tools/docs/d2k/faq/faq.ht
    ml
Write a Comment
User Comments (0)
About PowerShow.com