D2K and Distributed Computing - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

D2K and Distributed Computing

Description:

Distributed computing in D2K is referred to as Proximities. ... Luigi Marini. Robert McGrath. Chris Navarro. Greg Pape. Barry Sanders. Andrew Shirk. David Tcheng ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 22
Provided by: lisag5
Category:

less

Transcript and Presenter's Notes

Title: D2K and Distributed Computing


1
D2K and Distributed Computing
  • May 24, 2006

2
Review/Questions
Distributed Computing
3
Overview
  • Distributed computing in D2K is referred to as
    Proximities.
  • By default, distributed computing is disabled for
    all itineraries.
  • D2K communicates over sockets with no security
    restrictions
  • Setting this up requires a few steps
  • Downloading and executing the D2K Server on
    machines that will be the remote machines
  • Setting Preferences Environment
  • Set Jini URL to test
  • Copy the policy.all file from D2K install
    directory to your home directory
  • On windows this file is in c/Documents and
    Settings/lauvil

4
Proximity Editor
  • Under Tools -gt Edit Proximities
  • Assign modules of an itinerary to D2K Servers for
    processing.
  • When an itinerary is run, D2K will automatically
    handle the distribution of module execution
    across the specified machines.
  • The Proximity Editor is displayed using a table
    of machines and modules.
  • Column labels across the top are the names of
    machines and in parentheses is the number of
    processors available on that machine.
  • Row labels along the left hand side are module
    names of the currently loaded itinerary.
  • Checkboxes in the table cells allow the user to
    associate a module with a machine for processing.

5
Module assignment
  • Modules indicated in red are User interface or
    Visualizations modules
  • Cannot be assigned to remote machines.
  • Modules indicated in black can only be assigned
    to one machine.
  • Modules indicated in green are reentrant and can
    be assigned to multiple machines. (more later)
  • The default port number for D2K Servers is 7021.
  • The D2K Toolkit will expect to find all remote
    D2K servers running on the port specified in the
    Proximity Editor.

6
Setting up machines
  • Use Proximity Editor GUI to create a
    machines.txt file to specify the server systems
    in your ltD2K install directorygt.
  • Each line of this file specifies the system,
    number of processors, name
  • E.g. pancake.ncsa.uiuc.edu,8,pancake

7
Advantage
Distributed Computing
  • Execute on remote machine that may have more
    memory
  • Execute on remote machine that may have a more
    powerful CPU.

8
Example of Executing on Remote Machine
  • Load Discovery/Rule Association/FP-Growth
    Itinerary
  • Check properties of Input1Filename
  • Set file to data/UCI/mushroom.csv
  • Check properties of FPGrowth
  • Set Support to 20
  • Check properties of Confidence
  • Set Support to 80
  • In Proximity Editor
  • Add machines
  • Select FPGrowth to run on one of the distributed
    machines
  • Change the port to 7011
  • Click Run
  • In Choose Attributes, select subset or all of the
    data as input, click Done.

9
Parallel Computing
Distributed Computing
  • Modules can be designed so that they can be
    launched many times.
  • And thus run in parallel.
  • We call these modules reentrant modules.
  • Extend ReentrantComputeModule or
    OrderedReentrantModule.
  • Clone themselves and run in tandem on different
    servers, operating simultaneously on different
    pieces of data.
  • Contain no state variables (no class variables)
  • In the proximity editor, select multiple
    locations for reentrant modules to execute.

10
Example Reentrant module usage
  • Load ClusterOptimize Itinerary
  • Each model that gets built can be created in
    parallel
  • Hier.Agglom.Clusterer is the only reentrant
    module in this itinerary.

11
Processor Status Overlay
Distributed Computing
  • Indicates the number of processors on each
    machine.
  • Represents utilization of each machine.

12
Parameter Optimization
  • Models often have control parameters that affect
    the model building.
  • Difficult to know the what values to use for
    these control parameters.
  • D2K provides a framework to allow for exploration
    of the parameter space of the algorithms.
  • Also use this framework for parameter study of
    transformations.

13
Framework for Parameter Optimization
  • D2K provides a framework for performing parameter
    optimization in model building.
  • This framework allows a user to specify a range
    of values for each parameter of interest.
  • The optimizing itinerary then "searches" the
    space of possible parameter combinations -
    producing a new model with each iteration.
  • These models are evaluated until either a fixed
    number of iterations or the discovery of a very
    good model causes the optimization process to
    halt.
  • This approach does take significant computing
    resources but it has the potential of finding
    models that are empirically better in various
    measures.

14
Parameter Space Exploration in D2K
Distributed Computing
15
Parameter Optimization Data Types
  • ParameterPoint is a data type that is a vector of
    control parameters to be passed to modules.
  • ParameterSpace is a data type that is a table of
    values that define the space of possible
    parameter combinations. Each ParameterSpace
    contains
  • Types for each parameter (numeric only for now).
    Nominal types are treated as integers so you need
    to read the property descriptions carefully to
    understand the proper values and their meanings.
  • Min/max values for each parameter.
  • A default value for each parameter.
  • The resolution for each parameter. This is the
    minimum distance between values of a parameter.
    Since parameters are all numeric this is the same
    as the smallest allowable absolute value of the
    difference between two parameter values.
  • Labels or names for each parameter.

16
Parameter Optimization Parameter Space Generator
  • Parameter space generator is a module where the
    user sets the values for the parameter space.
  • Preset values already exist but they can be
    overridden prior to execution.
  • If a parameter should remain fixed, its min and
    max values can be set equal.
  • Each optimizable module (or set of modules - see
    KMeans for example) should have a parameter space
    generator module that subclasses
    AbstractParamSpaceGen and implements properties
    for the parameters specific to that module.
  • Modules whose names end in OPT.
  • These are transformation or model building
    modules that are "optimizable" and therefore take
    a ParameterPoint as one input.
  • Expect to see pairs of module classes like
    DecisionTreeInducer and DecisionTreeInducerOpt.

17
Optimizers
  • There are optimizer modules that will perform the
    following
  • Take as input once a ParameterSpace and with each
    iteration take as input the "score" for the last
    ParameterPoint that was output.
  • Output a newly chosen point in parameter space
    with each iteration. At the end (after a fixed
    number of iterations or some other halting
    criteria), it outputs the history of points
    chosen and their scores along with the optimal
    points found.

18
Evaluation Modules
  • Typically inputs a model and possibly a test
    table and outputs a ParameterPoint that is a
    vector of scores for that model.
  • There can be multiple score values for each
    objective being optimized but in the simplest
    cases there is one score and it is commonly
    accuracy.
  • When crossfold validation is being used, the
    evaluator will often average scores over the
    number of folds.

19
Collection of Results
  • There is a module that takes the point in
    parameter space and the score for the model built
    using those parameters and concatenates the two
    together to form a single vector that is returned
    to the optimizer.

20
Example Parameter Optimization
  • Load Discovery -gt Clustering -gtClusterOptimize

21
The ALG Team
  • Staff
  • Bernie Acs
  • Loretta Auvil
  • David Clutter
  • Vered Goren
  • Eugene Grois
  • Luigi Marini
  • Robert McGrath
  • Chris Navarro
  • Greg Pape
  • Barry Sanders
  • Andrew Shirk
  • David Tcheng
  • Michael Welge
  • Students
  • Chen Chen
  • Hong Cheng
  • Yaniv Eytani
  • Fang Guo
  • Govind Kabra
  • Chao Liu
  • Haitao Mo
  • Xuanhui Wang
  • Qian Yang
  • Feida Zhu
Write a Comment
User Comments (0)
About PowerShow.com