Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art

Description:

... are implemented to allow network communications between clients and servers. ... the relationship utilization index vs. the response time of a computer system. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 29
Provided by: latinamer
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art


1
Data Mining and Cross-Validation over
distributed / grid enabled networks current
state of the art
  • Presented by Juan Bernal
  • COT4930 - Introduction to Data Mining
  • Instructor Dr. Koshgoftaar,
  • Florida Atlantic University
  • Spring, 2008

2
Topics
  • Introduction
  • Cross-Validation definition and importance
  • Why is Cross-Validation a computational-intensive
    task?
  • Distributing Data-Mining processes over a
    computer network.
  • WEKA and distributed Data Mining How it is done
  • Other Projects implementing grids/distributed
    networks
  • Weka Parallel
  • Grid Weka
  • Weka4WS
  • Inhambu
  • Conclusion

3
Introduction
  • Data Mining today is being performed in vast
    amounts of ever growing data. The need to analyze
    and extract information from different domain
    databases demands more computational resources
    and the expected results in the minimum amount of
    time possible.
  • There are many different projects that try to
    address Data Mining processes over distributed,
    or grid enabled networks. All of them attempt to
    make use of all the available computer resources
    in a grid or networked environment to improve the
    time that takes to obtain results and even to
    increase the accuracy of the results obtained.
  • One of the Data Mining most computational
    intensive tasks is Cross Validation, which is the
    focus in many grid/distributed network Data
    Mining tools.

4
Cross-Validation
  • Cross-Validation (CV) is the standard Data Mining
    method for evaluating performance of
    classification algorithms. Mainly, to evaluate
    the Error Rate of a learning technique.
  • In CV a dataset is partitioned in n folds, where
    each is used for testing and the remainder used
    for training. The procedure of testing and
    training is repeated n times so that each
    partition or fold is used once for testing.
  • The standard way of predicting the error rate of
    a learning technique given a single, fixed sample
    of data is to use a stratified 10-fold
    cross-validation.
  • Stratification implies making sure that when
    sampling is done each class is properly
    represented in both training and test datasets.
    This is achieved by randomly sampling the dataset
    when doing the n fold partitions.

5
10-Fold Cross-Validation
  • In a stratified 10-fold Cross-Validation the data
    is divided randomly into 10 parts in which the
    class is represented in approximately the same
    proportions as in the full dataset. Each part is
    held out in turn and the learning scheme trained
    on the remaining nine-tenths then its error rate
    is calculated on the holdout set. The learning
    procedure is executed a total of 10 times on
    different training sets, and finally the 10 error
    rates are averaged to yield an overall error
    estimate.

3-fold cross-validation graphical example
6
Why is Cross-Validation a computational-intensive
task?
  • When seeking an accurate error estimate, it is
    standard procedure to repeat the CV process 10
    times. This means invoking the learning algorithm
    100 times and is a computational and time
    intensive task.
  • Given the nature of Cross-Validation many
    researchers have worked on executing this process
    more efficiently over a grid or networked
    computer environments.

7
Distributing Data-Mining processes over a
computer network
  • Different projects including WEKA have
    implemented a way to distribute Data Mining
    processes and in particular Cross-Validation over
    networked computers. In almost all projects a
    client-server approach is used and methods like
    Java RMI (Remote Method Invocation) and WSRF (Web
    Services Resource Framework) are implemented to
    allow network communications between clients and
    servers.
  • Also, WEKA is the main tool over which different
    projects are based to achieve Data Mining over
    computer networks due to its easily accessible
    Java source code and adaptability.

8
WEKA distribution of Data Mining Processes over
several computers
  • The WEKA tool contains a feature to split an
    experiment and distribute it across several
    processors.
  • Distributing an experiment involves splitting it
    into subexperiments that RMI sends to the host
    for execution. The experiment can be partitioned
    by datasets, where each subexperiment is
    self-contained and applies all schemes to a
    single dataset. In the other hand, with few
    datasets the partitions can set by run. For
    example a 10 times 10-fold CV would be split into
    10 sub experiments, one per run.
  • This feature is available from the experimenter
    section of the WEKA tool which is the main
    section under which research is done.
  • Under the Experimenter the ability to distribute
    processes is found under the advanced version of
    the Setup panel.

9
(No Transcript)
10
WEKA requirements for distributing experiments
  • Each host
  • Needs Java installed
  • Needs access to databases to be used
  • Needs to be running the weka.experiment.RemoteEngi
    ne experiment server
  • Distributing an experiment works best if the
    results are sent to a central database by
    selecting JDBC as the result destination. If not
    preferred, each host can save the results to a
    different ARFF that can be merged afterwards.

11
WEKA difficulties for distributed implementation
  • File and directory permissions can be difficult
    to set up.
  • Manually installing and configuring each host
    with the Weka experimenter server and the
    remote.policy file which grants remote engine
    permissions for network operations.
  • Manually initializing or starting each host.
  • Setting up a centralized database server and
    access.
  • In the positive side once all these
    configurations and preparations are done the
    experiment can be executed and time can be saved
    by distributing the workload among the hosts.
  • WEKA Experimenter Tutorial
  • http//sourceforge.net/project/downloading.php?gro
    upnamewekafilenameExperimenterTutorial-3.4.12.p
    dfuse_mirrorinternap

12
Other Projects implementing grids / distributed
networks for Data Mining and Cross-Validation
  • Based on Weka there are some projects that try to
    improve the process of performing data mining and
    cross-validation over numerous computers
  • Weka Parallel
  • Grid Weka
  • Inhambu
  • Weka4WS

13
Weka-Parallel Machine Learning in Parallel
  • Weka-Parallel was created with the intention of
    being able to run the cross-validation portion of
    any given classifier very quickly. This speed
    increase is accomplished by simultaneously
    calculating the necessary information using many
    different machines.
  • To achieve communication from the computer
    running Weka (client) to the other computers
    (servers) Weka-Parallel uses a simple connection
    established by the Socket class in the Java.net
    package. Each server would start a daemon that
    listens to a port, then the socket would open a
    Data and an Object DataStream to send/receive
    information.
  • RMI was not used to manage the client calls to
    servers to do the necessary methods for
    calculating specific folds of CV. Instead, the
    client sends integer codes to the servers telling
    him what methods to run.
  • Each server receives a copy of the dataset, and
    information on what fold it has to perform. The
    client computer maintains an index to assign what
    fold each server performs and has a Round Robin
    algorithm.

14
Weka-ParallelSpeedup performance analysis
  • An experiment was done running the J48 decision
    tree classifier with default parameters on the
    Waveform-5000 dataset from the UCI repository.
    The 5000-Waveform dataset contains 5300 points in
    21 dimensions, and the goal s to find the
    classifier that correctly distinguishes between 3
    classes of waves. A 500-fold cross validation was
    ran using up to 14 computers with similar
    hardware.

Weka-Parallel link http//weka-parallel.sourcefor
ge.net/
15
Grid-Weka
  • In the Grid-enabled Weka, execution of the
    following tasks can be distributed across several
    computers in an ad-hoc Grid
  • Building a classifier on a remote machine.
  • Testing a previously built classifier on several
    machines in parallel .
  • Labeling a dataset using a previously built
    classifier on several machine in parallel.
  • Using several machines to perform parallel
    cross-validation.
  • Labeling involves applying a previously learned
    classifier to an unlabelled data set to predict
    instance labels.
  • Testing takes a labeled data set, temporarily
    removes class labels, applies the classifier, and
    then analyses the quality of the classification
    algorithm by comparing the actual and the
    predicted labels.
  • Finally, for n-fold cross-validation a labeled
    data set is partitioned into n folds, and n
    training and testing iterations are performed. On
    each iteration, one fold is used as a test set,
    and the rest of the data is used as a training
    set. A classifier is learned on the training set
    and then validated on the test data.
  • Grid-Weka is similar to the Weka-Parallel
    project, but allows for performing more functions
    in parallel on remote machines (and also includes
    better load balancing, fault monitoring, and
    datasets management).

16
Grid-Weka
  • The labeling function is distributed by
    partitioning the data set, labeling several
    partitions in parallel on different available
    machines, and merging the results into a single
    labeled data set.
  • The testing function is distributed in a similar
    way, with test statistics being computed in
    parallel on several machines for different
    subsets of the test data.
  • Distributing cross-validation is also
    straightforward individual iterations for
    different folds are executed on different
    machines.

17
Grid-Weka Setup details
  • It uses a custom interface for communication
    between clients and servers utilizing native Java
    object serialization for data exchange.
  • It is mainly done on a Java command line
    execution style.
  • It uses a .weka-parallel configuration file in
    the client computer to setup the list of servers,
    in the following format
  • PORTltPort numbergtltMachine IP address or DNS
    namegtltNumber of Weka servers running on this
    machinegtltMax. amount of memory on this machine
    in MbytesgtltMachine IP address or DNS namegt
  • For each Weka server, a copy of the Weka software
    (the .jar file) is made on the selected machines
    and the Weka server class is run as follows
    java weka.core.DistributedServer ltPort numbergt
  • If a machine is going to run more than one weka
    server each server should have its own directory
    so it doesnt combine results generated.

18
Performance analysis between Weka-Parallel and
Grid-Weka
  • Grid-weka sacrifices some performance in exchange
    of more features, compare to the Parallel-Weka.
    These features are load-balancing, data
    recovery/fault monitoring, and more data mining
    functions than just cross validation.

Grid-Weka Development http//userweb.port.ac.uk/
khusainr/weka/Xin_thesis.pdf Grid-Weka HowTo
http//userweb.port.ac.uk/khusainr/weka/gweka-how
to.html
19
Inhambu
  • Inhambu is a distributed object-oriented system
    that supports the execution of data mining
    applications on clusters of PCs and workstations.
  • Inhambu is a system that uses the idle resources
    in a cluster composed of commodity PCs,
    supporting the execution of DM applications based
    on the Weka tool.
  • Its goal is to improve issues with Scheduling and
    load sharing, Overloading and contention
    avoidance, Heterogeneity, and Fault tolerance
    when performing Data Mining processes in a grid
    or clusters of computers.

20
Inhambu architecture
  • The architecture of Inhambu implements
  • An application layer consists in a modified
    implementation of Weka. With specific components
    implemented and deployed at the client and server
    sides. The client component executes the user
    interface and generates DM tasks, while the
    server contains the core Weka classes which
    execute the DM tasks.
  • A resource management layer which provides for
    the execution of Weka in a distributed
    environment.
  • The trader provides publishing and discovery
    mechanisms for clients and servers.

21
Inhambu Improvements
  • Scheduling and load sharing Implementation of
    static and dynamic performance indices. Static
    performance indices are usually implemented by
    static values that express or quantify amounts of
    resources and capacities. After an index is
    created then a dynamic monitoring performance
    updates the index.
  • Overloading and contention avoidance
    Implementation of a best effort policy, where
    to avoid overloading a computer, it can only be
    chosen to receive load entities if its load index
    is below a given threshold. Default value chosen
    for the threshold is 0.7. for the relationship
    utilization index vs. the response time of a
    computer system.
  • Heterogeneity Based on the Capacity State Index
    maintained distribution of the work can be
    enhanced in heterogeneous environments.
  • Fault tolerance Checkpointing and recovery was
    implemented in the client side.

22
Inhambu performance against Parallel-Weka.
  • Performance was done by running experiments on 2
    real world databases
  • Adults Census Income, and the a dataset for the
    diffuse large-B-cell lymphoma DLBCL.
  • The first performance test was done to determine
    scalability as shown in the tables when using J48
    and PART classifications.
  • Inhambu and Weka-Parallel performs roughly
    similar for fine granularity tasks, and Inhambu
    performs better than Weka-Parallel when running
    tasks whose granularity is coarser.

23
Inhambu Performance on non-dedicated and
heterogeneous clusters
  • Notice that Weka-Parallel can lead to better
    performance in presence of shorter tasks, such as
    J4.8, due to its low communication overhead (it
    uses sockets). Regardless of higher overhead due
    to the use of RMI, Inhambu has a better
    performance in presence of longer tasks,

Inhambu link http//inhambu.incubadora.fapesp.br/
portal
24
Weka4WS
  • The goal of Weka4WS is to extend Weka to support
    remote execution of the data mining algorithms
    through the Web Services Resource Framework
    (WSRF) Web Services.
  • To enable remote invocation, all the data mining
    algorithms provided by the Weka library are
    exposed as a Web Service.
  • Weka4WS has been developed by using the WSRF Java
    library provided by Globus Toolkit 4 (GT4). Which
    is an OGSA (Open Grid Service Architecture).

25
Weka4WS structure
  • In the Weka4WS framework all nodes use the GT4
    services for standard Grid functionalities, such
    as security and data management. Those nodes can
    be distinguished in two categories
  • 1. user nodes, which are the local machines of
    the users providing the Weka4WS client software
  • 2. computing nodes, which provide the Weka4WS Web
    Services allowing the execution of remote data
    mining tasks.
  • The storage node can be applied when a
    centralized database is used.

26
Weka4WS Setup details
  • Weka4WS requires Globus Toolkit 4 on the
    computing nodes and only the Java WS Core (a
    subset of Globus Toolkit) on the user nodes. But
    since GT4 only runs in Unix platforms, the
    computing nodes need to be Unix or Linux.
  • The Weka4WS client can be installed in either
    Unix or Windows environment.
  • Due to the nature of the web-service oriented
    approach there are security requirements because
    Weka4WS runs in a security context, and uses a
    grid-map authorization (only users listed in the
    service grid-map can execute it). Authentication
    needed using certificates.
  • In the client computer a machines file is needed
    for listing all the computing nodes. This is the
    only setup/configuration Weka4WS needs. The
    format of this file
  • computing node
  • hostname container port gridFTP port
  • pluto.deis.unical.it 8443 2811

27
Weka4WS performance
  • Performance analysis of Weka4WS for executing a
    typical data mining task in different network
    scenarios. In particular, the execution times of
    the different steps needed to perform the overall
    data mining task were evaluated to determine
    overhead on LAN vs. WAN networks.
  • No performance comparisons were done against
    other Grid enabled data mining tools.

Weka4WS paper http//grid.deis.unical.it/papers/p
df/PKDD2005.pdf
28
Conclusion
  • The area of Data Mining and Cross-Validation over
    Grid enabled environments is in constant
    development.
  • Latest efforts try to develop and implement
    standard frameworks such as the OGSA (Open Grid
    Service Architecture) for data mining tools.
  • From the analysis of each of the presented tools,
    Weka4WS presents the most interesting overall.
    Still, other projects have positive features that
    eventually will be conglomerated into a single
    Grid Data Mining Tool based on Weka.
  • Further research will focus on enhancing
    performance, of the current tools that use RMI
    and the WSRF to avoid the overhead given by
    communications. Also, a further research topic
    can include available peer-to-peer or internet
    networks to facilitate performing data mining
    task over an Internet cluster available to
    everyone.
Write a Comment
User Comments (0)
About PowerShow.com