Title: Data Mining and Cross-Validation over distributed / grid enabled networks: current state of the art
1Data Mining and Cross-Validation over
distributed / grid enabled networks current
state of the art
- Presented by Juan Bernal
- COT4930 - Introduction to Data Mining
- Instructor Dr. Koshgoftaar,
- Florida Atlantic University
- Spring, 2008
2Topics
- Introduction
- Cross-Validation definition and importance
- Why is Cross-Validation a computational-intensive
task? - Distributing Data-Mining processes over a
computer network. - WEKA and distributed Data Mining How it is done
- Other Projects implementing grids/distributed
networks - Weka Parallel
- Grid Weka
- Weka4WS
- Inhambu
- Conclusion
3Introduction
- Data Mining today is being performed in vast
amounts of ever growing data. The need to analyze
and extract information from different domain
databases demands more computational resources
and the expected results in the minimum amount of
time possible. - There are many different projects that try to
address Data Mining processes over distributed,
or grid enabled networks. All of them attempt to
make use of all the available computer resources
in a grid or networked environment to improve the
time that takes to obtain results and even to
increase the accuracy of the results obtained. - One of the Data Mining most computational
intensive tasks is Cross Validation, which is the
focus in many grid/distributed network Data
Mining tools.
4Cross-Validation
- Cross-Validation (CV) is the standard Data Mining
method for evaluating performance of
classification algorithms. Mainly, to evaluate
the Error Rate of a learning technique. - In CV a dataset is partitioned in n folds, where
each is used for testing and the remainder used
for training. The procedure of testing and
training is repeated n times so that each
partition or fold is used once for testing. - The standard way of predicting the error rate of
a learning technique given a single, fixed sample
of data is to use a stratified 10-fold
cross-validation. - Stratification implies making sure that when
sampling is done each class is properly
represented in both training and test datasets.
This is achieved by randomly sampling the dataset
when doing the n fold partitions.
510-Fold Cross-Validation
- In a stratified 10-fold Cross-Validation the data
is divided randomly into 10 parts in which the
class is represented in approximately the same
proportions as in the full dataset. Each part is
held out in turn and the learning scheme trained
on the remaining nine-tenths then its error rate
is calculated on the holdout set. The learning
procedure is executed a total of 10 times on
different training sets, and finally the 10 error
rates are averaged to yield an overall error
estimate.
3-fold cross-validation graphical example
6Why is Cross-Validation a computational-intensive
task?
- When seeking an accurate error estimate, it is
standard procedure to repeat the CV process 10
times. This means invoking the learning algorithm
100 times and is a computational and time
intensive task. - Given the nature of Cross-Validation many
researchers have worked on executing this process
more efficiently over a grid or networked
computer environments.
7Distributing Data-Mining processes over a
computer network
- Different projects including WEKA have
implemented a way to distribute Data Mining
processes and in particular Cross-Validation over
networked computers. In almost all projects a
client-server approach is used and methods like
Java RMI (Remote Method Invocation) and WSRF (Web
Services Resource Framework) are implemented to
allow network communications between clients and
servers. - Also, WEKA is the main tool over which different
projects are based to achieve Data Mining over
computer networks due to its easily accessible
Java source code and adaptability.
8WEKA distribution of Data Mining Processes over
several computers
- The WEKA tool contains a feature to split an
experiment and distribute it across several
processors. - Distributing an experiment involves splitting it
into subexperiments that RMI sends to the host
for execution. The experiment can be partitioned
by datasets, where each subexperiment is
self-contained and applies all schemes to a
single dataset. In the other hand, with few
datasets the partitions can set by run. For
example a 10 times 10-fold CV would be split into
10 sub experiments, one per run. - This feature is available from the experimenter
section of the WEKA tool which is the main
section under which research is done. - Under the Experimenter the ability to distribute
processes is found under the advanced version of
the Setup panel.
9(No Transcript)
10WEKA requirements for distributing experiments
- Each host
- Needs Java installed
- Needs access to databases to be used
- Needs to be running the weka.experiment.RemoteEngi
ne experiment server - Distributing an experiment works best if the
results are sent to a central database by
selecting JDBC as the result destination. If not
preferred, each host can save the results to a
different ARFF that can be merged afterwards.
11WEKA difficulties for distributed implementation
- File and directory permissions can be difficult
to set up. - Manually installing and configuring each host
with the Weka experimenter server and the
remote.policy file which grants remote engine
permissions for network operations. - Manually initializing or starting each host.
- Setting up a centralized database server and
access. - In the positive side once all these
configurations and preparations are done the
experiment can be executed and time can be saved
by distributing the workload among the hosts. - WEKA Experimenter Tutorial
- http//sourceforge.net/project/downloading.php?gro
upnamewekafilenameExperimenterTutorial-3.4.12.p
dfuse_mirrorinternap
12Other Projects implementing grids / distributed
networks for Data Mining and Cross-Validation
- Based on Weka there are some projects that try to
improve the process of performing data mining and
cross-validation over numerous computers - Weka Parallel
- Grid Weka
- Inhambu
- Weka4WS
13Weka-Parallel Machine Learning in Parallel
- Weka-Parallel was created with the intention of
being able to run the cross-validation portion of
any given classifier very quickly. This speed
increase is accomplished by simultaneously
calculating the necessary information using many
different machines. - To achieve communication from the computer
running Weka (client) to the other computers
(servers) Weka-Parallel uses a simple connection
established by the Socket class in the Java.net
package. Each server would start a daemon that
listens to a port, then the socket would open a
Data and an Object DataStream to send/receive
information. - RMI was not used to manage the client calls to
servers to do the necessary methods for
calculating specific folds of CV. Instead, the
client sends integer codes to the servers telling
him what methods to run. - Each server receives a copy of the dataset, and
information on what fold it has to perform. The
client computer maintains an index to assign what
fold each server performs and has a Round Robin
algorithm.
14Weka-ParallelSpeedup performance analysis
- An experiment was done running the J48 decision
tree classifier with default parameters on the
Waveform-5000 dataset from the UCI repository.
The 5000-Waveform dataset contains 5300 points in
21 dimensions, and the goal s to find the
classifier that correctly distinguishes between 3
classes of waves. A 500-fold cross validation was
ran using up to 14 computers with similar
hardware.
Weka-Parallel link http//weka-parallel.sourcefor
ge.net/
15Grid-Weka
- In the Grid-enabled Weka, execution of the
following tasks can be distributed across several
computers in an ad-hoc Grid - Building a classifier on a remote machine.
- Testing a previously built classifier on several
machines in parallel . - Labeling a dataset using a previously built
classifier on several machine in parallel. - Using several machines to perform parallel
cross-validation. - Labeling involves applying a previously learned
classifier to an unlabelled data set to predict
instance labels. - Testing takes a labeled data set, temporarily
removes class labels, applies the classifier, and
then analyses the quality of the classification
algorithm by comparing the actual and the
predicted labels. - Finally, for n-fold cross-validation a labeled
data set is partitioned into n folds, and n
training and testing iterations are performed. On
each iteration, one fold is used as a test set,
and the rest of the data is used as a training
set. A classifier is learned on the training set
and then validated on the test data. - Grid-Weka is similar to the Weka-Parallel
project, but allows for performing more functions
in parallel on remote machines (and also includes
better load balancing, fault monitoring, and
datasets management).
16Grid-Weka
- The labeling function is distributed by
partitioning the data set, labeling several
partitions in parallel on different available
machines, and merging the results into a single
labeled data set. - The testing function is distributed in a similar
way, with test statistics being computed in
parallel on several machines for different
subsets of the test data. - Distributing cross-validation is also
straightforward individual iterations for
different folds are executed on different
machines.
17Grid-Weka Setup details
- It uses a custom interface for communication
between clients and servers utilizing native Java
object serialization for data exchange. - It is mainly done on a Java command line
execution style. - It uses a .weka-parallel configuration file in
the client computer to setup the list of servers,
in the following format - PORTltPort numbergtltMachine IP address or DNS
namegtltNumber of Weka servers running on this
machinegtltMax. amount of memory on this machine
in MbytesgtltMachine IP address or DNS namegt - For each Weka server, a copy of the Weka software
(the .jar file) is made on the selected machines
and the Weka server class is run as follows
java weka.core.DistributedServer ltPort numbergt - If a machine is going to run more than one weka
server each server should have its own directory
so it doesnt combine results generated.
18Performance analysis between Weka-Parallel and
Grid-Weka
- Grid-weka sacrifices some performance in exchange
of more features, compare to the Parallel-Weka.
These features are load-balancing, data
recovery/fault monitoring, and more data mining
functions than just cross validation.
Grid-Weka Development http//userweb.port.ac.uk/
khusainr/weka/Xin_thesis.pdf Grid-Weka HowTo
http//userweb.port.ac.uk/khusainr/weka/gweka-how
to.html
19Inhambu
- Inhambu is a distributed object-oriented system
that supports the execution of data mining
applications on clusters of PCs and workstations. - Inhambu is a system that uses the idle resources
in a cluster composed of commodity PCs,
supporting the execution of DM applications based
on the Weka tool. - Its goal is to improve issues with Scheduling and
load sharing, Overloading and contention
avoidance, Heterogeneity, and Fault tolerance
when performing Data Mining processes in a grid
or clusters of computers.
20Inhambu architecture
- The architecture of Inhambu implements
- An application layer consists in a modified
implementation of Weka. With specific components
implemented and deployed at the client and server
sides. The client component executes the user
interface and generates DM tasks, while the
server contains the core Weka classes which
execute the DM tasks. - A resource management layer which provides for
the execution of Weka in a distributed
environment. - The trader provides publishing and discovery
mechanisms for clients and servers.
21Inhambu Improvements
- Scheduling and load sharing Implementation of
static and dynamic performance indices. Static
performance indices are usually implemented by
static values that express or quantify amounts of
resources and capacities. After an index is
created then a dynamic monitoring performance
updates the index. - Overloading and contention avoidance
Implementation of a best effort policy, where
to avoid overloading a computer, it can only be
chosen to receive load entities if its load index
is below a given threshold. Default value chosen
for the threshold is 0.7. for the relationship
utilization index vs. the response time of a
computer system. - Heterogeneity Based on the Capacity State Index
maintained distribution of the work can be
enhanced in heterogeneous environments. - Fault tolerance Checkpointing and recovery was
implemented in the client side.
22Inhambu performance against Parallel-Weka.
- Performance was done by running experiments on 2
real world databases - Adults Census Income, and the a dataset for the
diffuse large-B-cell lymphoma DLBCL. - The first performance test was done to determine
scalability as shown in the tables when using J48
and PART classifications. - Inhambu and Weka-Parallel performs roughly
similar for fine granularity tasks, and Inhambu
performs better than Weka-Parallel when running
tasks whose granularity is coarser.
23Inhambu Performance on non-dedicated and
heterogeneous clusters
- Notice that Weka-Parallel can lead to better
performance in presence of shorter tasks, such as
J4.8, due to its low communication overhead (it
uses sockets). Regardless of higher overhead due
to the use of RMI, Inhambu has a better
performance in presence of longer tasks,
Inhambu link http//inhambu.incubadora.fapesp.br/
portal
24Weka4WS
- The goal of Weka4WS is to extend Weka to support
remote execution of the data mining algorithms
through the Web Services Resource Framework
(WSRF) Web Services. - To enable remote invocation, all the data mining
algorithms provided by the Weka library are
exposed as a Web Service. - Weka4WS has been developed by using the WSRF Java
library provided by Globus Toolkit 4 (GT4). Which
is an OGSA (Open Grid Service Architecture).
25Weka4WS structure
- In the Weka4WS framework all nodes use the GT4
services for standard Grid functionalities, such
as security and data management. Those nodes can
be distinguished in two categories - 1. user nodes, which are the local machines of
the users providing the Weka4WS client software - 2. computing nodes, which provide the Weka4WS Web
Services allowing the execution of remote data
mining tasks. - The storage node can be applied when a
centralized database is used.
26Weka4WS Setup details
- Weka4WS requires Globus Toolkit 4 on the
computing nodes and only the Java WS Core (a
subset of Globus Toolkit) on the user nodes. But
since GT4 only runs in Unix platforms, the
computing nodes need to be Unix or Linux. - The Weka4WS client can be installed in either
Unix or Windows environment. - Due to the nature of the web-service oriented
approach there are security requirements because
Weka4WS runs in a security context, and uses a
grid-map authorization (only users listed in the
service grid-map can execute it). Authentication
needed using certificates. - In the client computer a machines file is needed
for listing all the computing nodes. This is the
only setup/configuration Weka4WS needs. The
format of this file - computing node
- hostname container port gridFTP port
- pluto.deis.unical.it 8443 2811
27Weka4WS performance
- Performance analysis of Weka4WS for executing a
typical data mining task in different network
scenarios. In particular, the execution times of
the different steps needed to perform the overall
data mining task were evaluated to determine
overhead on LAN vs. WAN networks. - No performance comparisons were done against
other Grid enabled data mining tools.
Weka4WS paper http//grid.deis.unical.it/papers/p
df/PKDD2005.pdf
28Conclusion
- The area of Data Mining and Cross-Validation over
Grid enabled environments is in constant
development. - Latest efforts try to develop and implement
standard frameworks such as the OGSA (Open Grid
Service Architecture) for data mining tools. - From the analysis of each of the presented tools,
Weka4WS presents the most interesting overall.
Still, other projects have positive features that
eventually will be conglomerated into a single
Grid Data Mining Tool based on Weka. - Further research will focus on enhancing
performance, of the current tools that use RMI
and the WSRF to avoid the overhead given by
communications. Also, a further research topic
can include available peer-to-peer or internet
networks to facilitate performing data mining
task over an Internet cluster available to
everyone.