Title: Noah Evans
1An Inferno Grid for NLP using Map Reduce
Noah Evans ??????????????????
Introduction
Inferno
Map Reduce
System
Resource Demands of Natural Language
Processing NLP is a field that deals with huge
amounts of data. Corpora can be millions or even
billions of words. The web is even larger. Not
only do NLP applications deal with massive
amounts of data, the operations that manipulate
NLP data are computationally expensive. The
massive amount of computing power and data
throughput necessary for everyday NLP tasks makes
it important to be able to use all of the
computing power available to solve NLP problems.
Typically NLP researchers have used powerful
individual computers to solve NLP problems.
Using single computers is easy but inefficient,
users must choose and schedule their operations
manually and they are limited to the power of one
computer. This poster proposes using
technologies created by Bell Labs and Google,
Inferno and Map Reduce, to allow the efficient
and transparent use of multiple computers
together to tackle NLP problems.
Purpose Inferno provides a means for the grid
system to communicate. It provides a simple
method of communication between computers and
programs that allows NLP tasks and grid control
to be conducted quickly, easily and transparently
.
Purpose Map reduce provides a model of
computation for a grid system. Map reduce allows
the system to distribute computation in a uniform
and easily understandable way. Moreover this
method of distribution does not require any
knowledge of parallel computing. This makes the
system easier to use and less prone to mistakes.
Grid NAIST has a 64 computer cluster. Each
computer in the cluster is a dual opteron 2.4Ghz
with 16 gigabytes of memory. This cluster will be
used to form the basis of the grid. Data We are
using a corpus based on 700 gigabytes of web
data. Implementation Until a complete
implementation Map Reduce is available we will
use Caerwyn Jones Geryon partial implementation
of Map Reduce. Mapping and reducing will be
initially implemented by inferno shell commands.
These commands will either be implemented in
limbo or using infernos ability to interact with
the host operating system to run external
commands and pass the results of these commands
to other nodes in the grid using Styx.
Applications Initially the grid will be used
to perform n-gram searches on the web data.
Grid Computing
Hosted Operating System Developed by Bell Labs,
Inferno can either run as a standalone operating
system, like Windows or Unix or a hosted
operating system, running as a normal program,
like Internet Explorer or Microsoft
Word. Inferno runs on a virtual machine dis and
uses a byte compiled garbage collected language
similar to C for programming. Inferno also
provides a suite of commands similar to
Unix(sort, wc, grep) as well as an advanced text
editor and debugger.
- Distributed Computing
- Distributed computing is based on the
understanding that computing power centered in
one computer costs more economically than the
equivalent power distributed among multiple,
cheaper computers. In other words, the price of
computing power is not linear to the power
gained. In addition there are upper limits on the
power of even the most expensive individual
computers. Individual computers are also limited
by the maximum power of the technology
available. - However the combined power of multiple computers
also has limitations. Coordinating and executing
distributed computation is not as efficient as
computation on a single computer and is difficult
to implement because the user must consider
communication between different computers. - Grid Systems
- Grid computing is a type of distributed computing
that connects computers that arent necessarily
closely connected or centrally administered to
perform distributed computations. Grid systems
typically break down the problem into unconnected
subproblems and distribute the subproblems among
the computers in the grid and distributes them
over a network. Members of a grid can reside on
the same network or be loosely connected all over
the world.
Conclusions
Method of Distributed Computation Map Reduce was
developed by google as a way of doing large
computations in parallel. Map Reduce uses two
techniques taken from functional
programming. Map takes a group of key values
and maps those keys onto a new set of values.
For example, parsing a set of sentences can be
thought of as mapping each sentence onto its
corresponding parse tree. Reduce takes those
values and reduces them to a meaningful value,
For example counting the occurrences of certain
tree patterns and counting n-grams are examples
of reduce operations. Mapping and Reducing are
implemented on individual computers(worker
nodes). The values to be processed are split and
sent to worker nodes which map the values and
then the mapped values are sent to other
computers which then reduce and output the mapped
data. NLP Applications Map Reduce is naturally
applicable to the annotation based model of an
NLP workflow. Adding annotation corresponds to
the Map function. Segmenting a Japanese sentence
using Chasen or parsing an English sentence using
the Charniak parser can be seen as mapping the a
sentence onto its words or mapping a sentence
onto its parse tree respectively. Making
conclusions based on the mapped values
corresponds to the Reduce function. Counting the
number of n-grams generated from sentences
Chunked by chasen or returning specific subtrees
from sentences generated from the Charniak
parser. Advantages This way of computing values
using map reduce is that it is inherently
parallelizable. Each set of mapped values does
not depend on the values computed by any other
worker node. Since one mapped value does not
depend on any other value, individual computers
dont need to communicate with each other. This
means that every computer in the grid can compute
values as fast as it can. Applied to the NLP
based annotation workflow Map Reduce provides a
natural way of expressing NLP computations in a
parallelizable way.
Grids have the potential to increase researcher
and machine productivity in solving Natural
Language Processing problems. Grids allow
researchers to utilize more computing power to
deal with the massive amounts of data used in NLP
problems and to perform computationally complex
tasks performed. This particular implementation
of a grid using Inferno and Map Reduce further
enhances researcher productivity by providing
simple abstractions for grid interaction and
computation. Communication between users and the
grid is transparent and computations are
performed using functional principles to minimize
the interconnectedness of computations. This
means that the researcher does not need any
knowledge of the communication medium or parallel
computing. This allows the researchers to
concentrate on solving NLP problems
Grids and NLP
References
- Styx
- The Styx protocol is a method of communication
that is the key to Infernos ability to act as a
distributed system. - Styx implements communication by creating in
memory file systems. These filesystems are
provided by namespaces unique to each program
which allows individual programs to use different
sets of resources. Interactions with the
filesystem are performed using traditional file
reads and writes then interpreted by the kernel. - By allowing access to Styx resources remotely,
Styx can be used for transparent interprocess
communication(ipc) and remote procedure
calls(rpc).
- NLP Workflows
- Typical NLP tasks can thought of as adding
annotations to language data and making
conclusions based on those annotations. An NLP
researcher starts with raw language data(a
corpus) and applies a series of transformations
to that data(Chasen, Cabocha, the Charniak
Parser). Each transformation adds progressively
more information to the previous data. Finally,
the NLP researcher makes some conclusion from
that data(the number n-grams, extracting a
certain type of subtree) - Grids
- NLP research conducting over a grid would
automate and distribute this workflow over
multiple machines. The raw data would be divided
among each computer and annotation could be
assigned either serially or in parallel(output
from one annotation sent directly as input to
another annotation). The assignment for grid
workflows is traditionally done in an XML job
control language. - Previous Work
- Hughes, Bird et al 03 talks about using grids
with existing grid frameworks and finding and
using resources efficiently - Tamburini 04 discusses making a combined corpus
from a grid of smaller corpora distributed
globally. - Sonntag 04 describes how to automatically
discover grid workflows for NLP applications
- MapReduce Simplified Data Processing on Large Clu
sters , OSDI 04 - The Inferno Operating System, Bell Labs Technical
Journal, Vol. 2 - The Styx Architecture for Distributed Systems,
Bell Labs Technical Journal, Vol. 4, No. 2 - Distributed NLP and Machine Learning for Question
Answering Grid, IWSIMWG 04 - A Grid Based Architecture for High-Performance
NLP, arXivcs/0308008. - Building Distributed Language Resources by Grid
Computing, Computer 02
- Advantages
- By using the styx protocol as its primary rpc
method, a grid based on Inferno avoids the
overhead and complexity that come with other XML
based systems. Reads and writes are the only
operations necessary to interact with remote
machines on an Inferno grid . This provides an
abstraction at the os level, invisible to the
user. Host programs are unaware if they are
dealing with a local or a remote resource. This
greatly simplifies the implementation of programs
like the grid controller or worker nodes because
they can be implemented separately from the grids
structure.