Noah Evans - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Noah Evans

Description:

NLP is a field that deals with huge amounts of data. ... will either be implemented in limbo or using inferno's ability to interact ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 2
Provided by: vorlo
Category:
Tags: evans | limbo | noah

less

Transcript and Presenter's Notes

Title: Noah Evans


1
An Inferno Grid for NLP using Map Reduce
Noah Evans ??????????????????
Introduction
Inferno
Map Reduce
System
Resource Demands of Natural Language
Processing NLP is a field that deals with huge
amounts of data. Corpora can be millions or even
billions of words. The web is even larger. Not
only do NLP applications deal with massive
amounts of data, the operations that manipulate
NLP data are computationally expensive. The
massive amount of computing power and data
throughput necessary for everyday NLP tasks makes
it important to be able to use all of the
computing power available to solve NLP problems.
Typically NLP researchers have used powerful
individual computers to solve NLP problems.
Using single computers is easy but inefficient,
users must choose and schedule their operations
manually and they are limited to the power of one
computer. This poster proposes using
technologies created by Bell Labs and Google,
Inferno and Map Reduce, to allow the efficient
and transparent use of multiple computers
together to tackle NLP problems.
Purpose Inferno provides a means for the grid
system to communicate. It provides a simple
method of communication between computers and
programs that allows NLP tasks and grid control
to be conducted quickly, easily and transparently
.
Purpose Map reduce provides a model of
computation for a grid system. Map reduce allows
the system to distribute computation in a uniform
and easily understandable way. Moreover this
method of distribution does not require any
knowledge of parallel computing. This makes the
system easier to use and less prone to mistakes.
Grid NAIST has a 64 computer cluster. Each
computer in the cluster is a dual opteron 2.4Ghz
with 16 gigabytes of memory. This cluster will be
used to form the basis of the grid. Data We are
using a corpus based on 700 gigabytes of web
data. Implementation Until a complete
implementation Map Reduce is available we will
use Caerwyn Jones Geryon partial implementation
of Map Reduce. Mapping and reducing will be
initially implemented by inferno shell commands.
These commands will either be implemented in
limbo or using infernos ability to interact with
the host operating system to run external
commands and pass the results of these commands
to other nodes in the grid using Styx.
Applications Initially the grid will be used
to perform n-gram searches on the web data.
Grid Computing
Hosted Operating System Developed by Bell Labs,
Inferno can either run as a standalone operating
system, like Windows or Unix or a hosted
operating system, running as a normal program,
like Internet Explorer or Microsoft
Word. Inferno runs on a virtual machine dis and
uses a byte compiled garbage collected language
similar to C for programming. Inferno also
provides a suite of commands similar to
Unix(sort, wc, grep) as well as an advanced text
editor and debugger.
  • Distributed Computing
  • Distributed computing is based on the
    understanding that computing power centered in
    one computer costs more economically than the
    equivalent power distributed among multiple,
    cheaper computers. In other words, the price of
    computing power is not linear to the power
    gained. In addition there are upper limits on the
    power of even the most expensive individual
    computers. Individual computers are also limited
    by the maximum power of the technology
    available.
  • However the combined power of multiple computers
    also has limitations. Coordinating and executing
    distributed computation is not as efficient as
    computation on a single computer and is difficult
    to implement because the user must consider
    communication between different computers.
  • Grid Systems
  • Grid computing is a type of distributed computing
    that connects computers that arent necessarily
    closely connected or centrally administered to
    perform distributed computations. Grid systems
    typically break down the problem into unconnected
    subproblems and distribute the subproblems among
    the computers in the grid and distributes them
    over a network. Members of a grid can reside on
    the same network or be loosely connected all over
    the world.

Conclusions
Method of Distributed Computation Map Reduce was
developed by google as a way of doing large
computations in parallel. Map Reduce uses two
techniques taken from functional
programming. Map takes a group of key values
and maps those keys onto a new set of values.
For example, parsing a set of sentences can be
thought of as mapping each sentence onto its
corresponding parse tree. Reduce takes those
values and reduces them to a meaningful value,
For example counting the occurrences of certain
tree patterns and counting n-grams are examples
of reduce operations. Mapping and Reducing are
implemented on individual computers(worker
nodes). The values to be processed are split and
sent to worker nodes which map the values and
then the mapped values are sent to other
computers which then reduce and output the mapped
data. NLP Applications Map Reduce is naturally
applicable to the annotation based model of an
NLP workflow. Adding annotation corresponds to
the Map function. Segmenting a Japanese sentence
using Chasen or parsing an English sentence using
the Charniak parser can be seen as mapping the a
sentence onto its words or mapping a sentence
onto its parse tree respectively. Making
conclusions based on the mapped values
corresponds to the Reduce function. Counting the
number of n-grams generated from sentences
Chunked by chasen or returning specific subtrees
from sentences generated from the Charniak
parser. Advantages This way of computing values
using map reduce is that it is inherently
parallelizable. Each set of mapped values does
not depend on the values computed by any other
worker node. Since one mapped value does not
depend on any other value, individual computers
dont need to communicate with each other. This
means that every computer in the grid can compute
values as fast as it can. Applied to the NLP
based annotation workflow Map Reduce provides a
natural way of expressing NLP computations in a
parallelizable way.
Grids have the potential to increase researcher
and machine productivity in solving Natural
Language Processing problems. Grids allow
researchers to utilize more computing power to
deal with the massive amounts of data used in NLP
problems and to perform computationally complex
tasks performed. This particular implementation
of a grid using Inferno and Map Reduce further
enhances researcher productivity by providing
simple abstractions for grid interaction and
computation. Communication between users and the
grid is transparent and computations are
performed using functional principles to minimize
the interconnectedness of computations. This
means that the researcher does not need any
knowledge of the communication medium or parallel
computing. This allows the researchers to
concentrate on solving NLP problems
Grids and NLP
References
  • Styx
  • The Styx protocol is a method of communication
    that is the key to Infernos ability to act as a
    distributed system.
  • Styx implements communication by creating in
    memory file systems. These filesystems are
    provided by namespaces unique to each program
    which allows individual programs to use different
    sets of resources. Interactions with the
    filesystem are performed using traditional file
    reads and writes then interpreted by the kernel.
  • By allowing access to Styx resources remotely,
    Styx can be used for transparent interprocess
    communication(ipc) and remote procedure
    calls(rpc).
  • NLP Workflows
  • Typical NLP tasks can thought of as adding
    annotations to language data and making
    conclusions based on those annotations. An NLP
    researcher starts with raw language data(a
    corpus) and applies a series of transformations
    to that data(Chasen, Cabocha, the Charniak
    Parser). Each transformation adds progressively
    more information to the previous data. Finally,
    the NLP researcher makes some conclusion from
    that data(the number n-grams, extracting a
    certain type of subtree)
  • Grids
  • NLP research conducting over a grid would
    automate and distribute this workflow over
    multiple machines. The raw data would be divided
    among each computer and annotation could be
    assigned either serially or in parallel(output
    from one annotation sent directly as input to
    another annotation). The assignment for grid
    workflows is traditionally done in an XML job
    control language.
  • Previous Work
  • Hughes, Bird et al 03 talks about using grids
    with existing grid frameworks and finding and
    using resources efficiently
  • Tamburini 04 discusses making a combined corpus
    from a grid of smaller corpora distributed
    globally.
  • Sonntag 04 describes how to automatically
    discover grid workflows for NLP applications
  • MapReduce Simplified Data Processing on Large Clu
    sters , OSDI 04
  • The Inferno Operating System, Bell Labs Technical
    Journal, Vol. 2
  • The Styx Architecture for Distributed Systems,
    Bell Labs Technical Journal, Vol. 4, No. 2
  • Distributed NLP and Machine Learning for Question
    Answering Grid, IWSIMWG 04
  • A Grid Based Architecture for High-Performance
    NLP, arXivcs/0308008.
  • Building Distributed Language Resources by Grid
    Computing, Computer 02
  • Advantages
  • By using the styx protocol as its primary rpc
    method, a grid based on Inferno avoids the
    overhead and complexity that come with other XML
    based systems. Reads and writes are the only
    operations necessary to interact with remote
    machines on an Inferno grid . This provides an
    abstraction at the os level, invisible to the
    user. Host programs are unaware if they are
    dealing with a local or a remote resource. This
    greatly simplifies the implementation of programs
    like the grid controller or worker nodes because
    they can be implemented separately from the grids
    structure.
Write a Comment
User Comments (0)
About PowerShow.com