Distributed Iterative Training - PowerPoint PPT Presentation

About This Presentation

Title:

Distributed Iterative Training

Description:

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 22

Provided by: KevinG93

Learn more at: http://boston.lti.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Iterative Training

1
Distributed Iterative Training
Kevin Gimpel Shay Cohen Severin Hacker
Noah A. Smith
2
Outline

The Problem
Distributed Architecture
Experiments and Hadoop Issues

3
Iterative Training

Many problems in NLP and machine learning require
iterating over large training sets many times
Training log-linear models (logistic regression,
conditional random fields)
Unsupervised or semi-supervised learning with EM
(word alignment in MT, grammar induction)
Minimum Error-Rate Training in MT
Online learning (MIRA, perceptron, stochastic
gradient descent)
All of the above except can be easily
parallelized
Compute statistics on sections of the data
independently
Aggregate them
Update parameters using statistics of full set of
data
Repeat until a stopping criterion is met

4
Dependency Grammar Induction

Given sentences of natural language text, infer
(dependency) parse trees
State-of-the-art results obtained using only a
few thousand sentences of length 10 tokens
(Smith and Eisner, 2006)
This talk scaling up to more and longer
sentences using Hadoop!

5
Dependency Grammar Induction

Training
Input is a set of sentences (actually, POS tag
sequences) and a grammar with initial parameter
values
Run an iterative optimization algorithm (EM,
LBFGS, etc.) that changes the parameter values on
each iteration
Output is a learned set of parameter values
Testing
Use grammar with learned parameters to parse a
small set of test sentences
Evaluate by computing percentage of predicted
edges that match a human annotator

6
Outline

The Problem
Distributed Architecture
Experiments and Hadoop Issues

7
MapReduce for Grammar Induction

MapReduce was designed for
Large amounts of data distributed across many
disks
Simple data processing
We have
(Relatively) small amounts of data
Expensive processing and high memory requirements

8
MapReduce for Grammar Induction

Algorithms require 50-100 iterations for
convergence
Each iteration requires a full sweep over all
training data
Computational bottleneck is computing expected
counts for EM on each iteration (gradient for
LBFGS)
Our approach run one MapReduce job for each
iteration
Map compute expected counts (gradient)
Reduce aggregate
Offline renormalize (EM) or modify parameter
values (LBFGS)
Note renormalization could be done in reduce
tasks for EM with correct partition functions,
but using LBFGS in multiple reduce tasks is
trickier

9
MapReduce Implementation
Server

Normalize expected counts to get new parameter
values
Start new MapReduce job, placing new parameter
values on distributed cache

Distributed Cache
Map
Reduce
Compute expected counts
Aggregate expected counts
10
Running Experiments

We use streaming for all experiments with 2 C
programs server and map (reduce is a simple
summer)
gt cd /home/kgimpel/grammar_induction
gt hod allocate d /home/kgimpel/grammar_induction
n 25
gt ./dep_induction_server \
input_file/user/kgimpel/data/train20-20parts \
aux_fileaux.train20 output_filemodel.train20 \
hod_config/home/kgimpel/grammar_induction \
num_reduce_tasks5 1gt stdout 2gt stderr
dep_induction_server runs a MapReduce job on each
iteration

Input split into pieces for map tasks (dataset
too small for default Hadoop splitter)
11
Outline

The Problem
Distributed Architecture
Experiments and Hadoop Issues

12
Speed-up with Hadoop

38,576 sentences
40 words / sent.
40 nodes
5 reduce tasks
Average iteration
time reduced from
2039 s to 115 s
Total time reduced
from 3400 minutes
to 200 minutes

13
Hadoop Issues

Overhead of running a single MapReduce job
Stragglers in the map phase

14
Typical Iteration (40 nodes, 38,576 sentences)

231705 map 0 reduce 0
231712 map 3 reduce 0
231713 map 26 reduce 0
231714 map 49 reduce 0
231715 map 66 reduce 0
231716 map 72 reduce 0
231717 map 97 reduce 0
231718 map 100 reduce 0
231800 map 100 reduce 1
231815 map 100 reduce 2
231818 map 100 reduce 4
231820 map 100 reduce 15
231827 map 100 reduce 17
231828 map 100 reduce 18
231830 map 100 reduce 23
231832 map 100 reduce 100

Consistent 40-second delay between map and reduce
phases

115 s per iteration total
40 s per iteration of overhead
When were running 100 iterations
per experiment, 40 seconds per
iteration really adds up!

of execution time is overhead!
15
Typical Iteration (40 nodes, 38,576 sentences)

231705 map 0 reduce 0
231712 map 3 reduce 0
231713 map 26 reduce 0
231714 map 49 reduce 0
231715 map 66 reduce 0
231716 map 72 reduce 0
231717 map 97 reduce 0
231718 map 100 reduce 0
231800 map 100 reduce 1
231815 map 100 reduce 2
231818 map 100 reduce 4
231820 map 100 reduce 15
231827 map 100 reduce 17
231828 map 100 reduce 18
231830 map 100 reduce 23
231832 map 100 reduce 100

Why does reduce take so long?

5 reduce tasks used
Reduce phase is simply
aggregation of values
for 2600 parameters

16
Histogram of Iteration Times
Mean 115 s
17
Histogram of Iteration Times
Mean 115 s
Whats going on here?
18
Typical Iteration

231705 map 0 reduce 0
231712 map 3 reduce 0
231713 map 26 reduce 0
231714 map 49 reduce 0
231715 map 66 reduce 0
231716 map 72 reduce 0
231717 map 97 reduce 0
231718 map 100 reduce 0
231800 map 100 reduce 1
231815 map 100 reduce 2
231818 map 100 reduce 4
231820 map 100 reduce 15
231827 map 100 reduce 17
231828 map 100 reduce 18
231830 map 100 reduce 23
231832 map 100 reduce 100

19
Typical Iteration
Slow Iteration
232027 map 0 reduce 0 232034 map 5
reduce 0 232035 map 20 reduce 0 232036
map 41 reduce 0 232037 map 56 reduce
0 232038 map 74 reduce 0 232039 map
95 reduce 0 232040 map 97 reduce
0 232132 map 97 reduce 1 232137 map
97 reduce 2 232142 map 97 reduce
12 232143 map 97 reduce 15 232147
map 97 reduce 19 232150 map 97 reduce
21 232152 map 97 reduce 26 232157
map 97 reduce 31 232158 map 97 reduce
32 232346 map 100 reduce 32 232454
map 100 reduce 46 232455 map 100 reduce
86 232456 map 100 reduce 100

231705 map 0 reduce 0
231712 map 3 reduce 0
231713 map 26 reduce 0
231714 map 49 reduce 0
231715 map 66 reduce 0
231716 map 72 reduce 0
231717 map 97 reduce 0
231718 map 100 reduce 0
231800 map 100 reduce 1
231815 map 100 reduce 2
231818 map 100 reduce 4
231820 map 100 reduce 15
231827 map 100 reduce 17
231828 map 100 reduce 18
231830 map 100 reduce 23
231832 map 100 reduce 100

3 minutes waiting for last map tasks to
complete
20
Typical Iteration
Slow Iteration
232027 map 0 reduce 0 232034 map 5
reduce 0 232035 map 20 reduce 0 232036
map 41 reduce 0 232037 map 56 reduce
0 232038 map 74 reduce 0 232039 map
95 reduce 0 232040 map 97 reduce
0 232132 map 97 reduce 1 232137 map
97 reduce 2 232142 map 97 reduce
12 232143 map 97 reduce 15 232147
map 97 reduce 19 232150 map 97 reduce
21 232152 map 97 reduce 26 232157
map 97 reduce 31 232158 map 97 reduce
32 232346 map 100 reduce 32 232454
map 100 reduce 46 232455 map 100 reduce
86 232456 map 100 reduce 100