Title: Distributed Iterative Training
1Distributed Iterative Training
Kevin Gimpel Shay Cohen Severin Hacker
Noah A. Smith
2Outline
- The Problem
- Distributed Architecture
- Experiments and Hadoop Issues
3Iterative Training
- Many problems in NLP and machine learning require
iterating over large training sets many times - Training log-linear models (logistic regression,
conditional random fields) - Unsupervised or semi-supervised learning with EM
(word alignment in MT, grammar induction) - Minimum Error-Rate Training in MT
- Online learning (MIRA, perceptron, stochastic
gradient descent) - All of the above except can be easily
parallelized - Compute statistics on sections of the data
independently - Aggregate them
- Update parameters using statistics of full set of
data - Repeat until a stopping criterion is met
4Dependency Grammar Induction
- Given sentences of natural language text, infer
(dependency) parse trees - State-of-the-art results obtained using only a
few thousand sentences of length 10 tokens
(Smith and Eisner, 2006) - This talk scaling up to more and longer
sentences using Hadoop!
5Dependency Grammar Induction
- Training
- Input is a set of sentences (actually, POS tag
sequences) and a grammar with initial parameter
values - Run an iterative optimization algorithm (EM,
LBFGS, etc.) that changes the parameter values on
each iteration - Output is a learned set of parameter values
- Testing
- Use grammar with learned parameters to parse a
small set of test sentences - Evaluate by computing percentage of predicted
edges that match a human annotator
6Outline
- The Problem
- Distributed Architecture
- Experiments and Hadoop Issues
7MapReduce for Grammar Induction
- MapReduce was designed for
- Large amounts of data distributed across many
disks - Simple data processing
- We have
- (Relatively) small amounts of data
- Expensive processing and high memory requirements
8MapReduce for Grammar Induction
- Algorithms require 50-100 iterations for
convergence - Each iteration requires a full sweep over all
training data - Computational bottleneck is computing expected
counts for EM on each iteration (gradient for
LBFGS) - Our approach run one MapReduce job for each
iteration - Map compute expected counts (gradient)
- Reduce aggregate
- Offline renormalize (EM) or modify parameter
values (LBFGS) - Note renormalization could be done in reduce
tasks for EM with correct partition functions,
but using LBFGS in multiple reduce tasks is
trickier
9MapReduce Implementation
Server
- Normalize expected counts to get new parameter
values - Start new MapReduce job, placing new parameter
values on distributed cache
Distributed Cache
Map
Reduce
Compute expected counts
Aggregate expected counts
10Running Experiments
- We use streaming for all experiments with 2 C
programs server and map (reduce is a simple
summer) - gt cd /home/kgimpel/grammar_induction
- gt hod allocate d /home/kgimpel/grammar_induction
n 25 - gt ./dep_induction_server \
- input_file/user/kgimpel/data/train20-20parts \
- aux_fileaux.train20 output_filemodel.train20 \
- hod_config/home/kgimpel/grammar_induction \
- num_reduce_tasks5 1gt stdout 2gt stderr
- dep_induction_server runs a MapReduce job on each
iteration -
Input split into pieces for map tasks (dataset
too small for default Hadoop splitter)
11Outline
- The Problem
- Distributed Architecture
- Experiments and Hadoop Issues
12Speed-up with Hadoop
- 38,576 sentences
- 40 words / sent.
- 40 nodes
- 5 reduce tasks
- Average iteration
- time reduced from
- 2039 s to 115 s
- Total time reduced
- from 3400 minutes
- to 200 minutes
13Hadoop Issues
- Overhead of running a single MapReduce job
- Stragglers in the map phase
14Typical Iteration (40 nodes, 38,576 sentences)
- 231705 map 0 reduce 0
- 231712 map 3 reduce 0
- 231713 map 26 reduce 0
- 231714 map 49 reduce 0
- 231715 map 66 reduce 0
- 231716 map 72 reduce 0
- 231717 map 97 reduce 0
- 231718 map 100 reduce 0
- 231800 map 100 reduce 1
- 231815 map 100 reduce 2
- 231818 map 100 reduce 4
- 231820 map 100 reduce 15
- 231827 map 100 reduce 17
- 231828 map 100 reduce 18
- 231830 map 100 reduce 23
- 231832 map 100 reduce 100
Consistent 40-second delay between map and reduce
phases
- 115 s per iteration total
- 40 s per iteration of overhead
- When were running 100 iterations
- per experiment, 40 seconds per
- iteration really adds up!
of execution time is overhead!
15Typical Iteration (40 nodes, 38,576 sentences)
- 231705 map 0 reduce 0
- 231712 map 3 reduce 0
- 231713 map 26 reduce 0
- 231714 map 49 reduce 0
- 231715 map 66 reduce 0
- 231716 map 72 reduce 0
- 231717 map 97 reduce 0
- 231718 map 100 reduce 0
- 231800 map 100 reduce 1
- 231815 map 100 reduce 2
- 231818 map 100 reduce 4
- 231820 map 100 reduce 15
- 231827 map 100 reduce 17
- 231828 map 100 reduce 18
- 231830 map 100 reduce 23
- 231832 map 100 reduce 100
Why does reduce take so long?
- 5 reduce tasks used
- Reduce phase is simply
- aggregation of values
- for 2600 parameters
16Histogram of Iteration Times
Mean 115 s
17Histogram of Iteration Times
Mean 115 s
Whats going on here?
18Typical Iteration
- 231705 map 0 reduce 0
- 231712 map 3 reduce 0
- 231713 map 26 reduce 0
- 231714 map 49 reduce 0
- 231715 map 66 reduce 0
- 231716 map 72 reduce 0
- 231717 map 97 reduce 0
- 231718 map 100 reduce 0
- 231800 map 100 reduce 1
- 231815 map 100 reduce 2
- 231818 map 100 reduce 4
- 231820 map 100 reduce 15
- 231827 map 100 reduce 17
- 231828 map 100 reduce 18
- 231830 map 100 reduce 23
- 231832 map 100 reduce 100
19Typical Iteration
Slow Iteration
232027 map 0 reduce 0 232034 map 5
reduce 0 232035 map 20 reduce 0 232036
map 41 reduce 0 232037 map 56 reduce
0 232038 map 74 reduce 0 232039 map
95 reduce 0 232040 map 97 reduce
0 232132 map 97 reduce 1 232137 map
97 reduce 2 232142 map 97 reduce
12 232143 map 97 reduce 15 232147
map 97 reduce 19 232150 map 97 reduce
21 232152 map 97 reduce 26 232157
map 97 reduce 31 232158 map 97 reduce
32 232346 map 100 reduce 32 232454
map 100 reduce 46 232455 map 100 reduce
86 232456 map 100 reduce 100
- 231705 map 0 reduce 0
- 231712 map 3 reduce 0
- 231713 map 26 reduce 0
- 231714 map 49 reduce 0
- 231715 map 66 reduce 0
- 231716 map 72 reduce 0
- 231717 map 97 reduce 0
- 231718 map 100 reduce 0
- 231800 map 100 reduce 1
- 231815 map 100 reduce 2
- 231818 map 100 reduce 4
- 231820 map 100 reduce 15
- 231827 map 100 reduce 17
- 231828 map 100 reduce 18
- 231830 map 100 reduce 23
- 231832 map 100 reduce 100
3 minutes waiting for last map tasks to
complete
20Typical Iteration
Slow Iteration
232027 map 0 reduce 0 232034 map 5
reduce 0 232035 map 20 reduce 0 232036
map 41 reduce 0 232037 map 56 reduce
0 232038 map 74 reduce 0 232039 map
95 reduce 0 232040 map 97 reduce
0 232132 map 97 reduce 1 232137 map
97 reduce 2 232142 map 97 reduce
12 232143 map 97 reduce 15 232147
map 97 reduce 19 232150 map 97 reduce
21 232152 map 97 reduce 26 232157
map 97 reduce 31 232158 map 97 reduce
32 232346 map 100 reduce 32 232454
map 100 reduce 46 232455 map 100 reduce
86 232456 map 100 reduce 100
- 231705 map 0 reduce 0
- 231712 map 3 reduce 0
- 231713 map 26 reduce 0
- 231714 map 49 reduce 0
- 231715 map 66 reduce 0
- 231716 map 72 reduce 0
- 231717 map 97 reduce 0
- 231718 map 100 reduce 0
- 231800 map 100 reduce 1
- 231815 map 100 reduce 2
- 231818 map 100 reduce 4
- 231820 map 100 reduce 15
- 231827 map 100 reduce 17
- 231828 map 100 reduce 18
- 231830 map 100 reduce 23
- 231832 map 100 reduce 100
3 minutes waiting for last map tasks to
complete
Suggestions? (Doesnt Hadoop replicate map tasks
to avoid this?)
21Questions?