Title: Computer Networks - Lecture 1
1High Level Abstractions for Data-Intensive
Computing Christopher Moretti, Hoang Bui,
Brandon Rich, and Douglas Thain University of
Notre Dame
2Computings central challenge, How not to make a
mess of it, has not yet been met. -Edsger
Dijkstra
3Overview
- Many systems today give end users access to
hundreds or thousands of CPUs. - But, it is far too easy for the naive user to
create a big mess in the process. - Our Solution
- Deploy high-level abstractions that describe both
data and computation needs. - Some examples of current work
- All-Pairs An abstraction for biometric
workloads. - Distributed Ensemble Classification
- DataLab A system and language for data-parallel
computation.
4Three Examples of Work at Notre Dame
S2 1 S2 2 S2 3 S2 4 S2 5 S2 6 S2 7
S1 1
S1 2
S1 3
S1 4
S1 5
S1 6
S1 7
5Distributed Computing is Hard!
What is Condor?
Which resources?
How Many?
What happens when things fail?
How do I fit my workload into jobs?
How long will it take?
What about job input data?
What do I do with the results?
How can I measure job stats?
6Distributed Computing is Hard!
What is Condor?
Which resources?
How Many?
ARGH!
What happens when things fail?
How do I fit my workload into jobs?
How long will it take?
What about job input data?
What do I do with the results?
How can I measure job stats?
7Domain Experts are not Distributed Experts
Clouds Clusters OSG
TeraGrid
8Abstractions Compiler
include ltiostream.hgt int main() int i, j
for(i0 ilt100 i) for(j0 jlt100 j)
coutltlt ij ltlt endl
MyProg.exe
glibc
9Abstractions Map-Reduce
inputs (file,word)
intermediates (word,count)
output (word,count)
nouns
map
doc
verbs
unique nouns
reduce
nouns
map
doc
verbs
unique verbs
reduce
nouns
map
doc
verbs
Sample Application Identify all unique nouns and
verbs in 1M documents
10Abstractions Map-Reduce
- Map-Reduce is a distributed abstraction that
encapsulates the data and computation needs of a
workload. - So can Map-Reduce solve an All-Pairs problem?
- Not efficiently.
- AllPairs(A,B,F) ? Map(F,S)
- S ((A1,B1), (A1, B2) )
- So we have a large workload with one job per
comparison, with no attempt to run computations
where the data lies, or prestage data to the
location at which it will be used. - This is our motivating (bad!) example!
11The All-Pairs Problem
- All-Pairs(
- Set S1,
- Set S2,
- Function F
- )
- yields a matrix M
- Mij F(S1i,S2j)
- 60K 20KB images gt1GB
- 3.6B comparisons
- _at_ 50/s 2.3 CPUYrs
- x 8B output 29GB
S2 1 S2 2 S2 3 S2 4 S2 5 S2 6 S2 7
S1 1
S1 2
S1 3
S1 4
S1 5
S1 6
S1 7
12Biometric All-Pairs Comparison
1 .8 .1 0 0 .1
1 0 .1 .1 0
1 0 .1 .7
1 0
1 .1
1
13Naïve Mistakes
Computing Problem Even expert users dont know
how to tune jobs optimally, and can make 100 CPUs
even slower than one by overloading the file
server, network, or resource manager.
14Consequences of Naïve Mistakes
15All Pairs Abstraction
binary function F
set S of files
F
invocation
M AllPairs(F,S)
16Avoiding Consquences of Naïveté
Approach Create data intensive abstractions that
allow the system at runtime to distribute data,
partition jobs, exploit locality, and hide
errors. All-Pairs(F,S) F(Si,Sj) for all
elements in S.
Here is F(x,y) Here is set S.
CPU
CPU
CPU
CPU
All-Pairs Portal
Addl. Fault Tolerance
file server
(File Distribution by Spanning Tree)
17All-Pairs Production System at Notre Dame
300 active storage units 500 CPUs, 40TB disk
Web Portal
F
G
H
4 Choose optimal partitioning and submit batch
jobs.
S
T
F
F
F
1 - Upload F and S into web portal.
2 - AllPairs(F,S)
F
F
F
All-Pairs Engine
3 - O(log n) distribution by spanning tree.
5 - Collect and assemble results.
6 - Return result matrix to user.
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Returning the Result Matrix
4.37
- Too many files.
- Hard to do prefetching.
- Too large files.
- Must scan entire file.
- Row/Column ordered.
- How can we build it?
6.01
2.22
4.37 7.13 8.94 6.72 1.34 0.98
22Result Storage by Abstraction
- Chirp_array allows users to create, manage,
modify large arrays without having to realize
underlying form. - Operations on chirp_array
- create a chirp_array
- open a chirp_array
- set value Ai,j
- get value Ai,j
- get row Ai
- get column Aj
- set row Ai
- set column Aj
23Result Storage with chirp_array
Cache
CPU
CPU
CPU
Disk
Disk
Disk
24Result Storage with chirp_array
Cache
CPU
CPU
CPU
Disk
Disk
Disk
25Result Storage with chirp_array
Cache
CPU
CPU
CPU
Disk
Disk
Disk
26Data Mining on Large Data Sets
Problem Supercomputers are expensive, not all
scientists have access to them for completing
very large memory problems. Classification on
large data sets without sufficient memory can
degrade throughput, degrade accuracy, or fail
outright.
27Data Mining Using Ensembles
(From Steinhaeuser and Chawla, 2007)
28Data Mining Using Ensembles
(From Steinhaeuser and Chawla, 2007)
29Abstraction for Ensembles Using Natural
Parallelism
Choose optimal partitioning and submit batch jobs.
Here are my algorithms. Here is my data set. Here
is my test set.
CPU
CPU
CPU
CPU
Abstraction Engine
Local Votes
Return local votes for tabulation and final
prediction.
30DataLab Abstractions
file system
distributed data structures
function evaluation
tcsh emacs perl
set S
file F
Y F(X)
A
B
C
job_start job_commit job_wait job_remove
parrot
chirp server
chirp server
chirp server
chirp server
chirp server
unix filesys
unix filesys
unix filesys
F
X
Y
31DataLab Language Syntax
apply F on S into T
32For More Information
- Christopher Moretti
- cmoretti_at_cse.nd.edu
- Douglas Thain
- dthain_at_cse.nd.edu
- Cooperative Computing Lab
- http//cse.nd.edu/ccl