Title: DryadLINQ: Computer Vision (among other things) on a cluster
1DryadLINQ Computer Vision (among other things)
on a cluster
- ECCV AC workshop 14th June, 2008
Michael Isard Microsoft Research, Silicon Valley
2Parallel programming, yada yada
- Intel claims we will all have many-core, etc.
- This algorithm is easily parallelizable
- Not we implemented a parallel version
- Historically, low-latency fine-grain parallelism
- Shared-memory SMP (threads, locks, etc.)
- MPI (finite-element analysis, etc.)
- But also data-parallel!
- We have lots of data now (video, the web)
- But most people still use their laptops/toy data
- Even big systems use tens of computers
3Why do people use Matlab?
- Parallel programming tedious and complex
- Distributed programming even worse
- Perl scripts, manual management of data,
- Matlab is easy (or at least popular)
- Relatively few high-level constructs
- System does the right thing
- Programmers willing to put up with a lot
- We want similarly low barrier to entry
- Familiar languages, legacy codebase, etc.
4What are we doing?
- When single-computer processing runs out of steam
- Web-scale processing of terabytes of data
- Infeasible without a big cluster
- Network log-mining, machine learning
- Multi-week job ? 4 hours on 250 computers
- 1-hour iteration ? 3.5 minutes on 4 computers
5A typical data-intensive query
var logentries from line in logs
where !line.StartsWith("") select new
LogEntry(line) var user from access
in logentries where access.user.EndsWith(_at_
"\ulfar") select access var accesses
from access in user group access by
access.page into pages select new
UserPageCount("ulfar", pages.Key,
pages.Count()) var htmAccesses from
access in accesses where
access.page.EndsWith(".htm") orderby
access.count descending select access
6Steps in the query
var logentries from line in logs
where !line.StartsWith("") select new
LogEntry(line) var user from access
in logentries where access.user.EndsWith(_at_
"\ulfar") select access var accesses
from access in user group access by
access.page into pages select new
UserPageCount("ulfar", pages.Key,
pages.Count()) var htmAccesses from
access in accesses where
access.page.EndsWith(".htm") orderby
access.count descending select access
Go through logs and keep only lines that are not
comments. Parse each line into a LogEntry object.
Go through logentries and keep only entries that
are accesses by ulfar.
Group ulfars accesses according to what page
they correspond to. For each page, count the
occurrences.
Sort the pages ulfar has accessed according to
access frequency.
7Serial execution
var logentries from line in logs
where !line.StartsWith("") select new
LogEntry(line) var user from access
in logentries where access.user.EndsWith(_at_
"\ulfar") select access var accesses
from access in user group access by
access.page into pages select new
UserPageCount("ulfar", pages.Key,
pages.Count()) var htmAccesses from
access in accesses where
access.page.EndsWith(".htm") orderby
access.count descending select access
For each line in logs, do
For each entry in logentries, do..
Sort entries in user by page. Then iterate over
sorted list, counting the occurrences of each
page as you go.
Re-sort entries in access by page frequency.
8Parallel execution
var logentries from line in logs
where !line.StartsWith("") select new
LogEntry(line) var user from access
in logentries where access.user.EndsWith(_at_
"\ulfar") select access var accesses
from access in user group access by
access.page into pages select new
UserPageCount("ulfar", pages.Key,
pages.Count()) var htmAccesses from
access in accesses where
access.page.EndsWith(".htm") orderby
access.count descending select access
9Linear Regression
- Vectors x input(0), y input(1)
- Matrices xx x.PairwiseOuterProduct(x)
- OneMatrix xxs xx.Sum()
- Matrices yx y.PairwiseOuterProduct(x)
- OneMatrix yxs yx.Sum()
- OneMatrix xxinv xxs.Map(a gt a.Inverse())
- OneMatrix A yxs.Map( xxinv, (a, b) gt
a.Multiply(b))
9
10Execution Graph
X0
X1
X2
Y0
Y1
Y2
XXT
XXT
XXT
YXT
YXT
YXT
S
S
-1
10
A
11DryadLINQ
- Programmer writes sequential C code
- Rich type system, libraries, modules, loops
- System can figure out data-parallelism
- Sees declarative expression plans
- Full control of high-level optimizations
- Traditional parallel-database tricks
12Dryad execution engine
Andrew Birrell, Mihai Budiu, Dennis
Fetterly, Michael Isard, Yuan Yu
- General-purpose execution environment for
distributed, data-parallel applications - Concentrates on throughput not latency
- Assumes private data center
- Automatic management of scheduling, distribution,
fault tolerance, etc. - Well tested over two years on clusters of
thousands of computers
13Job Directed Acyclic Graph
Outputs
Processing vertices
Channels (file, pipe, shared memory)
Inputs
14Scheduler state machine
- Scheduling a DAG
- Vertex can run anywhere once all its inputs are
ready - Constraints/hints place it near its inputs
- Fault tolerance
- If A fails, run it again
- If As inputs are gone, run upstream vertices
again (recursively) - If A is slow, run another copy elsewhere and use
output from whichever finishes first
15Static/dynamic optimizations
- Static optimizer builds execution graph
- Dynamic optimizer mutates running graph
- Picks number of partitions when size is known
- Builds aggregation trees based on locality
16LINQ
- Constructs/type system in .NET v3.5
- Operators to manipulate datasets
- Data elements are arbitrary .NET types
- Traditional relational operators
- Select, Join, Aggregate, etc.
- Extensible
- Add new operators
- Add new implementations
17DryadLINQ
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai
Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon
Currey
- Automatically distribute a LINQ program
- Few Dryad-specific extensions
- Same source program runs on single-core through
multi-core up to cluster
18A complete DryadLINQ program
public class LogEntry public string
user public string ip public
string page public LogEntry(string
line) string fields
line.Split(' ') this.user
fields8 this.ip fields9
this.page fields5
public class UserPageCount public
string user public string page
public int count public
UserPageCount(string user, string page, int
count) this.user user
this.page page this.count
count
DryadDataContext ddc new DryadDataContext(fs//
logfile) DryadTableltstringgt logs
ddc.GetTableltstringgt() var logentries
from line in logs where
!line.StartsWith("") select new
LogEntry(line) var user from access
in logentries where access.user.EndsWith(_at_
"\ulfar") select access var accesses
from access in user group access by
access.page into pages select new
UserPageCount("ulfar", pages.Key,
pages.Count()) var htmAccesses from
access in accesses where
access.page.EndsWith(".htm") orderby
access.count descending select access
htmAccesses.ToDryadTable(fs//results)
19DryadLINQ From LINQ to Dryad
Automatic query plan generation
Distributed query execution by Dryad
Query plan
LINQ query
Dryad
var logentries from line in logs
where !line.StartsWith("") select new
LogEntry(line)
select
where
logs
20How does it work?
- Sequential code operates on datasets
- But really just builds an expression graph
- Lazy evaluation
- When a result is retrieved
- Entire graph is handed to DryadLINQ
- Optimizer builds efficient DAG
- Program is executed on cluster
21Terasort
- 10 billion 100-byte records (1012 bytes)
- 240 computers, 960 disks
- 349 secs
- Comparable with record
public struct TeraRecord IComparableltTeraRecord
gt public const int RecordSize 100
public const int KeySize 10
public byte content public int
CompareTo(TeraRecord rec) for (int
i 0 i lt KeySize i) int
cmp this.contenti - rec.contenti
if (cmp ! 0) return cmp
return 0 public static
TeraRecord Read(DryadBinaryReader rd)
TeraRecord rec rec.content
rd.ReadBytes(RecordSize) return
rec public static int
Write(DryadBinaryWriter wr, TeraRecord rec)
return wr.WriteBytes(rec.content)
class Terasort public
static void Main(string args)
DryadDataContext ddc new DryadDataContext(_at_"file
//\\svc-yuanbyu-00\dryad\terasort")
DryadTableltTeraRecordgt records
ddc.GetPartitionedTableltTeraRecordgt("sherwood-sort
2.pt") var q records.OrderBy(x gt
x) q.ToDryadPartitionedTable("sherwoo
d-sort2.pt")
22Machine Learning in DryadLINQ
Kannan Achan, Mihai Budiu
Data analysis
Machine learning
Large Vector
DryadLINQ
Dryad
22
23Linear Regression Code
- Vectors x input(0), y input(1)
- Matrices xx x.PairwiseOuterProduct(x)
- OneMatrix xxs xx.Sum()
- Matrices yx y.PairwiseOuterProduct(x)
- OneMatrix yxs yx.Sum()
- OneMatrix xxinv xxs.Map(a gt a.Inverse())
- OneMatrix A yxs.Map( xxinv, (a, b) gt
a.Multiply(b))
23
24Expectation Maximization
- 160 lines
- 3 iterations shown
24
25Computer vision
- Ongoing
- Epitomes, features for image search,
- Anecdotal evidence
- Nebojsa Jojic, Anitha Kannan
- Tutorial from Mihai
- Anitha implemented Probabilistic Image Map
algorithm in an afternoon
26Continuing research
- Application-level research
- What can we write with DryadLINQ?
- System-level research
- Performance, usability, etc.
- Lots of interest from learning/vision researchers
27(No Transcript)