Title: Introduction to CS739: Distribution Systems
1Introduction to CS739Distribution Systems
UNIVERSITY of WISCONSIN-MADISONComputer Sciences
Department
CS 739Distributed Systems
Andrea C. Arpaci-Dusseau
What are distributed systems? What are the
benefits and challenges? How will CS739 be
structured? Readings, Writeups,
Presentations Projects
2Goals of Course
- Learn about challenges and existing techniques
for building distributed systems and services - Read and discuss influential papers from SOSP,
OSDI, NSDI - Gain some experience programming in distributed
environment - Warm-up project
- Final project
3What is a Distributed System?
- Leslie Lamport says You know you have one when
the crash of a computer you never heard of stops
you from doing any work - More technical definitionCollection of
independent computers that appears to its users
as a single coherent system - How are parallel, distributed, networked systems
different? - All contain nodes (processing, memory, disk)
connected with network
Moreunified
Lessunified
parallel
distributed
networked
Consider distributed services as well
4Benefits of Distributed Systems
- Great price/performance
- Leverage commodity components (nodes and
networks) - Use many, many of them
- Incremental scalability
- Can add x new nodes (or disks or memory) to
improve performance x - Improved availability
- Continue operating when some nodes stop working
- Improved reliability
- Deliver correct results when some nodes
misbehave, corrupt data - Allow geographically-distributed individuals to
share data or cooperate
5Distributed System Challenges
- Lack of global state information
- Different nodes have different view of system
- What are the contents of file A?
- How many jobs are running on node X?
- Which nodes are currently part of the system?
- See delays, different ordering of messages, lost
messages, network partitions - Tension with goal of single coherent system
- Handling slow, failed and misbehaving nodes
- How do you avoid slow nodes?
- How do you get back data or work from failed
node? - When nodes disagree, how do you know who is
wrong? - Tension with goal of available and reliable
- When is it okay to have some centralized
components? - Simplifies state management, but single
point-of-failure and performance bottleneck -
6Content of 739
- Distributed system courses can be very different
- Theoretical distributed algorithms (e.g., to
allow nodes to come to consensus or agreement) - 4 lectures
- Practical distributed programming (e.g., using
RPC, JAVA RMI, CORBA, DCOM, MPI, PVM) - Warm-up project
- Research systems new ideas for making
distributed systems better - Focus of course
- Implemented systems with new conceptual ideas
- Recent papers in top systems conferences (SOSP,
OSDI, NSDI)
7Learning by Reading
- Intense reading list assume sophisticated reader
(736) - Usually cover 1 fascinating paper per class
- No exams
- Three types of classes
- Formal lecture Only for 4 theory topics
- Discussions Most papers
- I ask questions, expect everyone to
enthusiastically participate fairly casual - Task 1 Read paper 2-3 times before class
- Task 2 Email write-up to me BEFORE class
- Task 3 Take turns being scribe (about 2 times in
semester) - Write-up notes from discussion in latex
- Post to web page within 72 hours
8Learning by Reading (cont)
- Types of classes (cont)
- Group-led lectures 4 topics
- Small group gives overview of about 3-4 related
papers - Topics
- Distributed system analysis
- Process migration
- Programming environments
- Specialized distributed services
- Advantages
- Good practice for giving presentations
- Learn about topic in slightly more depth
- Tasks
- Group
- Finalize related papers (1 week before)
- Present to me (2 days before)
- Use slides
- Everyone else Skim papers
- Handout State preferences by next week
9Course Topics Reading List
- Distributed Operating Systems (Survey, Amoeba vs
Sprite) - Network File Systems (NFS, Coda, LBFS)
- Theory Time, Ordering, and Distributed Snapshots
(2 Lamport papers) - Analysis of Distributed Systems (1 Group
Presentation) - Programming Environments (DSM, MapReduce, Group)
- Process Migration (1 Group)
- Specialized Distributed Services (Porcupine
Group) - SPRING BREAK
- Theory Consensus (Byzantine failures and
fail-stop processors) - Cluster-based File Systems (PetalFrangipani and
GoogleFS) - Communication Primitives (RPC vs U-Net)
- P2P Systems (Measurement, CFS, Amazon, Pangaea,
LOCKSS) - Miscellaneous Trust, Recovery, Mistakes,
Speculation, Sensor Networks
10Learning by Doing
- Warm-up Project
- Goal Become familiar with existing distributed
programming environments - Examples Hadoop (open-source MapReduce), MPI,
PVM - Task 0 Get environment running
- Task 1 Implement simple application (e.g.,
sorting) - Task 2 Report sufficient numbers to indicate did
something - Final Project
- Goal 1 Experience with research process in
general - Work on open-ended project, unknown result
- New idea where dont know if it will work
- Goal 2 Learn about specific topic in depth
- Topic from my list or your own choice work with
project partner - Deliverables 20 minute talk, short research
paper
11Agenda for Next Class
- See websitewww.cs.wisc.edu/cs739-1
- Read
- Survey Distributed Operating SystemsAndrew S.
Tanenbaum and Robbert Van RenesseACM Computing
Surveys, Volume 17, Issue 4 (December 1985), pp
419-470 - Long paper Focus on Sections 1 and 2
- Answer question
- What were the goals of distributed systems at
this time? Which design issue (I.e.,
communication primitives, naming and protection,
resource management, fault tolerance, services)
seems most challenging (or interesting)? Why? - Email answer to me with Subject cs739 Survey
- Think about group presentation papers