CS252 Graduate Computer Architecture Lecture 19 Queuing Theory (Con - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

CS252 Graduate Computer Architecture Lecture 19 Queuing Theory (Con

Description:

... theory doesn't deal with transient behavior, only steady-state behavior ... Time between two successive arrivals in line are random and memoryless: (M for C ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 35
Provided by: johnkubi
Category:

less

Transcript and Presenter's Notes

Title: CS252 Graduate Computer Architecture Lecture 19 Queuing Theory (Con


1
CS252Graduate Computer ArchitectureLecture
19Queuing Theory (Cont)Intro to
Multiprocessing
  • John Kubiatowicz
  • Electrical Engineering and Computer Sciences
  • University of California, Berkeley
  • http//www.eecs.berkeley.edu/kubitron/cs252

2
Recall Magnetic Disk Characteristic
Track
Sector
  • Cylinder all the tracks under the head at a
    given point on all surface
  • Read/write data is a three-stage process
  • Seek time position the head/arm over the proper
    track (into proper cylinder)
  • Rotational latency wait for the desired
    sectorto rotate under the read/write head
  • Transfer time transfer a block of bits
    (sector)under the read-write head
  • Disk Latency Queueing Time Controller time
    Seek Time Rotation Time Xfer Time
  • Highest Bandwidth
  • transfer large group of blocks sequentially from
    one track

Platter
3
Recall Introduction to Queuing Theory
  • What about queuing time??
  • Lets apply some queuing theory
  • Queuing Theory applies to long term, steady state
    behavior ? Arrival rate Departure rate
  • Littles Law Mean tasks in system arrival
    rate x mean response time
  • Observed by many, Little was first to prove
  • Simple interpretation you should see the same
    number of tasks in queue when entering as when
    leaving.
  • Applies to any system in equilibrium, as long as
    nothing in black box is creating or destroying
    tasks
  • Typical queuing theory doesnt deal with
    transient behavior, only steady-state behavior

4
A Little Queuing Theory Mean Wait Time
  • Parameters that describe our system
  • ? mean number of arriving customers/second
  • Tser mean time to service a customer (m1)
  • C squared coefficient of variance ?2/m12
  • µ service rate 1/Tser
  • u server utilization (0?u?1) u ?/µ ? ? Tser
  • Parameters we wish to compute
  • Tq Time spent in queue
  • Lq Length of queue ? ? Tq (by Littles law)
  • Basic Approach
  • Customers before us must finish mean time Lq ?
    Tser
  • If something at server, takes m1(z) to complete
    on avg
  • Chance server busy u ? mean time is u ? m1(z)
  • Computation of wait time in queue (Tq)
  • Tq Lq ? Tser u ? m1(z)

5
Mean Residual Wait Time m1(z)
  • Imagine n samples
  • There are n ? P(Tx) samples of size Tx
  • Total space of samples of size Tx
  • Total time for n services
  • Chance arrive in service of length Tx
  • Avg remaining time if land in Tx ½Tx
  • Finally Average Residual Time m1(z)

6
A Little Queuing Theory M/G/1 and M/M/1
  • Computation of wait time in queue (Tq) Tq Lq
    ? Tser u ? m1(z)
  • Tq ? ? Tq ? Tser u ? m1(z)
  • Tq u ? Tq u ? m1(z)
  • Tq ? (1 u) m1(z) ? u ? Tq m1(z) ? u/(1-u)
    ?
  • Tq Tser ? ½(1C) ? u/(1 u)
  • Notice that as u?1, Tq?? !
  • Assumptions so far
  • System in equilibrium No limit to the queue
    works First-In-First-Out
  • Time between two successive arrivals in line are
    random and memoryless (M for C1 exponentially
    random)
  • Server can start on next customer immediately
    after prior finishes
  • General service distribution (no restrictions), 1
    server
  • Called M/G/1 queue Tq Tser x ½(1C) x u/(1
    u))
  • Memoryless service distribution (C 1)
  • Called M/M/1 queue Tq Tser x u/(1 u)

7
A Little Queuing Theory An Example
  • Example Usage Statistics
  • User requests 10 x 8KB disk I/Os per second
  • Requests service exponentially distributed
    (C1.0)
  • Avg. service 20 ms (From controllerseekrottra
    ns)
  • Questions
  • How utilized is the disk?
  • Ans server utilization, u ?Tser
  • What is the average time spent in the queue?
  • Ans Tq
  • What is the number of requests in the queue?
  • Ans Lq
  • What is the avg response time for disk request?
  • Ans Tsys Tq Tser
  • Computation
  • ? (avg arriving customers/s) 10/s
  • Tser (avg time to service customer) 20 ms
    (0.02s)
  • u (server utilization) ? x Tser 10/s x .02s
    0.2
  • Tq (avg time/customer in queue) Tser x u/(1
    u) 20 x 0.2/(1-0.2) 20 x 0.25 5 ms (0
    .005s)
  • Lq (avg length of queue) ? x Tq10/s x .005s
    0.05

8
Use Arrays of Small Disks?
  • Katz and Patterson asked in 1987
  • Can smaller disks be used to close gap in
    performance between disks and CPUs?

Conventional 4 disk designs
10
5.25
3.5
14
High End
Low End
Disk Array 1 disk design
3.5
9
Array Reliability
  • Reliability of N disks Reliability of 1 Disk
    N
  • 50,000 Hours 70 disks 700 hours
  • Disk system MTTF Drops from 6 years to 1
    month!
  • Arrays (without redundancy) too unreliable to
    be useful!

Hot spares support reconstruction in parallel
with access very high media availability can be
achieved
10
Redundant Arrays of DisksRAID 1 Disk
Mirroring/Shadowing
recovery group
 Each disk is fully duplicated onto its
"shadow" Very high availability can be
achieved Bandwidth sacrifice on write
Logical write two physical writes Reads may
be optimized Most expensive solution 100
capacity overhead
Targeted for high I/O rate , high availability
environments
11
Redundant Arrays of Disks RAID 5 High I/O Rate
Parity
Increasing Logical Disk Addresses
D0
D1
D2
D3
P
A logical write becomes four physical
I/Os Independent writes possible because
of interleaved parity Reed-Solomon Codes ("Q")
for protection during reconstruction
D4
D5
D6
P
D7
D8
D9
P
D10
D11
D12
P
D13
D14
D15
Stripe
P
D16
D17
D18
D19
Targeted for mixed applications
Stripe Unit
D20
D21
D22
D23
P
. . .
. . .
. . .
. . .
. . .
Disk Columns
12
Problems of Disk Arrays Small Writes
RAID-5 Small Write Algorithm
1 Logical Write 2 Physical Reads 2 Physical
Writes
D0
D1
D2
D3
D0'
P
old data
new data
old parity
(1. Read)
(2. Read)
XOR


XOR
(3. Write)
(4. Write)
D0'
D1
D2
D3
P'
13
System Availability Orthogonal RAIDs
Array Controller
String Controller
. . .
String Controller
. . .
String Controller
. . .
String Controller
. . .
String Controller
. . .
String Controller
. . .
Data Recovery Group unit of data redundancy
Redundant Support Components fans, power
supplies, controller, cables
End to End Data Integrity internal parity
protected data paths
14
Administrivia
  • Still grading Exams!
  • Sorry my TA was preparing for Quals
  • Will get them done in next week (promise!)
  • Projects
  • Should be getting fully up to speed on project
  • Set up meeting with me this week

15
What is Parallel Architecture?
  • A parallel computer is a collection of processing
    elements that cooperate to solve large problems
  • Most important new element It is all about
    communication!
  • What does the programmer (or OS or Compiler
    writer) think about?
  • Models of computation
  • PRAM? BSP? Sequential Consistency?
  • Resource Allocation
  • how powerful are the elements?
  • how much memory?
  • What mechanisms must be in hardware vs software
  • What does a single processor look like?
  • High performance general purpose processor
  • SIMD processor/Vector Processor
  • Data access, Communication and Synchronization
  • how do the elements cooperate and communicate?
  • how are data transmitted between processors?
  • what are the abstractions and primitives for
    cooperation?

16
Flynns Classification (1966)
  • Broad classification of parallel computing
    systems
  • SISD Single Instruction, Single Data
  • conventional uniprocessor
  • SIMD Single Instruction, Multiple Data
  • one instruction stream, multiple data paths
  • distributed memory SIMD (MPP, DAP, CM-12,
    Maspar)
  • shared memory SIMD (STARAN, vector computers)
  • MIMD Multiple Instruction, Multiple Data
  • message passing machines (Transputers, nCube,
    CM-5)
  • non-cache-coherent shared memory machines (BBN
    Butterfly, T3D)
  • cache-coherent shared memory machines (Sequent,
    Sun Starfire, SGI Origin)
  • MISD Multiple Instruction, Single Data
  • Not a practical configuration

17
Examples of MIMD Machines
  • Symmetric Multiprocessor
  • Multiple processors in box with shared memory
    communication
  • Current MultiCore chips like this
  • Every processor runs copy of OS
  • Non-uniform shared-memory with separate I/O
    through host
  • Multiple processors
  • Each with local memory
  • general scalable network
  • Extremely light OS on node provides simple
    services
  • Scheduling/synchronization
  • Network-accessible host for I/O
  • Cluster
  • Many independent machine connected with general
    network
  • Communication through messages

18
Categories of Thread Execution
Simultaneous Multithreading
Multiprocessing
Superscalar
Fine-Grained
Coarse-Grained
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
19
Parallel Programming Models
  • Programming model is made up of the languages and
    libraries that create an abstract view of the
    machine
  • Control
  • How is parallelism created?
  • What orderings exist between operations?
  • How do different threads of control synchronize?
  • Data
  • What data is private vs. shared?
  • How is logically shared data accessed or
    communicated?
  • Synchronization
  • What operations can be used to coordinate
    parallelism
  • What are the atomic (indivisible) operations?
  • Cost
  • How do we account for the cost of each of the
    above?

20
Simple Programming Example
  • Consider applying a function f to the elements of
    an array A and then computing its sum
  • Questions
  • Where does A live? All in single memory?
    Partitioned?
  • What work will be done by each processors?
  • They need to coordinate to get a single result,
    how?

s
21
Programming Model 1 Shared Memory
  • Program is a collection of threads of control.
  • Can be created dynamically, mid-execution, in
    some languages
  • Each thread has a set of private variables, e.g.,
    local stack variables
  • Also a set of shared variables, e.g., static
    variables, shared common blocks, or global heap.
  • Threads communicate implicitly by writing and
    reading shared variables.
  • Threads coordinate by synchronizing on shared
    variables

22
Simple Programming Example SM
  • Shared memory strategy
  • small number p ltlt nsize(A) processors
  • attached to single memory
  • Parallel Decomposition
  • Each evaluation and each partial sum is a task.
  • Assign n/p numbers to each of p procs
  • Each computes independent private results and
    partial sum.
  • Collect the p partial sums and compute a global
    sum.
  • Two Classes of Data
  • Logically Shared
  • The original n numbers, the global sum.
  • Logically Private
  • The individual function evaluations.
  • What about the individual partial sums?

23
Shared Memory Code for sum
static int s 0
Thread 1 for i 0, n/2-1 s s
f(Ai)
Thread 2 for i n/2, n-1 s s
f(Ai)
  • Problem is a race condition on variable s in the
    program
  • A race condition or data race occurs when
  • two processors (or two threads) access the same
    variable, and at least one does a write.
  • The accesses are concurrent (not synchronized) so
    they could happen simultaneously

24
A Closer Look
3
5
A
f square
static int s 0
Thread 1 . compute f(Ai) and put in
reg0 reg1 s reg1 reg1 reg0 s
reg1
Thread 2 compute f(Ai) and put in reg0
reg1 s reg1 reg1 reg0 s reg1
9
25
0
0
9
25
25
9
  • Assume A 3,5, f is the square function, and
    s0 initially
  • For this program to work, s should be 34 at the
    end
  • but it may be 34,9, or 25
  • The atomic operations are reads and writes
  • Never see ½ of one number, but operation is
    not atomic
  • All computations happen in (private) registers

25
Improved Code for Sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
  • Since addition is associative, its OK to
    rearrange order
  • Most computation is on private variables
  • Sharing frequency is also reduced, which might
    improve speed
  • But there is still a race condition on the update
    of shared s
  • The race condition can be fixed by adding locks
    (only one thread can hold a lock at a time
    others wait for it)

26
What about Synchronization?
  • All shared-memory programs need synchronization
  • Barrier global (/coordinated) synchronization
  • simple use of barriers -- all threads hit the
    same one
  • work_on_my_subgrid()
  • barrier
  • read_neighboring_values()
  • barrier
  • Mutexes mutual exclusion locks
  • threads are mostly independent and must access
    common data
  • lock l alloc_and_init() / shared
    /
  • lock(l)
  • access data
  • unlock(l)
  • Need atomic operations bigger than loads/stores
  • Actually Dijkstras algorithm can get by with
    only loads/stores, but this is quite complex (and
    doesnt work under all circumstances)
  • Example atomic swap, test-and-test-and-set
  • Another Option Transactional memory
  • Hardware equivalent of optimistic concurrency
  • Some think that this is the answer to all
    parallel programming

27
Programming Model 2 Message Passing
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Thread of control plus local address space -- NO
    shared data.
  • Logically shared data is partitioned over local
    processes.
  • Processes communicate by explicit send/receive
    pairs
  • Coordination is implicit in every communication
    event.
  • MPI (Message Passing Interface) is the most
    commonly used SW

28
Compute A1A2 on each processor
  • First possible solution what could go wrong?

Processor 1 xlocal A1 send xlocal,
proc2 receive xremote, proc2 s xlocal
xremote
Processor 2 xlocal A2 send xlocal,
proc1 receive xremote, proc1 s xlocal
xremote
  • If send/receive acts like the telephone system?
    The post office?
  • What if there are more than 2 processors?

29
MPI the de facto standard
  • MPI has become the de facto standard for parallel
    computing using message passing
  • Example
  • for(i1iltnumprocsi)
  • sprintf(buff, "Hello d! ", i)
  • MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG,
    MPI_COMM_WORLD)
  • for(i1iltnumprocsi)
  • MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG,
    MPI_COMM_WORLD, stat)
  • printf("d s\n", myid, buff)
  • Pros and Cons of standards
  • MPI created finally a standard for applications
    development in the HPC community ? portability
  • The MPI standard is a least common denominator
    building on mid-80s technology, so may discourage
    innovation

30
Which is better? SM or MP?
  • Which is better, Shared Memory or Message
    Passing?
  • Depends on the program!
  • Both are communication Turing complete
  • i.e. can build Shared Memory with Message Passing
    and vice-versa
  • Advantages of Shared Memory
  • Implicit communication (loads/stores)
  • Low overhead when cached
  • Disadvantages of Shared Memory
  • Complex to build in way that scales well
  • Requires synchronization operations
  • Hard to control data placement within caching
    system
  • Advantages of Message Passing
  • Explicit Communication (sending/receiving of
    messages)
  • Easier to control data placement (no automatic
    caching)
  • Disadvantages of Message Passing
  • Message passing overhead can be quite high
  • More complex to program
  • Introduces question of reception technique
    (interrupts/polling)

31
Basic Definitions
  • Network interface
  • Processor (or programmers) interface to the
    network
  • Mechanism for injecting packets/removing packets
  • Links
  • Bundle of wires or fibers that carries a signal
  • May have separate wires for clocking
  • Switches
  • connects fixed number of input channels to fixed
    number of output channels
  • Can have a serious impact on latency, saturation,
    deadlock

32
Links and Channels
  • transmitter converts stream of digital symbols
    into signal that is driven down the link
  • receiver converts it back
  • tran/rcv share physical protocol
  • trans link rcv form Channel for digital info
    flow between switches
  • link-level protocol segments stream of symbols
    into larger units packets or messages (framing)
  • node-level protocol embeds commands for dest
    communication assist within packet

33
Clock Synchronization?
  • Receiver must be synchronized to transmitter
  • To know when to latch data
  • Fully Synchronous
  • Same clock and phase Isochronous
  • Same clock, different phase Mesochronous
  • High-speed serial links work this way
  • Use of encoding (8B/10B) to ensure sufficient
    high-frequency component for clock recovery
  • Fully Asynchronous
  • No clock Request/Ack signals
  • Different clock Need some sort of clock
    recovery?

34
Conclusion
  • Disk Time queue controller seek rotate
    transfer
  • Queuing Latency
  • M/M/1 and M/G/1 queues simplest to analyze
  • Assume memoryless input stream of requests
  • As utilization approaches 100, latency ? ?
  • M/M/1 Tq Tser x u/(1 u)
  • M/G/1 Tq Tser x ½(1C) x u/(1 u)
  • Multiprocessing
  • Multiple processors connect together
  • It is all about communication!
  • Programming Models
  • Shared Memory
  • Message Passing
Write a Comment
User Comments (0)
About PowerShow.com