Server Resources - PowerPoint PPT Presentation

1 / 147
About This Presentation
Title:

Server Resources

Description:

location : store all data on all servers. load imbalance. Network. Network. Network ... MS Media Server, Apple Darwin. 1) Real server, VXtreme, Starlight, VDO, ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 148
Provided by: paa5138
Category:

less

Transcript and Presenter's Notes

Title: Server Resources


1
Server Resources
INF5071 Performance in Distributed Systems
  • 8., 15., 22. September 2006

2
Motivation
  • In a distributed system, the performance of every
    single machine is important
  • poor performance of one single node might be
    sufficient to kill the system (not better than
    the weakest)
  • Machines at the server side are most challenging
  • a large number of concurrent clients
  • shared, limited amount of resources
  • We will see examples that show simple means to
    improve performance
  • decreasing the required number of machines
  • increase the number of concurrent clients
  • improve resource utilization
  • enable timely delivery of data

3
Overview
  • Resources, real-time, continuous media streams,
  • Server examples
  • (CPU) Scheduling
  • Memory management
  • Storage management

4
Resources and RealTime
5
Resources
  • ResourceA resource is a system entity required
    by a task for manipulating data Steimetz
    Narhstedt 95
  • Characteristics
  • active provides a service, e.g., CPU, disk or
    network adapter
  • passive system capabilities required by active
    resources, e.g., memory
  • exclusive only one process at a time can use it,
    e.g., CPU
  • shared can be used by several concurrent
    processed, e.g., memory

6
RealTime
  • Real-time processA process which delivers the
    results of the processing in a given time-span
  • Real-time systemA system in which the
    correctness of a computation depends not only on
    obtaining the result, but also upon providing the
    result on time
  • DeadlineA deadline represents the latest
    acceptable time for the presentation of the
    processing result
  • Hard deadlines
  • must never be violated ? system failure
  • Soft deadlines
  • in some cases, the deadline might be missed
  • not too frequently
  • not by much time
  • result still may have some (but decreasing) value

7
Admission and Reservation
  • To prevent overload, admission may be performed
  • schedulability test
  • are there enough resources available for a new
    stream?
  • can we find a schedule for the new task without
    disturbing the existing workload?
  • a task is allowed if the utilization remains lt 1
  • yes allow new task, allocate/reserve resources
  • no reject
  • Resource reservation is analogous to booking
    (asking for resources)
  • pessimistic
  • avoid resource conflicts making worst-case
    reservations
  • potentially under-utilized resources
  • guaranteed QoS
  • optimistic
  • reserve according to average load
  • high utilization
  • overload may occur
  • perfect
  • must have detailed knowledge about resource
    requirements of all processes
  • too expensive to make/takes much time

8
RealTime and Operating Systems
  • The operating system manages local resources
    (CPU, memory, disk, network card, busses, ...)
  • In a real-time scenario, support is needed for
  • real-time processing
  • efficient memory management
  • high-rate, timely I/O
  • This also means support for proper
  • scheduling high priorities for
    time-restrictive tasks
  • timer support clock with fine granularity and
    event scheduling with high accuracy
  • kernel preemption avoid long periods where low
    priority processes cannot be interrupted
  • memory replacement prevent code for real-time
    programs from being paged out
  • fast switching both interrupts and context
    switching should be fast
  • ...

9
Timeliness
10
Timeliness
  • Start presenting data (e.g., video playout) at
    t1
  • Consumed bytes (offset)
  • variable rate
  • constant rate
  • Must start retrieving data earlier
  • Data must arrive beforeconsumption time
  • Data must be sent before arrival time
  • Data must be read from disk before sending time

11
Timeliness
  • Need buffers to hold data between the functions,
    e.g., client B(t) A(t) C(t), i.e., ? t
    A(t) C(t)
  • Latest start of data arrival is given by
    minB(t,t0,t1) ? t B(t,t0,t1) 0,i.e., the
    buffer must at all times t have more data to
    consume

12
Timeliness Streaming Data
  • Continuous Media and continuous streams are
    ILLUSIONS
  • retrieve data in blocks from disk
  • transfer blocks from file system to application
  • send packets to communication system
  • split packets into appropriate MTUs
  • ... (intermediate nodes)
  • ... (client)
  • different optimal sizes
  • pseudo-parallel processes (run in time slices)
  • need for scheduling(to have timing and
    appropriate resource allocation)

13
(Video) Server Structure
14
Server Components Switches
Tetzlaff Flynn 94
IP,
RPC in application,
NFS,
AFS, CODA,
distributed OS,
Disk arrays (RAID),
15
Server Topology I
  • Single server
  • easy to implement
  • scales poorly
  • Partitioned server
  • users divided into groups
  • content assumes equal groups
  • location store all data on all servers
  • load imbalance

16
Server Topology II
  • Externally switched servers
  • use network to make server pool
  • manages load imbalance(control server directs
    requests)
  • still data replication problems
  • (control server doesnt need to be a physical box
    - distributed process)
  • Fully switched server
  • server pool
  • storage device pool
  • additional hardware costs
  • e.g., Oracle, Intel, IBM

17
Data Retrieval
  • Pull model
  • client sends several requests
  • deliver only small part of data
  • fine-grained client control
  • favors high interactivity
  • suited for editing, searching, etc.
  • Push model
  • client sends one request
  • streaming delivery
  • favors capacity planning
  • suited for retrieval, download, playback, etc.

18
Typical Trends In the Internet Today
  • Push systems(pull in video editing/database
    systems)
  • Traditional (specialized) file systems not
    databases for data storage
  • No in-band control (control and data information
    in separate streams)
  • External directory services for data
    location(control server data pump)
  • Request redirection for access control
  • Single stand-alone servers ? (fully) switched
    servers

19
Server Examples
20
(Video) Server Product Status
1) Real server, VXtreme, Starlight, VDO, Netscape
Media Server,MS Media Server, Apple Darwin
1) Real server, VXtreme, Starlight, VDO, Netscape
Media Server,MS Media Server, Apple Darwin
2) IBM Mediastreamer, Oracle Video
Cartridge, N-Cube
2) IBM Mediastreamer, Oracle Video
Cartridge, N-Cube
3) SGI/Kassena Media Base, SUN Media Center,
IBM Video Charger
3) SGI/Kassena Media Base, SUN Media Center,
IBM Video Charger
21
Real Server
  • User space implementation
  • one control server
  • several protocols
  • several versions of data in same file
  • adapts to resources
  • Several formats, e.g.,
  • Reals own
  • MPEG-2 version with stream thinning(dropped
    with REAL ?)
  • Does not support
  • Quality-of-Service
  • load leveling

1
2
3
RTP/ RTCP
Reals protocol

UDP
TCP
IP
22
IBM Video Charger
  • May consist of one machine only, or
  • several IBMs Advanced Interactive eXecutive
    (AIX) machines
  • Servers
  • control
  • data
  • Lightly modified existing components
  • OS AIX4/5L
  • virtual shared disks (VSD)(guaranteed disk I/Os)
  • Special components
  • TigerShark MMFS(buffers, data rate, prefetching,
    codec, ...)
  • stream filters, control server, APIs, ...

specificcontrol server
RTSP
video stream API
mlib API
RTP
encrypt
filter

TigerShark MMFS
UDP
VSD
IP
23
nCUBE
  • Original research from Cal Tech/Intel (83)
  • Bought by C-COR in Jan. 05 (90M)
  • One server scales from 1 to 256 machines, 2n, n
    ? 0, 8, using a hypercube architecture
  • Why a hypercube?
  • video streaming is a switching problem
  • hypercube is a high performance scalable switch
  • no content replication and true linear
    scalability
  • integrated adaptive routing provides resilience
  • Highlights
  • one copy of a data element
  • scales from 5,000 to 500,000 clients
  • exceeds 60,000 simultaneous streams
  • 6,600 simultaneous streams at 2 - 4 Mbps each(26
    streams per machine if n 8)
  • Special components
  • boards with integrated components
  • TRANSIT operating system
  • n4 HAVOC (1999)
  • Hypercube And Vector Operations Controller
  • ASIC-based hypercube technology

request
8 hypercube connectors
configurable interface
memory
PCI bus
vector processor
SCSI ports
24
nCUBE Naturally Load-balanced
  • Disks connected to All MediaHubs
  • Each title striped across all MediaHUBs
  • Streaming Hub reads contentfrom all disks in the
    video server
  • Automatic load balancing
  • Immune to content usage pattern
  • Same load if same or different title
  • Each streams load spread over all nodes
  • RAID Sets distributed across MediaHubs
  • Immune to a MediaHUB failure
  • Increasing reliability
  • Only 1 copy of each title ever needed
  • Lots of room for expanded content,network-based
    PVR, or HDTV content

Content striped acrossall disks in the n4x server
Video Stream
25
Small Comparison
26
(CPU) Scheduling
27
Scheduling
  • A task is a schedulable entity (a process/thread
    executing a job, e.g., a packet through the
    communication system or a disk request through
    the file system)
  • In a multi-tasking system, several tasks may
    wish to use a resource simultaneously
  • A scheduler decides which task that may use the
    resource, i.e., determines order by which
    requests are serviced, using a scheduling
    algorithm
  • Each active (CPU, disk, NIC) resources needs a
    scheduler(passive resources are also
    scheduled, but in a slightly different way)

requests
scheduler
resource
28
Scheduling
  • Scheduling algorithm classification
  • dynamic
  • make scheduling decisions at run-time
  • flexible to adapt
  • considers only actual task requests and execution
    time parameters
  • large run-time overhead finding a schedule
  • static
  • make scheduling decisions at off-line (also
    called pre-run-time)
  • generates a dispatching table for run-time
    dispatcher at compile time
  • needs complete knowledge of task before compiling
  • small run-time overhead
  • preemptive
  • currently executing task may be interrupted
    (preempted) by higher priority processes
  • preempted process continues later at the same
    state
  • potential frequent contexts switching
  • (almost!?) useless for disk and network cards
  • non-preemptive
  • running tasks will be allowed to finish its
    time-slot (higher priority processes must wait)
  • reasonable for short tasks like sending a packet
    (used by disk and network cards)

29
Scheduling
  • Preemption
  • tasks waits for processing
  • scheduler assigns priorities
  • task with highest priority will be scheduled
    first
  • preempt current execution if a higher priority
    (more urgent) task arrives
  • real-time and best effort priorities(real-time
    processes have higher priority - if exists, they
    will run)
  • to kinds of preemption
  • preemption points
  • predictable overhead
  • simplified scheduler accounting
  • immediate preemption
  • needed for hard real-time systems
  • needs special timers and fast interrupt and
    context switch handling

requests
scheduler
resource
30
Scheduling
  • Scheduling is difficult and takes time

RT process
delay
process 1
process 2
process 3
process 4
process N
RT process

round-robin
RT process
delay
priority,non-preemtive
process 1
RT process
RT process
priority,preemtive
p 1
RT process
31
Scheduling in Linux
SHED_FIFO
  • Preemptive kernel
  • Threads and processes used to be equal, but
    Linux uses (in 2.6) thread scheduling
  • SHED_FIFO
  • may run forever, no timeslices
  • may use its own scheduling algorithm
  • SHED_RR
  • each priority in RR
  • timeslices of 10 ms (quantums)
  • SHED_OTHER
  • ordinary user processes
  • uses nice-values 1 priority40
  • timeslices of 10 ms (quantums)
  • Threads with highest goodness are selected first
  • realtime (FIFO and RR)goodness 1000
    priority
  • timesharing (OTHER) goodness (quantum gt 0 ?
    quantum priority 0)
  • Quantums are reset when no ready process has
    quantums left (end of epoch)quantum
    (quantum/2) priority

SHED_RR
nice
SHED_OTHER
32
RealTime Scheduling
  • Resource reservation
  • QoS can be guaranteed
  • relies on knowledge of tasks
  • no fairness
  • origin time sharing operating systems
  • e.g., earliest deadline first (EDF) and rate
    monotonic (RM)(AQUA, HeiTS, RT Upcalls, ...)
  • Proportional share resource allocation
  • no guarantees
  • requirements are specified by a relative share
  • allocation in proportion to competing shares
  • size of a share depends on system state and time
  • origin packet switched networks
  • e.g., Scheduler for Multimedia And Real-Time
    (SMART)(Lottery, Stride, Move-to-Rear List, ...)

33
Earliest Deadline First (EDF)
  • Preemptive scheduling based on dynamic task
    priorities
  • Task with closest deadline has highest priority?
    stream priorities vary with time
  • Dispatcher selects the highest priority task
  • Assumptions
  • requests for all tasks with deadlines are
    periodic
  • the deadline of a task is equal to the end on its
    period (starting of next)
  • independent tasks (no precedence)
  • run-time for each task is known and constant
  • context switches can be ignored

34
Earliest Deadline First (EDF)
  • Example

deadlines
Task A
time
Task B
Dispatching
35
Rate Monotonic (RM) Scheduling
  • Classic algorithm for hard real-time systems with
    one CPU Liu Layland 73
  • Pre-emptive scheduling based on static task
    priorities
  • Optimal no other algorithms with static task
    priorities can schedule tasks that cannot be
    scheduled by RM
  • Assumptions
  • requests for all tasks with deadlines are
    periodic
  • the deadline of a task is equal to the end on its
    period (starting of next)
  • independent tasks (no precedence)
  • run-time for each task is known and constant
  • context switches can be ignored
  • any non-periodic task has no deadline

36
Rate Monotonic (RM) Scheduling
shortest period, highest priority
  • Process priority based on task periods
  • task with shortest period gets highest static
    priority
  • task with longest period gets lowest static
    priority
  • dispatcher always selects task requests with
    highest priority
  • Example

priority
longest period, lowest priority
period length
Pi period for task i
Task 1
p2
P1 lt P2 ? P1 highest priority
Task 2
Dispatching
37
EDF Versus RM
  • It might be impossible to prevent deadline misses
    in a strict, fixed priority system

deadlines
Task A
time
Task B
Fixed priorities,A has priority, no dropping
waste of time
waste of time
Fixed priorities,B has priority, no dropping
RM may give some deadline violationswhich is
avoided by EDF
Earliest deadline first
38
SMART (Scheduler for Multimedia And RealTime
applications)
  • Designed for multimedia and real-time
    applications
  • Principles
  • priority high priority tasks should not suffer
    degradation due to presence of low priority tasks
  • proportional sharing allocate resources
    proportionally and distribute unused resources
    (work conserving)
  • tradeoff immediate fairness real-time and less
    competitive processes (short-lived, interactive,
    I/O-bound, ...) get instantaneous higher shares
  • graceful transitions adapt smoothly to resource
    demand changes
  • notification notify applications of resource
    changes

39
SMART (Scheduler for Multimedia And RealTime
applications)
  • Tasks have importance and urgency
  • urgency an immediate real-time constraint,
    short deadline(determine when a task will get
    resources)
  • importance a priority measure
  • expressed by a tuple priority p , biased
    virtual finishing time bvft
  • p is static supplied by user or assigned a
    default value
  • bvft is dynamic
  • virtual finishing time virtual application time
    for finishing if given the requested resources
  • bias bonus for interactive tasks
  • Best effort schedule based on urgency and
    importance
  • find most important tasks compare tupleT1 gt
    T2 ? (p1 gt p2) ? (p1 p2 ? bvft1 gt bvft2)
  • sort after urgency (EDF based sorting)
  • iteratively select task from candidate set as
    long as schedule is feasible

40
Evaluation of a RealTime Scheduling
  • Tests performed
  • by IBM (1993)
  • executing tasks with and without EDF
  • on an 57 MHz, 32 MB RAM, AIX Power 1
  • Video playback program
  • one real-time process
  • read compressed data
  • decompress data
  • present video frames via X server to user
  • process requires 15 timeslots of 28 ms each per
    second? 42 of the CPU time

41
Evaluation of a RealTime Scheduling
3 load processes(competing with the video
playback)
the real-time scheduler reaches all its deadlines
laxity (remaining time to deadline)
several deadlineviolations by the non-real-times
cheduler
task number
42
Evaluation of a RealTime Scheduling
Varied the number of load processes(competing
with the video playback)
Only video process
4 other processes
laxity (remaining time to deadline)
16 other processes
NB! The EDF scheduler kept its deadlines
task number
43
Evaluation of a RealTime Scheduling
  • Tests again performed
  • by IBM (1993)
  • on an 57 MHz, 32 MB RAM, AIX Power 1
  • Stupid end system program
  • 3 real-time processes only requesting CPU cycles
  • each process requires 15 timeslots of 21 ms each
    per second? 31.5 of the CPU time each? 94.5
    of the CPU time required for real-time tasks

44
Evaluation of a RealTime Scheduling
1 load process(competing with the real-time
processes)
the real-time scheduler reaches all its deadlines
laxity (remaining time to deadline)
task number
45
Evaluation of a RealTime Scheduling
16 load process(competing with the real-time
processes)
process 1
Regardless of other load, the EDF-scheduler
reach its deadlines(laxity almost equal as in
1 load process scenario)
laxity (remaining time to deadline)
process 2
NOTE Processes are scheduled in same order
process 3
task number
46
Memory Management
47
Why look at a passive resource?
Dying philosophers problem
Parking
Lack of space (or bandwidth) can delay
applications ? e.g., the dining philosophers
would die because the spaghetti-chef could
not find a parking lot
48
Delivery Systems
Network
49
Delivery Systems
  • several in-memory data movements and context
    switches
  • several disk-to-memory transfers

bus(es)
50
Memory Caching
51
Memory Caching
  • How do we manage a cache?
  • how much memory to use?
  • how much data to prefetch?
  • which data item to replace?

application
cache
communication system
file system
expensive
disk
network card
52
Is Caching Useful in a High-Rate Scenario?
  • High rate data may need lots of memory for
    caching
  • Tradeoff amount of memory, algorithms
    complexity, gain,
  • Cache only frequently used data how?(e.g.,
    first (small) parts of a broadcast partitioning
    scheme, allow top-ten only, )

Largest Dell Server in 2004 and all is NOT
used for caching
53
Need For Application-Specific Algorithms?
In this case, LRU replaces the next needed
frame. So the answer is in many cases YES
  • Most existing systems use an LRU-variant
  • keep a sorted list
  • replace first in list
  • insert new data elements at the end
  • if a data element is re-accessed (e.g., new
    client or rewind), move back to the end of the
    list
  • Extreme example video frame playout

longest time since access
shortest time since access
LRU buffer
play video (7 frames)
1
2
3
4
5
6
7
7
5
4
3
2
1
rewind and restart playout at 1
6
1
7
6
5
4
3
2
playout 2
2
7
6
5
4
3
playout 3
1
3
2
1
7
6
5
4
playout 4
54
Classification of Mechanisms
  • Block-level caching consider (possibly unrelated)
    set of blocks
  • each data element is viewed upon as an
    independent item
  • usually used in traditional systems
  • e.g., FIFO, LRU, LFU, CLOCK,
  • multimedia (video) approaches
  • Least/Most Relevant for Presentation (L/MRP)
  • Stream-dependent caching consider a stream object
    as a whole
  • related data elements are treated in the same way
  • research prototypes in multimedia systems
  • e.g.,
  • BASIC
  • DISTANCE
  • Interval Caching (IC)
  • Generalized Interval Caching (GIC)
  • Split and Merge (SAM)
  • SHR

55
Least/Most Relevant for Presentation (L/MRP)
Moser et al. 95
  • L/MRP is a buffer management mechanism for a
    single interactive, continuous data stream
  • adaptable to individual multimedia applications
  • preloads units most relevant for presentation
    from disk
  • replaces units least relevant for presentation
  • client pull based architecture

Homogeneous stream e.g., MJPEG video
Continuous Presentation Units (COPU) e.g., MJPEG
video frames
Server
Client
56
Least/Most Relevant for Presentation (L/MRP)
Moser et al. 95
  • Relevance values are calculated with respect to
    current playout of the multimedia stream
  • presentation point (current position in file)
  • mode / speed (forward, backward, FF, FB, jump)
  • relevance functions are configurable

COPUs continuous object presentation units
playback direction
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
15
16
17
18
19
14
20
21
13
22
12
23
11
10
24
25
26
57
Least/Most Relevant for Presentation (L/MRP)
Moser et al. 95
  • Global relevance value
  • each COPU can have more than one relevance value
  • bookmark sets (known interaction points)
  • several viewers (clients) of the same
  • maximum relevance for each COPU

Relevance
1
0
100
101
102
103
99
98
91
92
93
94
90
89
95
96
97
104
105
106
...
...
Referenced-Set
History-Set
58
Least/Most Relevant for Presentation (L/MRP)
  • L/MRP
  • gives few disk accesses (compared to other
    schemes)
  • supports interactivity
  • supports prefetching
  • targeted for single streams (users)
  • expensive (!) to execute (calculate relevance
    values for all COPUs each round)
  • Variations
  • Q-L/MRP extends L/MRP with multiple streams and
    changes prefetching mechanism (reduces overhead)
    Halvorsen et. al. 98
  • MPEG-L/MRP gives different relevance values for
    different MPEG frames Boll et. all. 00

59
Interval Caching (IC)
  • Interval caching (IC) is a caching strategy for
    streaming servers
  • caches data between requests for same video
    stream based on playout intervals between
    requests
  • following requests are thus served from the cache
    filled by preceding stream
  • sort intervals on length, buffer requirement is
    data size of interval
  • to maximize cache hit ratio (minimize disk
    accesses) the shortest intervals are cached first

I32
I33
I21
I11
I31
I12
60
Generalized Interval Caching (GIC)
  • Interval caching (IC) does not work for short
    clips
  • a frequently accessed short clip will not be
    cached
  • GIC generalizes the IC strategy
  • manages intervals for long video objects as IC
  • short intervals extend the interval definition
  • keep track of a finished stream for a while after
    its termination
  • define the interval for short stream as the
    length between the new stream and the position of
    the old stream if it had been a longer video
    object
  • the cache requirement is, however, only the real
    requirement
  • cache the shortest intervals as in IC

S11
Video clip 1
I11
C11
61
Generalized Interval Caching (GIC)
  • Open function form if possible new interval
    with previous stream if (NO) exit / dont
    cache / compute interval size and cache
    requirement reorder interval list / smallest
    first / if (not already in a cached
    interval) if (space available) cache
    interval else if (larger cached intervals
    exist and sufficient memory can be released)
    release memory from larger
    intervals cache new interval
  • Close function if (not following another stream)
    exit / not served from cache / delete
    interval with preceding stream free memory if
    (next interval can be cached in released memory)
    cache next interval

62
LRU vs. L/MRP vs. IC Caching
  • What kind of caching strategy is best (VoD
    streaming)?
  • caching effect

I1
I2
I3
I4
Memory (L/MRP)
Memory (IC)
Memory (LRU)
63
LRU vs. L/MRP vs. IC Caching
  • What kind of caching strategy is best (VoD
    streaming)?
  • caching effect (IC best)
  • CPU requirement

64
In-Memory Copy Operations
65
In Memory Copy Operations
application
expensive
expensive
communication system
file system
disk
network card
66
Basic Idea of ZeroCopy Data Paths
data_pointer
data_pointer
bus(es)
67
Existing Linux Data Paths
A lot of research has been performed in this
area!!!! BUT, what is the status today of
commodity operating systems?
68
Content Download
bus(es)
69
Content Download read / send
application
application buffer
kernel
copy
copy
page cache
socket buffer
DMA transfer
DMA transfer
  • 2n copy operations
  • 2n system calls

70
Content Download mmap / send
application
kernel
page cache
socket buffer
copy
DMA transfer
DMA transfer
  • n copy operations
  • 1 n system calls

71
Content Download sendfile
application
kernel
gather DMA transfer
page cache
socket buffer
append descriptor
DMA transfer
  • 0 copy operations
  • 1 system calls

72
Content Download Results
  • Tested transfer of 1 GB file on Linux 2.6
  • Both UDP (with enhancements) and TCP

UDP
TCP
73
Streaming
bus(es)
74
Streaming read / send
application
application buffer
kernel
copy
copy
page cache
socket buffer
DMA transfer
DMA transfer
  • 2n copy operations
  • 2n system calls

75
Streaming read / writev
application
application buffer
kernel
copy
copy
copy
page cache
socket buffer
DMA transfer
DMA transfer
  • 3n copy operations
  • 2n system calls

? Previous solution one less copy per packet
76
Streaming mmap / send
application
application buffer
kernel
copy
page cache
socket buffer
copy
DMA transfer
DMA transfer
  • 2n copy operations
  • 1 4n system calls

77
Streaming mmap / writev
application
application buffer
kernel
copy
page cache
socket buffer
copy
DMA transfer
DMA transfer
  • 2n copy operations
  • 1 n system calls

? Previous solution three more calls per packet
78
Streaming sendfile
application
application buffer
copy
kernel
gather DMA transfer
page cache
socket buffer
append descriptor
DMA transfer
  • n copy operations
  • 4n system calls

79
Streaming Results
  • Tested streaming of 1 GB file on Linux 2.6
  • RTP over UDP

Compared to not sending an RTP header over UDP,
we get an increase of 29 (additional send call)
More copy operations and system calls required ?
potential for improvements
TCP sendfile (content download)
80
Enhanced Streaming Data Paths
81
Enhanced Streaming mmap / msend
application
application buffer
msend allows to send data from an mmaped file
without copy
copy
kernel
gather DMA transfer
page cache
socket buffer
append descriptor
copy
DMA transfer
DMA transfer
  • n copy operations
  • 1 4n system calls

? Previous solution one more copy per packet
82
Enhanced Streaming mmap / rtpmsend
application
application buffer
RTP header copy integrated into msend system call
copy
kernel
gather DMA transfer
page cache
socket buffer
append descriptor
DMA transfer
  • n copy operations
  • 1 n system calls

? previous solution three more calls per packet
83
Enhanced Streaming mmap / krtpmsend
application
application buffer
An RTP engine in the kernel adds RTP headers
copy
kernel
gather DMA transfer
RTP engine
page cache
socket buffer
append descriptor
DMA transfer
  • 0 copy operations
  • 1 system call

? previous solution one more copy per packet
? previous solution one more call per packet
84
Enhanced Streaming rtpsendfile
application
application buffer
RTP header copy integrated into sendfile system
call
copy
kernel
gather DMA transfer
page cache
socket buffer
append descriptor
DMA transfer
  • n copy operations
  • n system calls

? existing solution three more calls per packet
85
Enhanced Streaming krtpsendfile
application
application buffer
An RTP engine in the kernel adds RTP headers
copy
kernel
gather DMA transfer
RTP engine
page cache
socket buffer
append descriptor
DMA transfer
  • 0 copy operations
  • 1 system call

? previous solution one more copy per packet
? previous solution one more call per packet
86
Enhanced Streaming Results
  • Tested streaming of 1 GB file on Linux 2.6
  • RTP over UDP

mmap based mechanisms
sendfile based mechanisms
Existing mechanism (streaming)
27 improvement
25 improvement
TCP sendfile (content download)
87
Storage Disks
88
Disks
  • Two resources of importance
  • storage space
  • I/O bandwidth
  • Several approaches to manage data on disks
  • specific disk scheduling and appropriate buffers
  • optimize data placement
  • replication / striping
  • prefetching
  • combinations of the above

89
Mechanics of Disks
Spindleof which the platters rotate around
Tracksconcentric circles on asingle platter
Platterscircular platters covered with magnetic
material to provide nonvolatile storage of bits
Disk headsread or alter the magnetism (bits)
passing under it. The heads are attached to an
arm enabling it to move across the platter surface
Sectorssegment of the track circle usually
each contains 512 bytes separated by
non-magnetic gaps.The gaps are often used to
identifybeginning of a sector
Cylinderscorresponding tracks on the different
platters are said to form a cylinder
90
Disk Specifications
  • Some existing (Seagate) disks today

Note 2there is usually a trade off between
speed and capacity
Note 1there is a difference between internal
and formatted transfer rate. Internal is only
between platter. Formatted is after the signals
interfere with the electronics (cabling loss,
interference, retransmissions, checksums, etc.)
91
Disk Access Time
  • How do we retrieve data from disk?
  • position head over the cylinder (track) on which
    the block (consisting of one or more sectors) are
    located
  • read or write the data block as the sectors move
    under the head when the platters rotate
  • The time between the moment issuing a disk
    request and the time the block is resident in
    memory is called disk latency or disk access time

92
Disk Access Time
Disk platter
Disk access time
Disk head
Seek time
Rotational delay
Transfer time
Disk arm
Other delays
93
Disk Access Time Seek Time
  • Seek time is the time to position the head
  • the heads require a minimum amount of time to
    start and stop moving the head
  • some time is used for actually moving the head
    roughly proportional to the number of cylinders
    traveled
  • Time to move head

number of tracks seek time constant fixed overhead
Typical average 10 ms ? 40 ms 7.4 ms
(Barracuda 180) 5.7 ms (Cheetah 36) 3.6 ms
(Cheetah X15)
94
Disk Access Time Rotational Delay
  • Time for the disk platters to rotate so the first
    of the required sectors are under the disk head

Average delay is 1/2 revolutionTypical
average 8.33 ms (3.600 RPM) 5.56 ms
(5.400 RPM) 4.17 ms (7.200 RPM) 3.00 ms
(10.000 RPM) 2.00 ms (15.000 RPM)
95
Disk Access Time Transfer Time
  • Time for data to be read by the disk head, i.e.,
    time it takes the sectors of the requested block
    to rotate under the head
  • Transfer rate
  • Transfer time amount of data to read / transfer
    rate
  • Example Barracuda 180406 KB per track x 7.200
    RPM ? 47.58 MB/s
  • Example Cheetah X15316 KB per track x 15.000
    RPM ? 77.15 MB/s
  • Transfer time is dependent on data density and
    rotation speed
  • If we have to change track, time must also be
    added for moving the head

Noteone might achieve these transfer rates
reading continuously on disk, but time must be
added for seeks, etc.
96
Disk Access Time Other Delays
  • There are several other factors which might
    introduce additional delays
  • CPU time to issue and process I/O
  • contention for controller
  • contention for bus
  • contention for memory
  • verifying block correctness with checksums
    (retransmissions)
  • waiting in scheduling queue
  • ...
  • Typical values 0 (maybe except from waiting
    in the queue)

97
Disk Throughput
  • How much data can we retrieve per second?
  • Throughput
  • Examplefor each operation we have - average
    seek - average rotational delay - transfer
    time - no gaps, etc.
  • Cheetah X15 (max 77.15 MB/s)4 KB blocks ? 0.71
    MB/s64 KB blocks ? 11.42 MB/s
  • Barracuda 180 (max 47.58 MB/s) 4 KB blocks ?
    0.35 MB/s64 KB blocks ? 5.53 MB/s

98
Block Size
  • The block size may have large effects on
    performance
  • Exampleassume random block placement on disk
    and sequential file access
  • doubling block size will halve the number of disk
    accesses
  • each access take some more time to transfer the
    data, but the total transfer time is the same
    (i.e., more data per request)
  • halve the seek times
  • halve rotational delays are omitted
  • e.g., when increasing block size from 2 KB to 4
    KB (no gaps,...) for Cheetah X15 typically an
    average of
  • 3.6 ms is saved for seek time
  • 2 ms is saved in rotational delays
  • 0.026 ms is added per transfer time
  • increasing from 2 KB to 64 KB saves 96,4 when
    reading 64 KB


saving a total of 5.6 ms when reading 4 KB (49,8
)
99
Block Size
  • Thus, increasing block size can increase
    performance by reducing seek times and
    rotational delays(figure shows calculation on
    some older device)
  • But, blocks spanning several tracks still
    introduce latencies
  • and a large block size is not always best
  • small data elements may occupy only a fraction
    of the block (fragmentation)
  • Which block size to use therefore depends on
    data size and data reference patterns
  • The trend, however, is to use large block sizes
    as new technologies appear with increased
    performance at least in high data rate systems

100
Writing and Modifying Blocks
  • A write operation is analogous to read operations
  • must add time for block allocation
  • a write operation may has to be verified must
    wait another rotation and then read the block to
    see if it is the block we wanted to write
  • Total write time ? read time ( time for one
    rotation)
  • Cannot modify a block directly
  • read block into main memory
  • modify the block
  • write new content back to disk
  • (verify the write operation)
  • Total modify time ? read time time to modify
    write time

101
Disk Controllers
  • To manage the different parts of the disk, we use
    a disk controller, which is a small processor
    capable of
  • controlling the actuator moving the head to the
    desired track
  • selecting which platter and surface to use
  • knowing when right sector is under the head
  • transferring data between main memory and disk
  • New controllers acts like small computers
    themselves
  • both disk and controller now has an own buffer
    reducing disk access time
  • data on damaged disk blocks/sectors are just
    moved to spare room at the disk the system
    above (OS) does not know this, i.e., a block may
    lie elsewhere than the OS thinks

102
Efficient Secondary Storage Usage
  • Must take into account the use of secondary
    storage
  • there are large access time gaps, i.e., a disk
    access will probably dominate the total execution
    time
  • there may be huge performance improvements if we
    reduce the number of disk accesses
  • a slow algorithm with few disk accesses will
    probably outperform a fast algorithm with many
    disk accesses
  • Several ways to optimize .....
  • block size - 4 KB
  • file management / data placement - various
  • disk scheduling - SCAN derivate
  • multiple disks - a specific RAID level
  • prefetching - read-ahead prefetching
  • memory caching /replacement algorithms - LRU
    variant

103
Disk Scheduling
104
Disk Scheduling I
  • Seek time is the dominant factor of total disk
    I/O time
  • Let operating system or disk controller choose
    which request to serve next depending on the
    heads current position and requested blocks
    position on disk (disk scheduling)
  • Note that disk scheduling ? CPU scheduling
  • a mechanical device hard to determine
    (accurate) access times
  • disk accesses cannot be preempted runs until it
    finishes
  • disk I/O often the main performance bottleneck
  • General goals
  • short response time
  • high overall throughput
  • fairness (equal probability for all blocks to be
    accessed in the same time)
  • Tradeoff seek and rotational delay vs. maximum
    response time

105
Disk Scheduling II
  • Several traditional (performance oriented)
    algorithms
  • First-Come-First-Serve (FCFS)
  • Shortest Seek Time First (SSTF)
  • SCAN (and variations)
  • Look (and variations)

106
FirstComeFirstServe (FCFS)
  • FCFS serves the first arriving request first
  • Long seeks
  • Short average response time

incoming requests (in order of arrival)
12
14
2
7
21
8
24
12
14
2
7
21
Notethe lines only indicate some time not
exact amount
8
24
107
Shortest Seek Time First (SSTF)
  • SSTF serves closest request first
  • short seek times
  • longer maximum seek times may even lead to
    starvation

incoming requests (in order of arrival)
12
14
2
7
21
8
24
24
8
21
7
2
14
12
108
SCAN
  • SCAN (elevator) moves head edge to edge and
    serves requests on the way
  • bi-directional
  • compromise between response time and seek time
    optimizations

incoming requests (in order of arrival)
12
14
2
7
21
8
24
24
8
21
7
2
14
12
scheduling queue
109
SCAN vs. FCFS
12
14
2
7
21
8
24
incoming requests (in order of arrival)
cylinder number
  • Disk scheduling makes a difference!
  • In this case, we see that SCAN requires much
    less head movement compared to FCFS(37 vs. 75
    tracks/cylinders)

1
5
10
15
20
25
FCFS
SCAN
110
CSCAN
  • Circular-SCAN moves head from edge to edge
  • serves requests on one way uni-directional
  • improves response time (fairness)

incoming requests (in order of arrival)
12
14
2
7
21
8
24
24
8
21
7
2
14
12
scheduling queue
111
SCAN vs. CSCAN
  • Why is C-SCAN in average better in reality than
    SCAN when both service the same number of
    requests in two passes?
  • modern disks must accelerate (speed up and down)
    when seeking
  • head movement formula

time
number of tracks seek time constant fixed overhead
cylinders traveled
if n is large
112
LOOK and CLOOK
  • LOOK (C-LOOK) is a variation of SCAN (C-SCAN)
  • same schedule as SCAN
  • does not run to the edges
  • stops and returns at outer- and innermost request
  • increased efficiency
  • SCAN vs. LOOK example

incoming requests (in order of arrival)
12
14
2
7
21
8
24
scheduling queue
2
7
8
24
21
14
12
113
VSCAN(R)
  • V-SCAN(R) combines SCAN (or LOOK) and SSTF
  • define an R-sized unidirectional SCAN window,
    i.e., C-SCAN, and use SSTF outside the window
  • Example V-SCAN(0.6)
  • makes a C-SCAN window over 60 of the cylinders
  • uses SSTF for requests outside the window
  • V-SCAN(0.0) equivalent with SSTF
  • V-SCAN(1.0) equivalent with C-SCAN
  • V-SCAN(0.2) is supposed to be an appropriate
    configuration

cylinder number
1
5
10
15
20
25
114
LAST WEEK!!
  • DISKS SCHEDULING OF TRADITIONAL LOAD

115
What About Time-Dependent Media?
  • Suitability of classical algorithms
  • minimal disk arm movement (short seek times)
  • but, no provision of time or deadlines
  • generally not suitable
  • For example, a continuous media server requires
  • serve both periodic and aperiodic requests
  • never miss deadline due to aperiodic requests
  • aperiodic requests must not starve
  • support multiple streams
  • buffer space and efficiency tradeoff?

116
RealTime Disk Scheduling
  • Traditional algorithms have no provision of time
    or deadlines
  • Realtime algorithms targeted for realtime
    applications with deadlines
  • Several proposed algorithms
  • earliest deadline first (EDF)
  • SCAN-EDF
  • shortest seek and earliest deadline by
    ordering/value (SSEDO / SSEDV)
  • priority SCAN (PSCAN)
  • ...

117
Earliest Deadline First (EDF)
  • EDF serves the request with nearest deadline
    first
  • non-preemptive (i.e., an arriving request with a
    shorter deadline must wait)
  • excessive seeks ? poor throughput

incoming requests (ltblock, deadlinegt, in order of
arrival)
12,5
14,6
2,4
7,7
21,1
8,2
24,3
12,5
14,6
2,4
7,7
21,1
8,2
24,3
scheduling queue
118
SCANEDF
  • SCAN-EDF combines SCAN and EDF
  • the real-time aspects of EDF
  • seek optimizations of SCAN
  • especially useful if the end of the period is the
    deadline (some equal deadlines)
  • algorithm
  • serve requests with earlier deadline first (EDF)
  • sort requests with same deadline after track
    location (SCAN)

incoming requests (ltblock, deadlinegt, in order of
arrival)
2,3
14,1
9,3
7,2
21,1
8,2
24,2
16,1
2,3
14,1
9,3
7,2
21,1
8,2
24,2
16,1
scheduling queue
Notesimilarly, we can combine EDF with C-SCAN,
LOOK or C-LOOK
119
Stream Oriented Disk Scheduling
  • Streams often have soft deadlines and tolerate
    some slack due to buffering, i.e., pure real-time
    scheduling is inefficient and unnecessary
  • Stream oriented algorithms targeted for streaming
    continuous media data requiring periodic access
  • Several algorithms proposed
  • group sweep scheduling (GSS)
  • mixed disk scheduling strategy
  • contiguous media file system (CMFS)
  • lottery scheduling
  • stride scheduling
  • batched SCAN (BSCAN)
  • greedy-but-safe EDF (GS_EDF)
  • bubble up

120
Group Sweep Scheduling (GSS)
  • GSS combines Round-Robin (RR) and SCAN
  • requests are serviced in rounds (cycles)
  • principle
  • divide S active streams into G groups
  • service the G groups in RR order
  • service each stream in a group in C-SCAN order
  • playout can start at the end of the group
  • special cases
  • G S RR scheduling
  • G 1 SCAN scheduling
  • tradeoff between buffer space and disk arm
    movement
  • try different values for G giving minimum buffer
    requirement select minimum
  • a large G ? smaller groups, more arm movements
  • a small G ? larger groups, less arm movements

121
Group Sweep Scheduling (GSS)
  • GSS example streams A, B, C and D ? g1A,C and
    g2B,D
  • RR group schedule
  • C-SCAN block schedule within a group

25
A2
A1
A3
B2
B3
B1
C1
C2
C3
D3
D1
D2
A1
g1
A,C
C1
B1
g2
B,D
D1
C2
g1
C,A
A2
B2
g2
B,D
D2
g1
A3
A,C
C3
B3
g2
B,D
D3
122
Mixed Media Oriented Disk Scheduling
  • Applications may require both RT and NRT data
    desirable to have all on same disk
  • Several algorithms proposed
  • Felinis disk scheduler
  • Delta L
  • Fair mixed-media scheduling (FAMISH)
  • MARS scheduler
  • Cello
  • Adaptive disk scheduler for mixed media workloads
    (APEX)

123
MARS Disk Scheduler
  • Massively-parallel And Real-time Storage (MARS)
    scheduler supports mixed media on a single system
  • a two-level scheduling
  • round-based
  • top-level 1 NRT queue and n (1) RT queue(SCAN,
    but future GSS, SCAN-EDF, or)
  • use deficit RR fair queuing to assign quantums
    to each queue per round divides total
    bandwidth among queues
  • bottom-level select requests from queues
    according to quantums, use SCAN order
  • work-conserving(variable round times, new round
    starts immediately)

NRT
RT

deficit round robin fair queuingjob selector
124
Cello and APEX
  • Cello and APEX are similar to MARS, but slightly
    different in bandwidth allocation and work
    conservation
  • Cello has
  • three queues deadline (EDF), throughput
    intensive best effort (FCFS), interactive best
    effort (FCFS)
  • static proportional allocation scheme for
    bandwidth
  • FCFS ordering of queue requests in lower-level
    queue
  • partially work-conservingextra requests might
    be added at the end of the classindependent
    scheduler, but constant rounds
  • APEX
  • n queues
  • uses token bucket for traffic shaping (bandwidth
    allocation)
  • work-conserving adds extra requests if possible
    to a batch starts extra batch between ordinary
    batches

125
Cello
  • Cello is part of the Symphony FS supporting mixed
    media
  • two-le
Write a Comment
User Comments (0)
About PowerShow.com