Title: Performance Analysis and Optimization
 1Performance Analysis and Optimization
- Kayhan Atesci 
 - www.auburn.edu/atescka
 
  2Overview
- Theoretical Preliminaries 
 - NP-Completeness 
 - Analyzing RT Systems 
 - The Halting Problem 
 - Amdahls Law 
 - Gustafsons Law 
 
  3Overview (contd)
- Performance Analysis 
 - Execution Time Estimation 
 - Instruction Counting 
 - Instruction Execution-Time Simulators 
 - Using the System Clock 
 - Analysis of Polled Loops 
 - Analysis of Coroutines 
 - Analysis of Round-Robin Systems 
 - Analysis of Fixed-Period Systems
 
  4Overview (cont.)
- Performance Analysis (cont.) 
 - Analysis of Sporadic and Aperiodic Interrupt 
Systems  - Interrupt Latency 
 - Instruction Completion Times 
 - Deterministic Performance
 
  5Intro to Theoretical Preliminaries
- Performance Analysis is one of the fields where 
theory and practice do not coincide  - Formulas 
 - ignore resource contention 
 - use theoretically artificial hardware 
 - assume zero context switch time 
 - Not totally useless but less realistic
 
  6NP-Completeness
- NP The class of problems that cant be solved in 
polynomial time although a candidate solution can 
be defined  - NP-Complete A problem in the NP class to which 
other problems in NP are transformable  - NP-Hard A problem outside of the NP class to 
which other problems in NP are transformable  
  7Challenges in Analyzing RTS
- 30 years of research, strict practical 
 - constraints, NP-Complete problems etc.. 
 - Mutual exclusions makes it impossible to find an 
optimal scheduler  - Earliest deadline scheduling not optimal in 
multiprocessing  - Prior knowledge of deadlines, computation times 
and task start time needed 
  8Challenges  contd
- NP-Complete, NP-Hard problems 
 - Possible to schedule a set of periodic processes 
that use semaphores only to enforce mutual 
exclusion?  - Multiprocessor scheduling problems with either 
two or three processors, either with one or no 
resource, arbitrary or specified partial order 
  9The Halting Problem
- Is there a computer program that takes an 
arbitrary program, Pi, and all possible 
combinations of inputs, Ik, and determines 
whether or not Pi will halt on Ik?  - No, there is not. 
 
  10The Halting Problem  contd
Arbitrary Program Pi Source Code
Halt or no Halt Decision
Oracle
Set of inputs to Program, Ik 
 11The Halting Problem  contd 
 12The Halting Problem  contd
- Why is it relevant to RT systems? 
 - Schedulability Analyzer 
 - Takes an arbitrary program and the set of all 
possible inputs to that program and determines 
the best, worst, and average case execution times  - The running time can be determined given the 
specific language, fixed set of inputs and the 
execution times..  - But! NOT GENERALIZABLE! 
 
  13Amdahls Law
- Level of parallelization that can be achieved by 
a parallel computer  - n number of processors available for parallel 
processing  - s the fraction of the code that cannot be 
parallelized  - 1  s the fraction of code that can be 
parallelized 
  14Amdahls Law (cont.)
- Speedup  (s  ( 1  s)) 
 -  (s  (1-s)/n) 
 -  . 
 -  . 
 -  . 
 -  n 
 -  1  s(n-1) 
 
  15Amdahls Law  contd
- S  0, linear speedup! 
 - S  .1, 
 -  The processors working on the remaining 90 
(parallelizable code) will end up waiting for the 
single processor to finish the last 10.  - Revealed an insoluble problem in the field of 
parallel computers limited efficiency and 
application of parallelism  
  16Gustafsons Law
- Demonstrated with a 1024-processor system that 
the basic presumptions in Amdahls Law are 
inappropriate for parallelism  - Found that the problem size scales with the 
number of processors, or with a more powerful 
processor, the problem expands to make use of the 
increased facilities is inappropriate. 
  17Gustafsons Law  contd
- Demonstrated that only the parallel or vector 
part of a program scales with the problem size.  - Times for the vector startup, program loading, 
serial bottlenecks, and I/O that make up the 
serial component of the run do not grow with the 
problem size  
  18Gustafsons Law  contd
- Formulation 
 -  s serial time 
 -  p parallel time (1  s) 
 -  n number of processors 
 -  time required s  pn 
 - Much more optimistic than Amdahls law, much 
easier to achieve parallelism  
  19So far
- Theoretical Preliminaries 
 - NP-Completeness 
 - Challenges in Analyzing RTS 
 - The Halting Problem 
 - Amdahls Law 
 - Gustafsons Law 
 
  20Performance Analysis
- Natural desire to see if they will meet the 
deadlines  - Rarely possible due to NP-completeness and 
physical constraints  - However, it is possible to get a handle 
 - Important because CPU utilization requirements 
are stated as design goals and knowing them 
upfront is important in selecting the appropriate 
hardware and system design approach  
  21Code Execution Time Estimation
- Best method  Logic Analyzer (Ch. 8) 
 -  H/W latencies and other delays are 
 -  taken into account 
 - - System must be completely coded and 
 -  the target H/W available 
 - Usually employed in the late stages of 
 - coding, testing, and during system 
 - integration.
 
  22Instruction Counting
- When too early for logic analyzer, or one is not 
available the best method of determining CPU 
utilization due to code execution time  - Involves tracing the longest path through the 
code, adding up the instruction execution times 
along the way  - Reqs 
 - Code need to be already written or approximation 
of the final code  - Actual instruction times
 
  23Instruction Counting  contd 
 24Instruction Counting  contd
- Path 1 
 -  7 instructions _at_ 6 µsec  42 µsec 
 - Paths 2  3 
 -  9 instructions _at_ 6 µsec  54 µsec 
 - Utilization 
 -  0.054 
 -  5 
 - Can be automated with a parser for the target 
assembly language  
 1.08  
 25Instruction Execution-Time Simulators
- Requires more than just the information supplied 
in the CPU manufacturers data books.  - Dependent on memory access times and wait states. 
 - Simulation programs 
 - take as input CPU types, memory speeds, and an 
instruction mix.  - Output total instruction time and throughput
 
  26Using the System Clock
- Code can be timed by reading the system clock 
before and after its execution  - If code only takes a few microseconds, it will be 
better to execute the code a few thousand times.  - Helps to remove any inaccuracy introduced by the 
granularity of the clock.  
  27Using the System Clock  contd
More iterations Better precision 
 28So far
- Theoretical Preliminaries 
 - NP-Completeness 
 - Analyzing RT Systems 
 - The Halting Problem 
 - Amdahls Law 
 - Gustafsons Law 
 - Performance Analysis 
 - Execution Time Estimation 
 - Instruction Counting 
 - Instruction Execution-Time Simulators 
 - Using the System Clock 
 
  29Analysis of Polled Loops
- 3 components 
 - The H/W delays involved in setting the S/W flag 
by some external device  - The time for the polled loop to test the flag 
 - The time needed to process the event associated 
with the flag 
  30Polled Loops - contd
- First delay on the order of nanoseconds, can be 
ignored  - Second delay order of several microseconds 
 - Third delay depends on the process involved 
 - If events overlap, the response time is worse
 
  31Polled Loops  contd
- f  the time needed to check the flag 
 - P  the time to process the event 
 - n  overlapping events 
 - Response time for the nth overlapping event 
 -   n  f  P
 
  32Analysis of Coroutines
- Absence of interrupts makes this easy 
 - Response time found by tracing the worst-case 
path through each of the tasks 
  33Analysis of Round-Robin Systems
- Assumptions 
 - n processes in the ready queue 
 - No new ones arrive after the system starts 
 - None terminate prematurely 
 - The release time is arbitrary 
 - All processes have maximum end-to-end execution 
time c  - Timeslice of q
 
  34Round-Robin Systems  contd
- In practice, if a process completes before the 
end of a time quantum, that slack time would be 
assigned to the next ready process.  - However Assume it will not. 
 - This does not hurt the analysis because an upper 
bound is desired. 
  35Analysis of Round-Robin Systems (cont.)
- Each process receives 1/n of the CPU time in 
chunks of q  - Each process waits no longer than (n  1)q time 
units  - Each process requires at most c/q time units 
 - Each context switch time takes o time units 
 - Waiting time  (n  1)  q  n  o  c/q
 
  36Round-Robin Systems  contd
- Worst case time from readiness to completion is 
waiting time plus undisturbed time to complete, c  - T  (n  1)  q  n  o  (c/q)  c 
 
  37Analysis of Round-Robin Systems (cont.)
- Ex Consider six processes with a maximum 
execution time of 600ms. The time quantum, q, is 
40ms, and each context switch costs 2 ms.  -  n  6, c  600, q  40, o  2 
 -  
 -  T  ((6  1)  40  (6  2))  (600/40)  600 
 -   3750 ms 
 
  38Round-Robin Systems  contd
- In order to achieve fair behavior, q must be 
less than c  - Otherwise, the round-robin algorithm will become 
a first-come, first-serve algorithm in which each 
process will execute to the completion in order 
of arrival and this will be in favor of longer 
processes.  
  39Response Time Analysis for Fixed-Period Systems
- For the highest priority task, its worst case 
response time will be equal to its own execution 
time.  - Other tasks in the system are subjected to 
interference caused by execution of higher 
priority tasks. 
  40Analysis for Fixed-Period Systems (cont.)
- For a general task, Ti, the response time Ri, is 
given as Ri  ei  Ii  - Where 
 - Ii is the max amount of delay in execution caused 
by higher priority tasks  - ei is the execution time of the current task
 
  41Analysis for Fixed-Period Systems (cont.)
- The maximum interference Ti will face  
(ceiling)(Ri/pj)ej  - Each task of higher priority is interfering with 
task Ti, so Ii  ? (ceiling)(Ri/pj)  ej  - Which yields Ri  ei  ? (ceiling)(Ri/pj) 
 ej 
  42Analysis for Fixed-Period Systems (cont.)
- But this can be very difficult to solve for Ri 
 - Recursive Solution n Rn1,i  
ei  ?(ceiling)(Ri/pk)  ekj  -  
 - Where Rn,i is the response for the nth iteration. 
 
  43Analysis for Fixed-Period Systems (cont.)
- To use the recurrence relation to find response 
times, it is necessary to compute Rn1,I 
iteratively until the first value m is found such 
that Ri,n1  Rm,I  Rm,i is then the 
response time.  - If the equation does not have a solution, then 
the value of Ri will continue to rise, as is the 
case when a task set has a utilization greater 
than 100. 
  44Response-Time Analysis RMA Example
- Highest priority task, T1, will have a response 
time equal to its execution time  -  R1  3. 
 
  45Response-Time Analysis RMA Example
- T2 response time 
 - R1,2  4  (c)(4/9)3  7 
 - R2,2  4  (c)(7/9)3  7 
 - Since R1,2  R2,2, the response time of T2  7 
 
  46Response-Time Analysis RMA Example
- T3 response time 
 -  R1,3  2  (c)(2/9)3  (c)(2/12)4  9 
 -  
 -  R2,3  2  (c)(9/9)3  (c)(9/12)4  9 
 - Since R1,3  R2,3, the response time of the 
lowest priority task is 9.  
  47Analysis of Sporadic and Aperiodic Interrupt 
Systems
- Ideally modeled as a rate-monotonic system, but 
with the non-periodic tasks modeled as having a 
period equal to their worst-case expected 
interarrival time.  - May lead to unacceptably high utilizations 
 - Response time calculation depends on interrupt 
latency, dispatching times and context switch 
times  
  48Interrupt Latency
- Period between when a device requests an 
interrupt and when the first instruction for the 
associated H/W interrupt service routine 
executes.  - Worst-case INT latency must be considered! 
 - Typically occurs when all possible INTs in the 
system are requested simultaneously.  
  49Interrupt Latency  contd
- Worst case latency is also affected by the number 
of threads or processes running.  - Typically a RTOS need to disable INTs while it is 
processing a large number of threads or 
processes.  - If the design of the system requires a large 
number of threads or processes, it is necessary 
to perform latency measurements to check that the 
scheduler is not disabling INTs for an 
unreasonably long period of time. 
  50Instruction Completion Time
- Contributor to Interrupt latency 
 - Necessary to find the execution time of every 
macroinstruction by calculation, measurement, or 
manufacturers data sheets. 
  51Instruction Completion Time  contd
- The instruction with the longest execution time 
in the code will maximize the contribution to 
interrupt latency if it has just begun executing 
when the INT signal is received.  - Ex A system has instructions that take 10 
microseconds, 50 microseconds, and 250 
microseconds. The highest INT latency that can 
occur is 250 microseconds. 
  52Deterministic Performance
- Cache, Pipelines, DMA 
 - All designed to improve average RT performance 
 - But they destroy determinism, making RTS 
performance unanalyzable and unpredictable  - Worse-Case Scenarios 
 - Cache It must be assumed that every instr is not 
fetched from the cache but from memory. 
  53Deterministic Performance  contd
- Worst-Case Scenarios (cont.) 
 - Pipelines One must always assume that at every 
possible opportunity the pipeline is flushed  - DMA must be assumed that cycle stealing is 
occurring at every opportunity  - By making some reasonable assumptions about the 
impact of these effects, rational approximation 
of performance is possible 
  54In Review
- Theoretical Preliminaries 
 - NP-Completeness 
 - Analyzing RT Systems 
 - The Halting Problem 
 - Amdahls Law 
 - Gustafsons Law 
 
  55In Review
- Performance Analysis 
 - Execution Time Estimation 
 - Instruction Counting 
 - Instruction Execution-Time Simulators 
 - Using the System Clock 
 - Analysis of Polled Loops 
 - Analysis of Coroutines 
 - Analysis of Round-Robin Systems 
 - Analysis of Fixed-Period Systems
 
  56In Review
- Performance Analysis (cont.) 
 - Analysis of Sporadic and Aperiodic Interrupt 
Systems  - Interrupt Latency 
 - Instruction Completion Time 
 - Deterministic Performance
 
  57Questions
- What is an NP-hard problem? How does it 
differentiate from an NP-complete problem?  - Using Amdahls law, calculate the speed up for a 
code that is 70 percent parallelizable on 2 
processors.  - Why is performance analysis for RTS important? 
Why can one not do 100 realistic performance 
analysis?  
  58Questions  Contd
- The execution time of the function to be timed on 
slide 27 was estimated using the system clock 
technique as follows  -  Total time  2 microseconds 
 -  Time1 10000 microseconds 
 -  Time2 6000 microseconds 
 -  No of iterations 1000 
 -  What is the loop time? 
 
  59Questions  Contd
- Why must q be less than c in Round Robin systems? 
 - What is interrupt latency? Under what condition 
does the worst-case interrupt latency occur?  - How should one act to maintain determinism in RTS 
to some extent? (Hint 3 components)  - Why should one ALWAYS assume the worst case in 
general while doing RTS performance analysis?