Title: CENG 446
1CENG 446 Advanced Computer Architectures
- Dr. Brian T. Hemmelman
- Chapter 1 Slides
2The Many Faces of Computers
- Desktops Generally for a single user running a
wide variety of software. - Servers Handle many tasks for many users
(scientific, database, Web). - Supercomputers Typically designed for intensive
scientific and engineering modeling and
computation. - Datacenters Massive collections of processors
or servers to handle search engines or
e-commerce. - Embedded Computers Contained within or are a
part of other systems that perform a particular
function (traffic light, automobile, microwave,
smartphone, fly-by-light avionics system, etc.)
3The Pervasiveness of Computers
4Whats Going On Behind Your Program
- There are multiple layers of activity taking
place for a computer to perform a particular
task. - Application Software A particular program
designed to accomplish a specific task. - Systems Software Software that provides
services that are commonly useful. - Operating System Supervising program that
manages the resources of a computer. - Compilers Translate a program written in a
high-level language such as C or Java into
instructions that the hardware can execute. - Hardware The physical electronics and circuitry
that perform the actual calculations or
operations.
5Hardware/Software Hierarchy
6Bridging the Hardware/Software Interface
- The actual hardware uses voltage signals to
represent information - Boolean Logic Variable True/False
- Data Collections of 1s and 0s (bits)
- Computers must be told how to manipulate the
information through instructions. - Instructions are collections of 1s and 0s that
force the hardware to do something with other
collections of 1s and 0s.
7Bridging the Hardware/Software Interface
- The binary instructions that are actually
executed by the circuitry are called machine
language or op codes. - Creating all the binary instructions directly is
cumbersome and tedious so helper programs were
created to translate a symbolic notation of the
desired operation into the necessary bit pattern
for the machine. - These programs are called assemblers and the
symbolic notation is called assembly language. - Eventually, assemblers were used to write more
powerful translators that allowed a user to focus
on the task algorithm instead of the circuitry
dependent operations. - These programs are the compilers, and the
algorithmic codes are written in a high-level
programming language.
8The Journey of a Programmers Wish Into Physical
Reality
9Advantages of High-Level Programming Languages
- The programmer focuses on the algorithm and
describes it in a language more natural to
humans. - The programmer increases productivity as fewer
lines of code are needed to describe how to
execute the task. The process of converting the
algorithm into machine language is automated by
experts in translation. The programmer can focus
on becoming an algorithm expert instead of an
everything expert. - Programs and algorithms can be designed that are
largely independent of the specific processor or
circuitry on which they will execute. - These three advantages are so strong that today
little programming is done in assembly language.
10The Five Classic Computer Components
- Input
- Output
- Memory
- Datapath
- Control
11Input/Output Examples
- Keyboard
- Mouse
- Analog-to-Digital Converter
- Monitor
- Network Connection
- Pulse-Width-Modulation (PWM) Signal
12Memory
- External memory is almost always made of DRAM
chips.
13Memory
- Internal memory is usually made of SRAM memory
cells.
14More Memory
- Memory can also be categorized as volatile or
nonvolatile. - Volatile memory Only stores data when power is
on (e.g. DRAM or cache). - Nonvolative memory Data integrity is maintained
even if no power (e.g. hard drive, FLASH, DVD). - We could also distinguish main memory from
secondary memory - Main memory Volatile, where the program runs
and its data is updated. - Secondary memory Nonvolatile, where program and
data are stored between runs.
15Datapath and Control (The Processor or CPU)
- Datapath Performs the arithmetic operations.
- Control Tells the datapath, memory, and I/O
devices what to do.
16Changing Technology
- The increase in memory capacities and processor
clock speeds have been incredible.
17Changing Technology
- Computer performance has likewise been
continuously increasing.
18Power - The Limiting Factor
Power consumption is directly proportional to the
clock rate. Trying to increase computer
performance by only increasing the clock rate is
no longer feasible as heat dissipation becomes a
limiting factor. Hence the push towards
multicore processor chips running at somewhat
slower clock speeds.
19Computer Performance
- So how do we know if one computer is better than
a different computer? - The answer, unfortunately, is not a simple one.
- Computer performance is largely application
specific. Different applications have different
needs and objectives, so no one performance
measure is automatically The Best. - Clock frequency used to be a way to get an easy
measure of performance, but this is far to simple
an approach for most systems today.
20Defining Performance
- Books example Different airplanes with different
ranges and speeds. - Best performance could be defined in terms of
greatest range, fastest cruising speed, moving
the most passengers the quickest, etc.
21Defining Performance
- For desktops, laptops, supercomputers, and
embedded computer you are primarily interested in
response time (execution time). - Response time The total time required for the
computer to complete a task, including disk
accesses, memory accesses, I/O activities,
operating system overhead, CPU execution time,
and so on.
22Comparing Performance
- Therefore, if comparing two computers X and y we
could concludePerformanceX gt PerformanceY ifE
xecution timeY gt Execution timeX - We can then also compute ratios of performance as
23CPU Performance
- CPU Execution Time The time CPU itself spends
computing a particular task (does not include
time spent waiting for I/O or running other
programs). - We could look at this from the simple perspective
of the number of clock cycles it takes the CPU to
complete the task. Thus,
CPU execution time for TaskA CPU clock cycles
for TaskA ? Clock cycle time CPU execution time
for TaskA (CPU clock cycles for TaskA)/(Clock
rate)
- Improving performance can then be achieved by
reducing clock cycles required for the task
(perhaps by more powerful instructions and hence
more complicated circuitry) or decreasing the
clock cycle time (increasing clock frequency). - Decreasing clock cycle time tends to increase
power consumption, and decreasing the clock
cycles needed tends to increase clock cycle time.
24Instruction Performance
- Addressing clock frequency, the power wall, and
heat dissipation is a complete problem onto
itself that will not specifically be covered in
this class. - We can however look into decreasing CPU clock
cycles to complete a task.
CPU clock cycles (Instructions for TaskA) ?
(Average clock cycles/instruction)
- CPI (clock cycles per instruction) The average
number of clock cycles each instruction takes to
execute. - Substituting this expression into the equation on
the previous slide we obtain
25The Classic CPU Performance Equation
- Our challenge as hardware designers is to
optimize the balance between the instruction
count and the CPI. - How efficiently do we design the digital
circuitry? Can we create different instructions
that accomplish more without changing clock cycle
time? - Overall performance will also be affected by how
well compilers utilize the instructions available
in the instruction set implemented in hardware.
26Example, Page 34
Suppose we have two implementations of the same
instruction set architecture. Computer A has a
clock cycle time of 250 ps and a CPI of 2.0 for
some program, and computer B has a clock cycle
time of 500 ps and a CPI of 1.2 for the same
program. Which computer is faster for this
program and by how much? Both computers need to
execute the same number of instructions, I, as
they are running the same program.
CPUA time I ? 2.0 ? 250 ps 500 ? I ps CPUB
time I ? 1.2 ? 500 ps 600 ? I ps
(CPUA performance)/(CPUB performance) (CPUB
time)/(CPUA time) (600I/500I) 1.2
CPUA will run 1.2 times faster.
27Example, Page 35
A compiler designer is trying to decide between
two code sequences for a particular computer.
The hardware designers have supplied the
following facts
Instruction Class A Instruction Class B Instruction Class C
CPI 1 2 3
For a particular high-level language statement,
the compiler writer is considering two code
sequences that require the following instruction
counts
of Class A of Class B of Class C
Sequence 1 2 1 2
Sequence 2 4 1 1
Which code sequence executes the most
instructions? Which will be faster? What is the
CPI for each sequence?
28Example, Page 35
The instruction count is trivial. Sequence 1
uses five instructions and Sequence 2 uses six
instructions. Which sequence is faster is
determined by calculating total CPU clock
cycles Sequence 1 clock cycles (2 ? 1) (1 ?
2) (2 ? 3) 10 clock cycles Sequence 2 clock
cycles (4 ? 1) (1 ? 2) (1 ? 3) 9 clock
cycles The CPI for each sequence is easily
computed as CPI (CPU clock cycles)/(Instruction
Count) CPI Sequence 1 (10 clock cycles)/(5
instructions) 2.0 CPI Sequence 2 (9 clock
cycles)/(6 instructions) 1.5
Conclusion Make the common case fast!
29Case Study SPEC Benchmark for AMD Opteron X4
Model 2356
The integer portion of the benchmark, CINT2006,
is summarized in this table.
Those tasks with a CPI above 1.09 have a higher
CPI due to high cache miss rates, i.e. memory
access is slowing them down!