CENG 446 - PowerPoint PPT Presentation

About This Presentation
Title:

CENG 446

Description:

Desktops Generally for a single user running a wide variety of software. ... Execution timeY Execution timeX. We can then also compute ratios of performance as ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 30
Provided by: bhem9
Category:
Tags: ceng | timex

less

Transcript and Presenter's Notes

Title: CENG 446


1
CENG 446 Advanced Computer Architectures
  • Dr. Brian T. Hemmelman
  • Chapter 1 Slides

2
The Many Faces of Computers
  • Desktops Generally for a single user running a
    wide variety of software.
  • Servers Handle many tasks for many users
    (scientific, database, Web).
  • Supercomputers Typically designed for intensive
    scientific and engineering modeling and
    computation.
  • Datacenters Massive collections of processors
    or servers to handle search engines or
    e-commerce.
  • Embedded Computers Contained within or are a
    part of other systems that perform a particular
    function (traffic light, automobile, microwave,
    smartphone, fly-by-light avionics system, etc.)

3
The Pervasiveness of Computers
4
Whats Going On Behind Your Program
  • There are multiple layers of activity taking
    place for a computer to perform a particular
    task.
  • Application Software A particular program
    designed to accomplish a specific task.
  • Systems Software Software that provides
    services that are commonly useful.
  • Operating System Supervising program that
    manages the resources of a computer.
  • Compilers Translate a program written in a
    high-level language such as C or Java into
    instructions that the hardware can execute.
  • Hardware The physical electronics and circuitry
    that perform the actual calculations or
    operations.

5
Hardware/Software Hierarchy
6
Bridging the Hardware/Software Interface
  • The actual hardware uses voltage signals to
    represent information
  • Boolean Logic Variable True/False
  • Data Collections of 1s and 0s (bits)
  • Computers must be told how to manipulate the
    information through instructions.
  • Instructions are collections of 1s and 0s that
    force the hardware to do something with other
    collections of 1s and 0s.

7
Bridging the Hardware/Software Interface
  • The binary instructions that are actually
    executed by the circuitry are called machine
    language or op codes.
  • Creating all the binary instructions directly is
    cumbersome and tedious so helper programs were
    created to translate a symbolic notation of the
    desired operation into the necessary bit pattern
    for the machine.
  • These programs are called assemblers and the
    symbolic notation is called assembly language.
  • Eventually, assemblers were used to write more
    powerful translators that allowed a user to focus
    on the task algorithm instead of the circuitry
    dependent operations.
  • These programs are the compilers, and the
    algorithmic codes are written in a high-level
    programming language.

8
The Journey of a Programmers Wish Into Physical
Reality
9
Advantages of High-Level Programming Languages
  • The programmer focuses on the algorithm and
    describes it in a language more natural to
    humans.
  • The programmer increases productivity as fewer
    lines of code are needed to describe how to
    execute the task. The process of converting the
    algorithm into machine language is automated by
    experts in translation. The programmer can focus
    on becoming an algorithm expert instead of an
    everything expert.
  • Programs and algorithms can be designed that are
    largely independent of the specific processor or
    circuitry on which they will execute.
  • These three advantages are so strong that today
    little programming is done in assembly language.

10
The Five Classic Computer Components
  • Input
  • Output
  • Memory
  • Datapath
  • Control

11
Input/Output Examples
  • Keyboard
  • Mouse
  • Analog-to-Digital Converter
  • Monitor
  • Network Connection
  • Pulse-Width-Modulation (PWM) Signal

12
Memory
  • External memory is almost always made of DRAM
    chips.

13
Memory
  • Internal memory is usually made of SRAM memory
    cells.

14
More Memory
  • Memory can also be categorized as volatile or
    nonvolatile.
  • Volatile memory Only stores data when power is
    on (e.g. DRAM or cache).
  • Nonvolative memory Data integrity is maintained
    even if no power (e.g. hard drive, FLASH, DVD).
  • We could also distinguish main memory from
    secondary memory
  • Main memory Volatile, where the program runs
    and its data is updated.
  • Secondary memory Nonvolatile, where program and
    data are stored between runs.

15
Datapath and Control (The Processor or CPU)
  • Datapath Performs the arithmetic operations.
  • Control Tells the datapath, memory, and I/O
    devices what to do.

16
Changing Technology
  • The increase in memory capacities and processor
    clock speeds have been incredible.

17
Changing Technology
  • Computer performance has likewise been
    continuously increasing.

18
Power - The Limiting Factor
Power consumption is directly proportional to the
clock rate. Trying to increase computer
performance by only increasing the clock rate is
no longer feasible as heat dissipation becomes a
limiting factor. Hence the push towards
multicore processor chips running at somewhat
slower clock speeds.
19
Computer Performance
  • So how do we know if one computer is better than
    a different computer?
  • The answer, unfortunately, is not a simple one.
  • Computer performance is largely application
    specific. Different applications have different
    needs and objectives, so no one performance
    measure is automatically The Best.
  • Clock frequency used to be a way to get an easy
    measure of performance, but this is far to simple
    an approach for most systems today.

20
Defining Performance
  • Books example Different airplanes with different
    ranges and speeds.
  • Best performance could be defined in terms of
    greatest range, fastest cruising speed, moving
    the most passengers the quickest, etc.

21
Defining Performance
  • For desktops, laptops, supercomputers, and
    embedded computer you are primarily interested in
    response time (execution time).
  • Response time The total time required for the
    computer to complete a task, including disk
    accesses, memory accesses, I/O activities,
    operating system overhead, CPU execution time,
    and so on.

22
Comparing Performance
  • Therefore, if comparing two computers X and y we
    could concludePerformanceX gt PerformanceY ifE
    xecution timeY gt Execution timeX
  • We can then also compute ratios of performance as

23
CPU Performance
  • CPU Execution Time The time CPU itself spends
    computing a particular task (does not include
    time spent waiting for I/O or running other
    programs).
  • We could look at this from the simple perspective
    of the number of clock cycles it takes the CPU to
    complete the task. Thus,

CPU execution time for TaskA CPU clock cycles
for TaskA ? Clock cycle time CPU execution time
for TaskA (CPU clock cycles for TaskA)/(Clock
rate)
  • Improving performance can then be achieved by
    reducing clock cycles required for the task
    (perhaps by more powerful instructions and hence
    more complicated circuitry) or decreasing the
    clock cycle time (increasing clock frequency).
  • Decreasing clock cycle time tends to increase
    power consumption, and decreasing the clock
    cycles needed tends to increase clock cycle time.

24
Instruction Performance
  • Addressing clock frequency, the power wall, and
    heat dissipation is a complete problem onto
    itself that will not specifically be covered in
    this class.
  • We can however look into decreasing CPU clock
    cycles to complete a task.

CPU clock cycles (Instructions for TaskA) ?
(Average clock cycles/instruction)
  • CPI (clock cycles per instruction) The average
    number of clock cycles each instruction takes to
    execute.
  • Substituting this expression into the equation on
    the previous slide we obtain

25
The Classic CPU Performance Equation
  • Our challenge as hardware designers is to
    optimize the balance between the instruction
    count and the CPI.
  • How efficiently do we design the digital
    circuitry? Can we create different instructions
    that accomplish more without changing clock cycle
    time?
  • Overall performance will also be affected by how
    well compilers utilize the instructions available
    in the instruction set implemented in hardware.

26
Example, Page 34
Suppose we have two implementations of the same
instruction set architecture. Computer A has a
clock cycle time of 250 ps and a CPI of 2.0 for
some program, and computer B has a clock cycle
time of 500 ps and a CPI of 1.2 for the same
program. Which computer is faster for this
program and by how much? Both computers need to
execute the same number of instructions, I, as
they are running the same program.
CPUA time I ? 2.0 ? 250 ps 500 ? I ps CPUB
time I ? 1.2 ? 500 ps 600 ? I ps
(CPUA performance)/(CPUB performance) (CPUB
time)/(CPUA time) (600I/500I) 1.2
CPUA will run 1.2 times faster.
27
Example, Page 35
A compiler designer is trying to decide between
two code sequences for a particular computer.
The hardware designers have supplied the
following facts
Instruction Class A Instruction Class B Instruction Class C
CPI 1 2 3
For a particular high-level language statement,
the compiler writer is considering two code
sequences that require the following instruction
counts
of Class A of Class B of Class C
Sequence 1 2 1 2
Sequence 2 4 1 1
Which code sequence executes the most
instructions? Which will be faster? What is the
CPI for each sequence?
28
Example, Page 35
The instruction count is trivial. Sequence 1
uses five instructions and Sequence 2 uses six
instructions. Which sequence is faster is
determined by calculating total CPU clock
cycles Sequence 1 clock cycles (2 ? 1) (1 ?
2) (2 ? 3) 10 clock cycles Sequence 2 clock
cycles (4 ? 1) (1 ? 2) (1 ? 3) 9 clock
cycles The CPI for each sequence is easily
computed as CPI (CPU clock cycles)/(Instruction
Count) CPI Sequence 1 (10 clock cycles)/(5
instructions) 2.0 CPI Sequence 2 (9 clock
cycles)/(6 instructions) 1.5
Conclusion Make the common case fast!
29
Case Study SPEC Benchmark for AMD Opteron X4
Model 2356
The integer portion of the benchmark, CINT2006,
is summarized in this table.
Those tasks with a CPI above 1.09 have a higher
CPI due to high cache miss rates, i.e. memory
access is slowing them down!
Write a Comment
User Comments (0)
About PowerShow.com