Title: Alternative Architectures
1Alternative Architectures
- Christopher Trinh
- CIS Fall 2009
- Chapter 9 (pg 461 486)
2What are they?
- Architectures that transcend the classic von
Neumann approach. - Instruction level parallelism
- Multiprocessing architectures
- Parallel processing
- Dataflow computing
- Neural networks
- Systolic array
- Quantum computing
- Optical computing
- Biological computing
3Trade-offs
- Defined as a situation that involves losing one
quality or aspect of something in return for
gaining another quality or aspect. - Important concept in the computer field.
- Speed vs. Money
- Speed vs. Power consumption/Heat
4It's All about the Benjamins
- In trade-offs, money in most cases take
precedence. - Moores Law vs. Rock's Law
- The economic flipside to Moore's Law.
- The cost of a semiconductor chip fabrication
plant doubles every four years. As of 2003, the
price had already reached about 3 billion US
dollars. - Consumer market is now dominated with parallel
computing, in the form of Multiprocessor system. - Exceptions Research
5Back in the days
- CISC vs. RISC (complex vs reduced instruction
sets). - CISC largely motivated by high cost of memory
(i.e. registers). - Analogous to Text messages (SMS).
- lol, u, brb, omg, and gr8
- Same motivation for it, provided the same
benefit. - More information per memory/SMS.
-
6What's the Difference
- RSIC
- Minimize the number of cycles per instructions,
most instructions execute in one clock cycle. - Uses hardwired control, easier to do instruction
pipelining. - Complexity is pushed up into the domain of the
compiler. - More instructions.
- CSIS
- Increases performance by reducing number of
instructions per programs.
7Between RISC and CISC
- Cheaper and more plentiful memory be came
available. Money became less of a trade-off
factor. - Case for a Reduced Instruction Set Computer,
David Patterson and David Ditzel. - 45 data movement instructions
- 25 ALU instructions
- 30 flow control instructions
- Overall only 20 of the time complex instructions
were used.
8Performance Formula
Same program
Constant
5 X 10
CISC
RISC
9Microcode
- CISC rely on microcode for instruction
complexity. - Efficiency is limited by variable length
instructions, slowing down the decoding process. - Leads to varying number of clock cycles /
instruction, difficult to implement pipelines. - Interprets each instruction as it is fetched from
memory. Additional translation process. - More complex the instruction set, more time it
takes to look up the instructions and execute it - Back to text messages, IYKWIM and (_8()
10Comparison chart RISC vs. CISC on page 468
RISC is a misnomer. Presently, there are more
instructions in RISC machines than CISC. Most
architecture today is based off of RISC.
11Register windows sets
- Registers offer the greatest potential for
performance improvement. - Recall that on average, 45 of instructions in
programs involved the movement of data. - Saving registers, passing parameters, and
restoring registers involves considerable effort
and resources. - Highlevel languages depend on modularization for
efficiency, procedure calls and parameter passing
are natural side effects.
12- Imagine all registers divided into sets (or
windows). Each set has a specific number of
registers. - Only one set (or windows) is visible to the
processor. - Similar in concept to variable scope
- Global registers - common to all windows.
- Local registers - local to the current window.
- Input registers overlaps with the preceding
windows output registers. - Output registers overlaps with the next
windows input registers. - Current window pointer (CWP) points to the
register window set to be used at any given time.
13Registers have a circular nature. When procedures
end they are marked as reusable. Recursion
and deeply nested functions use main memory when
registers are full.
14Flynns Taxonomy
Considers two factors Number of instructions and
the number of data streams that flow into the
processor. Page 469 - 471
PU - Processing Unit
15Single Instruction, Single Data stream
(SISD) Single Instruction, Multiple Data streams
(SIMD) Multiple Instruction, Single Data stream
(MISD) Multiple Instruction, Multiple Data
streams (MIMD) Single Program, Multiple Data
streams (SPMD)
16SPMD
- Single Program, Multiple Data streams
- Consists of multiprocessors, each with its own
data set and program memory. - Same program is executed on each processor.
- Each node can do different things at the same
time. - If myNode 1 do this, else do that.
- Synchronization at various global control points.
- Often used as supercomputers.
17Vector processors (SIMD)
- Referred to as supercomputers
- Most famous are the Cray series, little change to
their basic architecture in the past 25 years. - Vector processors specialized heavily pipelined
processors that perform efficient operations on
entire vectors and matrices at once. - Suited for applications that benefit from high
degree of parallelism (ie. weather forecasting,
medical diagnoses, and image processing). - Efficient for two reasons machine fetches
significantly fewer instructions leading to less
decoding, control unit overhead and memory
bandwidth usage. Processor knows itll have
continuous source of data, so it can begin
pre-fetching corresponding pairs of values.
18- Vector registers specialized registers that can
hold several vector elements at one time. - Two types of vector processors
registers-register vector processors and
memory-memory vector processors. - Registers-register vector processors
- Require that all operations use registers as
source and destination operands. - disadvantage in that long vectors must be broken
into fixed length segments that are mall enough
to fit into registers. - Memory-memory vector processors
- allow operands from memory to be routed directly
to the ALU, results are stream back to memory. - disadvantage is that they have large startup
time due to memory latency, after the pipeline is
full disadvantage disappears. -
19Parallel and multiprocessor
- Two major parallel architectural paradigms. Under
MIMD architectures, but differ in how they use
memory. - Symmetric multiprocessors (SMPs)
- Massively parallel processors (MPPs)
MPP many processors distributed memory
communication via network SMP few processors
shared memory communication via memory
MPP
SMP
- Harder to program - so that pieces of the
program on separate CPUs can communicate with
each other. - Uses if program is easily partitioned.
- Large companies (data warehousing) frequently
use this system. -
- Easier to program.
- Suffer from bottleneck when all processors
attempt to access the same memory at the same
time.
20- Multiprocessing parallel architecture is
analogous to adding horse to help out with the
work (horsepower). - We improve processor performance by distributing
the computational load among several processors. - Parallelism results in higher throughput
(data/sec), better fault tolerance, and more
attractive price/performance ratio. - Amdahls Law States that if two processing
components run at two different speeds, the
slower speed will dominate. Perfect speed up is
not possible. - You are only as fast as your slowest part
- Every algorithm will eventually have a
sequential part to it. Additional processors have
to wait till the serial processing is complete. - Parallelism is not a magic solution to improve
speed. Some algorithms/programs have more
sequential processing and it is less cost
effective to employ a multiprocessing parallel
architecture (ie. Programming individual bank
transactions, however transactions of all bank
customers may have added benefit). -
-
21Instruction level parallelism (ILP)
- Superscalar vs. Very long instruction words
(VLIW) - Superscalar design methodology that allows
multiple instructions to be executed
simultaneously in each cycle. - Achieve speedup similar to the idea of adding
another lane to a busy single lane highway. - Exhibit parallelism through pipelining and
replication. - Added highway lanes are called execution units.
- Execution units consists of floating-point
adders, multipliers, and other specialized
components. - Not uncommon to have these units duplicated.
- Units are pipelined
- Pipelining - divides the fetch-decode-execute
cycle into stages, in which a set of instructions
are in different stages at the same time. -
22- Superpipelining is when a pipeline has stages
that require less than half a clock cycle to
execute - Accomplished using an internal clock which can
be added which is double the speed of the
external clock, allowing completion of two tasks
per external clock cycle. - Instruction fetch component that can retrieve
multiple instructions simultaneously from memory.
- Decoding unit determines whether the
instructions are independent (and thus be
executed simultaneously). - Superscalar processors rely on both the hardware
and compiler to generates approximate schedules
to make the best use of the machine resources.
23VLIW
- Relay entire on the compiler for scheduling of
operations. - Packs independent instructions into one long
instruction. - Because the instructions are fixed at compile
time, changes such as memory latency requires you
to recompile the code. - Could also lead to significant increases in the
amount of code generated. - Intels Itanium IA-64 is an example of VLIW
processor - Uses an EPIC style of VLIW
- Difference bundles its instructions in various
lengths, uses a special delimiter to indicate
where one bundle ends and another begins. - Instructions words are prefetched by hardware,
instructions within bundles are executed in
parallel and have no concern for ordering.
24Interconnection Networks
- Each processor has its own memory, but processors
are allowed to access each other memories via the
network. - Network topology - factor in the overhead of cost
of message passing. List of message passing
efficient factors - Bandwidth
- Message latency
- Transport latency
- Overhead
- Static networks vs. dynamic networks
- Dynamic networks allow the path between two
entites to change from one communication to the
next, static networks do not.
25(No Transcript)
26Dynamic networks allow for dynamic configuration
Bus or switch. Bus-based networks the
simplest and most cost efficient when amount of
entities is moderate. Main disadvantage is the
bottleneck can occur. Parallel buses can remove
this issue but the cost is considerable.
Crossbar switch
2 X 2 switch
27Omega Network
Example of a multistage network, built using 2 x
2 switches.
Trade off chart of various networks.
28Shared memory processors
Doesnt mean all processors must share one large
memory, each processor can have a local memory,
but it must be shared with other processors.
29- Shared memory MIMD have two categories in how
they sync their memory operations - Uniformed Memory Access (UMA) all memory access
take the same amount of time. Pool of shared
memory that is connected to a group of processors
through a bus or switch network. - Nonuniformed Memory Access (NUMA) memory access
is inconsistent across the address of the
machine. - Leads to cache coherence problems. (race
conditions) - Can use Snoopy cache controllers that monitor
the caches on all systems. Call cache coherent
NUMA (CC-NUMA) - You can use a various cache update protocol
- write-through
- write-through with update
- write-through with invalidation
- write-back
-
30Distributed Systems
- Loosely coupled distributed computers dependent
on a network for communication among processors
to solve a problem. - Cluster computing NOWs, COWs, DCPCs, and PoPCs,
all resources are within the same admin domain
working on group tasks. - You can make your own cluster by downloading
BEOWULF open-source project. - Public-resource computing or Global computing
grid computing where computing power is supplied
by volunteers thru the internet. Very cheap
source of computing power.
31SETI_at_Home project analyze radio data to
determine if there is intelligent life out there.
(Think the movie Contact).
Folding_at_Home project - designed to perform
computationally intensive simulations of protein
folding and other molecular dynamics (MD), and to
improve on the methods available to do so.
7.87 PFLOPS (250 bytes), the first computing
project of any kind to cross the four petaFLOPS
milestone. This level of performance is primarily
enabled by the cumulative effort of a vast array
of PlayStation 3 and powerful GPU units.