Krste Asanovic

About This Presentation

Title:

Krste Asanovic

Description:

Doesn't scale to large register files without bigger instructions ... Hardware saves 'next-PC' into machine register as each barrier instruction completes ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 86

Provided by: ronn74

Category:

more less

Transcript and Presenter's Notes

Title: Krste Asanovic

1
Power-Efficient Microarchitectures

Krste Asanovic
krste_at_mit.edu
MIT Computer Science and Artificial Intelligence
Laboratory
http//cag.csail.mit.edu/scale
IBM ACEED Conference, Austin, TX
1 March 2005

2
Academic Computer Architectures

Build one flimsy (but expensive) prototype, that
is never really used
Eventually, some ideas are adopted, enter mass
production, millions sold

3
SpecInt 2000
Horowitz, ISSCC2004
4
Power
Horowitz, ISSCC2004
5
Where does the power go?
IBM, HPCA 2005

Parallel instruction fetch and decode
Register renaming, issue window, reorder buffer
Multiported register files and bypass networks
Load and store queues
Multiported primary caches and TLBs
Energy-oblivious instruction sets (i.e., 360,
x86, RISC) require most of this
microarchitectural machinery to achieve high
performance.

6
Energy-Oblivious Instruction Sets

Current RISC/VLIW ISAs only expose hardware
features that affect critical path through
computation

Most energy is consumed in microarchitectural
operations that are hidden from software!

7
Energy-Exposed Instruction Sets

Rethinking the hardware-software interface for
lower power
Use compile-time knowledge to reduce run-time
energy dissipation
Without reducing performance
Without using excessive energy to transmit
compile-time knowledge to hardware at run time

8
IBMs Instruction Sets

Pre-1964 IBM 701, 650, 702, 1401,
Prehistoric times
1964 IBM System/360
Invention of the instruction set architecture
(ISA)
1978 IBM System/38, AS/400
Object-based capability systems
1990 IBM POWER
Superscalar RISC
Maybe time to start working on next energy-aware
ISA?

9
Talk Outline

Variable-Length Instruction Formats
Vectors
Exception Management
The Vector-Thread Architecture

10
Problems with Fixed-Length Instructions

Waste memory bandwidth/power at all levels of
instruction hierarchy
Reduce effective cache capacity
Introduce unnecessary serial dependencies to work
around length limits
lui r1, 0x8765 MIPS code to load 32-bit
ori r1, r1, 0x4321 constant 0x87654321 in r1
Advantages?
Easier pipelined or parallel fetch and decode.

11
Heads and Tails Format

Each instruction split into two
portionsfixed-length head variable-length
tail

12
Heads and Tails Format

Each instruction split into two
portionsfixed-length head variable-length
tail
Multiple instructions packed into a fixed-length
bundle

13
Heads and Tails Format

Each instruction split into two
portionsfixed-length head variable-length
tail
Multiple instructions packed into a fixed-length
bundle
A cache line can have multiple bundles

14
Heads and Tails Format
15
Heads and Tails Format
H0 H1 H2 H3 H4
H0 H1 H2 H3 H4 H5 H6
H0 H1 H2 H3 H4 H5
heads
16
Heads and Tails Format
H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
17
Heads and Tails Format
unused
H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
18
Heads and Tails Format
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
19
Heads and Tails Format
? not all heads must have tails
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
20
Heads and Tails Format
? not all heads must have tails ? tails at fixed
granularity
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
21
Heads and Tails Format
? not all heads must have tails ? tails at fixed
granularity ? granularity of tails independent
of size of heads
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
22
Heads and Tails Format
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
23
Heads and Tails Format
? sequential pc incremented
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
24
Heads and Tails Format
? sequential pc incremented ? end of bundle
bundle incremented inst reset to 0
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
25
Heads and Tails Format
? sequential pc incremented ? end of bundle
bundle incremented inst reset to 0 ? branch
inst checked
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
26
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
27
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2
? 2nd length decoder needs to know Length1 first
28
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2

Length 3
? 3rd length decoder needs to know Length1
Length2
29
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2

Length 3

? Need to know all 3 lengths to fetch and align
more instructions.
30
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel
31
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))
32
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3

? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))

33
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1

Length 1
Length 2
Length 3

? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))

34
Heads and Tails Summary

Density of variable-length instructions while
retaining pipelined or superscalar instruction
fetch
For recoded MIPS ISA, save 25 static and
dynamic instruction bits using 256-bit bundles
Can design an ISA to exploit HT (e.g., avoid
spurious serializations)

35
Vectors
36
Parallelism is Good
Horowitz, ISSCC2004
37
Forms of Parallelism and Energy per Op
Scalar Pipelined Machine
Energy/ Operation
Performance
1
2
5
38
Vectors

Omission of vectors is single biggest mistake in
commercial computer architectures
Simple
High performance
Low power
Works great with caches
Mature compiler technology
Easily understood performance-programming model
Good for everything, not just scientific
computing
Possibly only valid reasons for omission
A little harder to make work with virtual memory
and rapid context swaps (see restart markers)
Large vector register files (see vector-thread
architecture)

39
Automatic Code Vectorization
for (i0 i lt N i) Ci Ai Bi
Vectorization is a massive compile-time
reordering of operation sequencing ? avoids many
run-time overheads
40
Vector Energy Advantages

Instruction fetch amortized over vector of
operations
Loop bookkeeping factored out into separate
control processor
Efficient vector memory operations move multiple
memory operands with one cache tagTLB lookup
All arithmetic operations only access local lane,
no cross-lane wiring
Length of vector register effectively provides
register renaming and loop unrolling without
additional hardware

41
Vector Instruction Parallelism

Can overlap execution of multiple vector
instructions
example machine has 32 elements per vector
register and 8 lanes

Load Unit
Multiply Unit
Add Unit
time
Instruction issue
Complete 24 operations/cycle while issuing 1
short instruction/cycle
42
Why SIMD extensions fall short of Vectors

Only executes one cycles worth of operands per
instruction fetch
Requires superscalar dispatch to keep multiple
functional units busy
Scalar unit cannot run ahead to find next vector
loop
Tied up issuing SIMD instructions for current
loop
No long vector memory operations
Memory system cant get ahead in fetching data
without speculation
Doesnt scale to wider datapaths without software
rewrite
Doesnt scale to large register files without
bigger instructions
Awkward interface for compilers
Extensive microarchitecture-specific loop
unrolling and software pipelining required to
keep pipelines busy
Load/store alignment constraints
No vector length register
No scatter/gather
Causes larger loop startup delays than vectors

43
Vectors vs. Superscalar on General-Purpose
Applications
Single-scalar Runtime

Accelerating 28 of code by factor of 8 gives
same speedup as accelerating all code by 1.3
vectorizing specint95

44
Vectorizable Workloads

Vectors known to work well for scientific and
media applications, but also can help many other
codes, e.g.,
Databases
Hash-joins vectorizable
String operations
Operating Systems
Bzero/bcopy
Many other important commercial algorithms can
be vectorized
All vendors will soon be telling customers to
multithread their code to get better performance
Vectorization can be simpler, and give much
better power-performance than multithreading

45
Exceptions
46
Exception Management Overhead

Large part of power cost in modern
microarchitectures come from need to provide
precise exceptions
Reorder buffer to track original program order
Register renaming or bypass networks to allow
undo of speculative register writes
Store queues to allow undo of speculative memory
writes
(Even in-order architectures speculate on
exceptions)
But also large opportunity cost because some
things are too difficult to make precise
Deeply exposed machine state
Overlapped execution of multiple highly parallel
instructions
Special purpose execution units with embedded
state

47
Whats important in Exceptions?

For operating system with multiprogramming and
virtual memory
Must allow fast (and simple) process state save,
to allow process restart later
These swappable exceptions are much easier to
provide than precise exceptions, especially in
highly parallel machines with large quantities of
architectural state

48
Software Restart Markers

Software explicitly marks restart points, e.g.,
by setting barrier bit on each instruction
Hardware saves next-PC into machine register as
each barrier instruction completes
branches store target PC
must also wait for any earlier potentially
exception-causing instructions to clear exception
checks (trap barrier)
After any trap, OS resumes execution at saved PC

49
Idempotent Regions

Hardware does not buffer state updates and cannot
undo state changes if trap occurs in middle of
region
Can only restart cleanly if regions are
idempotent, i.e., can re-execute from beginning
multiple times with same effect

add r3, r1, r2 st.bar r3, 0(r5) Restart
ld r2, 4(r7) ld r3, 8(r7) add.bar r4, r2, r3
Restart
st r4, 4(r7) st.bar r7, 8(r7) Restart
50
Rules for Idempotent Regions

Sufficient rule is that external read set is
disjoint from internal write set
OK to overwrite value if it was produced within
region
Not necessary because of idempotent update
operations, e.g.,
X lt- X AND Y
Y lt- Y OR Z
Require that any prefix of region is also
idempotent, e.g.,

51
Some Idempotent Functions

matmul(int m, int k, int n,
const double a, const double b, double c)
int sprintf(char s, const char format, ...)
int sscanf(char s, const char format, ...)
char strcpy(char, const char) / Also strcmp,
strlen../
void memcpy(char, const char, size_t) / Also
memset../
double sin(double) / Also sqrt, exp, etc. /
double atof(const char) / Also atoi, atol,
strtod,... /

Can be protected with single restart marker on
calling instruction, saving only entry PC
assuming arguments untouched in stack memory
For vector machine, almost no (lt1) overhead to
add restart markers to common loops

52
Temporary State

Temporary state is only visible inside restart
region.
Thrown away at any exception.
Will be rebuilt when restart region is restarted.
For SCALE, all vector-thread unit state is
temporary
OS unaware of vector-thread unit.
Provides advantages of exposing more machine
state without the headaches.

53
Vector-Thread Architecture
54
Vector and Multithreaded Architectures
vector control
Control Processor
thread control
PE0
PE1
PE2
PEN
PE0
PE1
PE2
PEN
Memory
Memory

Vector processors provide efficient DLP execution
Amortize instruction control
Amortize loop bookkeeping overhead
Exploit structured memory accesses
Unable to execute loops with loop-carried
dependencies or complex internal control flow

Multithreaded processors can flexibly exploit TLP
Unable to amortize common control overhead across
threads
Unable to exploit structured memory accesses
across threads
Costly memory-based synchronization and
communication between threads

55
Vector-Thread Architecture

VT unifies the vector and multithreaded compute
models
A control processor interacts with a vector of
virtual processors (VPs)
Vector-fetch control processor fetches
instructions for all VPs in parallel
Thread-fetch a VP fetches its own instructions
VT allows a seamless intermixing of vector and
thread control

vector-fetch
Control Processor
thread- fetch
VP0
VP1
VP2
VP3
VPN
Memory
56
Virtual Processor Abstraction
vector-fetch

VPs contain a set of registers
VPs execute RISC-like instructions grouped into
atomic instruction blocks (AIBs)
VPs have no automatic program counter, AIBs must
be explicitly fetched
VPs contain pending vector-fetch and thread-fetch
addresses
A fetch instruction allows a VP to fetch its own
AIB
May be predicated for conditional branch
If an AIB does not execute a fetch, the VP thread
stops

VP

thread- fetch
VP thread execution
AIB
instruction
fetch
thread-fetch
fetch
thread-fetch
57
Virtual Processor Vector

A VT architecture includes a control processor
and a virtual processor vector
Two interacting instruction sets
A vector-fetch command allows the control
processor to fetch an AIB for all the VPs in
parallel
Vector-load and vector-store commands transfer
blocks of data between memory and the VP registers

vector-fetch
Control Processor
VP0
VP1
VPN

vector-load

vector-store
Vector Memory Unit
Memory
58
Cross-VP Data Transfers

Cross-VP connections provide fine-grain data
operand communication and synchronization
VP instructions may target nextVP as destination
or use prevVP as a source
CrossVP queue holds wrap-around data, control
processor can push and pop
Restricted ring communication pattern is cheap to
implement, scalable, and matches the software
usage model for VPs

Control Processor
vector-fetch
VP0
VP1
VPN

crossVP-pop
crossVP-push

crossVP queue
59
Mapping Loops to VT

A broad class of loops map naturally to VT
Vectorizable loops
Loops with loop-carried dependencies
Loops with internal control flow
Each VP executes one loop iteration
Control processor manages the execution
Stripmining enables implementation-dependent
vector lengths
Programmer or compiler only schedules one loop
iteration on one VP
No cross-iteration scheduling

60
Vectorizable Loops

Data-parallel loops with no internal control flow
mapped using vector commands
predication for small conditionals

ld
ld
ltlt
x

operation
st
loop iteration DAG
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-load
ld
ld
ld
ld
ld
vector-fetch
ltlt
x
ltlt
x
ltlt
x
ltlt
x
ltlt
x

vector-store
st
st
st
st
st
61
Loop-Carried Dependencies

Loops with cross-iteration dependencies mapped
using vector commands with cross-VP data
transfers
Vector-fetch introduces chain of prevVP receives
and nextVP sends
Vector-memory commands with non-vectorizable
compute

ld
ld
ltlt
x

st
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-load
ld
ld
ld
ld
ld
vector-fetch
ltlt
x
ltlt
x
ltlt
x
ltlt
x
ltlt
x

vector-store
st
st
st
st
st
62
Loops with Internal Control Flow

Data-parallel loops with large conditionals or
inner-loops mapped using thread-fetches
Vector-commands and thread-fetches freely
intermixed
Once launched, the VP threads execute to
completion before the next control processor
command

ld
ld

br
st
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-fetch
ld
ld
ld
ld
ld

br
br
br
br
br
ld
ld
ld

br
br
br
vector-store
st
st
st
st
st
63
VT Physical Model

A Vector-Thread Unit contains an array of lanes
with physical register files and execution units
VPs map to lanes and share physical resources, VP
execution is time-multiplexed on the lanes
Independent parallel lanes exploit parallelism
across VPs and data operand locality within VPs

64
VP Execution Interleaving

Hardware provides the benefits of loop unrolling
by interleaving VPs
Time-multiplexing can hide thread-fetch, memory,
and functional unit latencies

time-multiplexing
Lane 3
Lane 2
Lane 1
Lane 0
VP0
VP4
VP8
VP12
VP1
VP5
VP9
VP13
VP2
VP6
VP10
VP14
VP3
VP7
VP11
VP15
65
VP Execution Interleaving

Dynamic scheduling of cross-VP data transfers
automatically adapts to software critical path
(in contrast to static software pipelining)
No static cross-iteration scheduling
Tolerant to variable dynamic latencies

time-multiplexing
Lane 3
Lane 2
Lane 1
Lane 0
VP0
VP4
VP8
VP12
VP1
VP5
VP9
VP13
VP2
VP6
VP10
VP14
VP3
VP7
VP11
VP15
vector-fetch
time
66
SCALE Registers and VP Configuration
c0

Atomic instruction blocks allow VPs to share
temporary state only valid within the AIB
VP general registers divided into private and
shared
Chain registers at ALU inputs avoid reading and
writing general register file to save energy

shared
VP8
VP4
VP0
cr0
cr1

Number of VP registers in each cluster is
configurable
The hardware can support more VPs when they each
have fewer private registers
Low overhead Control processor instruction
configures VPs before entering stripmine loop, VP
state undefined across reconfigurations

4 VPs with 0 shared regs 8 private regs
7 VPs with 4 shared regs 4 private regs
25 VPs with 7 shared regs 1 private reg
VP12
shared
shared
VP24
VP20
VP8
VP16
VP12
VP4
VP8
VP0
VP4
VP0
67
SCALE Prototype and Simulator

Prototype SCALE processor in development
Control processor MIPS, 1 instr/cycle
VTU 4 lanes, 4 clusters/lane, 32
registers/cluster, 128 VPs max
Primary I/D cache 32 KB, 4x128b per cycle,
non-blocking
DRAM 64b, 200 MHz DDR2 (64b at 400Mb/s 3.2GB/s)
Estimated 10 mm2 in TSMC 180nm, 400 MHz (25 FO4)

68
Summary

Energy/operation for given performance is key
parameter
Increasing parallelism and locality are standard
tricks for improving performance
But standard microarchitectural techniques to
achieve better parallelism and locality increase
energy/operation
Energy-exposed instruction set allows software to
increase parallelism, increase locality, and
reduce microarchitectural waste for lower
energy/op

69
SCALE Group
http//cag.csail.mit.edu/scale

Seongmoo Heo
Ronny Krashinsky
Jae Lee
Rose Liu
Albert Ma
Heidi Pan
Brian Pharris
Jessica Tseng
Michael Zhang

Krste Asanovic
Gautham Arumilli
Ken Barr
Elizabeth Basha
Chris Batten
Vimal Bhalodia
Jared Casper
Steve Gerding
Mark Hampton

Funding provided by DARPA, NSF, CMI, IBM,
Infineon, Intel, SGI, Xilinx, MIT Project Oxygen
70
(No Transcript)
71
Backup
72
Lane Execution

Lanes execute decoupled from each other
Command management unit handles vector-fetch and
thread-fetch commands
Execution cluster executes instructions in-order
from small AIB cache (e.g. 32 instructions)
AIB caches exploit locality to reduce instruction
fetch energy (on par with register read)
Execute directives point to AIBs and indicate
which VP(s) the AIB should be executed for
For a thread-fetch command, the lane executes the
AIB for the requesting VP
For a vector-fetch command, the lane executes the
AIB for every VP
AIBs and vector-fetch commands reduce control
overhead
10s100s of instructions executed per fetch
address tag-check, even for non-vectorizable
loops

Lane 0
vector-fetch
thread-fetch
vector-fetch
miss addr
AIB address
miss
AIB tags
VP12
execute directive
VP8
VP4
VP0
VP
AIB cache
ALU
AIB fill
instr.
AIB
73
SCALE Vector-Thread Processor

SCALE is designed to be a complexity-effective
all-purpose embedded processor
Exploit all available forms of parallelism and
locality to achieve high performance and low
energy
Constrained to small area (estimated 10 mm2 in
0.18 µm)
Reduce wire delay and complexity
Support tiling of multiple SCALE processors for
increased throughput
Careful balance between software and hardware for
code mapping and scheduling
Optimize runtime energy, area efficiency, and
performance while maintaining a clean scalable
programming model

74
SCALE Clusters

VPs partitioned into four clusters to exploit ILP
and allow lane implementations to optimize area,
energy, and circuit delay
Clusters are heterogeneous c0 can execute loads
and stores, c1 can execute fetches, c3 has
integer mult/div
Clusters execute decoupled from each other

Lane 0
Lane 1
Lane 2
Lane 3
Control Processor
SCALE VP
AIB Fill Unit

c3
c3
c3
c3
c3
c2
c2
c2
c2
c2
c1
c1
c1
c1
c1
c0
c0
c0
c0
c0
L1 Cache
75
SCALE Micro-Ops

Assembler translates portable software ISA into
hardware micro-ops
Per-cluster micro-op bundles access local
registers only
Inter-cluster data transfers broken into
transports and writebacks

Software VP code
Hardware micro-ops
cluster micro-op bundle
Cluster 3 not shown
76
SCALE Cluster Decoupling

Cluster execution is decoupled
Cluster AIB caches hold micro-op bundles
Each cluster has its own execute-directive queue,
and local control
Inter-cluster data transfers synchronize with
handshake signals
Memory Access Decoupling (see paper)
Load-data-queue enables continued execution after
a cache miss
Decoupled-store-queue enables loads to slip ahead
of stores

Cluster 3
VP
writeback
compute
AIB cache
Regs
ALU
transport
Cluster 2
writeback
compute
AIBs
transport
Cluster 1
writeback
compute
AIBs
transport
Cluster 0
writeback
compute
AIBs
transport
77
Why it might be time for new ISA

Power-performance crisis
Single-thread performance plateau
for real this time
Memory wall
Reliability scaling
Hope for everyday large-scale multithreading
Software quality crisis

78
SpecInt/MHz
79
Clock Frequency Scaling
80
Clock Cycle in FO4
Alpha
81
Forms of Parallelism and Energy per Op
Scalar Pipelined Machine
Energy/ Operation
Performance
1
2
5
82
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
83
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
84
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
85
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1

Write a Comment

User Comments (0)