Title: Reducing Control Power in CGRAs with Token Flow
1Reducing Control Power in CGRAs with Token Flow
- Hyunchul Park, Yongjun Park, and Scott Mahlke
- University of Michigan
2Coarse-Grained Reconfigurable Architecture (CGRA)
- Array of PEs connected in a mesh-like
interconnect - High throughput with a large number of resources
- Distributed hardware offers low cost/power
consumption - High flexibility with dynamic reconfiguration
3CGRA Attractive Alternative to ASICs
- Suitable for running multimedia applications for
future embedded systems - High throughput, low power consumption, high
flexibility - Morphosys 8x8 array with RISC processor
- SiliconHive hierarchical systolic array
- ADRES 4x4 array with tightly coupled VLIW
Morphosys SiliconHive ADRES
viterbi at 80Mbps
h.264 at 30fps
50-60 MOps /mW
3
4Control Power Explosion
Single PE
PE Instruction
- Large number of configuration signals
- Distributed interconnect, many resources to
control - Nealy 1000 bits each cycle
- No code compression technique developed for CGRAs
- Fully decoded instructions are stored in memory
- 45 total power
4
5Code Compressions
- Huffman encoding
- High efficiency, but sequential process
- Dictionary-based
- Recurring patterns stored in dictionary
- Not many patterns found in CGRAs
- Instruction level code compression
- No-op compression Itanium, DSPs
- Only 17 are no-ops in CGRA
6Fine-grain Code Compression
- Compress unused fields rather than the whole
instruction - Opcode, MUX selection, register address
- 35 of fields contain valid information
- Instruction format needs be stored in the memory
- Information regarding which fields exist in the
memory - Significant overhead 172 bits (20) for a 4x4
CGRA
6
7Dynamic Instruction Format Discovery
FU dest lt- src0 src1 RF reg write
- Resources need configuration only when data flows
through them - Instruction format can be discovered by looking
at the data flow - Token network from dataflow machines can be
utilized - Token is 1 bit information indicating incoming
data in next cycle - Each PE observes incoming tokens and determines
the instruction format
7
8Dynamic Configuration of PEs
Dataflow Graph
Mapping
Configuration
- Each cycle, tokens are sent to the consuming PEs
- Consuming resources collect incoming tokens,
discover instruction formats, and fetch only
necessary instruction fields - Next cycle, resources can execute the scheduled
operations
8
9Token Generation
- Tokens are generated at the beginning of dataflow
live-in nodes in RFs - Each RF read port needs token generation info
26 read ports in 4x4 CGRA - 26 bits for token generation vs. 172 bits for
instruction format
10Token Network
- Token network between datapath and decoder
- No instruction format, but token generation info
in the memory - Adds 1 cycle between IF and EX stage
- Created by cloning the datapath
- 1 bit interconnect with same topology
- Each resource translated to a token processing
module - Encode dest fields, not src fields
10
11Register File Token Module
token_gen
token sender
- Write port MUXes are converted to token receivers
- Determine selection bits
- Read ports are converted to token senders
- Tokens are initially generated here
- Token generation information stored in a separate
memory
11
12FU Token Module
- Input MUXes are converted to token receivers
- Opcode processor
- Fetch opcode field if necessary
- Determine token type (data/pred), latency
12
13System Overview
datapath
14Experimental Setup
- Target multimedia applications for embedded
systems - Modulo scheduling for compute intensive loops in
3D graphics, AAC decoder, AVC decoder (214 loops) - Three different control path designs
- baseline fully decoded instructions
- static fine-grained code compression with
instruction format stored in the memory - token fine-grain code compression with token
network
14
15Code Size / Performance
- Fine grain code compression increase code
efficiency - Token network further improve code efficiency
- Performance degradation
- Sharing of fields, allowing only 2 dests
15
16Power / Area
- SRAM read power is greatly reduced with token
network - Introducing token network slightly increases
power and area - Area overhead can be mitigated with the reduced
SRAM area - Hardware overhead for migrating staging
predicates into token network is minimal
16
17Staging Predicates Optimization
- Modulo scheduled loops
- Prolog (filling pipeline)
- Kernel code (steady state)
- Epilog (draining pipeline)
- Only kernel code is stored in memory
- Staging predicate control prolog/epilog phases
i0
i0
i1
i2
II
i1
i2
Overlapped Execution
17
18Migrating Staging Predicate
- Staging predicate
- Control information, not data dependent
- 10 configurations used for routing staging
predicate - Move staging predicates into control path
- Increase token by 1 bit staging predicate
- Only top nodes are guarded
- Staging predicate flows along with tokens
- Benefits
- Code size reduction
- Performance increase
stage 0
stage 1
stage 2
stage 3
data
staging predicate
18
19Code Size / Performance
- Code size reduction by 9
- Migrating staging predicates improve performance
by 7 - 5 increase over baseline
19
20Power / Area
- Power/area of token network increase due to valid
bit - Reduced code size decreases SRAM power/area
- Overall overhead for migrating staging predicates
is minimal
20
21Overall Power
226.4 mW
170.0 mW
- System power measured for a kernel loop in AVC
- Introducing token network reduces the overall
system power by 25, while achieving 5
performance gain
21
22Conclusion
- Fine grain code compression is a good fit for
CGRAs - Token network can eliminate the instruction
format overhead - Dynamic discovery of instruction format
- Small overhead (lt 3)
- Migrating staging predicates to token network
improves performance - Applicable to other highly distributed
architectures
22
23Questions?
24Token Sender
- Each output port of resources are converted into
a token sender - FU output, routing mux output, register file read
ports - Send out tokens only to the specified consumers
in dest fields - Allow only two destinations for each output,
potentially limits the performance
25Token Receiver
- Input MUXes are converted to token receivers
- Dest fields are stored in the memory, not src
fields - MUX selection bits are determined with incoming
token position
25
26Dynamic Instruction Format Discovery
- Resources need configuration only when data flows
through them - Instruction format can be discovered by looking
at the data flow - Token network from dataflow machines can be
utilized - Token is 1 bit information indicating incoming
data in next cycle - Each PE observes incoming tokens and determines
the instruction format
26
27Who Generates Tokens?
- Tokens are generated at the start of dataflow
- Live-ins
- Terminate when they get into a register file
- Tokens terminated in register files can be
re-generated - Read ports of register files generate tokens
- Token generation information at RF read ports are
stored separately - 26 read ports in 4x4 CGRA
28Reducing Decoder Complexity
MEM
MEM
MEM
MEM
decoder
decoder
decoder
decoder
Token Network
- Partitioning the configuration memory and decoder
- Trade-off between number of memories and decoder
complexity - Design space exploration for memory partitioning
- Which fields are stored in the same memory?
- Sharing of field entries in the memory
under-utilized fields
28
29Memory Partitioning
- Bundle fields with the same type field width
uniformity - Design space exploration result for a 4x4 CGRA
- sharing degree total entries / total fields
- Reduces decoder complexity by 33 over naïve
partitioning - Sharing incurs less than 1 performance
degradation
type fields memories entries total entries sharing degree
opcode 16 2 8 16 1.0
dest 96 8 8 64 0.75
const 16 2 6 12 0.75
reg addr 48 4 6 24 0.5
29