Architectures and Design Tools for Embedded SOCs

About This Presentation

Title:

Architectures and Design Tools for Embedded SOCs

Description:

Compile time: insert code, set up hardware. Run time: use feedback from hardware ... Performance monitors for each type/buffer. misses, stall time on hit, thresholds ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 49

Provided by: rajesh9

Category:

more less

Transcript and Presenter's Notes

Title: Architectures and Design Tools for Embedded SOCs

1
Application-Adaptive Architectures

Architectures and Design Tools for Embedded SOCs
Rajesh K. Gupta
Center for Embedded Computer Systems
Department of Information Computer Science
University of California, Irvine.
http//www.cecs.uci.edu/rgupta

2
Semiconductor System Chips

Two trends
increasing use of embedded intelligence
networking of embedded intelligence
In ten years
the big e.g., terabit optical core, gigabit
wireless, ...
the small e.g., pervasive self-powered sensor
motes
the cheap e.g., one-cent radios
short-range (10-100m), low power (10nJ/bit), low
bit rate (1-100kbps)
The consequence
smart spaces, intelligent interfaces, ad hoc
networks

3
The ConsequencePervasive Embedded Intelligence
4
Vision

Explore technologies to build efficient embedded
computer systems (ECS)
including design technology (i.e., CAD tools and
methodologies)

Application Specific Gates
I/O
Code
Processor Cores
Memory

Exploit ECS technologies for real-life
applications
from consumer electronics, automotive,
information appliances

5
Research Organization

Research group divided into two focus sub-groups

Design Automation
Focus on tools and methodology for system design
Projects
SPARK
BALBOA
COPPER
RADHA-RATAN
Sumit Gupta, Frederic Doucet, Ravindra Jejurikar,
SunWoo Kim, Nick Saviou, Maarika Saarela,
Mehrdad Reshadi, Sandeep Shukla

System Design
Focus on design
Projects
AMRM
Adaptive Memory Reconfiguration Management
PDS
Power-aware Distributed Systems
Dan Nicolaescu, Weiyu Tang, Cristiano Pereira,
Paolo Alberto, Yuvraj Agrawal, Anjum Gupta,
Manjari Chhawcharia, Mukesh Rajan

Problems, techniques
Tools, methods
jointly with Prof. N. Dutt, A. Nicolau, A.
Veidenbaum
Website http//www.cecs.uci.edu/ltprojectgt
6
Ongoing Projects

Design Automation
POWER/PERFORMANCE OPTIMIZATION
COPPER Compiler-controlled power/performance
optimization (DARPA)
Formal methods in power modeling and optimization
(SRC, NSF)
HIGH-LEVEL SYNTHESIS SPARK
Parallelizing compiler optimizations for HLS
(Intel)
Integrated fine-grain and coarse-grain
optimizations for HLS (SRC)
HIGH-LEVEL MODELING BALBOA RADHA-RATAN
System modeling and presynthesis optimizations
(NSF)
Networked systems modeling (Synopsys, NSF)
Timing-driven codesign of NES (CoRe, Synopsys)

7
Ongoing Projects

System Design
ADAPTIVE MEMORY RECONFIGURATION MANAGEMENT
Design of an adaptive memory hierarchy that
optimizes placement and movement of data through
the caches according to application needs.
POWER-AWARE DISTRIBUTED SYSTEMS
Power-aware design in a sensor network
environment. Intra-nodal and network-level power
modeling and management.

8
Driving Forces

Advances in microelectronics technology continue
to dominate ECS innovations
continued digitization of information
processing moves on-chip and closer to sensors,
antennae
high speed high capacity processing in VLSI
chips

Wire delay, ns/cm.
Evolutionary growth but its effects are subtle
and powerful!
9
Consider Interconnect
2000
1000
Average interconnect delay is greater than the
gate delays! Reduced marginal cost of logic and
signal regeneration needs make it possible to
include logic in inter-block interconnect.
10
Rethinking VLSI Circuit Design When Interconnect
Dominates

DEVICE Choose better interconnect
Copper, low temperature interconnect
CAD Choose better interconnect topology, sizes
Minimize path from driver gate to each receiver
gate
e.g., A-tree algorithm yields about 12 reduction
in delay
Select wire sizes to minimize net delays
e.g., up to 35 reduction in delay by optimal
sizing algorithms
CIRCUIT More repeaters, Better clocking
Frequent use of signal repeaters in block-level
designs
longest interconnect2000 mu for 0.3 mu process
A storage element no longer (always) defines a
clock boundary
storage delay (1.5x switching delay)
Circuit block designs that work independently of
data latencies
e.g., heterogenous clocking interfaces Yun
ICCD96

11
Implications Architectures

Architectures to exploit interconnect delays
pipeline interconnect delays recall Cray-2
cycle time max delay - min delay
use interconnect delay as the minimum delay
need PR estimates early in the design
Algorithms that use interconnect latencies
interconnect as functional units
functional unit schedules are based on a measure
of spatial distances
Increase local decision making
multiple state transitions in a clock period
storage-controlled routing
re-programmable blocks in custom layouts

12
Opportunity Application-Adaptive Architectures

Exploit architectural low-hanging fruits
performance variation across applications
(10-100X)
performance variation across data-sets (10X)
Use interconnect and data-path reconfiguration to
increase performance
combat performance fragility and
improve fault tolerance
Configurable hardware is used to improve
utilization of performance critical resources
instead of using configurable hardware to build
additional resources
design goal is to achieve peak performance across
applications
configurable hardware leveraged in efficient
utilization of performance critical resources

13
Architectural Macro-View

Configurability enables a range of logical views
Programmability enabled shared cache coherent
address spaces can be augmented
additional communication mechanisms, special
protocols
Adaptation to
computational, compositional, and hardware
packaging structure.

14
Architectural Adaptation

Each of the following elements can benefit from
increased adaptability
(above and beyond CPU programming)
CPU
Memory hierarchy eliminate false sharing
Memory system virtual memory layout based on
cache miss data
IO disk layout based on access pattern
Network interface scheduling to reduced
end-to-end latency
Adaptability used to build
programmable engines in IO, memory controllers,
cache controllers, network devices
configurable data-paths and logic in any part of
the system
configurable queueing in scheduling for
interconnect, devices, memory
smart interfaces for information flow from
applications to hardware
performance monitoring and coordinated resource
management...

Intelligent interfaces, information formats,
mechanisms and policies.
15
Adaptation Challenges

Is application-driven adaptation viable from
technology and cost point of view?
How to structure adaptability
to maximize the performance benefits
provide protection, multitasking and a reasonable
programming environment
enable easy exploitation of adaptability through
automatic or semi-automatic means.
We focus on memory hierarchy as the first
candidate to explore the extent and utility of
adaptation
Adaptive Memory Reconfiguration Management (AMRM)

16
Develop an adaptive cache memory architecture
that achieves peak system efficiency by enabling
applications to optimally manage movement of data
through the memory hierarchy.
17
1. Architectural Perspective to Adaptivity

What to adapt?
Cache organization, cache assists, hierarchy
Data layout, address mapping, virtual memory
What drives adaptivity?
Performance improvement versus cost
When to perform adaptive action(s)?
Compile time insert code, set up hardware
Run time use feedback from hardware
Where to implement adaptivity?
Software, Hardware
Software and Hardware

18
Current Investigation

Cache assists
choice of mechanisms mechanism parameters
(size, lookahead etc.)
Prefetching
Stream Buffers
Stride-directed based on address alone
Miss Stride prefetch the same address using the
number of intervening misses as lookahead
Pointer Stride
Victim Cache, Write Buffer
Adaptive cache organizations
line size, policies, adaptivity algorithms
Hierarchy organization
what are datapaths like?
How much parallelism is possible in the hierarchy?

19
Architectural Adaptation for Memory Management

Combat latency deterioration
optimal prefetching
memory side pointer chasing
blocking mechanisms
fast barrier, broadcast support
synchronization support
Bandwidth management
memory (re)organization to suit application
characteristics
translate and gather hardware
prefetching with compaction
Memory controller design

20
Adaptation for Latency Tolerance
21
Prefetching Experiments

Operation
1. Application sets prefetch parameters (compiler
controlled)
set lower/upper bounds on memory regions (for
memory protection etc.)
download pointer extraction function
element size
2. Prefetching event generation (runtime
controlled)
when a new cache block is filled
Application view
generate_event (pointer to pass matrix element
structure)
...
generate_event(signal to enable prefetch)
lt code on which prefetching is applied gt
generate_event(signal to diable prefetch)

22
22
Adaptation for Bandwidth Reduction

Prefetching Entire Row/Column
Pack Cache with Used Data Only

23
Applications

Computation Kernels
SAXPY, Large Stride Vector Fetch and Store,
Irregular Scatter/Gather, 3D Jacobi Kernel, 3D
Jacobi Kernel with large local computation,
Tree-matching.
Sparse matrix library for Matrix-Matrix and
Matrix-Vector operations
Application codes
MP3D, Adaptive Mesh Refinement (HAMR), Ray
Casting, Hierarchical Radiosity Computation

23
24
Simulation Results Latency

Sparse MM blocking, prefetching, packing (all
based on application data structures)

Cache Miss Rate
10X reduction in latency using application data
structure optimization
25
Simulation Results Bandwidth

Optimization designed to significantly reduce the
volume of data traffic
efficient fast storage management, packing,
prefetch

Data Traffic (MB)
100X reduction in BW using application-specific
packing and fetching.
26
Hardware Cost

Area measured in terms of LSI Logic LCA 10K
standard cells or Xilinx CLBs
Approximately 24 KB of programming.
Programming cost much less than the cost of
programming any single application,
Maintains architectural usability by keeping the
same programming model.

27
L1 Cache Assists
28
Configurations

32KB L1 data cache, 32B lines, direct-map
0.5MB L2 cache, 64B line, direct-map
8-line write buffer
Latencies
1-cycle L1, 8-cycle L2, 60-cycle memory
1-cycle prefetch, Write Buffer, Victim Cache
All 3 mechanisms at once

29
A. Effect of Adaptive Mechanisms
30
Mechanism Effectiveness Over Time
31
(No Transcript)
32
Observed Behavior

See wide variability in effectiveness between
Individual Programs
Within a program as a function of time
Both of the above facts indicate a likely
improvement from adaptivity
Select a better one among mechanisms
Even more can be expected from adaptively
re-allocating from the combined buffer pool
To reduce stall time
To reduce the number of misses
Propose a hardware mechanism to select between
assist types and allocate buffer space
Give compiler an opportunity to set parameters.

33
Implementing Adaptive Mechanisms

Hardware
a common pool of (small) n-word buffers
a set of possible policies, a subset of
Prefetches Stride-directed, PC-based,
History-based
Victim cache, Write buffer
Performance monitors for each type/buffer
misses, stall time on hit, thresholds
Dynamic buffer allocator among mechanisms
Predict future behavior from observed past
Observe in time interval ?T, set for next ?T
Save performance trends in next-level tags (lt
8bits)
Adapt the following
Number of buffers per mechanism
May also include control, e.g. prediction tables
Prefetch lookahead (buffer depth)
Increase when buffers fill up and are still
stalling
Adaptivity interval

34
B. Optimum Configurations
Effect of line size over time
Miss rate vs. line size
35
B. Adaptive L1 Data Cache Organization

Adaptive line size
use a baseline line size but fetch and replace
a variable number of words (cache with small
physical line size)
requires three capabilities
ability to modify hardware parameters dynamically
ability to profile and monitor performance over
time
a scheduling algorithm to direct adaptation
process
balance adaptivity advantages against overheads
and memory bandwidth increase
Differences from fixed line size cache
multiple line sizes coexist in the cache
one or multiple lines can be replaced on one miss
fetch
Design questions
when to change the line size?
how to change it?
what information to keep and where?
how to tie this decision making to compiler,
runtime?

36
Cache Line Size Reconfiguration

Change on line replacements
history based spatial locality prediction for
line size change
tag lookup delay is not affected
On cache miss test whether adjacent lines of
incoming cache line are in cache
On cache hit mark hit for the smallest component
line
Information to keep
each physical line needs (in addition to
standard tag)
current virtual line size 3 bits
adjacent bit 1
virtual line usage counter (saturating) 2 bits
initial line size (set using compiler help)
Optimization for traffic versus miss rate
optimize for traffic line size decrease has
priority (dec fast)
optimize for miss rate line size increase has
priority (inc fast)
optimize for both use a state machine to control
inc or dec

37
Line Size Reconfiguration
Miss_fetch Input cache miss address addr Tag
lookup for addr returns l2, the line size for
the cache entry e currently in the cache Miss
fetch is initiated which brings in the line at
addr and its line size l1 If l1 gt l2
then for each line to be replaced by line at
addr do Get an entry ei for this line and it
length li line_size_analysis (eI, , li) If
size change or modified then write to memory
Endif Endfor Else decide on the next size for
replaced line decrease_line_size_request
(e) Endif If adjacent cache line of same size is
in the cache now then set adjacent bit for both
adjacent lines Endif
38
Line Size Analysis
line_size_analysis Input a cache entry e and
its size l A cache line consists of words
e1,,el If all of the words e1 el/2 or all
of el/2,,el are not used then decrease_line_si
ze_request(e) If the adjacentcache line is
in the cache then reset its adjacent
bit Endif Elseif adjacent bit is set AND the
adjacent cache line is not in the cache
AND most of the entries have low
usage then increase_line_size_request(e) Endif
39
Observations
Miss rates for fixed and adaptive line sizes

Adaptivity works
Miss rates for different starting line size are
almost identical
Average line sizes are almost identical
Adaptive reconfiguration improves performance
minimum miss rate over execution, up to 50
reduction in memory traffic
However, the rate at which reconfiguration takes
place is an important parameter
Line size adjustment is frequent
25-50 of all lines fetched either increase of
decrease on replacement.

40
Adaptivity in Scheduling Algorithm
Note 1. Data for six benchmarks 2.
Miss rate ratio is calculated by dividing fixed
32 bytes line cache miss rate 3.
Traffic rate ratio is calculated by dividing
fixed 32 bytes line cache traffic 4.
There are three points at (1,1)
41
Experimental Framework

Compiler
MIPSPro compilers
MIPS-III ISA, 32B executable
O2 optimization flag
Target processor R4000
Simulator
Processor MINT-3, single issue, statically
schedule
Cache driven by MINT-3 generated addresses
Basic hardware configuration
R-4000 instruction set
1-cycle computational instructions
4B and 8B loads and stores
Direct-mapped, write-back, 16KB cache
8B to 256B cache line sizes

42
The AMRM Prototype

Divided into two phases
Phase 1
A board-level, processor-independent memory
system
memory hierarchy implemented using generic FPGA
Provide a pre-VLSI prototype
Ability to change system timing parameters
Use virtual clock and system timing
Allows modifying without a redesign latencies,
bus delays
fix real clock at 66MHz in the implementation
Phase 2
A processor-specific ASIC implementation, the
AMRM chip
Ability to explore validation and protection
mechanisms

43
(No Transcript)
44
AMRM Phase I Board
45
AMRM Phase I Board Implementation

A 33MHz, 32b PCI interface, 256MB DRAM
Performance counters readable as memory
A virtual clock unit -- computes target execution
time.
A daughter card interface to memory bus
Reconfigurable control unit
Software-configurable L1 cache
1MByte of SRAM
An application can change at any time
Cache size - 8KB to 512KB
Line size - 8B to 512B
Write policy write-back or write-through
Software specifies the virtual access time

46
Reconfiguration Capabilities

Virtual time management, cache control, cache
configuration, etc, are implemented in an FPGA
Algorithms can be changed
Versatile daughter card allows addition of new
modules to memory bus (after L1)

47
AMRM Phase I Prototype
48
Software API

Relocate user data to AMRM memory
Run an application out of the AMRM memory
hierarchy
Commands for L1 cache reconfiguration
Memory-mapped
Select one of cache sizes
Select one of line sizes
Software cache invalidate, barrier
Card can buffer requests
Compiler instruments a program
replace LD/ST with access to AMRM memory
inserts reconfiguration commands
supplies computation time

49
(No Transcript)
50
AMRM Phase II Prototype

From emulation board to complete adaptive
single-board system
both data as well as instruction cache
adaptations
Applications download, profiling information fed
back to processor, for compiler assisted
reconfiguration command insertions
Online reconfigurability and hardware assisted
Realistic delay characteristics

51
Summary

Semiconductor advances are bringing powerful
changes to how systems are architected and built
challenges underlying assumptions on synchronous
digital hardware designs
interconnect (local and global) dominates
architectural choices, local decision making is
free
in particular, it can be made adaptable using CAD
tools.
New Architectural Capabilities
E.g., achieve peak performance by adapting
machine capabilities to application and data
characteristics.
current investigation has demonstrated
feasibility of cache assists and configurations
that can be adapted on a per application basis.
Compiler control of the adaptation is key to
usability of the architecture.
AMRM is a proxy for customized architectures
enabled by advances in CAD tools and design
flows.
Maturing CAD tools that are beginning to provide
reasonable QoR on automated flows from high-level
models (C/C based) to synthesized and mapped
circuits, particularly for reconfigurable
technologies
concurrent system design and application
development.