Title: Architectures and Design Tools for Embedded SOCs
1Application-Adaptive Architectures
- Architectures and Design Tools for Embedded SOCs
- Rajesh K. Gupta
- Center for Embedded Computer Systems
- Department of Information Computer Science
- University of California, Irvine.
- http//www.cecs.uci.edu/rgupta
2Semiconductor System Chips
- Two trends
- increasing use of embedded intelligence
- networking of embedded intelligence
- In ten years
- the big e.g., terabit optical core, gigabit
wireless, ... - the small e.g., pervasive self-powered sensor
motes - the cheap e.g., one-cent radios
- short-range (10-100m), low power (10nJ/bit), low
bit rate (1-100kbps) - The consequence
- smart spaces, intelligent interfaces, ad hoc
networks
3The ConsequencePervasive Embedded Intelligence
4Vision
- Explore technologies to build efficient embedded
computer systems (ECS) - including design technology (i.e., CAD tools and
methodologies)
Application Specific Gates
I/O
Code
Processor Cores
Memory
- Exploit ECS technologies for real-life
applications - from consumer electronics, automotive,
information appliances
5Research Organization
- Research group divided into two focus sub-groups
- Design Automation
- Focus on tools and methodology for system design
- Projects
- SPARK
- BALBOA
- COPPER
- RADHA-RATAN
- Sumit Gupta, Frederic Doucet, Ravindra Jejurikar,
SunWoo Kim, Nick Saviou, Maarika Saarela,
Mehrdad Reshadi, Sandeep Shukla
- System Design
- Focus on design
- Projects
- AMRM
- Adaptive Memory Reconfiguration Management
- PDS
- Power-aware Distributed Systems
- Dan Nicolaescu, Weiyu Tang, Cristiano Pereira,
Paolo Alberto, Yuvraj Agrawal, Anjum Gupta,
Manjari Chhawcharia, Mukesh Rajan
Problems, techniques
Tools, methods
jointly with Prof. N. Dutt, A. Nicolau, A.
Veidenbaum
Website http//www.cecs.uci.edu/ltprojectgt
6Ongoing Projects
- Design Automation
- POWER/PERFORMANCE OPTIMIZATION
- COPPER Compiler-controlled power/performance
optimization (DARPA) - Formal methods in power modeling and optimization
(SRC, NSF) - HIGH-LEVEL SYNTHESIS SPARK
- Parallelizing compiler optimizations for HLS
(Intel) - Integrated fine-grain and coarse-grain
optimizations for HLS (SRC) - HIGH-LEVEL MODELING BALBOA RADHA-RATAN
- System modeling and presynthesis optimizations
(NSF) - Networked systems modeling (Synopsys, NSF)
- Timing-driven codesign of NES (CoRe, Synopsys)
7Ongoing Projects
- System Design
- ADAPTIVE MEMORY RECONFIGURATION MANAGEMENT
- Design of an adaptive memory hierarchy that
optimizes placement and movement of data through
the caches according to application needs. - POWER-AWARE DISTRIBUTED SYSTEMS
- Power-aware design in a sensor network
environment. Intra-nodal and network-level power
modeling and management.
8Driving Forces
- Advances in microelectronics technology continue
to dominate ECS innovations - continued digitization of information
- processing moves on-chip and closer to sensors,
antennae - high speed high capacity processing in VLSI
chips
Wire delay, ns/cm.
Evolutionary growth but its effects are subtle
and powerful!
9Consider Interconnect
2000
1000
Average interconnect delay is greater than the
gate delays! Reduced marginal cost of logic and
signal regeneration needs make it possible to
include logic in inter-block interconnect.
10Rethinking VLSI Circuit Design When Interconnect
Dominates
- DEVICE Choose better interconnect
- Copper, low temperature interconnect
- CAD Choose better interconnect topology, sizes
- Minimize path from driver gate to each receiver
gate - e.g., A-tree algorithm yields about 12 reduction
in delay - Select wire sizes to minimize net delays
- e.g., up to 35 reduction in delay by optimal
sizing algorithms - CIRCUIT More repeaters, Better clocking
- Frequent use of signal repeaters in block-level
designs - longest interconnect2000 mu for 0.3 mu process
- A storage element no longer (always) defines a
clock boundary - storage delay (1.5x switching delay)
- Circuit block designs that work independently of
data latencies - e.g., heterogenous clocking interfaces Yun
ICCD96
11Implications Architectures
- Architectures to exploit interconnect delays
- pipeline interconnect delays recall Cray-2
- cycle time max delay - min delay
- use interconnect delay as the minimum delay
- need PR estimates early in the design
- Algorithms that use interconnect latencies
- interconnect as functional units
- functional unit schedules are based on a measure
of spatial distances - Increase local decision making
- multiple state transitions in a clock period
- storage-controlled routing
- re-programmable blocks in custom layouts
12Opportunity Application-Adaptive Architectures
- Exploit architectural low-hanging fruits
- performance variation across applications
(10-100X) - performance variation across data-sets (10X)
- Use interconnect and data-path reconfiguration to
- increase performance
- combat performance fragility and
- improve fault tolerance
- Configurable hardware is used to improve
utilization of performance critical resources - instead of using configurable hardware to build
additional resources - design goal is to achieve peak performance across
applications - configurable hardware leveraged in efficient
utilization of performance critical resources
13Architectural Macro-View
- Configurability enables a range of logical views
- Programmability enabled shared cache coherent
address spaces can be augmented - additional communication mechanisms, special
protocols - Adaptation to
- computational, compositional, and hardware
packaging structure.
14Architectural Adaptation
- Each of the following elements can benefit from
increased adaptability - (above and beyond CPU programming)
- CPU
- Memory hierarchy eliminate false sharing
- Memory system virtual memory layout based on
cache miss data - IO disk layout based on access pattern
- Network interface scheduling to reduced
end-to-end latency - Adaptability used to build
- programmable engines in IO, memory controllers,
cache controllers, network devices - configurable data-paths and logic in any part of
the system - configurable queueing in scheduling for
interconnect, devices, memory - smart interfaces for information flow from
applications to hardware - performance monitoring and coordinated resource
management...
Intelligent interfaces, information formats,
mechanisms and policies.
15Adaptation Challenges
- Is application-driven adaptation viable from
technology and cost point of view? - How to structure adaptability
- to maximize the performance benefits
- provide protection, multitasking and a reasonable
programming environment - enable easy exploitation of adaptability through
automatic or semi-automatic means. - We focus on memory hierarchy as the first
candidate to explore the extent and utility of
adaptation - Adaptive Memory Reconfiguration Management (AMRM)
16Develop an adaptive cache memory architecture
that achieves peak system efficiency by enabling
applications to optimally manage movement of data
through the memory hierarchy.
171. Architectural Perspective to Adaptivity
- What to adapt?
- Cache organization, cache assists, hierarchy
- Data layout, address mapping, virtual memory
- What drives adaptivity?
- Performance improvement versus cost
- When to perform adaptive action(s)?
- Compile time insert code, set up hardware
- Run time use feedback from hardware
- Where to implement adaptivity?
- Software, Hardware
- Software and Hardware
18Current Investigation
- Cache assists
- choice of mechanisms mechanism parameters
(size, lookahead etc.) - Prefetching
- Stream Buffers
- Stride-directed based on address alone
- Miss Stride prefetch the same address using the
number of intervening misses as lookahead - Pointer Stride
- Victim Cache, Write Buffer
- Adaptive cache organizations
- line size, policies, adaptivity algorithms
- Hierarchy organization
- what are datapaths like?
- How much parallelism is possible in the hierarchy?
19Architectural Adaptation for Memory Management
- Combat latency deterioration
- optimal prefetching
- memory side pointer chasing
- blocking mechanisms
- fast barrier, broadcast support
- synchronization support
- Bandwidth management
- memory (re)organization to suit application
characteristics - translate and gather hardware
- prefetching with compaction
- Memory controller design
20Adaptation for Latency Tolerance
21Prefetching Experiments
- Operation
- 1. Application sets prefetch parameters (compiler
controlled) - set lower/upper bounds on memory regions (for
memory protection etc.) - download pointer extraction function
- element size
- 2. Prefetching event generation (runtime
controlled) - when a new cache block is filled
- Application view
- generate_event (pointer to pass matrix element
structure) - ...
- generate_event(signal to enable prefetch)
- lt code on which prefetching is applied gt
- generate_event(signal to diable prefetch)
22
22Adaptation for Bandwidth Reduction
- Prefetching Entire Row/Column
- Pack Cache with Used Data Only
23Applications
- Computation Kernels
- SAXPY, Large Stride Vector Fetch and Store,
Irregular Scatter/Gather, 3D Jacobi Kernel, 3D
Jacobi Kernel with large local computation,
Tree-matching. - Sparse matrix library for Matrix-Matrix and
Matrix-Vector operations - Application codes
- MP3D, Adaptive Mesh Refinement (HAMR), Ray
Casting, Hierarchical Radiosity Computation
23
24Simulation Results Latency
- Sparse MM blocking, prefetching, packing (all
based on application data structures)
Cache Miss Rate
10X reduction in latency using application data
structure optimization
25Simulation Results Bandwidth
- Optimization designed to significantly reduce the
volume of data traffic - efficient fast storage management, packing,
prefetch
Data Traffic (MB)
100X reduction in BW using application-specific
packing and fetching.
26Hardware Cost
- Area measured in terms of LSI Logic LCA 10K
standard cells or Xilinx CLBs - Approximately 24 KB of programming.
- Programming cost much less than the cost of
programming any single application, - Maintains architectural usability by keeping the
same programming model.
27L1 Cache Assists
28Configurations
- 32KB L1 data cache, 32B lines, direct-map
- 0.5MB L2 cache, 64B line, direct-map
- 8-line write buffer
- Latencies
- 1-cycle L1, 8-cycle L2, 60-cycle memory
- 1-cycle prefetch, Write Buffer, Victim Cache
- All 3 mechanisms at once
29A. Effect of Adaptive Mechanisms
30Mechanism Effectiveness Over Time
31(No Transcript)
32Observed Behavior
- See wide variability in effectiveness between
- Individual Programs
- Within a program as a function of time
- Both of the above facts indicate a likely
improvement from adaptivity - Select a better one among mechanisms
- Even more can be expected from adaptively
re-allocating from the combined buffer pool - To reduce stall time
- To reduce the number of misses
- Propose a hardware mechanism to select between
assist types and allocate buffer space - Give compiler an opportunity to set parameters.
33Implementing Adaptive Mechanisms
- Hardware
- a common pool of (small) n-word buffers
- a set of possible policies, a subset of
- Prefetches Stride-directed, PC-based,
History-based - Victim cache, Write buffer
- Performance monitors for each type/buffer
- misses, stall time on hit, thresholds
- Dynamic buffer allocator among mechanisms
- Predict future behavior from observed past
- Observe in time interval ?T, set for next ?T
- Save performance trends in next-level tags (lt
8bits) - Adapt the following
- Number of buffers per mechanism
- May also include control, e.g. prediction tables
- Prefetch lookahead (buffer depth)
- Increase when buffers fill up and are still
stalling - Adaptivity interval
34B. Optimum Configurations
Effect of line size over time
Miss rate vs. line size
35B. Adaptive L1 Data Cache Organization
- Adaptive line size
- use a baseline line size but fetch and replace
a variable number of words (cache with small
physical line size) - requires three capabilities
- ability to modify hardware parameters dynamically
- ability to profile and monitor performance over
time - a scheduling algorithm to direct adaptation
process - balance adaptivity advantages against overheads
and memory bandwidth increase - Differences from fixed line size cache
- multiple line sizes coexist in the cache
- one or multiple lines can be replaced on one miss
fetch - Design questions
- when to change the line size?
- how to change it?
- what information to keep and where?
- how to tie this decision making to compiler,
runtime?
36Cache Line Size Reconfiguration
- Change on line replacements
- history based spatial locality prediction for
line size change - tag lookup delay is not affected
- On cache miss test whether adjacent lines of
incoming cache line are in cache - On cache hit mark hit for the smallest component
line - Information to keep
- each physical line needs (in addition to
standard tag) - current virtual line size 3 bits
- adjacent bit 1
- virtual line usage counter (saturating) 2 bits
- initial line size (set using compiler help)
- Optimization for traffic versus miss rate
- optimize for traffic line size decrease has
priority (dec fast) - optimize for miss rate line size increase has
priority (inc fast) - optimize for both use a state machine to control
inc or dec
37Line Size Reconfiguration
Miss_fetch Input cache miss address addr Tag
lookup for addr returns l2, the line size for
the cache entry e currently in the cache Miss
fetch is initiated which brings in the line at
addr and its line size l1 If l1 gt l2
then for each line to be replaced by line at
addr do Get an entry ei for this line and it
length li line_size_analysis (eI, , li) If
size change or modified then write to memory
Endif Endfor Else decide on the next size for
replaced line decrease_line_size_request
(e) Endif If adjacent cache line of same size is
in the cache now then set adjacent bit for both
adjacent lines Endif
38Line Size Analysis
line_size_analysis Input a cache entry e and
its size l A cache line consists of words
e1,,el If all of the words e1 el/2 or all
of el/2,,el are not used then decrease_line_si
ze_request(e) If the adjacentcache line is
in the cache then reset its adjacent
bit Endif Elseif adjacent bit is set AND the
adjacent cache line is not in the cache
AND most of the entries have low
usage then increase_line_size_request(e) Endif
39Observations
Miss rates for fixed and adaptive line sizes
- Adaptivity works
- Miss rates for different starting line size are
almost identical - Average line sizes are almost identical
- Adaptive reconfiguration improves performance
- minimum miss rate over execution, up to 50
reduction in memory traffic - However, the rate at which reconfiguration takes
place is an important parameter - Line size adjustment is frequent
- 25-50 of all lines fetched either increase of
decrease on replacement.
40Adaptivity in Scheduling Algorithm
Note 1. Data for six benchmarks 2.
Miss rate ratio is calculated by dividing fixed
32 bytes line cache miss rate 3.
Traffic rate ratio is calculated by dividing
fixed 32 bytes line cache traffic 4.
There are three points at (1,1)
41Experimental Framework
- Compiler
- MIPSPro compilers
- MIPS-III ISA, 32B executable
- O2 optimization flag
- Target processor R4000
- Simulator
- Processor MINT-3, single issue, statically
schedule - Cache driven by MINT-3 generated addresses
- Basic hardware configuration
- R-4000 instruction set
- 1-cycle computational instructions
- 4B and 8B loads and stores
- Direct-mapped, write-back, 16KB cache
- 8B to 256B cache line sizes
42The AMRM Prototype
- Divided into two phases
- Phase 1
- A board-level, processor-independent memory
system - memory hierarchy implemented using generic FPGA
- Provide a pre-VLSI prototype
- Ability to change system timing parameters
- Use virtual clock and system timing
- Allows modifying without a redesign latencies,
bus delays - fix real clock at 66MHz in the implementation
- Phase 2
- A processor-specific ASIC implementation, the
AMRM chip - Ability to explore validation and protection
mechanisms
43(No Transcript)
44AMRM Phase I Board
45AMRM Phase I Board Implementation
- A 33MHz, 32b PCI interface, 256MB DRAM
- Performance counters readable as memory
- A virtual clock unit -- computes target execution
time. - A daughter card interface to memory bus
- Reconfigurable control unit
- Software-configurable L1 cache
- 1MByte of SRAM
- An application can change at any time
- Cache size - 8KB to 512KB
- Line size - 8B to 512B
- Write policy write-back or write-through
- Software specifies the virtual access time
46Reconfiguration Capabilities
- Virtual time management, cache control, cache
configuration, etc, are implemented in an FPGA - Algorithms can be changed
- Versatile daughter card allows addition of new
modules to memory bus (after L1)
47AMRM Phase I Prototype
48Software API
- Relocate user data to AMRM memory
- Run an application out of the AMRM memory
hierarchy - Commands for L1 cache reconfiguration
- Memory-mapped
- Select one of cache sizes
- Select one of line sizes
- Software cache invalidate, barrier
- Card can buffer requests
- Compiler instruments a program
- replace LD/ST with access to AMRM memory
- inserts reconfiguration commands
- supplies computation time
49(No Transcript)
50AMRM Phase II Prototype
- From emulation board to complete adaptive
single-board system - both data as well as instruction cache
adaptations - Applications download, profiling information fed
back to processor, for compiler assisted
reconfiguration command insertions - Online reconfigurability and hardware assisted
- Realistic delay characteristics
51Summary
- Semiconductor advances are bringing powerful
changes to how systems are architected and built - challenges underlying assumptions on synchronous
digital hardware designs - interconnect (local and global) dominates
architectural choices, local decision making is
free - in particular, it can be made adaptable using CAD
tools. - New Architectural Capabilities
- E.g., achieve peak performance by adapting
machine capabilities to application and data
characteristics. - current investigation has demonstrated
feasibility of cache assists and configurations
that can be adapted on a per application basis. - Compiler control of the adaptation is key to
usability of the architecture. - AMRM is a proxy for customized architectures
- enabled by advances in CAD tools and design
flows. - Maturing CAD tools that are beginning to provide
reasonable QoR on automated flows from high-level
models (C/C based) to synthesized and mapped
circuits, particularly for reconfigurable
technologies - concurrent system design and application
development.