Title: AMRM: Project Technical Approach
1AMRMProject Technical Approach
A Technology and Architectural View of Adaptation
- Rajesh Gupta
- Project Kickoff Meeting
- November 5, 1998
- Washington DC
2Outline
- Technology trends driving this project
- Changing ground-rules in high-performance system
design - Rethinking circuits and microelectronic system
design - Rethinking architectures
- The opportunity of application-adaptive
architectures - Adaptation Challenges
- Adaptation for memory hierarchy
- Why memory hierarchy?
- Adaptation space and possible gains
- Summary
3Technology Evolution
Industry continues to outpace NTRS projections on
technology scaling and IC density.
4Consider Interconnect
- Average interconnect delay is greater than the
gate delays! - Reduced marginal cost of logic and signal
regeneration needs make it possible to include
logic in inter-block interconnect.
5Rethinking Circuits When Interconnect Dominates
- DEVICE Choose better interconnect
- Copper, low temperature interconnect
- CAD Choose better interconnect topology, sizes
- Minimize path from driver gate to each receiver
gate - e.g., A-tree algorithm yields about 12 reduction
in delay - Select wire sizes to minimize net delays
- e.g., upto 35 reduction in delay by optimal
sizing algorithms - CKT Use more signal repeaters in block-level
designs - longest interconnect2000 mu for 350nm process
- u-ARCH A storage element no longer defines a
clock boundary - Multiple storage elements in a single clock
- Multiple state transitions in a clock period
- Storage-controlled routing
- Reduced marginal cost of logic
6Implications Circuit Blocks
- Frequent use of signal repeaters in block-level
designs - longest interconnect2000 mu for 0.3 mu process
- A storage element no longer (always) defines a
clock boundary - storage delay (1.5x switching delay)
- multiple storage elements in a single clock
- multiple state transitions in a clock period
- storage-controlled routing
- Circuit block designs that work independently of
data latencies - asynchronous blocks
- Heterogenous clocking interfaces
- pausible clocking Yun, ICCD96
- mixed synchronous, asynchronous circuit blocks.
7Implications Architectures
- Architectures to exploit interconnect delays
- pipeline interconnect delays recall Cray-2
- cycle time max delay - min delay
- use interconnect delay as the minimum delay
- need PR estimates early in the design
- Algorithms that use interconnect latencies
- interconnect as functional units
- functional unit schedules are based on a measure
of spatial distances - Increase local decision making
- multiple state transitions in a clock period
- storage-controlled routing
- re-programmable blocks in custom layouts
8Opportunity Application-Adaptive Architectures
- Exploit architectural low-hanging fruits
- performance variation across applications
(10-100X) - performance variation across data-sets (10X)
- Use interconnect and data-path reconfiguration to
- increase performance
- combat performance fragility and
- improve fault tolerance
- Configurable hardware is used to improve
utilization of performance critical resources - instead of using configurable hardware to build
additional resources - design goal is to achieve peak performance across
applications - configurable hardware leveraged in efficient
utilization of performance critical resources
9Architectural Adaptation
- Each of the following elements can benefit from
increased adaptability (above and beyond CPU
programming) - CPU
- Memory hierarchy eliminate false sharing
- Memory system virtual memory layout based on
cache miss data - IO disk layout based on access pattern
- Network interface scheduling to reduced
end-to-end latency - Adaptability used to build
- programmable engines in IO, memory controllers,
cache controllers, network devices - configurable data-paths and logic in any part of
the system - configurable queueing in scheduling for
interconnect, devices, memory - smart interfaces for information flow from
applications to hardware - performance monitoring and coordinated resource
management...
Intelligent interfaces, information formats,
mechanisms and policies.
10Adaptation Challenges
- Is application-driven adaptation viable from
technology and cost point of view? - How to structure adaptability
- to maximize the performance benefits
- provide protection, multitasking and a reasonable
programming environment - enable easy exploitation of adaptability through
automatic or semi-automatic means. - We focus on memory hierarchy as the first
candidate to explore the extent and utility of
adaptation.
11Why Cache Memory?
124-year technological scaling
- CPU performance increases by 47 per year
- DRAM performance increases by 7per year
- Assume the Alpha is scaled using this scaling and
- Organization remains 8KB/96KB/4MB/mem
- Benchmarks requirements are same
- Expect something similar if both L2/L3 cache size
and benchmark size will increase
13(No Transcript)
14Impact of Memory Stalls
- A statically scheduled processor with a blocking
cache stalls, on average, for - 15 of the time in integer benchmarks
- 43 of the time in f.p. benchmarks
- 70 of the time in transaction benchmark
- Possible performance improvements due to improved
memory hierarchy without technology scaling - 1.17x,
- 1.89x, and
- 3.33x
- Possible improvements with technology scaling
- 2.4x, 7.5x, and 20x
15(No Transcript)
16Opportunities for Adaptivity in Caches
- Cache organization
- Cache performance assist mechanisms
- Hierarchy organization
- Memory organization (DRAM, etc)
- Data layout and address mapping
- Virtual Memory
- Compiler assist
17Opportunities - Contd
- Cache organization adapt what?
- Size NO
- Associativity NO
- Line size MAYBE,
- Write policy YES (fetch,allocate,w-back/thru)
- Mapping function MAYBE
- Organization, clock rate optimized together
18Opportunities - Contd
- Cache Assist prefetch, write buffer, victim
cache, etc. between different levels - due to delay/size constraint, all of the above
cannot be implemented together - improvement as f(size) may not be at max_size
- Adapt what?
- which mechanism(s) to use, algorithms
- mechanism parameters size, lookahead, etc
19Opportunities - Contd
- Hierarchy Organization
- Where are cache assist mechanisms applied?
- Between L1 and L2
- Between L1 and Memory
- Between L2 and Memory...
- What are the datapaths like?
- Is prefetch, victim cache, write buffer data
written into a next-level cache? - How much parallelism is possible in the hiearchy?
20Opportunities - Contd
- Memory Organization
- Cached DRAM?
- yes, but very limited configurations
- Interleave change?
- Hard to accomplish dynamically
- Tagged memory
- Keep state for adaptivity
21Opportunities - Contd
- Data layout and address mapping
- In theory, something can be done but
- would require time-consuming data re-arrangement
- MP case is even worse
- Adaptive address mapping or hashing
- based on what?
22Opportunities - Contd
- Compiler assist can
- Select initial hardware configuration
- Pass hints on to hardware
- Generate code to collect run-time info and adapt
during execution - Adapt configuration after being called at
certain intervals during execution - Re-optimize code at run-time
23Opportunities - Contd
- Virtual Memory can adapt
- Page size?
- Mapping?
- Page prefetching/read ahead
- Write buffer (file cache)
- The above under multiprogramming?
24Applying Adaptivity
- What Drives Adaptivity?
- Performance impact, overall and/or relative
- Effectiveness, e.g. miss rate
- Processor stall introduced
- Program characteristics
- When to perform adaptive action?
- Run time use feedback from hardware
- Compile time insert code, set up hardware
25Where to Implement Adaptivity?
- In Software compiler and/or OS
- (Static) Knowledge of program behavior
- Factored into optimization and scheduling
- Extra code, overhead
- Lack of dynamic run-time information
- Rate of adaptivity
- Requires recompilation, OS changes
26Where to Implement?- Contd
- Hardware
- dynamic information available
- fast decision mechanism possible
- transparent to software (thus safe)
- delay, clock rate limit algorithm complexity
- difficult to maintain long-term trends
- little knowledge of program behavior
27Where to Implement - Contd
- Hardware/software
- Software can set coarse hardware parameters
- Hardware can supply software dynamic info
- Perhaps more complex algorithms can be used
- Software modification required
- Communication mechanism required
28Current Investigation
- L1 cache assist
- See wide variability in assist mechanisms
effectiveness between - Individual Programs
- Within a program as a function of time
- Propose a hardware mechanism to select between
assist types and allocate buffer space - Give compiler an opportunity to set parameters
29Mechanisms Used (L1 to L2)
- Prefetching
- Stream Buffers
- Stride-directed, based on address alone
- Miss Stride prefetch the same addr using the
number of intervening misses as lookahead - Pointer Stride
- Victim Cache
- Write Buffer
30Mechanisms Used - Contd
- A mechanism can be used by itself
- Which is most effective?
- All can be used at once
- Buffer space size and organization fixed
- No adaptivity involved in current results
- Observe time-domain behavior
31Configurations
- 32KB L1 data cache, 32B lines, direct-map
- 0.5MB L2 cache, 64B line, direct-map
- 8-line write buffer
- Latencies
- 1-cycle L1, 8-cycle L2, 60-cycle memory
- 1-cycle prefetch, Write Buffer, Victim Cache
- All 3 mechanisms at once
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Observed Behavior
- Programs exhibit different effect from each
mechanism - none is a consistent winner
- Within a program, the same holds in the time
domain between mechanisms. - Both of the above facts indicate a likely
improvement from adaptivity - Select a better one among mechanisms
- Even more can be expected from adaptively
re-allocating from the combined buffer pool - To reduce stall time
- To reduce the number of misses
37Possible Adaptive Mechanisms
- Hardware
- a common pool of (small) n-word buffers
- a set of possible policies, a subset of
- Stride-directed prefetch
- PC-based prefetch
- History-based prefetch
- Victim cache
- Write buffer
38Adaptive Hardware - Contd
- Performance monitors for each type/buffer
- misses, stall time on hit, thresholds
- Dynamic buffer allocator among mechanisms
- Allocation and monitoring policy
- Predict future behavior from observed past
- Observe in time interval ?T, set for next ?T
- Save perform. trends in next-level tags (lt8bits)
39Adaptive Hardware - Contd
- Adapt the following
- Number of buffers per mechanism
- May also include control, e.g. prediction tables
- Prefetch lookahead (buffer depth)
- Increase when buffers fill up and are still
stalling - Adaptivity interval
- Increase when every
40Adaptivity via compiler
- Give software control over configuration setting
- Provide feedback via same parameters as used by
hardware stall time, miss rate, etc - Have the compiler
- select program points to change configuration
- set parametrs based on hardware feedback
- use compile-time knowledge as well
41Further opportunities to adapt
- L2 cache organization
- variable-size line
- L2 non-sequential prefetch
- L3 organization and use (for deep sub-?)
- In-memory adaptivity assist (DRAM tags)
- Multiple processor scenarios
- Even longer latency
- Coherence, hardware or software
- Synchronization
- Prefetch under and beyond the above
- Avoid coherence if possible
- Prefetch past synchronization
- Assist Adaptive Scheduling
42The AMRM Project Compiler, Architecture and
VLSI Research for AA Architectures
Compiler control
Application Analysis Identification of AA
Mechanisms Semantic Retention StrategiesCompiler
Instrumentation for Runtime
Machine Definition
Memory hierarchy analysis ref. structure
identification
Fault Detection and Containment Interface to
mapping and synthesis hardware Continuous
Validation Strategies
Protection tests
Partitioning, Synthesis, Mapping Algorithms for
efficient runtime adaptation Efficient
reprogrammable circuit structures for rapid
reconfiguration Prototype hardware platform
43Summary
- Semiconductor advances are bringing powerful
changes to how systems are architected and built - challenges underlying assumptions on synchronous
digital hardware designs - interconnect (local and global) dominates
architectural choices, local decision making is
free - in particular, it can be made adaptable using CAD
tools. - The AMRM Project
- achieve peak performance by adapting machine
capabilities to application and data
characteristics. - Initial focus on memory hierarchy promises to
yield high performance gains due to worsening
effects of memory (vs. cpu speeds) and increased
data sets.
44Appendix Assists Being Explored
45Victim Caching
addr
data
Small (1-5 lines) fully-associative cache
configured as victim/stream cache or stream
buffer.
victim line
tags
mux
data
FA
addr
tag
v
one line
Direct mapped L1, L2
1
MRU
new line
tags
to stream buffer
- VC useful in case of conflict misses, long
sequential reference streams. Prevent sharp fall
off in performance when WSS is slightly larger
than L1. - Estimate WSS from the structure of the RM such as
the size of the strongly connected components
(SCCs) - MORPH data-path structure supports addition of a
parameterized victim/stream cache. The control
logic is synthesized using CAD tools. - Victim caches provide 50X the marginal
improvement in hit rate over the primary cache.
46Victim Cache
- Mainly used to eliminate conflict miss
- Prediction the memory address of a cache line
that is replaced is likely to be accessed again
in near future - Scenario for prediction to be effective false
sharing, ugly address mapping - Architecture implementation use a on-chip buffer
to store the contents of recently replaced cache
line - Drawbacks
- Ugly mapping can be rectified by cache aware
compiler - Small size of victim cache, probability of memory
address reuse within short period is very low. - Experiment shows victim cache is not effective
across the board for DI apps.
47Stream Buffer
- Mainly used to eliminate compulsory/capacity
misses - Prediction if a memory address is missed, the
consecutive address is likely to be missed in
near future - Scenario for prediction to be useful stream
access - Architecture implementation when an address
miss, prefetch consecutive address into on-chip
buffer. When there is a hit in stream buffer,
prefetch the consecutive address of the hit
address.
48Stream Cache
- Modification of stream buffer
- Use a separate cache to store stream data to
prevent cache pollution - When there is a hit in stream buffer, the hit
address is sent to stream cache instead of L1
cache
49Stride Prefetch
- Mainly used to eliminate compulsory/capacity miss
- Prediction if a memory address is missed, an
address that is offset by a distance from the
missed address is likely to be missed in near
future - Scenario for prediction to be useful stride
access - Architecture implementation when an address
miss, prefetch address that is offset by a
distance from the missed address. When there is a
hit in buffer, also prefetch the address that is
offset by a distance from the hit address.
50Miss Stride Buffer
- Mainly used to eliminate conflict miss
- Prediction if a memory address miss again after
N other misses, the memory address is likely to
miss again after N other misses - Scenario for the prediction to be useful
- multiple loop nests
- some variables or array elements are reused
across iterations
51Advantage over Victim Cache
- Eliminate conflict miss that even cache aware
compiler can not eliminate - Ugly mappings are fewer and can be rectified
- Much more conflicts are random. From probability
perspective, a certain memory address will
conflict with other addresses after some time,
but we can not know at compile time which address
it will conflict. - There can be a much longer period before the
conflict address is reused - Victim caches small size
52Architecture Implementation
- Memory history buffer
- FIFO buffer to record recently missed memory
address - Predict only when there is a hit in the buffer
- Miss stride can be calculated by the relative
position of consecutive miss for the same address - The size of the buffer determines the number of
predictions - Prefetch buffer (On-chip)
- Store the contents of prefetched memory address
- The size of the buffer determines how much we can
tolerate the variation of miss stride - Prefetch scheduler
- Select a right time to prefetch
- Avoid collision
- Prefetcher
- prefetch the contents of miss address into
on-chip prefetch buffer
53Global Picture
Miss Address
54Yes
lt12gt
lt12gt
Head
Tail
lt12gt
1
No
Yes
1
lt12gt
55Prefetch Scheduler
56Pointer Stream Buffer
57Appendix Prefetching Adaptation Results
58Prefetching for Latency BW Management
- Combat latency deterioration
- optimal prefetching
- memory side pointer chasing
- blocking mechanisms
- fast barrier, broadcast support
- synchronization support
- Bandwidth management
- memory (re)organization to suit application
characteristics - translate and gather hardware
- prefetching with compaction
59Adaptation for Latency Tolerance
- Operation
- 1. Application sets prefetch parameters (compiler
controlled) - 2. Prefetching event generation
- (runtime controlled)
- when a new cache block is filled
if(startltvAddrltend) if(pAddr 0x20)
addr pAddr - 0x20 else addr pAddr
0x20 ltinitiate fetch of cache line at addr
to L1gt
60Prefetching Experiments
- Operation
- 1. Application sets prefetch parameters (compiler
controlled) - set lower/uppoer bounds on memory regions (for
memory protection etc.) - download pointer extraction function
- element size
- 2. Prefetching event generation (runtime
controlled) - when a new cache block is filled
- Application view
- generate_event (pointer to pass matrix element
structure) - ...
- generate_event(signal to enable prefetch)
- lt code on which prefetching is applied gt
- generate_event(signal to diable prefetch)
22
61Adaptation for Bandwidth Reduction
- Prefetching Entire Row/Column
- Pack Cache with Used Data Only
62Simulation Results Latency
- Sparse MM blocking, prefetching, packing (all
based on application data structures)
Cache Miss Rate
10X reduction in latency using application data
structure optimization
63Simulation Results Bandwidth
- Optimization designed to significantly reduce the
volume of data traffic - efficient fast storage management, packing,
prefetch
Data Traffic (MB)
100X reduction in BW using application-specific
packing and fetching.