AMRM: Project Technical Approach - PowerPoint PPT Presentation

About This Presentation

Title:

AMRM: Project Technical Approach

Description:

Benchmarks requirements are same. Expect something similar if both L2/L3 cache size and benchmark size will increase ... 70% of the time in transaction benchmark ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 64

Provided by: rajesh49

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: AMRM: Project Technical Approach

1
AMRMProject Technical Approach
A Technology and Architectural View of Adaptation

Rajesh Gupta
Project Kickoff Meeting
November 5, 1998
Washington DC

2
Outline

Technology trends driving this project
Changing ground-rules in high-performance system
design
Rethinking circuits and microelectronic system
design
Rethinking architectures
The opportunity of application-adaptive
architectures
Adaptation Challenges
Adaptation for memory hierarchy
Why memory hierarchy?
Adaptation space and possible gains
Summary

3
Technology Evolution
Industry continues to outpace NTRS projections on
technology scaling and IC density.
4
Consider Interconnect

Average interconnect delay is greater than the
gate delays!
Reduced marginal cost of logic and signal
regeneration needs make it possible to include
logic in inter-block interconnect.

5
Rethinking Circuits When Interconnect Dominates

DEVICE Choose better interconnect
Copper, low temperature interconnect
CAD Choose better interconnect topology, sizes
Minimize path from driver gate to each receiver
gate
e.g., A-tree algorithm yields about 12 reduction
in delay
Select wire sizes to minimize net delays
e.g., upto 35 reduction in delay by optimal
sizing algorithms
CKT Use more signal repeaters in block-level
designs
longest interconnect2000 mu for 350nm process
u-ARCH A storage element no longer defines a
clock boundary
Multiple storage elements in a single clock
Multiple state transitions in a clock period
Storage-controlled routing
Reduced marginal cost of logic

6
Implications Circuit Blocks

Frequent use of signal repeaters in block-level
designs
longest interconnect2000 mu for 0.3 mu process
A storage element no longer (always) defines a
clock boundary
storage delay (1.5x switching delay)
multiple storage elements in a single clock
multiple state transitions in a clock period
storage-controlled routing
Circuit block designs that work independently of
data latencies
asynchronous blocks
Heterogenous clocking interfaces
pausible clocking Yun, ICCD96
mixed synchronous, asynchronous circuit blocks.

7
Implications Architectures

Architectures to exploit interconnect delays
pipeline interconnect delays recall Cray-2
cycle time max delay - min delay
use interconnect delay as the minimum delay
need PR estimates early in the design
Algorithms that use interconnect latencies
interconnect as functional units
functional unit schedules are based on a measure
of spatial distances
Increase local decision making
multiple state transitions in a clock period
storage-controlled routing
re-programmable blocks in custom layouts

8
Opportunity Application-Adaptive Architectures

Exploit architectural low-hanging fruits
performance variation across applications
(10-100X)
performance variation across data-sets (10X)
Use interconnect and data-path reconfiguration to
increase performance
combat performance fragility and
improve fault tolerance
Configurable hardware is used to improve
utilization of performance critical resources
instead of using configurable hardware to build
additional resources
design goal is to achieve peak performance across
applications
configurable hardware leveraged in efficient
utilization of performance critical resources

9
Architectural Adaptation

Each of the following elements can benefit from
increased adaptability (above and beyond CPU
programming)
CPU
Memory hierarchy eliminate false sharing
Memory system virtual memory layout based on
cache miss data
IO disk layout based on access pattern
Network interface scheduling to reduced
end-to-end latency
Adaptability used to build
programmable engines in IO, memory controllers,
cache controllers, network devices
configurable data-paths and logic in any part of
the system
configurable queueing in scheduling for
interconnect, devices, memory
smart interfaces for information flow from
applications to hardware
performance monitoring and coordinated resource
management...

Intelligent interfaces, information formats,
mechanisms and policies.
10
Adaptation Challenges

Is application-driven adaptation viable from
technology and cost point of view?
How to structure adaptability
to maximize the performance benefits
provide protection, multitasking and a reasonable
programming environment
enable easy exploitation of adaptability through
automatic or semi-automatic means.
We focus on memory hierarchy as the first
candidate to explore the extent and utility of
adaptation.

11
Why Cache Memory?
12
4-year technological scaling

CPU performance increases by 47 per year
DRAM performance increases by 7per year
Assume the Alpha is scaled using this scaling and
Organization remains 8KB/96KB/4MB/mem
Benchmarks requirements are same
Expect something similar if both L2/L3 cache size
and benchmark size will increase

13
(No Transcript)
14
Impact of Memory Stalls

A statically scheduled processor with a blocking
cache stalls, on average, for
15 of the time in integer benchmarks
43 of the time in f.p. benchmarks
70 of the time in transaction benchmark
Possible performance improvements due to improved
memory hierarchy without technology scaling
1.17x,
1.89x, and
3.33x
Possible improvements with technology scaling
2.4x, 7.5x, and 20x

15
(No Transcript)
16
Opportunities for Adaptivity in Caches

Cache organization
Cache performance assist mechanisms
Hierarchy organization
Memory organization (DRAM, etc)
Data layout and address mapping
Virtual Memory
Compiler assist

17
Opportunities - Contd

Cache organization adapt what?
Size NO
Associativity NO
Line size MAYBE,
Write policy YES (fetch,allocate,w-back/thru)
Mapping function MAYBE
Organization, clock rate optimized together

18
Opportunities - Contd

Cache Assist prefetch, write buffer, victim
cache, etc. between different levels
due to delay/size constraint, all of the above
cannot be implemented together
improvement as f(size) may not be at max_size
Adapt what?
which mechanism(s) to use, algorithms
mechanism parameters size, lookahead, etc

19
Opportunities - Contd

Hierarchy Organization
Where are cache assist mechanisms applied?
Between L1 and L2
Between L1 and Memory
Between L2 and Memory...
What are the datapaths like?
Is prefetch, victim cache, write buffer data
written into a next-level cache?
How much parallelism is possible in the hiearchy?

20
Opportunities - Contd

Memory Organization
Cached DRAM?
yes, but very limited configurations
Interleave change?
Hard to accomplish dynamically
Tagged memory
Keep state for adaptivity

21
Opportunities - Contd

Data layout and address mapping
In theory, something can be done but
would require time-consuming data re-arrangement
MP case is even worse
Adaptive address mapping or hashing
based on what?

22
Opportunities - Contd

Compiler assist can
Select initial hardware configuration
Pass hints on to hardware
Generate code to collect run-time info and adapt
during execution
Adapt configuration after being called at
certain intervals during execution
Re-optimize code at run-time

23
Opportunities - Contd

Virtual Memory can adapt
Page size?
Mapping?
Page prefetching/read ahead
Write buffer (file cache)
The above under multiprogramming?

24
Applying Adaptivity

What Drives Adaptivity?
Performance impact, overall and/or relative
Effectiveness, e.g. miss rate
Processor stall introduced
Program characteristics
When to perform adaptive action?
Run time use feedback from hardware
Compile time insert code, set up hardware

25
Where to Implement Adaptivity?

In Software compiler and/or OS
(Static) Knowledge of program behavior
Factored into optimization and scheduling
Extra code, overhead
Lack of dynamic run-time information
Rate of adaptivity
Requires recompilation, OS changes

26
Where to Implement?- Contd

Hardware
dynamic information available
fast decision mechanism possible
transparent to software (thus safe)
delay, clock rate limit algorithm complexity
difficult to maintain long-term trends
little knowledge of program behavior

27
Where to Implement - Contd

Hardware/software
Software can set coarse hardware parameters
Hardware can supply software dynamic info
Perhaps more complex algorithms can be used
Software modification required
Communication mechanism required

28
Current Investigation

L1 cache assist
See wide variability in assist mechanisms
effectiveness between
Individual Programs
Within a program as a function of time
Propose a hardware mechanism to select between
assist types and allocate buffer space
Give compiler an opportunity to set parameters

29
Mechanisms Used (L1 to L2)

Prefetching
Stream Buffers
Stride-directed, based on address alone
Miss Stride prefetch the same addr using the
number of intervening misses as lookahead
Pointer Stride
Victim Cache
Write Buffer

30
Mechanisms Used - Contd

A mechanism can be used by itself
Which is most effective?
All can be used at once
Buffer space size and organization fixed
No adaptivity involved in current results
Observe time-domain behavior

31
Configurations

32KB L1 data cache, 32B lines, direct-map
0.5MB L2 cache, 64B line, direct-map
8-line write buffer
Latencies
1-cycle L1, 8-cycle L2, 60-cycle memory
1-cycle prefetch, Write Buffer, Victim Cache
All 3 mechanisms at once

32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Observed Behavior

Programs exhibit different effect from each
mechanism
none is a consistent winner
Within a program, the same holds in the time
domain between mechanisms.
Both of the above facts indicate a likely
improvement from adaptivity
Select a better one among mechanisms
Even more can be expected from adaptively
re-allocating from the combined buffer pool
To reduce stall time
To reduce the number of misses

37
Possible Adaptive Mechanisms

Hardware
a common pool of (small) n-word buffers
a set of possible policies, a subset of
Stride-directed prefetch
PC-based prefetch
History-based prefetch
Victim cache
Write buffer

38
Adaptive Hardware - Contd

Performance monitors for each type/buffer
misses, stall time on hit, thresholds
Dynamic buffer allocator among mechanisms
Allocation and monitoring policy
Predict future behavior from observed past
Observe in time interval ?T, set for next ?T
Save perform. trends in next-level tags (lt8bits)

39
Adaptive Hardware - Contd

Adapt the following
Number of buffers per mechanism
May also include control, e.g. prediction tables
Prefetch lookahead (buffer depth)
Increase when buffers fill up and are still
stalling
Adaptivity interval
Increase when every

40
Adaptivity via compiler

Give software control over configuration setting
Provide feedback via same parameters as used by
hardware stall time, miss rate, etc
Have the compiler
select program points to change configuration
set parametrs based on hardware feedback
use compile-time knowledge as well

41
Further opportunities to adapt

L2 cache organization
variable-size line
L2 non-sequential prefetch
L3 organization and use (for deep sub-?)
In-memory adaptivity assist (DRAM tags)
Multiple processor scenarios
Even longer latency
Coherence, hardware or software
Synchronization
Prefetch under and beyond the above
Avoid coherence if possible
Prefetch past synchronization
Assist Adaptive Scheduling

42
The AMRM Project Compiler, Architecture and
VLSI Research for AA Architectures
Compiler control
Application Analysis Identification of AA
Mechanisms Semantic Retention StrategiesCompiler
Instrumentation for Runtime
Machine Definition
Memory hierarchy analysis ref. structure
identification
Fault Detection and Containment Interface to
mapping and synthesis hardware Continuous
Validation Strategies
Protection tests
Partitioning, Synthesis, Mapping Algorithms for
efficient runtime adaptation Efficient
reprogrammable circuit structures for rapid
reconfiguration Prototype hardware platform
43
Summary

Semiconductor advances are bringing powerful
changes to how systems are architected and built
challenges underlying assumptions on synchronous
digital hardware designs
interconnect (local and global) dominates
architectural choices, local decision making is
free
in particular, it can be made adaptable using CAD
tools.
The AMRM Project
achieve peak performance by adapting machine
capabilities to application and data
characteristics.
Initial focus on memory hierarchy promises to
yield high performance gains due to worsening
effects of memory (vs. cpu speeds) and increased
data sets.

44
Appendix Assists Being Explored
45
Victim Caching
addr
data
Small (1-5 lines) fully-associative cache
configured as victim/stream cache or stream
buffer.
victim line
tags
mux
data
FA
addr
tag
v
one line
Direct mapped L1, L2
1
MRU
new line
tags
to stream buffer

VC useful in case of conflict misses, long
sequential reference streams. Prevent sharp fall
off in performance when WSS is slightly larger
than L1.
Estimate WSS from the structure of the RM such as
the size of the strongly connected components
(SCCs)
MORPH data-path structure supports addition of a
parameterized victim/stream cache. The control
logic is synthesized using CAD tools.
Victim caches provide 50X the marginal
improvement in hit rate over the primary cache.

46
Victim Cache

Mainly used to eliminate conflict miss
Prediction the memory address of a cache line
that is replaced is likely to be accessed again
in near future
Scenario for prediction to be effective false
sharing, ugly address mapping
Architecture implementation use a on-chip buffer
to store the contents of recently replaced cache
line
Drawbacks
Ugly mapping can be rectified by cache aware
compiler
Small size of victim cache, probability of memory
address reuse within short period is very low.
Experiment shows victim cache is not effective
across the board for DI apps.

47
Stream Buffer

Mainly used to eliminate compulsory/capacity
misses
Prediction if a memory address is missed, the
consecutive address is likely to be missed in
near future
Scenario for prediction to be useful stream
access
Architecture implementation when an address
miss, prefetch consecutive address into on-chip
buffer. When there is a hit in stream buffer,
prefetch the consecutive address of the hit
address.

48
Stream Cache

Modification of stream buffer
Use a separate cache to store stream data to
prevent cache pollution
When there is a hit in stream buffer, the hit
address is sent to stream cache instead of L1
cache

49
Stride Prefetch

Mainly used to eliminate compulsory/capacity miss
Prediction if a memory address is missed, an
address that is offset by a distance from the
missed address is likely to be missed in near
future
Scenario for prediction to be useful stride
access
Architecture implementation when an address
miss, prefetch address that is offset by a
distance from the missed address. When there is a
hit in buffer, also prefetch the address that is
offset by a distance from the hit address.

50
Miss Stride Buffer

Mainly used to eliminate conflict miss
Prediction if a memory address miss again after
N other misses, the memory address is likely to
miss again after N other misses
Scenario for the prediction to be useful
multiple loop nests
some variables or array elements are reused
across iterations

51
Advantage over Victim Cache

Eliminate conflict miss that even cache aware
compiler can not eliminate
Ugly mappings are fewer and can be rectified
Much more conflicts are random. From probability
perspective, a certain memory address will
conflict with other addresses after some time,
but we can not know at compile time which address
it will conflict.
There can be a much longer period before the
conflict address is reused
Victim caches small size

52
Architecture Implementation

Memory history buffer
FIFO buffer to record recently missed memory
address
Predict only when there is a hit in the buffer
Miss stride can be calculated by the relative
position of consecutive miss for the same address
The size of the buffer determines the number of
predictions
Prefetch buffer (On-chip)
Store the contents of prefetched memory address
The size of the buffer determines how much we can
tolerate the variation of miss stride
Prefetch scheduler
Select a right time to prefetch
Avoid collision
Prefetcher
prefetch the contents of miss address into
on-chip prefetch buffer

53
Global Picture
Miss Address
54
Yes
lt12gt
lt12gt
Head
Tail
lt12gt
1
No

Yes
1
lt12gt
55
Prefetch Scheduler
56
Pointer Stream Buffer
57
Appendix Prefetching Adaptation Results
58
Prefetching for Latency BW Management