AMRM: Project Technical Approach - PowerPoint PPT Presentation

About This Presentation
Title:

AMRM: Project Technical Approach

Description:

Benchmarks requirements are same. Expect something similar if both L2/L3 cache size and benchmark size will increase ... 70% of the time in transaction benchmark ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 64
Provided by: rajesh49
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: AMRM: Project Technical Approach


1
AMRMProject Technical Approach
A Technology and Architectural View of Adaptation
  • Rajesh Gupta
  • Project Kickoff Meeting
  • November 5, 1998
  • Washington DC

2
Outline
  • Technology trends driving this project
  • Changing ground-rules in high-performance system
    design
  • Rethinking circuits and microelectronic system
    design
  • Rethinking architectures
  • The opportunity of application-adaptive
    architectures
  • Adaptation Challenges
  • Adaptation for memory hierarchy
  • Why memory hierarchy?
  • Adaptation space and possible gains
  • Summary

3
Technology Evolution
Industry continues to outpace NTRS projections on
technology scaling and IC density.
4
Consider Interconnect
  • Average interconnect delay is greater than the
    gate delays!
  • Reduced marginal cost of logic and signal
    regeneration needs make it possible to include
    logic in inter-block interconnect.

5
Rethinking Circuits When Interconnect Dominates
  • DEVICE Choose better interconnect
  • Copper, low temperature interconnect
  • CAD Choose better interconnect topology, sizes
  • Minimize path from driver gate to each receiver
    gate
  • e.g., A-tree algorithm yields about 12 reduction
    in delay
  • Select wire sizes to minimize net delays
  • e.g., upto 35 reduction in delay by optimal
    sizing algorithms
  • CKT Use more signal repeaters in block-level
    designs
  • longest interconnect2000 mu for 350nm process
  • u-ARCH A storage element no longer defines a
    clock boundary
  • Multiple storage elements in a single clock
  • Multiple state transitions in a clock period
  • Storage-controlled routing
  • Reduced marginal cost of logic

6
Implications Circuit Blocks
  • Frequent use of signal repeaters in block-level
    designs
  • longest interconnect2000 mu for 0.3 mu process
  • A storage element no longer (always) defines a
    clock boundary
  • storage delay (1.5x switching delay)
  • multiple storage elements in a single clock
  • multiple state transitions in a clock period
  • storage-controlled routing
  • Circuit block designs that work independently of
    data latencies
  • asynchronous blocks
  • Heterogenous clocking interfaces
  • pausible clocking Yun, ICCD96
  • mixed synchronous, asynchronous circuit blocks.

7
Implications Architectures
  • Architectures to exploit interconnect delays
  • pipeline interconnect delays recall Cray-2
  • cycle time max delay - min delay
  • use interconnect delay as the minimum delay
  • need PR estimates early in the design
  • Algorithms that use interconnect latencies
  • interconnect as functional units
  • functional unit schedules are based on a measure
    of spatial distances
  • Increase local decision making
  • multiple state transitions in a clock period
  • storage-controlled routing
  • re-programmable blocks in custom layouts

8
Opportunity Application-Adaptive Architectures
  • Exploit architectural low-hanging fruits
  • performance variation across applications
    (10-100X)
  • performance variation across data-sets (10X)
  • Use interconnect and data-path reconfiguration to
  • increase performance
  • combat performance fragility and
  • improve fault tolerance
  • Configurable hardware is used to improve
    utilization of performance critical resources
  • instead of using configurable hardware to build
    additional resources
  • design goal is to achieve peak performance across
    applications
  • configurable hardware leveraged in efficient
    utilization of performance critical resources

9
Architectural Adaptation
  • Each of the following elements can benefit from
    increased adaptability (above and beyond CPU
    programming)
  • CPU
  • Memory hierarchy eliminate false sharing
  • Memory system virtual memory layout based on
    cache miss data
  • IO disk layout based on access pattern
  • Network interface scheduling to reduced
    end-to-end latency
  • Adaptability used to build
  • programmable engines in IO, memory controllers,
    cache controllers, network devices
  • configurable data-paths and logic in any part of
    the system
  • configurable queueing in scheduling for
    interconnect, devices, memory
  • smart interfaces for information flow from
    applications to hardware
  • performance monitoring and coordinated resource
    management...

Intelligent interfaces, information formats,
mechanisms and policies.
10
Adaptation Challenges
  • Is application-driven adaptation viable from
    technology and cost point of view?
  • How to structure adaptability
  • to maximize the performance benefits
  • provide protection, multitasking and a reasonable
    programming environment
  • enable easy exploitation of adaptability through
    automatic or semi-automatic means.
  • We focus on memory hierarchy as the first
    candidate to explore the extent and utility of
    adaptation.

11
Why Cache Memory?
12
4-year technological scaling
  • CPU performance increases by 47 per year
  • DRAM performance increases by 7per year
  • Assume the Alpha is scaled using this scaling and
  • Organization remains 8KB/96KB/4MB/mem
  • Benchmarks requirements are same
  • Expect something similar if both L2/L3 cache size
    and benchmark size will increase

13
(No Transcript)
14
Impact of Memory Stalls
  • A statically scheduled processor with a blocking
    cache stalls, on average, for
  • 15 of the time in integer benchmarks
  • 43 of the time in f.p. benchmarks
  • 70 of the time in transaction benchmark
  • Possible performance improvements due to improved
    memory hierarchy without technology scaling
  • 1.17x,
  • 1.89x, and
  • 3.33x
  • Possible improvements with technology scaling
  • 2.4x, 7.5x, and 20x

15
(No Transcript)
16
Opportunities for Adaptivity in Caches
  • Cache organization
  • Cache performance assist mechanisms
  • Hierarchy organization
  • Memory organization (DRAM, etc)
  • Data layout and address mapping
  • Virtual Memory
  • Compiler assist

17
Opportunities - Contd
  • Cache organization adapt what?
  • Size NO
  • Associativity NO
  • Line size MAYBE,
  • Write policy YES (fetch,allocate,w-back/thru)
  • Mapping function MAYBE
  • Organization, clock rate optimized together

18
Opportunities - Contd
  • Cache Assist prefetch, write buffer, victim
    cache, etc. between different levels
  • due to delay/size constraint, all of the above
    cannot be implemented together
  • improvement as f(size) may not be at max_size
  • Adapt what?
  • which mechanism(s) to use, algorithms
  • mechanism parameters size, lookahead, etc

19
Opportunities - Contd
  • Hierarchy Organization
  • Where are cache assist mechanisms applied?
  • Between L1 and L2
  • Between L1 and Memory
  • Between L2 and Memory...
  • What are the datapaths like?
  • Is prefetch, victim cache, write buffer data
    written into a next-level cache?
  • How much parallelism is possible in the hiearchy?

20
Opportunities - Contd
  • Memory Organization
  • Cached DRAM?
  • yes, but very limited configurations
  • Interleave change?
  • Hard to accomplish dynamically
  • Tagged memory
  • Keep state for adaptivity

21
Opportunities - Contd
  • Data layout and address mapping
  • In theory, something can be done but
  • would require time-consuming data re-arrangement
  • MP case is even worse
  • Adaptive address mapping or hashing
  • based on what?

22
Opportunities - Contd
  • Compiler assist can
  • Select initial hardware configuration
  • Pass hints on to hardware
  • Generate code to collect run-time info and adapt
    during execution
  • Adapt configuration after being called at
    certain intervals during execution
  • Re-optimize code at run-time

23
Opportunities - Contd
  • Virtual Memory can adapt
  • Page size?
  • Mapping?
  • Page prefetching/read ahead
  • Write buffer (file cache)
  • The above under multiprogramming?

24
Applying Adaptivity
  • What Drives Adaptivity?
  • Performance impact, overall and/or relative
  • Effectiveness, e.g. miss rate
  • Processor stall introduced
  • Program characteristics
  • When to perform adaptive action?
  • Run time use feedback from hardware
  • Compile time insert code, set up hardware

25
Where to Implement Adaptivity?
  • In Software compiler and/or OS
  • (Static) Knowledge of program behavior
  • Factored into optimization and scheduling
  • Extra code, overhead
  • Lack of dynamic run-time information
  • Rate of adaptivity
  • Requires recompilation, OS changes

26
Where to Implement?- Contd
  • Hardware
  • dynamic information available
  • fast decision mechanism possible
  • transparent to software (thus safe)
  • delay, clock rate limit algorithm complexity
  • difficult to maintain long-term trends
  • little knowledge of program behavior

27
Where to Implement - Contd
  • Hardware/software
  • Software can set coarse hardware parameters
  • Hardware can supply software dynamic info
  • Perhaps more complex algorithms can be used
  • Software modification required
  • Communication mechanism required

28
Current Investigation
  • L1 cache assist
  • See wide variability in assist mechanisms
    effectiveness between
  • Individual Programs
  • Within a program as a function of time
  • Propose a hardware mechanism to select between
    assist types and allocate buffer space
  • Give compiler an opportunity to set parameters

29
Mechanisms Used (L1 to L2)
  • Prefetching
  • Stream Buffers
  • Stride-directed, based on address alone
  • Miss Stride prefetch the same addr using the
    number of intervening misses as lookahead
  • Pointer Stride
  • Victim Cache
  • Write Buffer

30
Mechanisms Used - Contd
  • A mechanism can be used by itself
  • Which is most effective?
  • All can be used at once
  • Buffer space size and organization fixed
  • No adaptivity involved in current results
  • Observe time-domain behavior

31
Configurations
  • 32KB L1 data cache, 32B lines, direct-map
  • 0.5MB L2 cache, 64B line, direct-map
  • 8-line write buffer
  • Latencies
  • 1-cycle L1, 8-cycle L2, 60-cycle memory
  • 1-cycle prefetch, Write Buffer, Victim Cache
  • All 3 mechanisms at once

32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Observed Behavior
  • Programs exhibit different effect from each
    mechanism
  • none is a consistent winner
  • Within a program, the same holds in the time
    domain between mechanisms.
  • Both of the above facts indicate a likely
    improvement from adaptivity
  • Select a better one among mechanisms
  • Even more can be expected from adaptively
    re-allocating from the combined buffer pool
  • To reduce stall time
  • To reduce the number of misses

37
Possible Adaptive Mechanisms
  • Hardware
  • a common pool of (small) n-word buffers
  • a set of possible policies, a subset of
  • Stride-directed prefetch
  • PC-based prefetch
  • History-based prefetch
  • Victim cache
  • Write buffer

38
Adaptive Hardware - Contd
  • Performance monitors for each type/buffer
  • misses, stall time on hit, thresholds
  • Dynamic buffer allocator among mechanisms
  • Allocation and monitoring policy
  • Predict future behavior from observed past
  • Observe in time interval ?T, set for next ?T
  • Save perform. trends in next-level tags (lt8bits)

39
Adaptive Hardware - Contd
  • Adapt the following
  • Number of buffers per mechanism
  • May also include control, e.g. prediction tables
  • Prefetch lookahead (buffer depth)
  • Increase when buffers fill up and are still
    stalling
  • Adaptivity interval
  • Increase when every

40
Adaptivity via compiler
  • Give software control over configuration setting
  • Provide feedback via same parameters as used by
    hardware stall time, miss rate, etc
  • Have the compiler
  • select program points to change configuration
  • set parametrs based on hardware feedback
  • use compile-time knowledge as well

41
Further opportunities to adapt
  • L2 cache organization
  • variable-size line
  • L2 non-sequential prefetch
  • L3 organization and use (for deep sub-?)
  • In-memory adaptivity assist (DRAM tags)
  • Multiple processor scenarios
  • Even longer latency
  • Coherence, hardware or software
  • Synchronization
  • Prefetch under and beyond the above
  • Avoid coherence if possible
  • Prefetch past synchronization
  • Assist Adaptive Scheduling

42
The AMRM Project Compiler, Architecture and
VLSI Research for AA Architectures
Compiler control
Application Analysis Identification of AA
Mechanisms Semantic Retention StrategiesCompiler
Instrumentation for Runtime
Machine Definition
Memory hierarchy analysis ref. structure
identification
Fault Detection and Containment Interface to
mapping and synthesis hardware Continuous
Validation Strategies
Protection tests
Partitioning, Synthesis, Mapping Algorithms for
efficient runtime adaptation Efficient
reprogrammable circuit structures for rapid
reconfiguration Prototype hardware platform
43
Summary
  • Semiconductor advances are bringing powerful
    changes to how systems are architected and built
  • challenges underlying assumptions on synchronous
    digital hardware designs
  • interconnect (local and global) dominates
    architectural choices, local decision making is
    free
  • in particular, it can be made adaptable using CAD
    tools.
  • The AMRM Project
  • achieve peak performance by adapting machine
    capabilities to application and data
    characteristics.
  • Initial focus on memory hierarchy promises to
    yield high performance gains due to worsening
    effects of memory (vs. cpu speeds) and increased
    data sets.

44
Appendix Assists Being Explored
45
Victim Caching
addr
data
Small (1-5 lines) fully-associative cache
configured as victim/stream cache or stream
buffer.
victim line
tags
mux
data
FA
addr
tag
v
one line
Direct mapped L1, L2
1
MRU
new line
tags
to stream buffer
  • VC useful in case of conflict misses, long
    sequential reference streams. Prevent sharp fall
    off in performance when WSS is slightly larger
    than L1.
  • Estimate WSS from the structure of the RM such as
    the size of the strongly connected components
    (SCCs)
  • MORPH data-path structure supports addition of a
    parameterized victim/stream cache. The control
    logic is synthesized using CAD tools.
  • Victim caches provide 50X the marginal
    improvement in hit rate over the primary cache.

46
Victim Cache
  • Mainly used to eliminate conflict miss
  • Prediction the memory address of a cache line
    that is replaced is likely to be accessed again
    in near future
  • Scenario for prediction to be effective false
    sharing, ugly address mapping
  • Architecture implementation use a on-chip buffer
    to store the contents of recently replaced cache
    line
  • Drawbacks
  • Ugly mapping can be rectified by cache aware
    compiler
  • Small size of victim cache, probability of memory
    address reuse within short period is very low.
  • Experiment shows victim cache is not effective
    across the board for DI apps.

47
Stream Buffer
  • Mainly used to eliminate compulsory/capacity
    misses
  • Prediction if a memory address is missed, the
    consecutive address is likely to be missed in
    near future
  • Scenario for prediction to be useful stream
    access
  • Architecture implementation when an address
    miss, prefetch consecutive address into on-chip
    buffer. When there is a hit in stream buffer,
    prefetch the consecutive address of the hit
    address.

48
Stream Cache
  • Modification of stream buffer
  • Use a separate cache to store stream data to
    prevent cache pollution
  • When there is a hit in stream buffer, the hit
    address is sent to stream cache instead of L1
    cache

49
Stride Prefetch
  • Mainly used to eliminate compulsory/capacity miss
  • Prediction if a memory address is missed, an
    address that is offset by a distance from the
    missed address is likely to be missed in near
    future
  • Scenario for prediction to be useful stride
    access
  • Architecture implementation when an address
    miss, prefetch address that is offset by a
    distance from the missed address. When there is a
    hit in buffer, also prefetch the address that is
    offset by a distance from the hit address.

50
Miss Stride Buffer
  • Mainly used to eliminate conflict miss
  • Prediction if a memory address miss again after
    N other misses, the memory address is likely to
    miss again after N other misses
  • Scenario for the prediction to be useful
  • multiple loop nests
  • some variables or array elements are reused
    across iterations

51
Advantage over Victim Cache
  • Eliminate conflict miss that even cache aware
    compiler can not eliminate
  • Ugly mappings are fewer and can be rectified
  • Much more conflicts are random. From probability
    perspective, a certain memory address will
    conflict with other addresses after some time,
    but we can not know at compile time which address
    it will conflict.
  • There can be a much longer period before the
    conflict address is reused
  • Victim caches small size

52
Architecture Implementation
  • Memory history buffer
  • FIFO buffer to record recently missed memory
    address
  • Predict only when there is a hit in the buffer
  • Miss stride can be calculated by the relative
    position of consecutive miss for the same address
  • The size of the buffer determines the number of
    predictions
  • Prefetch buffer (On-chip)
  • Store the contents of prefetched memory address
  • The size of the buffer determines how much we can
    tolerate the variation of miss stride
  • Prefetch scheduler
  • Select a right time to prefetch
  • Avoid collision
  • Prefetcher
  • prefetch the contents of miss address into
    on-chip prefetch buffer

53
Global Picture
Miss Address
54
Yes
lt12gt
lt12gt
Head
Tail
lt12gt
1
No

Yes
1
lt12gt
55
Prefetch Scheduler
56
Pointer Stream Buffer
57
Appendix Prefetching Adaptation Results
58
Prefetching for Latency BW Management
  • Combat latency deterioration
  • optimal prefetching
  • memory side pointer chasing
  • blocking mechanisms
  • fast barrier, broadcast support
  • synchronization support
  • Bandwidth management
  • memory (re)organization to suit application
    characteristics
  • translate and gather hardware
  • prefetching with compaction

59
Adaptation for Latency Tolerance
  • Operation
  • 1. Application sets prefetch parameters (compiler
    controlled)
  • 2. Prefetching event generation
  • (runtime controlled)
  • when a new cache block is filled

if(startltvAddrltend) if(pAddr 0x20)
addr pAddr - 0x20 else addr pAddr
0x20 ltinitiate fetch of cache line at addr
to L1gt
60
Prefetching Experiments
  • Operation
  • 1. Application sets prefetch parameters (compiler
    controlled)
  • set lower/uppoer bounds on memory regions (for
    memory protection etc.)
  • download pointer extraction function
  • element size
  • 2. Prefetching event generation (runtime
    controlled)
  • when a new cache block is filled
  • Application view
  • generate_event (pointer to pass matrix element
    structure)
  • ...
  • generate_event(signal to enable prefetch)
  • lt code on which prefetching is applied gt
  • generate_event(signal to diable prefetch)

22
61
Adaptation for Bandwidth Reduction
  • Prefetching Entire Row/Column
  • Pack Cache with Used Data Only

62
Simulation Results Latency
  • Sparse MM blocking, prefetching, packing (all
    based on application data structures)

Cache Miss Rate
10X reduction in latency using application data
structure optimization
63
Simulation Results Bandwidth
  • Optimization designed to significantly reduce the
    volume of data traffic
  • efficient fast storage management, packing,
    prefetch

Data Traffic (MB)
100X reduction in BW using application-specific
packing and fetching.
Write a Comment
User Comments (0)
About PowerShow.com