Polar Opposites: Next Generation Languages - PowerPoint PPT Presentation

About This Presentation
Title:

Polar Opposites: Next Generation Languages

Description:

Title: Calvin & Kathryn s Wonderful Group Author: CSCF Last modified by: Kathryn Mckinely Created Date: 9/27/2001 8:35:48 PM Document presentation format – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 42
Provided by: CSCF
Category:

less

Transcript and Presenter's Notes

Title: Polar Opposites: Next Generation Languages


1
Polar OppositesNext Generation Languages
Architectures
  • Kathryn S McKinley
  • The University of Texas at Austin

2
Collaborators
  • Faculty
  • Steve Blackburn, Doug Burger, Perry Cheng, Steve
    Keckler, Eliot Moss,
  • Graduate Students
  • Xianglong Huang, Sundeep Kushwaha, Aaron Smith,
    Zhenlin Wang (MTU)
  • Research Staff
  • Jim Burrill, Sam Guyer, Bill Yoder

3
Computing in the Twenty-First Century
  • New and changing architectures
  • Hitting the microprocessor wall
  • TRIPS - an architecture for future technology
  • Object-oriented languages
  • Java and C becoming mainstream
  • Key challenges and approaches
  • Memory gap, parallelism
  • Language runtime implementation efficiency
  • Orchestrating a new software/hardware dance
  • Break down artificial system boundaries

4
Technology Scaling Hitting the Wall
Analytically
Qualitatively
35 nm
70 nm
100 nm
130 nm
20 mm chip edge
Either way Partitioning for on-chip
communication is key
5
End of the Road for Out-of-Order SuperScalars
  • Clock ride is over
  • Wire and pipeline limits
  • Quadratic out-of-order issue logic
  • Power, a first order constraint
  • Major vendors ending processor lines
  • Problems for any architectural solution
  • ILP - instruction level parallelism
  • Memory latency

6
Where are Programming Languages?
  • High Productivity Languages
  • Java, C, Matlab, S, Python, Perl
  • High Performance Languages
  • C/C, Fortran
  • Why not both in one?
  • Interpretation/JIT vs compilation
  • Language representation
  • Pointers, arrays, frequent method calls, etc.
  • Automatic memory management costs
  • Obscure ILP and memory behavior

7
Outline
  • TRIPS
  • Next generation tiled EDGE architecture
  • ILP compilation model
  • Memory system performance
  • Garbage collection influence
  • The GC advantage
  • Locality, locality, locality
  • Online adaptive copying
  • Cooperative software/hardware caching

8
TRIPS
  • Project Goals
  • Fast clock high ILP in future technologies
  • Architecture sustains 1 TRIPS in 35 nm technology
  • Cost-performance scalability
  • Find the right hardware/software balance
  • New balance reduces hardware complexity power
  • New compiler responsibilities challenges
  • Hardware/Software Prototype
  • Proof-of-concept of scalability and
    configurability
  • Technology transfer

9
TRIPS Prototype Architecture
10
Execution Substrate
Register banks
Execution node
Global Ctrl Branch Predictor
I-cache H
0
1
2
3
I-cache 0
D-cache/LSQ 0
I-cache 1
D-cache/LSQ 1
Execution array
I-cache 2
D-cache/LSQ 2
I-cache 3
D-cache/LSQ 3
  • Interconnect topology latency
  • exposed to compiler scheduler

11
Large Instruction Window
Control
opcode
src1
src2
src2
src1
opcode
src2
src1
opcode
opcode
src1
src2
ALU
Out-of-Order Instruction Buffers form a logical
z-dimension in each node
Router
4 logical frames of 4 X 4 instructions
Execution Node
  • Instruction buffers add depth to execution array
  • 2D array of ALUs 3D volume of instructions
  • Entire 3D volume exposed to compiler

12
Execution Model
  • SPDI - static placement, dynamic issue
  • Dataflow within a block
  • Sequential between blocks
  • TRIPS compiler challenges
  • Create large blocks of instructions
  • Single entry, multiple exit, predication
  • Schedule blocks of instructions on a tile
  • Resource limitations
  • Registers, Memory operations

13
Block Execution Model
  • Program execution
  • Fetch and map block to TRIPS grid
  • Execute block, produce result(s)
  • Commit results
  • Repeat
  • Block dataflow execution
  • Each cycle, execute a ready instruction at every
    node
  • Single read of registers and memory locations
  • Single write of registers and memory locations
  • Update the PC to successor block
  • TRIPS core may speculatively execute multiple
    blocks (as well as instructions)
  • TRIPS uses branch prediction and register
    renaming between blocks, but not within a block

start
A
C
B
D
E
end
14
Just Right Division of Labor
  • TRIPS architecture
  • Eliminates short-term temporaries
  • Out-of-order execution at every node in grid
  • Exploits ILP, hides unpredictable latencies
  • without superscalar quadratic hardware
  • without VLIW guarantees of completion time
  • Scale compiler - generate ILP
  • Large hyperblocks - predicate, unroll, inline,
    etc.
  • Schedule hyperblocks
  • Map independent instructions to different nodes
  • Map communicating instructions to same or close
    nodes
  • Let hardware deal with unpredictable latencies
    (loads)
  • Exploits Hardware and Compiler Strengths

15
High Productivity Programming Languages
  • Interpretation/JIT vs compilation
  • Language representation
  • Pointers, arrays, frequent method calls, etc.
  • Automatic memory management costs MMTk in IBM
    Jikes RVM
  • ICSE04, SIGMETRICS04
  • Memory Management Toolkit for Java
  • High Performance, Extensible, Portable
  • Mark-Sweep, Copying SemiSpace, Reference Counting
  • Generational collection, Beltway, etc.

16
Allocation Choices
Bump-Pointer
Free-List
  • Fast (increment bounds check)
  • Can't incrementally free reuse must free en
    masse
  • Relatively slow (consult list for fit)
  • Can incrementally free reuse cells

17
Allocation Choices
  • Bump pointer
  • 70 bytes IA32 instructions, 726MB/s
  • Free list
  • 140 bytes IA32 instructions, 654MB/s
  • Bump pointer 11 faster in tight loop
  • lt 1 in practical setting
  • No significant difference (?)
  • Second order effects?
  • Locality??
  • Collection mechanism??

18
Implications for Locality
  • Compare SS MS mutator
  • Mutator time
  • Mutator memory performance L1, L2 TLB

19
javac
20
pseudojbb
21
db
22
Locality Architecture
23
MS/SS Crossover 1.6GHz PPC
24
MS/SS Crossover1.9GHz AMD
25
MS/SS Crossover 2.6GHz P4
26
MS/SS Crossover3.2GHz P4
27
MS/SS Crossover
locality
space
2.6GHz
1.6GHz
1.9GHz
3.2GHz
28
Locality in Memory Management
  • Explicit memory management on its way out
  • Key GC vs Explicit MM insights 20 yrs old
  • Technology has and is changing
  • Generational and Beltway Collectors
  • Significant collection time benefits over
    full heap collectors
  • Collect young objects
  • Infrequently collect old space
  • Copying nursery attains similar locality effects
    as full heap

29
Where are the Misses?
Generational Copying Collector
30
Copy Order
  • Static copy orders
  • Bredth first - Cheney scan
  • Depth first, hierarchical
  • Problem one size does not fit all
  • Static profiling per class
  • Inconsistant with JIT
  • Object sampling
  • Too expensive in our experience
  • OOR - Online Object Reordering
  • OOPSLA04

31
OOR Overview
  • Records object accesses in each method (excludes
    cold basic blocks)
  • Finds hot methods by dynamic sampling
  • Reorders objects with hot fields in higher
    generation during GC
  • Copies hot objects into separate region

32
Static Analysis Example
  • Method Foo
  • Class A a
  • try
  • a.b
  • catch(Exception e)
  • a.c

Hot BB Collect access info
Compiler
Compiler
Cold BB Ignore
Access List 1. A.b 2. . .
33
Adaptive Sampling
  • Method Foo
  • Class A a
  • try
  • a.b
  • catch(Exception e)
  • a.c

Adaptive Sampling
Foo Accesses 1. A.b 2. . .
Foo is hot
A.b is hot
A
b
c
..
B
34
Advice Directed Reordering
  • Example
  • Assume (1,4), (4,7) and (2,6) are hot field
    accesses
  • Order 1,4,7,2,6 3,5

1
5
4
2
3
7
6
35
OOR System Overview
Hot Methods
Source Code
Look Up
Access Info Database
Adaptive Sampling
Baseline Compiler
Optimizing Compiler
Adds Entries
Register Hot Field Accesses
GC copying objects
GC Copies Objects
Executing Code
Affects Locality
Advice
OOR addition
Jikes RVM
Input/Output
36
Cost of OOR
Benchmark Default OOR Difference
jess 4.39 4.43 0.84
jack 5.79 5.82 0.57
raytrace 4.63 4.61 -0.59
mtrt 4.95 4.99 0.70
javac 12.83 12.70 -1.05
compress 8.56 8.54 0.20
pseudojbb 13.39 13.43 0.36
db 18.88 18.88 -0.03
antlr 0.94 0.91 -2.90
gcold 1.21 1.23 1.49
hsqldb 160.56 158.46 -1.30
ipsixql 41.62 42.43 1.93
jython 37.71 37.16 -1.44
ps-fun 129.24 128.04 -1.03
Mean -0.19
37
Performance db
38
Performance jython
39
Performance javac
40
Software is not enoughHardware is not enough
  • Problem inefficient use of cache
  • Hardware limitations set associativity, cannot
    predict the future
  • Cooperative Software/Hardware Caching
  • Combines high level compiler analysis with
    dynamic miss behavior
  • Lightweight ISA support conveys compilers global
    view to hardware
  • Compiler-guided cache replacement (evict-me)
  • Compiler-guided region prefetching
  • ISCA03, PACT02

41
Exciting Times
  • Dramatic architectural changes
  • Execution tiles
  • Cache Memory tiles
  • Next generation system solutions
  • Moving hardware/software boundaries
  • Online optimizations
  • Key compiler challenges (same old)
  • ILP and Cache Memory Hierarchy
Write a Comment
User Comments (0)
About PowerShow.com