Kari Tiensyrj - PowerPoint PPT Presentation

About This Presentation

Title:

Kari Tiensyrj

Description:

... Evaluation Hearing. 1. Kari Tiensyrj . Senior Research Scientist ... Senior Principal Researcher. NEC Europe. Ben Juurlink. Professor. Delft ... Games, ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 13

Provided by: irjak

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Kari Tiensyrj

1
Kari TiensyrjäSenior Research ScientistVTT

FP6-2004-IST-4 FET Proactive Initiative ACA
SUPERcomputing on a CHIP SUPERCHIP
Proposal Number 26888

Jesper Larsson TräffSenior Principal
ResearcherNEC Europe
Ian PhillipsProf., Principal Staff EngineerARM
Ben JuurlinkProfessorDelft University of
Technology
2
1. Paths to exploitation

FET project with potential for application
breakthroughs in a 10 years horizon
Industrial Partners (NEC, ARM, Intel) cover a
wide spectrum of application domains and provide
Steering of scientific and technological research
Transfer of knowledge and results to and
interplay with company design groups
Proposition to standardization bodies, where
relevant (B.3.6)
Active promotion of results (T6.1 and T6.2)
High-profile scientific and applied conferences
and journals
Organization of workshops
PhD courses and summer schools, incorporation
into advanced curricula
Links to NoEs
WP6 (led by Intel) dissemination and
exploitation (also B.3.3, B.4.1.7, and B.8.2.6)
T6.3 for technology transfer
T6.4 for exploitation

3
2. Target applications

Wide range of applications with high
computational requirements will be considered
WP4 will analyse and identify applications, and
selected sample applications will be implemented
as proof-of-concept
An initial set of applications considered

Mobile devices (energy-efficiency)
PDA, HDTV
Games, virtual reality

Desktops and servers (versatility from
high-performance/single-application to
high-throughput application suites)
Streaming and DSP applications, e.g. video in
bandwidth constrained active networks and
embedded 3D graphics
Real-time speech recognition and
videoconferencing
Database applications, string processing,
geographical information processing

Supercomputer (high-performance)
Vectorised CFD Boltzmann automata
MPI-parallelised finite element methods
Quantum Chromodynamics

4
3. Leading contenders within the proposal

Objectives to boost performance by 2-3 orders of
magnitude (compared to same transistor count),
exploit parallelism at all levels, realise
easy-to-use strong model of computing, provide
scalability/wide application area/power saving
techniques

Eclipse XMT CMP TTA/PISMA TRIPS
Scalable NOC with EREW PRAM model Simultaneous ILP-TLP exploitation Cacheless memory Regular structure - CMP with PRAM-like but more asynchronous model SMT synchronization mechanism On-chip caches - Shared memory using caches advanced cache coherency protocols - Tiled architecture with virtual shared memory communication - Very simple and strongly decentralized organization -Single chip reconfigurable processor / memory architecture -Grids of ALUs connected via operand networks -Static spatial scheduling
5
3. Leading contenders within the proposal (cont)

Initial choice of architectures is partially
guided by application requirements
Eclipse and XMT general purpose computing,
embedded computing
Advanced CMP high-throughput desktop and server
machines
TTA/PISMA streaming/DSP
TRIPS HPC, streaming/DSP, threaded servers
Procedure to choose the initial SUPERCHIP
architecture
1. Develop an architecture evaluation framework
(T1.1)
2. Develop semi-analytical power/performance/cost
models (T5.1)
3. Develop/modify existing simulators for the
architectures (T5.2)
4. Design benchmark programs for the
architectures (T4.1)
5. Perform evaluation identify strong/weak
points select (T1.1)
Preliminary criteria
Power, performance, cost (silicon area)
Estimated scalability, PRAM-like model support,
ease of programming
Estimated coverage for aimed application area,
TLP-ILP co-exploitation
Potential for solving the rest of the problems

6
4. Ensuring HW implementation technologies impact
on choice of scalable architecture

Scalability issues are observed in initial
selection of candidate architectures
Mesh-like topologies (providing constant wire
length links) Eclipse, CMP, TTA, TRIPS
Regular structures Eclipse, CMP, TTA, TRIPS
No forwarding networks (Eclipse) or multistage
forwarding networks (TRIPS)
No cache coherency mechanisms Eclipse
Multithreading Eclipse, XMT
Decentralized structure Eclipse, CMP, TTA, TRIPS
Semi-analytical modeling of the architectures and
candidate techniques (T5.1)
Analytical parametric power/performance/cost
estimation models
Hardware implementation parameters are extracted
from
Technology roadmaps e.g. ITRS
Pragmatic experience and knowledge of industrial
partners

7
4. Ensuring HW implementation technology impact
on our choice of scalable architecture (cont)

Architectural simulation (T5.2)
Develop/modify existing simulators
Benchmarks
Sample applications
Information on execution time, resource
utilization and power consumption is extracted
Modeling of the critical parts of architectures
Feasibility analysis of candidate architectures
Studies on fault tolerance, clocking schemes,
on-chip/off-chip communication, power saving and
other implementation related issues for the
SUPERCHIP architecture (T5.3)
Detailed modeling and feasibility assessment of
critical parts of the SUPERCHIP architecture
(T5.4)

8
5. Evolvement of the PRAM model for the candidate
architectures

For ease-of-programming the SUPERCHIP programming
model will be based on a PRAM-like model,
considering
Relaxed synchronization (BSP-like)
Strong memory semantics (CRCW-like, built-in
operators)
Potential for locality exploitation (memory,
Hierarchical-PRAM)

SUPERCHIP will develop the necessary
architectural support for this model

Architectural requirements
Synchronization implicit after each instruction
Bandwidth high bisection to handle random
communication
Latency communication/memory access latency
should be hidden

SUPERCHIP will not investigate PRAM-implementation
on distributed memory architectures in general

Long-term research issue Evolution of
programming model and architecture to SUPERCHIP
constellations

9
5. Evolvement of the PRAM model for the candidate
architectures (cont)
Candi-date Synchronization Bisection bandwidth Latency hiding Initial model
Eclipse synchronization wave fast barrier mechanism P/2 Super-pipelined multithreading EREW PRAM
XMT hardware synchronization ? caches PRAM-like
CMP software synchronization square root P caches NUMA
TTA/PISMA software synchronization square root P caches NUMA
TRIPS software synchronization square root P caches NUMA
10
6. Validation and assessment of the performance
scalability of the final choice of HW/SW
architecture

Analytically through parametric
power/performance/cost models
Empirically through simulations
Benchmark kernels and sample applications
Scalable benchmark suite for fine-grained shared
memory architecture
Standard benchmark suites
Sample applications
Parametric architecture simulations
By comparing to future alternative approaches
(e.g. advanced CMPs) and theoretical machines
(e.g. ideal PRAM) using the applications and
benchmarks

11
7. Plan for identifying the requirements for the
OS within the resources of the work plan

Goal is to identify requirements and implement
core OS services to demonstrate validity of the
architectural approach, but not to develop
full-fledged OS (as stated in B.4.1.5)
Requirements from underlying architecture and
applications
Resource management (process, thread and memory)
Runtime functions and services for applications
Input for identifying requirements will come from
several other tasks including T1.2, T1.3, T2.2
and T3.3
OS is not in charge of supporting distributed
shared memory
Certain OS functionality will be covered by
compilers run-time system
Task leader of OS task (T4.3, ULM) has developed
a distributed operating system (Plurix) which
provides an excellent basis

12
7. Plan for identifying the requirements for the
OS within the resources of the work plan (cont)

Preliminary anticipated OS requirements
Dynamic process/thread scheduling
Memory management (physical and virtual)
Synchronization including inter-process
communication
Support for power management and IO
Definition
A coarse-grain functional model of OS will be
developed and validated through simulation
Definition of API in SUPERCHIP language (or
pseudo-language in the early phase)
Implementation
Using the SUPERCHIP language and compiler (from
T2.2 and T3.3)
Testing with architecture simulation tools (from
T5.2)