Title: Kari Tiensyrj
1Kari TiensyrjäSenior Research ScientistVTT
- FP6-2004-IST-4 FET Proactive Initiative ACA
- SUPERcomputing on a CHIP SUPERCHIP
- Proposal Number 26888
Jesper Larsson TräffSenior Principal
ResearcherNEC Europe
Ian PhillipsProf., Principal Staff EngineerARM
Ben JuurlinkProfessorDelft University of
Technology
21. Paths to exploitation
- FET project with potential for application
breakthroughs in a 10 years horizon - Industrial Partners (NEC, ARM, Intel) cover a
wide spectrum of application domains and provide
- Steering of scientific and technological research
- Transfer of knowledge and results to and
interplay with company design groups - Proposition to standardization bodies, where
relevant (B.3.6) - Active promotion of results (T6.1 and T6.2)
- High-profile scientific and applied conferences
and journals - Organization of workshops
- PhD courses and summer schools, incorporation
into advanced curricula - Links to NoEs
- WP6 (led by Intel) dissemination and
exploitation (also B.3.3, B.4.1.7, and B.8.2.6) - T6.3 for technology transfer
- T6.4 for exploitation
32. Target applications
- Wide range of applications with high
computational requirements will be considered - WP4 will analyse and identify applications, and
selected sample applications will be implemented
as proof-of-concept - An initial set of applications considered
- Mobile devices (energy-efficiency)
- PDA, HDTV
- Games, virtual reality
- Desktops and servers (versatility from
high-performance/single-application to
high-throughput application suites) - Streaming and DSP applications, e.g. video in
bandwidth constrained active networks and
embedded 3D graphics - Real-time speech recognition and
videoconferencing - Database applications, string processing,
geographical information processing
- Supercomputer (high-performance)
- Vectorised CFD Boltzmann automata
- MPI-parallelised finite element methods
- Quantum Chromodynamics
43. Leading contenders within the proposal
- Objectives to boost performance by 2-3 orders of
magnitude (compared to same transistor count),
exploit parallelism at all levels, realise
easy-to-use strong model of computing, provide
scalability/wide application area/power saving
techniques
Eclipse XMT CMP TTA/PISMA TRIPS
Scalable NOC with EREW PRAM model Simultaneous ILP-TLP exploitation Cacheless memory Regular structure - CMP with PRAM-like but more asynchronous model SMT synchronization mechanism On-chip caches - Shared memory using caches advanced cache coherency protocols - Tiled architecture with virtual shared memory communication - Very simple and strongly decentralized organization -Single chip reconfigurable processor / memory architecture -Grids of ALUs connected via operand networks -Static spatial scheduling
53. Leading contenders within the proposal (cont)
- Initial choice of architectures is partially
guided by application requirements - Eclipse and XMT general purpose computing,
embedded computing - Advanced CMP high-throughput desktop and server
machines - TTA/PISMA streaming/DSP
- TRIPS HPC, streaming/DSP, threaded servers
- Procedure to choose the initial SUPERCHIP
architecture - 1. Develop an architecture evaluation framework
(T1.1) - 2. Develop semi-analytical power/performance/cost
models (T5.1) - 3. Develop/modify existing simulators for the
architectures (T5.2) - 4. Design benchmark programs for the
architectures (T4.1) - 5. Perform evaluation identify strong/weak
points select (T1.1) - Preliminary criteria
- Power, performance, cost (silicon area)
- Estimated scalability, PRAM-like model support,
ease of programming - Estimated coverage for aimed application area,
TLP-ILP co-exploitation - Potential for solving the rest of the problems
64. Ensuring HW implementation technologies impact
on choice of scalable architecture
- Scalability issues are observed in initial
selection of candidate architectures - Mesh-like topologies (providing constant wire
length links) Eclipse, CMP, TTA, TRIPS - Regular structures Eclipse, CMP, TTA, TRIPS
- No forwarding networks (Eclipse) or multistage
forwarding networks (TRIPS) - No cache coherency mechanisms Eclipse
- Multithreading Eclipse, XMT
- Decentralized structure Eclipse, CMP, TTA, TRIPS
- Semi-analytical modeling of the architectures and
candidate techniques (T5.1) - Analytical parametric power/performance/cost
estimation models - Hardware implementation parameters are extracted
from - Technology roadmaps e.g. ITRS
- Pragmatic experience and knowledge of industrial
partners
74. Ensuring HW implementation technology impact
on our choice of scalable architecture (cont)
- Architectural simulation (T5.2)
- Develop/modify existing simulators
- Benchmarks
- Sample applications
- Information on execution time, resource
utilization and power consumption is extracted - Modeling of the critical parts of architectures
- Feasibility analysis of candidate architectures
- Studies on fault tolerance, clocking schemes,
on-chip/off-chip communication, power saving and
other implementation related issues for the
SUPERCHIP architecture (T5.3) - Detailed modeling and feasibility assessment of
critical parts of the SUPERCHIP architecture
(T5.4)
85. Evolvement of the PRAM model for the candidate
architectures
- For ease-of-programming the SUPERCHIP programming
model will be based on a PRAM-like model,
considering - Relaxed synchronization (BSP-like)
- Strong memory semantics (CRCW-like, built-in
operators) - Potential for locality exploitation (memory,
Hierarchical-PRAM)
- SUPERCHIP will develop the necessary
architectural support for this model
- Architectural requirements
- Synchronization implicit after each instruction
- Bandwidth high bisection to handle random
communication - Latency communication/memory access latency
should be hidden
- SUPERCHIP will not investigate PRAM-implementation
on distributed memory architectures in general
- Long-term research issue Evolution of
programming model and architecture to SUPERCHIP
constellations
95. Evolvement of the PRAM model for the candidate
architectures (cont)
Candi-date Synchronization Bisection bandwidth Latency hiding Initial model
Eclipse synchronization wave fast barrier mechanism P/2 Super-pipelined multithreading EREW PRAM
XMT hardware synchronization ? caches PRAM-like
CMP software synchronization square root P caches NUMA
TTA/PISMA software synchronization square root P caches NUMA
TRIPS software synchronization square root P caches NUMA
106. Validation and assessment of the performance
scalability of the final choice of HW/SW
architecture
- Analytically through parametric
power/performance/cost models - Empirically through simulations
- Benchmark kernels and sample applications
- Scalable benchmark suite for fine-grained shared
memory architecture - Standard benchmark suites
- Sample applications
- Parametric architecture simulations
- By comparing to future alternative approaches
(e.g. advanced CMPs) and theoretical machines
(e.g. ideal PRAM) using the applications and
benchmarks
117. Plan for identifying the requirements for the
OS within the resources of the work plan
- Goal is to identify requirements and implement
core OS services to demonstrate validity of the
architectural approach, but not to develop
full-fledged OS (as stated in B.4.1.5) - Requirements from underlying architecture and
applications - Resource management (process, thread and memory)
- Runtime functions and services for applications
- Input for identifying requirements will come from
several other tasks including T1.2, T1.3, T2.2
and T3.3 - OS is not in charge of supporting distributed
shared memory - Certain OS functionality will be covered by
compilers run-time system - Task leader of OS task (T4.3, ULM) has developed
a distributed operating system (Plurix) which
provides an excellent basis
127. Plan for identifying the requirements for the
OS within the resources of the work plan (cont)
- Preliminary anticipated OS requirements
- Dynamic process/thread scheduling
- Memory management (physical and virtual)
- Synchronization including inter-process
communication - Support for power management and IO
- Definition
- A coarse-grain functional model of OS will be
developed and validated through simulation - Definition of API in SUPERCHIP language (or
pseudo-language in the early phase) - Implementation
- Using the SUPERCHIP language and compiler (from
T2.2 and T3.3) - Testing with architecture simulation tools (from
T5.2)
Feasible with the allocated resources and partners