CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger

Description:

Practical peak of 600 Mflops. Tera expects sustained 30-60% of. peak in 'good' user codes ... Applications selected were not chosen for superior T90 performance: ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 17
Provided by: parallelR
Category:

less

Transcript and Presenter's Notes

Title: CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger


1
CRAY T90 vs. Tera MTAThe Old Champ Facesa New
Challenger
  • Allan Snavely
  • San Diego Supercomputer Center
  • June 19, 1998

2
Background
  • CRAY vector computers have been the workhorses of
    scientific computing for over 2 decades.
  • CRAY PVPs have been effort/performance leaders
    due to vector processors, flat shared memory, and
    great tools.
  • Vector machines are still very popular in terms
    of number of users and available scientific
    applications software.
  • NPACI currently offers T916/14, J98/5, J916/16.
  • There is lots of legacy vector code, much of
    which will never see an MPI_Send call.
  • T90s are the last in the line of CRAY PVP
    computers.

3
More Background
  • Tera has developed revolutionary new
    architecture, the MTA, for parallel computing
    with a programming model as simple as the PVP
    model.
  • MTA can exploit more levels of parallelism than
    T90.
  • First Tera machine (MTA, for MultiThreaded
    Architecture) was delivered to SDSC in November
    1997 with a single 145 MHz processor (lt 1/2 final
    speed).
  • Tera delivered a two processor system to SDSC in
    early 1998 with two 255 MHz (still not final)
    processors and a network board (not final,
    either), but no UNIX.

4
Caveats, Disclaimers, and Excuses
  • MTA software is still being debugged.
  • Processors are not running at full speed
  • theoretical peak is 765 Mflops/CPU (255MHz), but
    will rise to 0.9-1.0 Gflops
  • Interconnect is not up to specification
  • memory-intensive codes cannot speed up by more
    than 1.75 until new network boards are installed
  • All of the above are improving daily and are
    production issues, not research issues.
  • We have had 2 processors running and a stable OS
    (but not UNIX yet) for only a few weeks. Time is
    shared w/Tera.

5

6
T90/MTA Hardware Comparison
  • CRAY T90
  • 440 MHz frequency8 128-element vector
    registers/CPUDual vector pipes into
    FUsPipelines ADD and MULT unitsCan execute 4
    flops/cycle (commonly 2)Flat shared memory
    DRAM, high bandwidth, low latencyCan issue
    2 loads 1 store / cycle
  • Peak 1.76 Gflops / CPU Practical peak of 1
    Gflops Currently observe 400-800 Mflops in
    'good' user codes
  • Tera MTA-1
  • 300 MHz clock (255MHz now)128 Streams (HW for
    threads)/CPUEffective depth of pipeline is
    21Additional FMA unitCan execute 3 flops/cycle
    (commonly 2)Flat shared memorySRAM,
    moderate latency, moderate bandwidthCan
    issue 1 memory ref / cycle
  • Peak 0.9 Gflops / CPU Practical peak of 600
    MflopsTera expects sustained 30-60 of
    peak in 'good' user codes

7
NAS 2.3-Serial Benchmarks
  • NAS Parallel Benchmarks version 2.3
  • Level 2 are not pencil-and-paper must be
    executed as is or with minimal tuning
  • Written using MPI for distributed memory,
    RISC-based machines
  • NAS 2.3-Serial
  • Reverse-engineered from NPB 2.3 MPI versions
    were serialized
  • Not necessarily optimal for vector or
    multithreaded platforms as is

8
(No Transcript)
9
NAS 2.3-Serial Benchmarks Results
10
Applications Performance Disclaimer
  • MTA wasnt available long enough to port, tune
    many applications
  • 2 processors werent available long enough to
    obtain many multiprocessor results
  • Most tuning effort performed by Tera staff
  • Applications selected were not chosen for
    superior T90 performance
  • LCPFCT performs very well on T90
  • AMBER performs fairly well on T90
  • LS-DYNA3D performs less well on T90 for many
    interesting cases

11
LCPFCT Performance Comparison
12
AMBER Performance Comparison
13
LS-DYNA3D Comparison
14
Conclusions
  • T90 multitasking doesn't allow the user fine
    control over load balancing.
  • Porting T90 codes to the MTA is easy.
  • Tuning on both platforms is facilitated by
    excellent compilers and simple programming
    models.
  • MTA can exploit the same parallelism in a problem
    which the T90 can. Can also exploit levels which
    the T90 doesnt.
  • MTA is likely to give good performance
    scalability on most T90 codes.
  • The T90 is still the world's fastest vector
    machine, but the MTA may outperform it across a
    wider spectrum of problems using vectors but also
    having more potential outer-loop, and higher
    level, parallelism.

15
Future MTA Hardware Plans
  • 4-processor network to be delivered soon (July?)
  • 2 more processors delivered shortly thereafter
    (August?)With each processor comes one or two
    1GB memory modules (not associated directly with
    processor, just how network is built)
  • UNIX will be completed by end of summer
    (Aug-Sept?)
  • Pending results of evaluations, increase size to
    8 (end of year?), then 16 (next year)
  • Fortran 90, OpenMP, other tools on the way...

16
Future Work
  • SC98
  • updated NAS benchmarks (final processors,
    network)
  • multiprocessor benchmarks
  • applications as well as kernels
  • Applications Porting and Tuning
  • More work on AMBER, LS-DYNA3D
  • Port GAMESS, MPIRE, OVERFLOW
  • Port other vendor and research codes
  • Suggestions? (allans_at_sdsc.edu)
Write a Comment
User Comments (0)
About PowerShow.com