SCI SOCKET: The fastest socket on earth? Atle Vesterkj - PowerPoint PPT Presentation

About This Presentation
Title:

SCI SOCKET: The fastest socket on earth? Atle Vesterkj

Description:

Dolphin has an implementation of the cache coherency that has been by customers ... Specialized version of MPICH for use with the Dolphin Interconnect. ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 39
Provided by: dol54
Category:

less

Transcript and Presenter's Notes

Title: SCI SOCKET: The fastest socket on earth? Atle Vesterkj


1
SCI SOCKET The fastest socket on earth?Atle
Vesterkjæratleve_at_dolphinics.comhttp//www.dolphi
nics.com
  • Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone
    47 23 16 70 00 Fax 47 23 16 71 80

2
SCI Standard - Dolphin SCI Technology
  • WHAT IS SCI?
  • The SCI standard ANSI/IEEE 1596-1992 defines a
    point-to-point interface and a set of packet
    protocols.
  • The protocols use small packet sizes with a
    16-byte header and data sizes of 16, 64, and 128
    bytes.
  • Packets are protected by a 16-bit CRC code.
  • An SCI interface has two unidirectional links
    that operate concurrently.
  • Ring configuration or direct connection over
    separate inn out cables.
  • The SCI protocols support shared memory and bus
    bridging by encapsulating bus requests and
    responses into SCI request and response packets.
  • Bus imitating protocol with packet-based
    handshake protocols and guaranteed data delivery.
  • Brings the bus out on cables.
  • For shared memory application a cache coherency
    protocol is defined
  • Dolphin has an implementation of the cache
    coherency that has been by customers
  • PCI do not feature Cache Coherency
  • Processor-Bus-Memory technology as opposed to
    Networking technology

3
SCI SOCKET - Outline
  • The fastest socket on earth and the impact on
    storage and applications
  • SCI technology
  • SCI SOCKET for storage and applications.
  • SCI SOCKET benchmarks

4
Highlights of the Dolphin SCI Technology
  • Ultra Low Latency
  • CPU has direct access to remote memory
  • No protocol overhead
  • 1.4 µs 4 bytes write
  • lt 3 µs 512 bytes write
  • 0.2 µs pipelined write
  • Fast failover for HA systems
  • Highly efficient bus bridging
  • Bus Requests and Responses (CPU load/store
    operations) are translated directly in Hardware
    to Request and Response Packets
  • Point to Point Links gives Bus Performance and
    Latency over Distance
  • High data throughput 346 MByte/s
  • 0.2 µs pipelined write

5
Highlights of the Dolphin SCI Technology
  • Wide Application Area - Common Mode
    Multiprocessing
  • Storage, Clustering, Multiprocessing, Embedded
    Systems, Telecommunication, Defense, Medical
    Imaging
  • Choice of Topologies, Ring, Torus, Switched
  • Shipping in Critical Applications for more than
    10 years
  • Based on ANSI/IEEE 1592-1992 Scalable Coherent
    Interface (SCI) Standard

6
Linköping University - NSC - SCI Clusters
Also in Sweden, Umeå University 120 Athlon nodes
  • Monolith 200 node, 2xXeon, 2,2 GHz, 3D SCI
  • INGVAR 32 node, AMD 900 MHz, 2D SCI
  • Otto 48 node, P4 2.26 GHz, 2D SCI
  • Maxwell 40 node 2xXeon, 2D SCI
  • Bris 162, 2x Xeon
  • Total 336 SCI Nodes

7
University of California Santa Cruz
  • 132 Nodes
  • 264 Processors
  • WulfKit
  • Linux
  • RackSaver Integration and Installation
  • AMD Athlon 1800 processors
  • RacKScaver 1U Chassis
  • No. 99 on HPC500 February 2002

8
The online signal processor on Manchester
University Jodrell Bank Observatory is a WulfKit
cluster
9
Applications, Database Clustering
Ultra Enterprise Cluster
  • SUNs High End servers are clustered with Dolphin
    Cards
  • Money Transaction and Data Base Applications
  • High Availability and Performance
  • Dolphin Ships Cards and Switches
  • 7th year of shipments
  • Oracle 9i Performance and Scaleability
  • SCI runs natively on SUNs RSM (Remote Shared
    Memory API

10
MySQL Cluster
  • Cost-effective5 nines availability using parallel
    server architecture with no single point of
    failure.
  • Performance and high throughput required to meet
    the most demanding enterprise applications.
  • Incremental scalin of applications in a linear
    fashion as needs grow without having to invest in
    expensive hardware
  • MySQL Cluster has a flexible distributed
    architecture which gives you complete control
    over the level of performance, reliability and
    scalability you need to match your application
    requirements.
  • Dolphin Ships Cards and Switches

http//www.mysql.com/products/cluster/
11
XPrime
  • X1 Database Performance Cluster for Microsoft
    SQL Server
  • Ever wonder what 512 Processors and 1 Terabyte
    of RAM can do for your database application?
  • Uses the Dolphin D336 card

Works with all existing SQL Server
applications Incremental capacity and
performance -Up to 512 Processors and 1
Terabyte of RAM Integrated SCI clustering
hardware Small rack footprint Plug
and Play Installation Claims 10x performance at
1/10 the cost
http//www.xprime.com/
12
SMPs
  • Convex Exemplar Supercomputer
  • Supercomputer SMP, Dolphin chips
  • Shipped in1994
  • Data General uses Dolphin Chips as main
    interconnect for ccNUMA Systems
  • Shared memory Intel x86 Multiprocessors
  • 32 Processors SCI Interconnect
  • Dolphin Ships Chips and Switches
  • 3 generations shipped

AV20000 NUMA Server
13
Mirage 2000 Upgrade, First Test Flight January
2001
Thales uses Dolphins Technology as the main
interconnect in the on-board Multi Processor
Offered with systems like Mirage 2000-9, Mirage
2000-5, Rafaleand more
14
Space Mission Application
Dolphins technology is chosen for evaluation
http//sim.jpl.nasa.gov/
Dolphins in Space!
15
SCI Removes I/O Bottleneck
  • Siemens Uses Dolphins SCI in RM600 E UNIX
    Servers IO Subsystem
  • SCI Allows Expansion to 150 PCI Slots and 5.8 TB
    of Storage.
  • High Performance SCI Links Enable Remote
    Attachment
  • 3rd generation shipping

RM600 Systm Bus
Bridge Chip
B-Link
PSB
LC
Dolphin Chips
LC
LC
SCI
SCI
SCI
RM600 E Server
LC
PCI
PCI
PSB
LC
16
High Performance File Server
  • Auspex Uses Dolphin PCI SCI Cards for Main Rack
    to Rack Interconnect in its 4 Front NS2000
  • Connects up to 4 storage system to the control
    system
  • Gives Performance, Expandability and Reliability
  • Dolphin Ships Cards

17
LinuxLabs
  • Linux expert company located in Atlanta GA
  • Delivers clusters and clustering SW
  • Involves in several opens source projects
  • Showed extensive SW support for Dolphin SCI at
    SC2004
  • Is upgrading the Ames Labs Linux/SCI cluster with
    their SW suite now

Global shared memory can be accessed using
either Linux Labs' modified Mpich, or Linux Labs
Cray-compatible shmem "get" and "put" call
library for low latency communication over
Dolphin SCI. Linux Labs' cluster software is
published under the GNU General Public License.
Linux Labs can provide you with a custom
shared-memory architecture for your next
supercomputing initiative at a fraction of the
cost of its functional equivalent
http//www.linuxlabs.com/
18
LinuxLabs Nimubs 2.0
  • Nimbus 2.0 is based on LANL's Clustermatic and
    RedHat 9.0 with significant enhancements and
    additional software produced by LinuxLabs.
    Improvements from the 1.0 version include
  • Full installable system on a boot CD using
    anaconda.
  • Integration of the Maui scheduler with the
    Clustermatic bproc system.
  • Specialized version of MPICH for use with the
    Dolphin Interconnect.
  • Clusgres system allowing the postgreSQL database
    to run on the cluster using Dolphin hardware.
  • System checkpointing capability.
  • Integration with the Ganglia system monitor.
  • Enhancements to Supermon so that filesystem data
    is made available to Maui and Ganglia
  • Enhanced network console capability to replace
    serial console concentrators in most
    applications. This has the dual benefit of
    simplifying the cluster and reducing cost.

http//www.linuxlabs.com/
19
LinuxLabs Clusgres
  • Clusgres is a clustered version of the Open
    source PostgresSQL database server specifically
    designed for the Dolphin interconnect.
  • Feature set
  • Database reads perform faster. The average query
    service time decreases linearly with the increase
    in cluster nodes.
  • Pure Database writes are as fast as that on a
    single system. But the are faster if they are
    mixed with reads. Work is underway to improve the
    performance of pure Database writes as well to
    linear performance.
  • The database is not modified and the support is
    transparent to the application.
  • It scales very well to as many nodes as supported
    by the underlying interconnect system.
  • It works with any POSIX compliant Unix like
    operating system.

http//www.linuxlabs.com/
20
MandrakeSoft
  • Mandrake offers MandrakeClustering Linux
    distribution
  • The distribution includes all Dolphin SCI SW
  • Other packeges included
  • Ganglia, Lam, Mpich, Pvm
  • Pconsole, Authd, Pcp, Pmake
  • Maui, PXE, PXELinux, SmartMonTools
  • Gexec, Ka-Run, ClusterScripts and more
  • MandrakeClustering is a MandrakeSoft clustering
    solution based on research product CLIC. It's
    been designed to offer research laboratories an
    affordable, fast-to-deploy, easy-to-use,
    heavy-calculation systems based on Clustering.

21
ExaFusion a virtual SAN
  • ExaFusion virtualizes disks or partitions on
    cluster's nodes and emulates an high performance
    SAN

Cluster
Cluster
Cluster
Virtual SAN
Virtual arrays
22
ExaFusion Features
23
ExaFusion Sofware architecture
Simple App.
Simple App.
Clustered Application
I/O on files(lseek, write, mkdir)
Local FS
Local FS
Clustered File System (GFS, GPFS)
I/O on blocks(read, write, ioctl)
ExaFusionClustered Volume Manager
Private Volume
Private Volume
Shared Volume
I/O on blocks
ExaFusionNetworked Block Device
Unembedded disk
Unembedded disk
Unembedded disk
Unembedded disk
Unembedded disk
Disk embedded in node
Disk embedded in node
Disk embedded in node
Disk embedded in node
Disk embedded in node
Dolphin SCI Interconnection Network
24
ExaFusion benchmarks

25
SCI Adapter Cards - 64 bit 66 MHz
  • PCI-, PMC(VME)- and CompactPCI- SCI Adapter
    Card
  • Industry-best latency
  • 1.4 microseconds 4 bytes write
  • lt 3 microseconds 512 bytes write
  • 0.2 microseconds pipelined write
  • High data throughput 346 MBytes/s
  • Supports both
  • Direct Memory Access (DMA)
  • Remote Memory Access (RMA)
  • Remote Interrupt
  • Hot-pluggable cabling
  • Redundant SCI adapters can be used for
    Fault-tolerance

SCI
LC
PCI
PSB
Cluster Adapter PCI to PCI Bridge PCI
Extension Reflected Memory
26
Dolphin Products Switches, Chips and Cards
27
Torus Topology
SCI
1D Topology (Ring) to 10 Nodes 2D Torus Topology
to 100 Nodes 3D Torus Toplogy to 1000s of Nodes
LC
PSB
PCI
LC
LC
PSB
PCI
SCI
LC
LC
LC
PSB
PCI
28
WulfKit
  • Kit with all parts necessary to upgrade a cluster
    node
  • Dolphin 1D, 2D or 3D HW
  • Topology is Ring, 2D Torus or 3D Torus
  • SW Options
  • Scali MPI Connect
  • Scali Manage
  • Dolphin Opens Source (GPL) SW
  • SCI-MPICH
  • SCI-Sockets
  • SISCI/IRM
  • Applications
  • HPC, ISP, ASP, Database Servers

29
Scali MPI Connect v 4.2 for WulfKit
Scali MPI Connect highlights  High bandwidth,
low latency performance  Interconnect
independent architecture Dynamic interconnect
selection at runtime Interconnect failover
functionality Support for heterogeneous
configurations Runtime selection of collective
algorithms Multi thread-safe implementation
Supports 64 bit on RedHat and SuSE for Intel and
AMD
Scali Manage Features System Installation and
Configuration System Administration System
Monitoring Alarms and Event Automation Work
Load Management Hardware Management
Heterogeneous Cluster Support
30
Dolphin SW
  • All Dolphin SW is free open source (GPL or LGPL)
  • SISCI shared memory interface
  • SCI-Sockets
  • Low Latency Socket Library
  • TCP and UDP Replacement
  • User and Kernel level support
  • Release 2.3 available
  • SCI-MPICH (RWTH Aachen)
  • MPICH 1.2 and some MPICH 2 features. MPICH 2 in
    development.
  • New release is being prepared, beta available
  • SCI Interconnect Manager
  • Automatic failover recovory.
  • No single point of failuere in 2D and 3D
    networks.
  • Other
  • SCI Reflective Memory, Scali MPI, Linux Labs SCI
    Cluster Cray-compatible shmem and Clugres
    PostgreSQL, MandrakeSoft Clustering HPC solution,
    Xprimes X1 Database Performance Cluster for
    Microsoft SQL Servers, ClusterFrame from Qlusters
    and SunCluster 3.1 (Oracle 9i), MySQL Cluster

31
Latency vs SW
SW Latency (1/2 Ping Pong roundtrip)
SISCI (Direct HW) 1.4 µs
SCI-Sockets 2.3 µs
Scali MPI Connect 3.5 µs
SCI-MPICH 3.8 µs
32
SCI SOCKET
Replace in Title/Slide Master with Company Logo
or delete
33
Motivation
  • Link level speeds of interconnects are increasing
  • Communication bottleneck moved to protocol
    software
  • High speed networks provide their own efficient
    interfaces
  • On the other hand
  • A large number of applications is build around
    legacy protocols such as TCP/IP suite
  • De-facto standard Berkeley Sockets API
  • Porting to hardware specific APIs unprofitable in
    many cases
  • SCI SOCKET aims to bring together

34
Berkeley Sockets over SCI
  • High Speed, Low Latency Replacement for Gigabit
    Ethernet for Critical Applications
  • Bypassing traditional network stacks like
    TCP/UDP/IP
  • Eliminating protocol overhead and Reducing
    latency
  • Transparent to applications, no modifications or
    recompilation required
  • Ultra low latency
  • 2.27 us socket send/receive latency

35
Berkeley Sockets over SCI
  • Data transfer through remote shared memory
  • Offers new socket transport family AF_SCI
  • Flexible using configuration files
  • Specifying Cluster nodes
  • Specifying ports

36
SCI SOCKET status
  • Available for Linux 2.4 x86 / x86_64
  • Itanium port and test in progress
  • Port to Windows using Windows Socket Direct (WSD)
    technology started. Will be available Q3 2004
  • Port to Solaris is being reviewed.

37
LD_PRELOAD
  • Standard mechanism to preload C library functions
  • User defined Library fuctions called instead of C
    library
  • AF_INET selects traditional TCP/IP path
  • AF_SCI selects SCI_SOCKET
  • int socket(int family, int type, int protocol)
  • if((family AF_INET) (type TCP type
    UDP))
  • socket_lib(AF_SCI,type)
  • else
  • socket_lib(family,type)

38
SCI SOCKET
  • Easy installation of the SCI socket library

39
Configuration File /etc/sci/scisock.conf
  • Selects which machines that can be reached using
    SCI
  • Optionally /etc/sci/scisock_opt.conf selects
    which ports that can be reached using SCI

This is a SCI socket config file Should be
placed in /etc/sci hostname SCI
NodeId nodeA
4 193.71.152.89 8 Mailhost
16 File-serv 20
This is a SCI socket_opt config file Should be
placed in /etc/sci directory -key -Type
-value EnablePortsByDefault
-yes/no EnablePort tcpudp portnumber
DisablePort tcpudp portnumber EnablePortRange
tcpudp start_port end_port DisablePortRange t
cpudp start_port end_port
40
Linux Kernel Socket Switch
User App
User space
Cluster File System
Socket lib
iSCSI
Linux Kernel Socket Switch
Kernel space
SCI SOCKET
Native SOCKET
IP
Ethernet driver
41
Small Message Latency
42
TCP STREAM
43
TCP-RR SCI SOCKET vs Gigabit Ethernet
44
Scali MPI over SCI SOCKET
  • SCI SOCKET is 1.6 - 6.0 times faster than TCP/GigE

45
Why is SCI SOCKET so fast ?
  • Small messages are sent using basic CPU
    instructions
  • Data are normally located in CPU cache
  • Low cost write post to local memory address
  • Single store CPU instruction to send 8 bytes
  • Raw send latency for 8 bytes is approximately 210
    nanoseconds
  • No need to lock down or register memory
  • Large messages are sent using DMA
  • Stream-lined and lock-free messaging protocol on
    top of shared memory
  • Combination of polling and interrupts
  • Receive message causes received message to be
    cached
  • No additional memory access

46
Cluster File Systems
  • SCI SOCKET A typical cluster file system will
    run out of the box
  • PVFS
  • Open Source / GPL software
  • http//www.parl.clemson.edu/pvfs/desc.html
  • Lustre
  • Open Source / GPL software
  • http//www.lustre.org/
  • GFS
  • Global File System
  • Commersial file system available from Sistina
  • www.sistina.com/products_gfs.htm

47
iSCSI
  • SCSI over IP
  • Protocol for encapsulating SCSI commands into IP
    packets
  • I/O block data transport over IP networks
  • iSCSI and SCI SOCKET can be used to build
    scalable SAN / NAS solutions

48
iSCSI over SCI SOCKET
  • Latency is approximately 10x better than Gigabit
    Ethernet
  • Latency is reported by Intels ktest

Gigabit Ethernet SCI SOCKET
SCSI op 0x28 250 us 29 us
SCSI op 0x2A 250 us 31 us
SCSI op 0x25 250 us 27 us
49
iSCSI over SCI SOCKET
  • Throughput is 2-4 times Gigabit Ethernet

50
SCI SOCKET comparison
Latency
Throughput
Reference
Technology
2.26 us
2016 Mbps
www.dolphinics.com
SCI
12 us
1818 Mbps
www.myrinet.com
Myrinet
23 us
936 Mbps
www.dolphinics.com
Gbit Ethernet
28 us
3768 Mbps
IEEE Symposium IPASS 2004
Infiniband
51
SCI vs other interconnects
  • As reported by Ameslab (Iowa state University,
    USA)
  • Netpipe benchmark

52
Applications running SCI SOCKET
  • Intel iSCSI
  • PVFS
  • LUSTRE
  • MySQL Cluster
  • LAM-MPI
  • MPICH2
  • PVM
  • Oracle (Client/Server sqlplus)
  • TerraGrid (tm) by Terrascale
  • Scali MPI Connect
  • Latency_bench
  • Netpipe TCP/PVM
  • Netperf

53
Current Development
  • Available on X86, X86_64, Linux 2.4 and 2.6.
  • Itanium beta release is ready
  • Porting to windows in progress
  • Support for multiple adapters in progress
  • Data striping gives multiple throughput with no
    latency penalty or extra CPU load
  • Redundancy and transparent failover to other SCI
    adapter and Ethernet

54
SCI SOCKET The fastest socket on earth?Atle
Vesterkjæratleve_at_dolphinics.comhttp//www.dolphi
nics.com
  • Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone
    47 23 16 70 00 Fax 47 23 16 71 80

55
(No Transcript)
56
http//www.gria.org/
  • Would you like your computers to earn you extra
    money?
  • Would you like to have cheap access to tons of
    computing power?
  • The GRIA project will take Grid technology into
    the real world, enabling industrial users to
    trade computational resources on a commercial
    basis to meet their needs more cost effectively.
  • GRIA enables organizations to
  • Outsource computation.
  • If you need short-term computation, and cannot
    justify the expense of the hardware purchase,
    GRIA provides a mechanism to discover, negotiate
    and utilize other organizations' spare computing
    resources.
  • Rent out spare CPU cycles.
  • GRIA provides a mechanism allowing you to
    commercially offer your spare computing resources
    on the Grid.

57
Acknowledgement
  • SCI SOCKET kernel module has been developed in
    the IST-33240 project GRIA (http//www.gria.org)
  • SCI SOCKET user space software library has been
    developed in the ITEA project HYADES
    (http//www.hyades-itea.org)
  • The SCI SOCKET software is open source and
    available under GPL/LGPL. Dolphin strongly
    appreciates the contribution to the code and
    testing done by volunteer programmers and
    partners.
  • More information about SCI SOCKET can be found at
    http//www.dolphinics.com/products/software/sci_so
    ckets.html
Write a Comment
User Comments (0)
About PowerShow.com