Title: SCI SOCKET: The fastest socket on earth? Atle Vesterkj
1SCI SOCKET The fastest socket on earth?Atle
Vesterkjæratleve_at_dolphinics.comhttp//www.dolphi
nics.com
- Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone
47 23 16 70 00 Fax 47 23 16 71 80
2SCI Standard - Dolphin SCI Technology
- WHAT IS SCI?
- The SCI standard ANSI/IEEE 1596-1992 defines a
point-to-point interface and a set of packet
protocols. - The protocols use small packet sizes with a
16-byte header and data sizes of 16, 64, and 128
bytes. - Packets are protected by a 16-bit CRC code.
- An SCI interface has two unidirectional links
that operate concurrently. - Ring configuration or direct connection over
separate inn out cables. - The SCI protocols support shared memory and bus
bridging by encapsulating bus requests and
responses into SCI request and response packets. - Bus imitating protocol with packet-based
handshake protocols and guaranteed data delivery. - Brings the bus out on cables.
- For shared memory application a cache coherency
protocol is defined - Dolphin has an implementation of the cache
coherency that has been by customers - PCI do not feature Cache Coherency
- Processor-Bus-Memory technology as opposed to
Networking technology
3SCI SOCKET - Outline
- The fastest socket on earth and the impact on
storage and applications - SCI technology
- SCI SOCKET for storage and applications.
- SCI SOCKET benchmarks
4Highlights of the Dolphin SCI Technology
- Ultra Low Latency
- CPU has direct access to remote memory
- No protocol overhead
- 1.4 µs 4 bytes write
- lt 3 µs 512 bytes write
- 0.2 µs pipelined write
- Fast failover for HA systems
- Highly efficient bus bridging
- Bus Requests and Responses (CPU load/store
operations) are translated directly in Hardware
to Request and Response Packets - Point to Point Links gives Bus Performance and
Latency over Distance - High data throughput 346 MByte/s
- 0.2 µs pipelined write
5Highlights of the Dolphin SCI Technology
- Wide Application Area - Common Mode
Multiprocessing - Storage, Clustering, Multiprocessing, Embedded
Systems, Telecommunication, Defense, Medical
Imaging - Choice of Topologies, Ring, Torus, Switched
- Shipping in Critical Applications for more than
10 years - Based on ANSI/IEEE 1592-1992 Scalable Coherent
Interface (SCI) Standard
6Linköping University - NSC - SCI Clusters
Also in Sweden, Umeå University 120 Athlon nodes
- Monolith 200 node, 2xXeon, 2,2 GHz, 3D SCI
- INGVAR 32 node, AMD 900 MHz, 2D SCI
- Otto 48 node, P4 2.26 GHz, 2D SCI
- Maxwell 40 node 2xXeon, 2D SCI
- Bris 162, 2x Xeon
- Total 336 SCI Nodes
7University of California Santa Cruz
- 132 Nodes
- 264 Processors
- WulfKit
- Linux
- RackSaver Integration and Installation
- AMD Athlon 1800 processors
- RacKScaver 1U Chassis
- No. 99 on HPC500 February 2002
8The online signal processor on Manchester
University Jodrell Bank Observatory is a WulfKit
cluster
9Applications, Database Clustering
Ultra Enterprise Cluster
- SUNs High End servers are clustered with Dolphin
Cards - Money Transaction and Data Base Applications
- High Availability and Performance
- Dolphin Ships Cards and Switches
- 7th year of shipments
- Oracle 9i Performance and Scaleability
- SCI runs natively on SUNs RSM (Remote Shared
Memory API
10MySQL Cluster
- Cost-effective5 nines availability using parallel
server architecture with no single point of
failure. - Performance and high throughput required to meet
the most demanding enterprise applications. - Incremental scalin of applications in a linear
fashion as needs grow without having to invest in
expensive hardware - MySQL Cluster has a flexible distributed
architecture which gives you complete control
over the level of performance, reliability and
scalability you need to match your application
requirements. - Dolphin Ships Cards and Switches
http//www.mysql.com/products/cluster/
11XPrime
- X1 Database Performance Cluster for Microsoft
SQL Server - Ever wonder what 512 Processors and 1 Terabyte
of RAM can do for your database application? - Uses the Dolphin D336 card
Works with all existing SQL Server
applications Incremental capacity and
performance -Up to 512 Processors and 1
Terabyte of RAM Integrated SCI clustering
hardware Small rack footprint Plug
and Play Installation Claims 10x performance at
1/10 the cost
http//www.xprime.com/
12SMPs
- Convex Exemplar Supercomputer
- Supercomputer SMP, Dolphin chips
- Shipped in1994
- Data General uses Dolphin Chips as main
interconnect for ccNUMA Systems - Shared memory Intel x86 Multiprocessors
- 32 Processors SCI Interconnect
- Dolphin Ships Chips and Switches
- 3 generations shipped
AV20000 NUMA Server
13Mirage 2000 Upgrade, First Test Flight January
2001
Thales uses Dolphins Technology as the main
interconnect in the on-board Multi Processor
Offered with systems like Mirage 2000-9, Mirage
2000-5, Rafaleand more
14Space Mission Application
Dolphins technology is chosen for evaluation
http//sim.jpl.nasa.gov/
Dolphins in Space!
15SCI Removes I/O Bottleneck
- Siemens Uses Dolphins SCI in RM600 E UNIX
Servers IO Subsystem - SCI Allows Expansion to 150 PCI Slots and 5.8 TB
of Storage. - High Performance SCI Links Enable Remote
Attachment - 3rd generation shipping
RM600 Systm Bus
Bridge Chip
B-Link
PSB
LC
Dolphin Chips
LC
LC
SCI
SCI
SCI
RM600 E Server
LC
PCI
PCI
PSB
LC
16High Performance File Server
- Auspex Uses Dolphin PCI SCI Cards for Main Rack
to Rack Interconnect in its 4 Front NS2000 - Connects up to 4 storage system to the control
system - Gives Performance, Expandability and Reliability
- Dolphin Ships Cards
17LinuxLabs
- Linux expert company located in Atlanta GA
- Delivers clusters and clustering SW
- Involves in several opens source projects
- Showed extensive SW support for Dolphin SCI at
SC2004 - Is upgrading the Ames Labs Linux/SCI cluster with
their SW suite now
Global shared memory can be accessed using
either Linux Labs' modified Mpich, or Linux Labs
Cray-compatible shmem "get" and "put" call
library for low latency communication over
Dolphin SCI. Linux Labs' cluster software is
published under the GNU General Public License.
Linux Labs can provide you with a custom
shared-memory architecture for your next
supercomputing initiative at a fraction of the
cost of its functional equivalent
http//www.linuxlabs.com/
18LinuxLabs Nimubs 2.0
- Nimbus 2.0 is based on LANL's Clustermatic and
RedHat 9.0 with significant enhancements and
additional software produced by LinuxLabs.
Improvements from the 1.0 version include - Full installable system on a boot CD using
anaconda. - Integration of the Maui scheduler with the
Clustermatic bproc system. - Specialized version of MPICH for use with the
Dolphin Interconnect. - Clusgres system allowing the postgreSQL database
to run on the cluster using Dolphin hardware. - System checkpointing capability.
- Integration with the Ganglia system monitor.
- Enhancements to Supermon so that filesystem data
is made available to Maui and Ganglia - Enhanced network console capability to replace
serial console concentrators in most
applications. This has the dual benefit of
simplifying the cluster and reducing cost.
http//www.linuxlabs.com/
19LinuxLabs Clusgres
-
- Clusgres is a clustered version of the Open
source PostgresSQL database server specifically
designed for the Dolphin interconnect. - Feature set
- Database reads perform faster. The average query
service time decreases linearly with the increase
in cluster nodes. - Pure Database writes are as fast as that on a
single system. But the are faster if they are
mixed with reads. Work is underway to improve the
performance of pure Database writes as well to
linear performance. - The database is not modified and the support is
transparent to the application. - It scales very well to as many nodes as supported
by the underlying interconnect system. - It works with any POSIX compliant Unix like
operating system.
http//www.linuxlabs.com/
20MandrakeSoft
- Mandrake offers MandrakeClustering Linux
distribution - The distribution includes all Dolphin SCI SW
- Other packeges included
- Ganglia, Lam, Mpich, Pvm
- Pconsole, Authd, Pcp, Pmake
- Maui, PXE, PXELinux, SmartMonTools
- Gexec, Ka-Run, ClusterScripts and more
- MandrakeClustering is a MandrakeSoft clustering
solution based on research product CLIC. It's
been designed to offer research laboratories an
affordable, fast-to-deploy, easy-to-use,
heavy-calculation systems based on Clustering.
21 ExaFusion a virtual SAN
- ExaFusion virtualizes disks or partitions on
cluster's nodes and emulates an high performance
SAN
Cluster
Cluster
Cluster
Virtual SAN
Virtual arrays
22 ExaFusion Features
23 ExaFusion Sofware architecture
Simple App.
Simple App.
Clustered Application
I/O on files(lseek, write, mkdir)
Local FS
Local FS
Clustered File System (GFS, GPFS)
I/O on blocks(read, write, ioctl)
ExaFusionClustered Volume Manager
Private Volume
Private Volume
Shared Volume
I/O on blocks
ExaFusionNetworked Block Device
Unembedded disk
Unembedded disk
Unembedded disk
Unembedded disk
Unembedded disk
Disk embedded in node
Disk embedded in node
Disk embedded in node
Disk embedded in node
Disk embedded in node
Dolphin SCI Interconnection Network
24 ExaFusion benchmarks
25SCI Adapter Cards - 64 bit 66 MHz
- PCI-, PMC(VME)- and CompactPCI- SCI Adapter
Card - Industry-best latency
- 1.4 microseconds 4 bytes write
- lt 3 microseconds 512 bytes write
- 0.2 microseconds pipelined write
- High data throughput 346 MBytes/s
- Supports both
- Direct Memory Access (DMA)
- Remote Memory Access (RMA)
- Remote Interrupt
- Hot-pluggable cabling
- Redundant SCI adapters can be used for
Fault-tolerance
SCI
LC
PCI
PSB
Cluster Adapter PCI to PCI Bridge PCI
Extension Reflected Memory
26Dolphin Products Switches, Chips and Cards
27Torus Topology
SCI
1D Topology (Ring) to 10 Nodes 2D Torus Topology
to 100 Nodes 3D Torus Toplogy to 1000s of Nodes
LC
PSB
PCI
LC
LC
PSB
PCI
SCI
LC
LC
LC
PSB
PCI
28WulfKit
- Kit with all parts necessary to upgrade a cluster
node - Dolphin 1D, 2D or 3D HW
- Topology is Ring, 2D Torus or 3D Torus
- SW Options
- Scali MPI Connect
- Scali Manage
- Dolphin Opens Source (GPL) SW
- SCI-MPICH
- SCI-Sockets
- SISCI/IRM
- Applications
- HPC, ISP, ASP, Database Servers
29Scali MPI Connect v 4.2 for WulfKit
Scali MPI Connect highlights High bandwidth,
low latency performance Interconnect
independent architecture Dynamic interconnect
selection at runtime Interconnect failover
functionality Support for heterogeneous
configurations Runtime selection of collective
algorithms Multi thread-safe implementation
Supports 64 bit on RedHat and SuSE for Intel and
AMD
Scali Manage Features System Installation and
Configuration System Administration System
Monitoring Alarms and Event Automation Work
Load Management Hardware Management
Heterogeneous Cluster Support
30Dolphin SW
- All Dolphin SW is free open source (GPL or LGPL)
- SISCI shared memory interface
- SCI-Sockets
- Low Latency Socket Library
- TCP and UDP Replacement
- User and Kernel level support
- Release 2.3 available
- SCI-MPICH (RWTH Aachen)
- MPICH 1.2 and some MPICH 2 features. MPICH 2 in
development. - New release is being prepared, beta available
- SCI Interconnect Manager
- Automatic failover recovory.
- No single point of failuere in 2D and 3D
networks. - Other
- SCI Reflective Memory, Scali MPI, Linux Labs SCI
Cluster Cray-compatible shmem and Clugres
PostgreSQL, MandrakeSoft Clustering HPC solution,
Xprimes X1 Database Performance Cluster for
Microsoft SQL Servers, ClusterFrame from Qlusters
and SunCluster 3.1 (Oracle 9i), MySQL Cluster
31Latency vs SW
SW Latency (1/2 Ping Pong roundtrip)
SISCI (Direct HW) 1.4 µs
SCI-Sockets 2.3 µs
Scali MPI Connect 3.5 µs
SCI-MPICH 3.8 µs
32 SCI SOCKET
Replace in Title/Slide Master with Company Logo
or delete
33Motivation
- Link level speeds of interconnects are increasing
- Communication bottleneck moved to protocol
software - High speed networks provide their own efficient
interfaces - On the other hand
- A large number of applications is build around
legacy protocols such as TCP/IP suite - De-facto standard Berkeley Sockets API
- Porting to hardware specific APIs unprofitable in
many cases - SCI SOCKET aims to bring together
34Berkeley Sockets over SCI
- High Speed, Low Latency Replacement for Gigabit
Ethernet for Critical Applications - Bypassing traditional network stacks like
TCP/UDP/IP - Eliminating protocol overhead and Reducing
latency - Transparent to applications, no modifications or
recompilation required - Ultra low latency
- 2.27 us socket send/receive latency
35Berkeley Sockets over SCI
- Data transfer through remote shared memory
- Offers new socket transport family AF_SCI
- Flexible using configuration files
- Specifying Cluster nodes
- Specifying ports
36SCI SOCKET status
- Available for Linux 2.4 x86 / x86_64
- Itanium port and test in progress
- Port to Windows using Windows Socket Direct (WSD)
technology started. Will be available Q3 2004 - Port to Solaris is being reviewed.
37LD_PRELOAD
- Standard mechanism to preload C library functions
- User defined Library fuctions called instead of C
library - AF_INET selects traditional TCP/IP path
- AF_SCI selects SCI_SOCKET
- int socket(int family, int type, int protocol)
- if((family AF_INET) (type TCP type
UDP)) - socket_lib(AF_SCI,type)
- else
- socket_lib(family,type)
38SCI SOCKET
- Easy installation of the SCI socket library
39Configuration File /etc/sci/scisock.conf
- Selects which machines that can be reached using
SCI - Optionally /etc/sci/scisock_opt.conf selects
which ports that can be reached using SCI
This is a SCI socket config file Should be
placed in /etc/sci hostname SCI
NodeId nodeA
4 193.71.152.89 8 Mailhost
16 File-serv 20
This is a SCI socket_opt config file Should be
placed in /etc/sci directory -key -Type
-value EnablePortsByDefault
-yes/no EnablePort tcpudp portnumber
DisablePort tcpudp portnumber EnablePortRange
tcpudp start_port end_port DisablePortRange t
cpudp start_port end_port
40Linux Kernel Socket Switch
User App
User space
Cluster File System
Socket lib
iSCSI
Linux Kernel Socket Switch
Kernel space
SCI SOCKET
Native SOCKET
IP
Ethernet driver
41Small Message Latency
42TCP STREAM
43TCP-RR SCI SOCKET vs Gigabit Ethernet
44Scali MPI over SCI SOCKET
- SCI SOCKET is 1.6 - 6.0 times faster than TCP/GigE
45Why is SCI SOCKET so fast ?
- Small messages are sent using basic CPU
instructions - Data are normally located in CPU cache
- Low cost write post to local memory address
- Single store CPU instruction to send 8 bytes
- Raw send latency for 8 bytes is approximately 210
nanoseconds - No need to lock down or register memory
- Large messages are sent using DMA
- Stream-lined and lock-free messaging protocol on
top of shared memory - Combination of polling and interrupts
- Receive message causes received message to be
cached - No additional memory access
46Cluster File Systems
- SCI SOCKET A typical cluster file system will
run out of the box - PVFS
- Open Source / GPL software
- http//www.parl.clemson.edu/pvfs/desc.html
- Lustre
- Open Source / GPL software
- http//www.lustre.org/
- GFS
- Global File System
- Commersial file system available from Sistina
- www.sistina.com/products_gfs.htm
47iSCSI
- SCSI over IP
- Protocol for encapsulating SCSI commands into IP
packets - I/O block data transport over IP networks
- iSCSI and SCI SOCKET can be used to build
scalable SAN / NAS solutions
48iSCSI over SCI SOCKET
- Latency is approximately 10x better than Gigabit
Ethernet - Latency is reported by Intels ktest
Gigabit Ethernet SCI SOCKET
SCSI op 0x28 250 us 29 us
SCSI op 0x2A 250 us 31 us
SCSI op 0x25 250 us 27 us
49iSCSI over SCI SOCKET
- Throughput is 2-4 times Gigabit Ethernet
50SCI SOCKET comparison
Latency
Throughput
Reference
Technology
2.26 us
2016 Mbps
www.dolphinics.com
SCI
12 us
1818 Mbps
www.myrinet.com
Myrinet
23 us
936 Mbps
www.dolphinics.com
Gbit Ethernet
28 us
3768 Mbps
IEEE Symposium IPASS 2004
Infiniband
51SCI vs other interconnects
- As reported by Ameslab (Iowa state University,
USA) - Netpipe benchmark
52Applications running SCI SOCKET
- Intel iSCSI
- PVFS
- LUSTRE
- MySQL Cluster
- LAM-MPI
- MPICH2
- PVM
- Oracle (Client/Server sqlplus)
- TerraGrid (tm) by Terrascale
- Scali MPI Connect
- Latency_bench
- Netpipe TCP/PVM
- Netperf
53Current Development
- Available on X86, X86_64, Linux 2.4 and 2.6.
- Itanium beta release is ready
- Porting to windows in progress
- Support for multiple adapters in progress
- Data striping gives multiple throughput with no
latency penalty or extra CPU load - Redundancy and transparent failover to other SCI
adapter and Ethernet
54SCI SOCKET The fastest socket on earth?Atle
Vesterkjæratleve_at_dolphinics.comhttp//www.dolphi
nics.com
- Olaf Helsets vei 6 NO-0619 Oslo, Norway Phone
47 23 16 70 00 Fax 47 23 16 71 80
55(No Transcript)
56http//www.gria.org/
- Would you like your computers to earn you extra
money? - Would you like to have cheap access to tons of
computing power? - The GRIA project will take Grid technology into
the real world, enabling industrial users to
trade computational resources on a commercial
basis to meet their needs more cost effectively. - GRIA enables organizations to
- Outsource computation.
- If you need short-term computation, and cannot
justify the expense of the hardware purchase,
GRIA provides a mechanism to discover, negotiate
and utilize other organizations' spare computing
resources. - Rent out spare CPU cycles.
- GRIA provides a mechanism allowing you to
commercially offer your spare computing resources
on the Grid.
57Acknowledgement
- SCI SOCKET kernel module has been developed in
the IST-33240 project GRIA (http//www.gria.org) - SCI SOCKET user space software library has been
developed in the ITEA project HYADES
(http//www.hyades-itea.org) - The SCI SOCKET software is open source and
available under GPL/LGPL. Dolphin strongly
appreciates the contribution to the code and
testing done by volunteer programmers and
partners. - More information about SCI SOCKET can be found at
http//www.dolphinics.com/products/software/sci_so
ckets.html