Title: Sun
1- Suns weak pointsin UE10000
2Suns Weak Points in UE10000
- DSD/DR is Not used by Customers
- Sun will not provide DSD reference sites Giga.
- Regular system administrator can not do the
DSD/DR changes, it takes very skilled system
administrator to handle the DSD/DR changes
Giga. - Very few customers use DSD/DR in database related
production environment. DRS/DR are used more
often in testing environment Giga. - Few customers use DSDs. Those who do say it
works fine most of the time. Gartner. - Quality Problems
- Terrible problems with USII last year unable to
do root cause analysis. Some customers wont
return to Sun, but will stay in Sun fold with
Fujistu Giga. - E Cache problem does not only bring down the
affected domain, it brings the whole UE10K down. - Sun has been having great difficulty to design
reliable Enterprise level servers. Due to their
background as a workstation vendor they are
behind in design for reliability technology. - The UltraSPARC II based systems did not have ECC
in cache memory with all the reliability problems
as a result. The USIII now supports ECC in
level-2 cache, but they are still behind as they
have no chip-kill technology or DMR. - No Virtual Partitions
- No Goal based and Multi System Workload
Management
3SINGLE POINTS OF FAILURE (SPOF)
- HP has the lowest SPOF failure rate The SPOF
failure rate between partitions in Superdome
(called the 'infrastructure failure rate') is
lower than the infrastructure failure rate of
S390 Lpars and certainly much lower than SUN
UE10K domains - How can this be??? when SUN quotes that the UE10K
has Complete Hardware Redundancy? - SUNs definition on SPOF Looking carefully at
the literature, Complete Hardware Redundancy
means A fully redundant system will always
recover from a system crash, by using (booting
from) standby hardware. Therefore, this complete
hardware redundancy is really a collection of
single points of failure by HPs definition
(the one the customer cares about). -
Source Ken Pomaranski, Hardware HA Architect
4Does Sun really understand reliability?
- From UE10K RAS manual
- Sun has made the time required for a module
replacement much shorter over time. This
enhancements coupled with improved diagnostic
capabilities have reduced the cycle time on
systems, simultaneously increasing reliability
and availability. - There is currently no industry adopted means to
measure MTBF. Therefore, comparisons between
vendors is of questionable use. - Each UE10K can be configured to have 100 HW
redundancy
Isnt reliability about keeping systems
running?
How then does Sun track server reliability?
Shouldnt the UE10K then never fail?
5Suns Customers Understand!
- Topping their list of complaints are the
frequency of server crashes caused by the problem
memory, fixes that don't work and Sun's
tendency to initially blame the problem on other
factors before acknowledging it - often only
under a nondisclosure agreement. Computer World
9/04/2000 - "They treated the whole thing like a cover-up,
said one user at a large utility in the Western
U.S. who asked not to be named. Computer World
9/04/00 - The long-standing nature of the problem and
Sun's handling of the issue raise troubling
questions about the quality of Sun's hardware and
support Gartner group - Engineers have long known that memory chips can
be disrupted by radiation and other environmental
factors. That is why Hewlett-Packard and IBM use
error-correcting code, or ECC, which detects
cache errors and restores bits that were changed
by mistake. Forbes 11/13/2000 - Sun servers lack ECC protection. "Frankly, we
just missed it. It's something we regret at this
point," Shoemaker Sun executive VP says.
Forbes 11/13/2000
What else have they missed??
6Suns UE10K Dynamic Reconfiguration Weaknesses
- Suns UE10K implementation of DR is not quite as
dynamic as SUN would have you believe. Its a
marketing tale!!! - Hot swapping I/O requires that CPU and memory
also be brought down. - Any DR activity requires that the database be
shut down, therefore making applications
unavailable during the process. - DR cannot be used in combination with memory
interleaving across system boards which reduces
maximum performance. Sun customers have to choose
between good system performance or DR
functionality, but cannot get both at the same
time! - DR is not supported in combination with
SunCluster fail-over. Since during a DR operation
the system halts, SunCluster considers this
system to be failing and starts a fail-over
procedure to another system. Sun customers have
to choose between a true multi-system, high
availability solution and the use of DR, but
cannot get both at the same time! - DR conflicts with Intimate Shared Memory (ISM)
used by demanding applications.To improve
performance, most memory intensive applications,
like databases, make use of the Intimate Shared
Memory (ISM) capability in the E10000. Most
applications using ISM do not allow dynamic
addition or removal of their shared memory
allocation. Using memory intensive applications
with ISM (like large databases) and making the
most efficient use of partitions prevent the use
of DR. - Deactivating/moving a system board with full
memory can take 15 minutes (backup and rearrange
memory contents). All activities in the affected
partitions(s) have to be paused during that time!
(To compensate Sun introduced TurboDR boards
with just CPUs, no memory...)
Source John Wiltschut, BSTO Marketing
7- Why Sun is being defensive Superdome vs.
E10000
8Sun blames HP and IBM for copying the E10000
- The truth is
- Superdome is more original than the E10000 has
ever been the E10K is an exact copy of the Cray
CS6400 - Sun is just playing catch-up with the E10000s
inferior performance, reliability and
functionality - The E10000 is an end-of-line product based on
old technology and without future expansion
capabilities - Superdome is built as an advanced architecture
based on the latest technology and with a very
strong growth potential - Sun has never developed a high-end server by
themselves.
Heard of Superdome?
9The E10000 is COPIED by Sun (from Cray)
- The CS6400 was developed by Cray and announced in
1993. - It supported up to 64 SuperSPARC processors (60
MHz) and ran CRS-OS, based on Solaris, but
modified by Cray. - Most of the CS6400 used less than 30 CPUs as it
did not scale very well. - In 1996 Sun purchased this technology from
Cray/SGI and introduced a copy in 1997 under the
name E10000. - All basic technology was already present in the
CS6400 and Sun has never added any break-through
improvements
10- HP Superdome supports 64 CPUs in a single system
with SMP functionality. - Superdome is built as an advanced architecture
based on the latest technology and with a very
strong growth potential. The modular packaging
allows you to use only half the size up to 32
processors. - SD has 3 base cabinet configu-rations. The E10K
comes in full size, even with only a few CPUs. - A 48-CPU Superdome delivers 71 more performance
in a system that is only 20 wider than a 64-CPU
E10000.
64 SMP CPUs in Single Cabinet
- Sun claims
- Supported with Solaris since 1993
- The reality
- The Cray CS6400 (announced in 1993) was not
developed by Sun, ran CRS-OS and had very limited
scalability. - The E10K is a copy of the CS6400 without
significant breakthrough technology added by Sun.
based on TPC benchmark with Oracle
11Full Dynamic Partitioning
- HP is the first vendor to provide the full
spectrum of partitioning Hyperplex, nPartitions,
virtual partitions and automatic resource
partitioning. The different levels of
partitioning can be combined as desired. - nPartitions can be added and removed within an
active Superdome. - Virtual Partitions are dynamic at the CPU level,
not just the cell level.
- Sun claims
- Supported with Solaris since 1997
- Sun still does not support full dynamic
partitioning (it does not support dynamic control
by applications). Dynamic System Domains (DSD)
require operator intervention and usually a
reboot. - The use of DSD has many limitations it cannot
be combined with memory interleaving, SunCluster
fail-over or Intimate Shared Memory. Domains
always have to be multiples of 4 CPUs.
The reality
see whitepaper DSD and DR -- the true story
12only hp offers the full spectrum of partitioning
isolation
flexibility
resource partitions
hard partitions with multiple nodes
virtual partitions within hard partitions
hard partitions within a node
prm (Process Resource Mgr) hp-ux wlm(Workload
Manager)
virtual partitions
hyperplex
nPartitions
- hardware isolation per cell
- complete software isolation
- multiple OS images
- complete hardware and software isolation
- multiple OS images
- dynamic resource allocation
- automatic goal-based resource allocation via set
slos - 1 OS image
- software isolation
- multiple OS images
- suncluster
- no high-speed interconnect
- 8 node max.
- doesnt work with suns dr
- dynamic system domains (dsd)
- require reboot in most situations
- difficult to modify configuration (sun experts
are usually needed)
- solaris resource manager (srm)
- expensive
- doesnt manage i/o
- not goal-based like hp-ux wlm
No
...Sun cant match
13- HP-UX can dynamically deallocate processors and
memory with DPR and DMR (dynamic processor and
memory resilience) in case of failures. This is a
fully automatic process. - Cell boards can be added and removed in an active
Superdome. - HP has been using error checking and correcting
in cache memory to prevent most processor and
system failures. Sun hasnt in the US II.
Automated DR / Hot-swap CPU Memory
- Sun claims
- Supported with Solaris since 2000/1997
The reality
- Automated DR is nothing more than scripting of an
otherwise manual cell board replacement process.
Dynamic Reconfiguration (DR) has many limitations
(similar to DSDs) - If a processor fails then the domain crashes and
a reboot is required. This is neither automatic
nor dynamic.
DR Dynamic Reconfiguration see whitepaper
DSD and DR -- the true story
14Interdomain Networking
- HP supports other high-speed communication links
like Hyperfabric, Fibre-Channel etc., and
recommends not to use IDN because of the lack of
isolation between partitions.
- Sun claims
- Supported with Solaris since 1999
The reality
- Interdomain networking (IDN) uses shared memory
and the connected domains are not isolated from
failures in the other domains. As IDN violates
hardware isolation (the main reason for
partitioning) it increases the risk of down-time. - Sun does not support high-speed interconnect like
Hyperfabric for high-bandwidth data transfer
between nodes and partitions.
15Clustered File Systems
- HP supports multiple file system options
depending on customer needs. CIFS/9000 is a
global file system supporting multi-platform,
multi-OS file systems. - MC/ServiceGuard provides a superior , mature
solution with support up to 16 nodes, hundreds of
applications and has more than 45,000
installations. Hyperplex supports hundred of
clustered nodes.
- Sun claims
- Supported with Solaris since 2000 (December)
The reality
- This was promised for SunCluster 3.0 but was
never delivered (confirmed during the press
conference). Sun tries to get around it by using
marketing terms like cluster-aware file system
and cluster file service. - Suns clustering solutions have always been
behind and customers have always preferred other
solutions. Even now SunCluster 3.0 only support 8
nodes and is focused on Solaris only.
16Global Network Services
- HP s MC/ServiceGuard already provides flexible
IP addresses so that applications can fail-over
to other nodes in a cluster without any problem. - HP is focused on supporting multi-platform,
multi-OS environments based on customer demand.
- Sun claims
- Supported with Solaris since 2000 (December)
The reality
- This is mainly about abstracting an IP service
from a network interface, such that applications
can be moved in a cluster (HA fail-over). To
speak in Sun terms nothing new... - Sun is focused on Solaris-only solutions with no
support for multi-OS.
17What Sun does not say...
- Suns current systems do not have Error Checking
and Correcting, Dynamic Processor and Memory
Resilience or Chip-Kill technology. - Analysts and press have reported serious problems
with Sun E10000 systems at customer sites. See
the Forbes and Gartner articles.
Reliability
- The US II processor lacks performance compared to
current HPs offerings, resulting in much lower
system performance. Even the US III will barely
meet the current PA-RISC performance levels.
Performance
Suns systems are lagging in all these areas
I/O bandwidth
- Todays applications like broadband and
datawarehousing requires high I/O bandwidths,
which Sun does not deliver.
- Current Sun products are basically end-of-life.
The US III requires new boxes and runs only the
Solaris 8 OS.
Investment protection
- Suns vision is limited to Solaris/SPARC only
Not towards multi-platform environments.
Multi-platform support
18Who is really playing Catch-Up?
19leadership performance, flexibility, availability
performance/ hp superdome
sun e10000 scalability CPU memory I/O tpm
flexibility hyperplex nPartitions virtual
partitions resource partitions utility
pricing iCOD IA-64 Multi-OS availability multi-
system single system investment protection
64
64
256
64/128
192
64
200K
115K/156K
leadership limited weakness
Page 19
20Suns Dark Secret
Sun Screen Sun Microsystems servers have been
crashing for more than a year. Sun has kept the
flaw secret--and hasnt yet fixed it 11/13/2000
21- Sun and HP
- Reliability
- Comparisons
22Why HP can fulfill the customer needs better than
Sun
- HP understands what available systems really
mean. Availability is the BASE upon which all
other features are built
23Reliability Comparison
HP UE10K SUNFIRE
Internal cache error correction YES NO NO
Dynamic processor resilience YES SOME SOME
Chip kill protection YES YES NO
HW scrubbing YES NO NO
Dynamic memory resilience YES NO NO
PCI bus error isolation YES NO NO
Full PCI OLAR YES NO NO
Address bus ECC YES NO NO
Redundant DC / DC converters YES NO NO
Full stuck-at bit correction YES NO NO
Interconnect reliability experience YES NO NO
CPU
MEMORY
IO
BACKPLANE
24Reliability Comparison (2)
HP UE10K SUNFIRE
5 nines solution availability YES NO NO
Data center wide HA solutions YES NO NO
Customer care for quality issues YES () NO NO
Proven domain isolation YES NO NO
Solution level verification YES ? ?
Cosmic ray tolerance YES NO NO
SOLUTIONLEVEL
HP projects that the above reliability
oversights result in SUN systems with 2-4x
greater failure rates than HP systems. This has
been proven by field experience. () Rather than
blame customers for quality problems, HP closely
tracks field data and works PROACTIVELY to fix
potential field quality problems.