Sun - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Sun

Description:

Reliability Performance Today s applications like broadband and datawarehousing requires high I/O bandwidths, which Sun does not deliver. – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 25
Provided by: Martin727
Category:

less

Transcript and Presenter's Notes

Title: Sun


1
  • Suns weak pointsin UE10000

2
Suns Weak Points in UE10000
  • DSD/DR is Not used by Customers
  • Sun will not provide DSD reference sites Giga.
  • Regular system administrator can not do the
    DSD/DR changes, it takes very skilled system
    administrator to handle the DSD/DR changes
    Giga.
  • Very few customers use DSD/DR in database related
    production environment. DRS/DR are used more
    often in testing environment Giga.
  • Few customers use DSDs. Those who do say it
    works fine most of the time. Gartner.
  • Quality Problems
  • Terrible problems with USII last year unable to
    do root cause analysis. Some customers wont
    return to Sun, but will stay in Sun fold with
    Fujistu Giga.
  • E Cache problem does not only bring down the
    affected domain, it brings the whole UE10K down.
  • Sun has been having great difficulty to design
    reliable Enterprise level servers. Due to their
    background as a workstation vendor they are
    behind in design for reliability technology.
  • The UltraSPARC II based systems did not have ECC
    in cache memory with all the reliability problems
    as a result. The USIII now supports ECC in
    level-2 cache, but they are still behind as they
    have no chip-kill technology or DMR.
  • No Virtual Partitions
  • No Goal based and Multi System Workload
    Management

3
SINGLE POINTS OF FAILURE (SPOF)
  • HP has the lowest SPOF failure rate The SPOF
    failure rate between partitions in Superdome
    (called the 'infrastructure failure rate') is
    lower than the infrastructure failure rate of
    S390 Lpars and certainly much lower than SUN
    UE10K domains
  • How can this be??? when SUN quotes that the UE10K
    has Complete Hardware Redundancy?
  • SUNs definition on SPOF Looking carefully at
    the literature, Complete Hardware Redundancy
    means A fully redundant system will always
    recover from a system crash, by using (booting
    from) standby hardware. Therefore, this complete
    hardware redundancy is really a collection of
    single points of failure by HPs definition
    (the one the customer cares about).
  •  

Source Ken Pomaranski, Hardware HA Architect
4
Does Sun really understand reliability?
  • From UE10K RAS manual
  • Sun has made the time required for a module
    replacement much shorter over time. This
    enhancements coupled with improved diagnostic
    capabilities have reduced the cycle time on
    systems, simultaneously increasing reliability
    and availability.
  • There is currently no industry adopted means to
    measure MTBF. Therefore, comparisons between
    vendors is of questionable use.
  • Each UE10K can be configured to have 100 HW
    redundancy

Isnt reliability about keeping systems
running?
How then does Sun track server reliability?
Shouldnt the UE10K then never fail?
5
Suns Customers Understand!
  • Topping their list of complaints are the
    frequency of server crashes caused by the problem
    memory, fixes that don't work and Sun's
    tendency to initially blame the problem on other
    factors before acknowledging it - often only
    under a nondisclosure agreement. Computer World
    9/04/2000
  • "They treated the whole thing like a cover-up,
    said one user at a large utility in the Western
    U.S. who asked not to be named. Computer World
    9/04/00
  • The long-standing nature of the problem and
    Sun's handling of the issue raise troubling
    questions about the quality of Sun's hardware and
    support Gartner group
  • Engineers have long known that memory chips can
    be disrupted by radiation and other environmental
    factors. That is why Hewlett-Packard and IBM use
    error-correcting code, or ECC, which detects
    cache errors and restores bits that were changed
    by mistake. Forbes 11/13/2000
  • Sun servers lack ECC protection. "Frankly, we
    just missed it. It's something we regret at this
    point," Shoemaker Sun executive VP says.
    Forbes 11/13/2000

What else have they missed??
6
Suns UE10K Dynamic Reconfiguration Weaknesses
  • Suns UE10K implementation of DR is not quite as
    dynamic as SUN would have you believe. Its a
    marketing tale!!!
  • Hot swapping I/O requires that CPU and memory
    also be brought down.
  • Any DR activity requires that the database be
    shut down, therefore making applications
    unavailable during the process.
  • DR cannot be used in combination with memory
    interleaving across system boards which reduces
    maximum performance. Sun customers have to choose
    between good system performance or DR
    functionality, but cannot get both at the same
    time!
  • DR is not supported in combination with
    SunCluster fail-over. Since during a DR operation
    the system halts, SunCluster considers this
    system to be failing and starts a fail-over
    procedure to another system. Sun customers have
    to choose between a true multi-system, high
    availability solution and the use of DR, but
    cannot get both at the same time!
  • DR conflicts with Intimate Shared Memory (ISM)
    used by demanding applications.To improve
    performance, most memory intensive applications,
    like databases, make use of the Intimate Shared
    Memory (ISM) capability in the E10000. Most
    applications using ISM do not allow dynamic
    addition or removal of their shared memory
    allocation. Using memory intensive applications
    with ISM (like large databases) and making the
    most efficient use of partitions prevent the use
    of DR.
  • Deactivating/moving a system board with full
    memory can take 15 minutes (backup and rearrange
    memory contents). All activities in the affected
    partitions(s) have to be paused during that time!
    (To compensate Sun introduced TurboDR boards
    with just CPUs, no memory...)

Source John Wiltschut, BSTO Marketing
7
  • Why Sun is being defensive Superdome vs.
    E10000

8
Sun blames HP and IBM for copying the E10000
  • The truth is
  • Superdome is more original than the E10000 has
    ever been the E10K is an exact copy of the Cray
    CS6400
  • Sun is just playing catch-up with the E10000s
    inferior performance, reliability and
    functionality
  • The E10000 is an end-of-line product based on
    old technology and without future expansion
    capabilities
  • Superdome is built as an advanced architecture
    based on the latest technology and with a very
    strong growth potential
  • Sun has never developed a high-end server by
    themselves.

Heard of Superdome?
9
The E10000 is COPIED by Sun (from Cray)
  • The CS6400 was developed by Cray and announced in
    1993.
  • It supported up to 64 SuperSPARC processors (60
    MHz) and ran CRS-OS, based on Solaris, but
    modified by Cray.
  • Most of the CS6400 used less than 30 CPUs as it
    did not scale very well.
  • In 1996 Sun purchased this technology from
    Cray/SGI and introduced a copy in 1997 under the
    name E10000.
  • All basic technology was already present in the
    CS6400 and Sun has never added any break-through
    improvements

10
  • HP Superdome supports 64 CPUs in a single system
    with SMP functionality.
  • Superdome is built as an advanced architecture
    based on the latest technology and with a very
    strong growth potential. The modular packaging
    allows you to use only half the size up to 32
    processors.
  • SD has 3 base cabinet configu-rations. The E10K
    comes in full size, even with only a few CPUs.
  • A 48-CPU Superdome delivers 71 more performance
    in a system that is only 20 wider than a 64-CPU
    E10000.

64 SMP CPUs in Single Cabinet
  • Sun claims
  • Supported with Solaris since 1993
  • The reality
  • The Cray CS6400 (announced in 1993) was not
    developed by Sun, ran CRS-OS and had very limited
    scalability.
  • The E10K is a copy of the CS6400 without
    significant breakthrough technology added by Sun.

based on TPC benchmark with Oracle
11
Full Dynamic Partitioning
  • HP is the first vendor to provide the full
    spectrum of partitioning Hyperplex, nPartitions,
    virtual partitions and automatic resource
    partitioning. The different levels of
    partitioning can be combined as desired.
  • nPartitions can be added and removed within an
    active Superdome.
  • Virtual Partitions are dynamic at the CPU level,
    not just the cell level.
  • Sun claims
  • Supported with Solaris since 1997
  • Sun still does not support full dynamic
    partitioning (it does not support dynamic control
    by applications). Dynamic System Domains (DSD)
    require operator intervention and usually a
    reboot.
  • The use of DSD has many limitations it cannot
    be combined with memory interleaving, SunCluster
    fail-over or Intimate Shared Memory. Domains
    always have to be multiples of 4 CPUs.

The reality
see whitepaper DSD and DR -- the true story
12
only hp offers the full spectrum of partitioning
isolation
flexibility
resource partitions
hard partitions with multiple nodes
virtual partitions within hard partitions
hard partitions within a node
prm (Process Resource Mgr) hp-ux wlm(Workload
Manager)
virtual partitions
hyperplex
nPartitions
  • hardware isolation per cell
  • complete software isolation
  • multiple OS images
  • complete hardware and software isolation
  • multiple OS images
  • dynamic resource allocation
  • automatic goal-based resource allocation via set
    slos
  • 1 OS image
  • software isolation
  • multiple OS images
  • suncluster
  • no high-speed interconnect
  • 8 node max.
  • doesnt work with suns dr
  • dynamic system domains (dsd)
  • require reboot in most situations
  • difficult to modify configuration (sun experts
    are usually needed)
  • solaris resource manager (srm)
  • expensive
  • doesnt manage i/o
  • not goal-based like hp-ux wlm

No
...Sun cant match
13
  • HP-UX can dynamically deallocate processors and
    memory with DPR and DMR (dynamic processor and
    memory resilience) in case of failures. This is a
    fully automatic process.
  • Cell boards can be added and removed in an active
    Superdome.
  • HP has been using error checking and correcting
    in cache memory to prevent most processor and
    system failures. Sun hasnt in the US II.

Automated DR / Hot-swap CPU Memory
  • Sun claims
  • Supported with Solaris since 2000/1997

The reality
  • Automated DR is nothing more than scripting of an
    otherwise manual cell board replacement process.
    Dynamic Reconfiguration (DR) has many limitations
    (similar to DSDs)
  • If a processor fails then the domain crashes and
    a reboot is required. This is neither automatic
    nor dynamic.

DR Dynamic Reconfiguration see whitepaper
DSD and DR -- the true story
14
Interdomain Networking
  • HP supports other high-speed communication links
    like Hyperfabric, Fibre-Channel etc., and
    recommends not to use IDN because of the lack of
    isolation between partitions.
  • Sun claims
  • Supported with Solaris since 1999

The reality
  • Interdomain networking (IDN) uses shared memory
    and the connected domains are not isolated from
    failures in the other domains. As IDN violates
    hardware isolation (the main reason for
    partitioning) it increases the risk of down-time.
  • Sun does not support high-speed interconnect like
    Hyperfabric for high-bandwidth data transfer
    between nodes and partitions.

15
Clustered File Systems
  • HP supports multiple file system options
    depending on customer needs. CIFS/9000 is a
    global file system supporting multi-platform,
    multi-OS file systems.
  • MC/ServiceGuard provides a superior , mature
    solution with support up to 16 nodes, hundreds of
    applications and has more than 45,000
    installations. Hyperplex supports hundred of
    clustered nodes.
  • Sun claims
  • Supported with Solaris since 2000 (December)

The reality
  • This was promised for SunCluster 3.0 but was
    never delivered (confirmed during the press
    conference). Sun tries to get around it by using
    marketing terms like cluster-aware file system
    and cluster file service.
  • Suns clustering solutions have always been
    behind and customers have always preferred other
    solutions. Even now SunCluster 3.0 only support 8
    nodes and is focused on Solaris only.

16
Global Network Services
  • HP s MC/ServiceGuard already provides flexible
    IP addresses so that applications can fail-over
    to other nodes in a cluster without any problem.
  • HP is focused on supporting multi-platform,
    multi-OS environments based on customer demand.
  • Sun claims
  • Supported with Solaris since 2000 (December)

The reality
  • This is mainly about abstracting an IP service
    from a network interface, such that applications
    can be moved in a cluster (HA fail-over). To
    speak in Sun terms nothing new...
  • Sun is focused on Solaris-only solutions with no
    support for multi-OS.

17
What Sun does not say...
  • Suns current systems do not have Error Checking
    and Correcting, Dynamic Processor and Memory
    Resilience or Chip-Kill technology.
  • Analysts and press have reported serious problems
    with Sun E10000 systems at customer sites. See
    the Forbes and Gartner articles.

Reliability
  • The US II processor lacks performance compared to
    current HPs offerings, resulting in much lower
    system performance. Even the US III will barely
    meet the current PA-RISC performance levels.

Performance
Suns systems are lagging in all these areas
I/O bandwidth
  • Todays applications like broadband and
    datawarehousing requires high I/O bandwidths,
    which Sun does not deliver.
  • Current Sun products are basically end-of-life.
    The US III requires new boxes and runs only the
    Solaris 8 OS.

Investment protection
  • Suns vision is limited to Solaris/SPARC only
    Not towards multi-platform environments.

Multi-platform support
18
Who is really playing Catch-Up?
19
leadership performance, flexibility, availability
performance/ hp superdome
sun e10000 scalability CPU memory I/O tpm
flexibility hyperplex nPartitions virtual
partitions resource partitions utility
pricing iCOD IA-64 Multi-OS availability multi-
system single system investment protection
64
64
256
64/128
192
64
200K
115K/156K
leadership limited weakness
Page 19
20
Suns Dark Secret
Sun Screen Sun Microsystems servers have been
crashing for more than a year. Sun has kept the
flaw secret--and hasnt yet fixed it 11/13/2000
21
  • Sun and HP
  • Reliability
  • Comparisons

22
Why HP can fulfill the customer needs better than
Sun
  • HP understands what available systems really
    mean. Availability is the BASE upon which all
    other features are built

23
Reliability Comparison
HP UE10K SUNFIRE
Internal cache error correction YES NO NO
Dynamic processor resilience YES SOME SOME
Chip kill protection YES YES NO
HW scrubbing YES NO NO
Dynamic memory resilience YES NO NO
PCI bus error isolation YES NO NO
Full PCI OLAR YES NO NO
Address bus ECC YES NO NO
Redundant DC / DC converters YES NO NO
Full stuck-at bit correction YES NO NO
Interconnect reliability experience YES NO NO
CPU
MEMORY
IO
BACKPLANE
24
Reliability Comparison (2)
HP UE10K SUNFIRE
5 nines solution availability YES NO NO
Data center wide HA solutions YES NO NO
Customer care for quality issues YES () NO NO
Proven domain isolation YES NO NO
Solution level verification YES ? ?
Cosmic ray tolerance YES NO NO
SOLUTIONLEVEL
HP projects that the above reliability
oversights result in SUN systems with 2-4x
greater failure rates than HP systems. This has
been proven by field experience. () Rather than
blame customers for quality problems, HP closely
tracks field data and works PROACTIVELY to fix
potential field quality problems.
Write a Comment
User Comments (0)
About PowerShow.com