The AMD Opteron

About This Presentation

Title:

The AMD Opteron

Description:

AMD produced both desktop and mobile K6 processors. ... AMD believes the following desktop apps stand to benefit the most from its ... – PowerPoint PPT presentation

Number of Views:474

Avg rating:3.0/5.0

Slides: 120

Provided by: ITCLabsand2

Learn more at: https://www.cs.virginia.edu

Category:

more less

Transcript and Presenter's Notes

Title: The AMD Opteron

1
The AMD Opteron

Henry Cook
Kum Sackey
Andrew Weatherton

2
Presentation Outline

History and Goals
Improvements
Pipeline Structure
Performance Comparisons

3
K8 Architecture Development

The Nx586, March 1994
Superscalar
Designed by NexGen
Manufactured by IBM
70-111MHz
32KB L1 cache
3.5 million transistors
.5 micron process

4
K8 Architecture Development

AMD SSA/5 (K5)
March 1996
Built by AMD from the ground up
Superscalar architecture
out of-order speculative execution
branch prediction
integrated FPU
power-management
75-117MHz
Ran hot
34KB L1 cache
4.5 million transistors
.35 micron process

5
K8 Architecture Development

AMD K6 (1997)
Based on NexGen's RISC86 core (in the Nx586)
Based on Nx586 core
166-300MHz
84KB L1 Cache
8.8 million transistors
.25 micron process

6
K8 Architecture Development

AMD K6 (1997) continued
Advantages of K6 over K5
RISC86 core translates x86 complex instructions
into shorter ones, allowing the AMD to reach
higher frequencies than the K5 core.
Larger L1 cache.
New MMX instructions.
AMD produced both desktop and mobile K6
processors. The only difference being lower
processor core voltage for the mobile part

7
K8 Architecture Development

First AMD Athlons, K7 (June 23, 1999)
Based on the K6 core
improved the K6s FPU
128 KB (2x64 KB) L1 cache
Initially 500-700MHz
8.8 million transistors
.25 micron process

8
K8 Architecture Development

AMD Athlons, K7 continued
1999-2002 held fastest x86 title off and on
First to 1GHz clock speed
Intel suffered a series of major production,
design, and quality control issues at this time.
Changed from slot to socket format
Athlon XP desktop
Athlon XP-M laptop
Athlon MP server

9
K8 Architecture Development

AMD Athlons, K7 continued
Final (5th) revision, the Barton
400 MHz FSB (up from 200 MHz)
Up to 2.2 GHz clock
512 KB L2 cache, off-chip
54.3 million transistors
.13 micron process
In 2004 AMD began using 90nm process on XP-M

10
The AMD Opteron

Built on the K8 Core
Released April 22, 2003
AMD's AMD64 (x86-64) ISA
Direct Connect Architecture
Integrated memory controllers
HyperTransport interface
Native execution of x86 64-bit apps
Native execution of x86 32-bit apps with no speed
penalty!

11
Opteron vs. Intel Offerings

Targeted at the server market
64-bit computing
Registered memory
Initial direct competitor was the Itanium
Itanium was the only other 64-bit processor
architecture with 32-bit x86 compatibility
But, 32-bit software support was not native
Emulated 32-bit performance took a significant hit

12
Opteron vs. ???

Opteron had no real competition
Near 11 multi-processor scaling
CPUs share a single common bus
integrated memory controller CPU can access
local-RAM without using the Hypertransport bus
processor-memory communication.
contention for the shared-bus leads to decreased
efficiency, not an issue for the Opteron
Still did not dominate the market

13
Opteron Layout
14
Other New Opteron Features

48-bit virtual address space and a 40-bit
physical address space
ECC (error correcting code) protection for L1
cache data, L2 cache data and tags
DRAM with hardware scrubbing of all ECC-protected
arrays

15
Other New Opteron Features

Lower thermal output, improved frequency scaling
via .13 micron SOI (silicon-insulator) process
technology
Two additional pipeline stages (compared to K7)
for increased performance and frequency
scalability
Higher IPC (instructions-per-clock) with larger
TLBs, flush filters, and enhanced branch
prediction algorithms

16
64-bit Computing

Move beyond the 4GB virtual-address space ceiling
32-bit systems impose
Servers and apps like databases, content
creation, MCAD, and design-automation tools push
that boundary.
AMDs implementation allows
Up to 256TB of virtual-address space
Up to 1TB of physical memory
No performance penalty

17
64-bit Computing Contd

AMD believes the following desktop apps stand to
benefit the most from its architecture, once
64-bit becomes more widespread
3D gaming
Codecs
Compression algorithms
Encryption
Internet content serving
Rendering

18
AMD and 64-bit Computing

Goal is not immediate transition to 64-bit
operation
Like Intels transition to 32-bit with the 386
AMD's Brunner "The transition will occur at the
pace of demand for its benefits."
Sets foundation and encourages development of
64-bit applications while fully supporting
current 32-bit standard

19
AMD64

AMDs 64-bit ISA
64-bit software support with zero-penalty 32-bit
backward compatibility
x86 based, with extensions
Cleans up x86-32 idiosyncrasies
Updated since release i.e. SSE3

20
AMD64 - Features

All benefits of 64-bit processing (e.g.
virtual-address space)
Added registers
Like Pentium 4 in 32-bit mode, but 8 more 64-bit
GPRs available for 64-bit
8 more XMM registers
Native 32-bit compatibility
Low translation overhead (unlike Intel)
Both 32 and 64-bit apps can be run under a 64bit
OS

21
Register Map for AMD64
22
AMD64 More Features

RIP relative data access Instructions can
reference data relative to PC, which makes code
in shared libraries more efficient and able be
mapped anywhere in the virtual address space.
NX Bit Not required for 64-bit computing, but
provides for a more tightly controlled software
environment. Hardware set permission levels make
it much more difficult for malicious code to take
control of the system.

23
AMD64 Operating Modes

Legacy mode supports 16- and 32-bit OSes and
apps, while long mode enables 64-bit OSes to
accommodate both 32- and 64-bit apps.
Legacy OS, device drivers, and apps will run
exactly as they did prior to upgrading.
Long Drivers and apps have to be recompiled, so
software selection will be limited, at least
initially.
Most likely scenario is a 64-bit OS with 64-bit
drivers, running a mixture of 32- and 64-bit apps
in compatibility mode.

24
(No Transcript)
25
(No Transcript)
26
Direct Connect Architecture

I/O Architecture for Opteron and Athlon64
Microprocessors are connected to
Memory through an integrated memory controller.
A high performance I/O subsystem via
Hypertransport bus
To other CPUs via HyperTransport bus

27
Onboard Memory Control

Processors do not have to go through a
northbridge to access memory
128-bit memory bus
Latency reduced and bandwidth doubled
Multicore Processors have own memory interface
and own memory
Available memory scales with the number of
processors

28
More Onboard Memory Control

DDR-SDRAM only
Up to 8 registered DDR DIMMs per processor
Memory bandwidth of up to 5.3 Gbytes/s (with
PC2700) per processor.
20 improvement over Athlon just due to
integrated memory

29
HyperTransport

Bidirectional, serial/parallel, scalable,
high-bandwidth low-latency bus
Packet based
32-bit words regardless of physical width
Facilitates power management and low latencies

30
HyperTransport in the Opteron

16 CAD HyperTransport (16-bit wide, CADCommand,
Address, Data)
processor-to-processor and processor-to- chipset
bandwidth of up to 6.4 GB/s (per HT port)
50 more than what the latest Pentium 4 or Xeon
processors
8-bit wide HyperTransport for components such as
normal I/O-Hubs

31
More Opteron HyperTransport

Number of HyperTransport channels
(up to 3) determined by number of CPUs
19.2 Gbytes/s of peak bandwidth per proccessor
All are bi-directional, quad-pumped
Low power consumption (1.2 W) reduces system
thermal budget

32
(No Transcript)
33
(No Transcript)
34
More HyperTransport

Auto-negotiated bus widths
Devices negotiate sizes during initialization
2-bit lines to 32-bit lines.
Busses of various widths can be mixed together in
a single application
Allows for high speed busses between main memory
and the CPU and lower speed busses to peripherals
as appropriate
PCI compatible but 80x faster

35
DCA InterCPU Connections

Multiple CPUs connected through a proprietary
extension running on additional HyperTransport
interfaces
Allows support of a cache-coherent, Non-Uniform
Memory Access, multi-CPU memory access protocol

36
DCA InterCPU Connections

Non-Uniform Memory Access
Separate cache memory for each processor
Memory access time depends on memory location.
(i.e. local faster than non-local)
Cache coherence
Integrity of data stored in local caches of a
shared resource
Each CPU can access the main memory of another
processor, transparent to the programmer

37
DCA Enables Multiple CPUs

Integrated memory controller allows cache access
without using HyperTransport
For non-local memory access and interprocessor
communication, only the initiator and target are
involved, keeping bus-utilization to a minimum.
All CPUs in multiprocessor Intel Xeon systems
share a single common bus for both
Contention for shared bus reduces efficiency

38
Multicore vs Multi-Processor

In multi-processor systems (more than one Opteron
on a single motherboard), the CPUs communicate
using the Direct Connect Architecture
Most retail motherboards offer one or two CPU
sockets
The Opteron CPU directly supports up to an 8-way
configuration (found in mid-level servers)

39
Multicore vs Multi-Processor

With multicore each physical Opteron chip
contains two separate processor cores (more
someday soon?)
Doubles the compute-power available to each
motherboard socket. One socket can delivers the
performance of two processors, two deliver a four
processor equivalent, etc.

40
Future Improvements

Dual-Core vs Double Core
Dual core Two processors on a single die
Double core Two single core processors in one
package
Better for manufacturing
Intel Pentium D 900 Presler
Combined L2 cache
Quad-core, etc.

41
K7 vs. K8 Changes
42
Summary of Changes From K7 to K8

Deeper Wider Pipeline
Better Branch Predictor
Large workload TLB
HyperTransport capabilities eliminate Northbridge
and allow low latency communication between
processors as well as I/O
Larger L2 cache with higher bandwidth and lower
latency
AMD 64 ISA allowing for 64-bit operation

43
The K7 Basics

3 x86 decoding units
3 integer units (ALU)
3 floating point units (FPU)
A 128KB L1 cache
Designed with an efficiency aim
IPC mark (Instructions Per Cycle)
K7 units allow to handle up to 9 instructions per
clock cycle

44
The K8 Basics

3 x86 decoding units
3 integer units (ALU)
3 floating point units (FPU)
A 1MB L1 cache

45
The K7 Core
46
The K8 Core
47
Things To Note About the K8

Schedules a large number of instructions
simultaneously
3 8-entry schedulers for integer instructions
A 36-entry scheduler for floating point
instructions
Compared to the K7, the K8 allows for more
integer instructions to be active in the
pipeline. How is this possible?

48
Processor Constraints

- A 'bigger' processor has more execution units
(width) and more stages in the pipeline (depth)
Processor 'size' is limited by the accuracy of
the branch predictor
determines how many instructions can be active in
the pipeline before an incorrect branch
prediction occurs
in theory, CPU should only accomodate the number
of instructions that can be sent in a pipe before
a misprediction

49
The K8 Branch Predictor
50
The K8 Branch Predictor Details

Compared to the K7, the K8 has improved branch
prediction
Global history counter (ghc) is 4x previous size
ghc is a massive array of 2-bit (0-3) counters,
indexed by a part of an instructions addresse
if the value is gt 2 then branch is predicted as
"taken
Taken branches incrememnt counter
Untaken branches decrement it
The larger global history counter means more
instruction addresses can be saved thus
increasing branch predictor accuracy

51
Translation Look-aside Buffer

The number of entries TLB has been increased
Helps performance in servers with large memory
requirements
Desktop performance impact will be limited to a
small boost when running 3D rendering software

52
HyperTransportTypical CPU to Memory Set-Up

CPU sends 200MHz clock to the north bridge, this
is the FSB.
The bus between north bridge and the CPU is 64
bits wide at 200MHz, (Quad Pumped for 4 packets
per cycle) giving effective rate of 800MHz
The memory bus is also 200MHz and 64 or 128 bits
wide (single or dual channel). As it is DDR
memory, two 64/128 bits packs are sent every
clock cycle.

53
HyperTransportOpteron Memory Set-Up

integrated memory controller does not improve the
memory bandwidth, but drastically reduces memory
request time
HyperTransport uses a 16 bits wide bus at 800MHz,
and a double data rate system that enables a
3.2GB peak bandwidth one-way

54
(No Transcript)
55
Pros Cons

Pros
The performance of the integrated controller of
the K8 increases as the CPU speed increases and
so does the request speed.
The addressable memory size and the total
bandwidth increase with the number of CPUs
Cons
Memory controller is customized to use a specific
memory, and is not very flexible about upgrading

56
Caches
57
L1 Cache Comparison
CPU K8 Pentium 4 Prescott
Size code 64KB TC 12Kµops
Size data 64KB data 16KB
Associativity code 2 way TC 8 way
Associativity data 2 way data 8 way
Cache line size code 64 bytes TC n.a
Cache line size data 64 bytes data 64 bytes
Write policy Write Back Write Through
Latency Given By Manufacturer 3 cycles 4 cycles
58
K8 L1 Cache

Compared to the Intel machine,the large size of
the L1 cache allows for bigger block size
Pros a big range of data or code in the same
memory area
Cons low associativity tends to create conflicts
during the caching phase.

59
(No Transcript)
60
L2 Cache Comparison
CPU K8 Pentium 4 Prescott
Size 512KB (NewCastle) 1024KB
Size 1024KB (Hammer) 1024KB
Associativity 16 way 8 way
Cache line size 64 bytes 64 bytes
Latency given by manufacturer 11 cycles 11 cycles
Bus width 128 bits 256 bits
L1 relationship exclusive inclusive
61
K8 L2 cache

L2 cache of the K8 shares lot of common features
with the K7.
The K8s L2 cache uses a 16-way set associativity
to partially compensates for the low
associativity of the L1.
Although the bus width in the K8 is double what
the K7 offered, it still is smaller than the
Intel model
The K8 also includes an hardware prefetch logic,
that allows to get data from memory to the L2
cache during the the memory bus idle time.

62
(No Transcript)
63
Inclusive vs. Exclusive Caching

Inclusive Caching Used by the Intel P4
L1 cache contains a subset of the L2 cache
During an L1 miss/L2 success data is copied into
the L1 cache and forwarded to the CPU
During an L1/L2 miss, data is copied from memory
into both L1 and L2 caches

64
Inclusive vs. Exclusive Caching

Exclusive Used by the Opteron
L1 and L2 caches cannot contain the same data
During an L1 miss/L2 success data
One line is evicted from the L1 cache into the L2
L2 cache copies data into the L1 cache
During an L1/L2 miss, data is copied into the L1
cache alone

65
Drawback of Exclusive Caching and its solution

Problem A line from the L1 must be copied to the
L2 before getting back the data from the L2.
Takes a lot of clock cycles, adding to the time
needed to get data from the L2
Solution
victim buffer (VB), that is a very little and
fast memory between L1 and L2.
The line evicted from L1 is then copied into the
VB rather than into the L2.
In the same time, the L2 read request is started,
so doing the L1 to VB write operation is hidden
by the L2 latency
Then, if by chance the next requested data is in
the VB, getting back the data from it is much
more quickly than getting it from the L2.
The VB is a good improvement, but it is very
limited by its small size (generally between 8
and 16 cache lines). Moreover, when the VB is
full, it must be flushed into the L2, that is an
additional step and needs some extra cycles.

66
Drawback of Inclusive

The constraint on the L1/L2 size ratio needs the
L1 to be small,
but a small size will result in reducing its
success rate, and consequently its performance.
On the other hand, if it is too big, the ratio
will be too large for good performance of the L2.
Reduces flexibility when deciding size of L1 and
L2 caches
It is very hard to build a CPU line with such
constraints. Intel released the Celeron P4 as a
budget CPU, but its 128KB L2 cache completely
broke the performance.
Total useful cache size is reduced since data is
duplicated over the caches

67
Inclusive vs. Exclusive Caching
Pros Cons
Exclusive No constraint on the L2 size. Total cache size is sum of the sub-level sizes. L2 performance decreases
Inclusive L2 performance Constraint on the L1/L2 size ratio Total cache size is effectively reduced
68
The Pipeline
69
K7 vs. K8 Pipeline Comparison
70
The Fetch Stage

Two Cycles Long
Feeds 3 Decoders with 16 instruction byres each
cycle
Uses the L1 code cache and the branch prediction
logic.

71
The Decode Stage

The decoders convert the x86 instruction in fixed
length micro-operations (µOPs).
Can generate 3 µOPs per cycle
The FastPath "simple" instructions, that are
decoded in 1-2 µOPs, are decoded by hardware then
packed and dispatched
Microcoded path complex instructions are decoded
using the internal ROM
Compared to the K7, more instructions in the K8
use the fast path especially SSE instructions.
AMD claims that the microcoded instructions
number decreased by 8 for integer and 28 for
floating point instructions.

72
Instruction Dispatch

There are
3 address generation units (AGU)
Three integer units (ALU). Most operations
complete within a cycle, in both 32 and 64bits
addition, rotation, shift, logical operations
(and, or).
Integer multiplication has a 3 cycles latency in
32 bits, and a 5 cycles latency in 64 bits.
Three floating point units (FPU), that handle
x87, MMX, 3DNow!, SSE and SSE2.

73
Load/Store Stage

Last stage of the pipeline process
uses the L1 data cache.
the L1 is dual-ported to handle two 32/64 bits
reads or writes each clock cycle.

74
Cache Summary

Compared to the K7, the K8 cache provides higher
bandwidth and lower latencies
Compared to the Intel P4, the K8 caches are
write-back and inclusive

75
AMD 64 GPR encoding

The IA32 instructions encoding is made with a
special byte called the ModRM (Mode / Register /
Memory), in which are encoded the source and
destination registers of the instruction.
3 bits encode the source register, 3 bits encode
the destination

Theres no way to change the ModRM byte since
that would break IA32 compatibility. So to allow
instructions to use the 8 new GPRs, an addition
bit named the REX is added outside the ModRM.
The REX is used only in long (64-bit) mode, and
only if the specified instruction is a 64-bit one

76
AMD 64 SSE

Abandoned the original MMX, 3DNow! Instruction
sets because they operated on the same physical
registers
Supports SSE/SSE2 using eight SSE-dedicated
80-bit registers
If a 128 bit instruction is processed it will
take two steps to complete
Intels P4 allows for the use of 128 bit
registers so 128 bit instructions only take a
single step
However, C/C compilers still usually output
scalar SSE instructions that only use 32/64 bits
so the Opteron can processes most SSE
instructions in one step and thus remain
competitive with the P4

77
AMD 64 One Last Trick

suppose we want to write 1 in a register, that is
written in pseudo-code as
mov register, 1
In the case of a 32 bits register, the immediate
value 1 will be encoded on 32 bits
mov eax, 00000001h
In the case the register is 64 bits
mov rax, 0000000000000001h
Problems? The 64-bit instruction takes 5 more
bits to encode the same number thus wasting space.

78
AMD 64 One Last Trick

Under AMD64, the default size for operand bits is
32.

79
AMD 64 One Last Trick

For memory addressing a more complicated table is
used.

80
AMD 64 Code Size

Cpuid.org estimated that a 64 bits code will be
20-25 bigger compared to the same IA32
instructions based code.
However, the use of sixteen GPR will tend to
reduce the number of instructions, and perhaps
make 64-bit code shorter than 32-bit code.
The K8 is able to handle the code size increase,
thanks to its 3 decoding units, and its big L1
code cache. The use of big 32KB blocs in the L1
organization in order seems now very useful

81
AMD 64 32-bit code vs. 64-bit Code
82

HardOCP
AthlonXP 3200 got outpaced by the Athlon64
3200the P4 and the P4EE came in at a dead tie,
which suggests that the extra CPU cache is not a
factor in this benchmark... pipeline enhancements
made to the new K8 core certainly did impact
instructions per clock.

Note Athlon64 3200 runs at 2.0GHz AthlonXP
3200 runs at 2.2 GHz
83
AMD 64 Conclusions

Allows for a larger addressable memory size
Allows for wider GPRs and 8 more of them
Allows the use of all x86 instructions that were
avaliable on the AMD64 by default
Can lead to small code that is faster as a result
of less memory shuffling

84
Opteron vs. Xeon
85
Opteron vs Xeon in a nutshell

Opteron offers better computing and per-Watt
performance at a roughly equivalent per-device
price
Opteron scales much better when moving from one
to two or even more CPUs
Fundamental limitation
Xeon processors must share one front side bus and
one memory array

86
FSB Bottleneck
Intels Xeon
AMDs Opteron
87
Xeon and the FSB Bottleneck

External north bridge makes implementing multiple
FSB interfaces expensive and hard
Intel just has all the processors share
Relies on large on-die L3 caches to hide issue
Problem grows with number of CPUs

88
The AMD Solution

Recall Each processor has own integrated memory
controller and three HyperTransport ports
No NB required for memory interaction
6.4 GB/s bandwidth between all CPUs
No scaling issue!

89
Further Xeon Notes

Even 64-bit extensions would not solve the
fundamental performance bottleneck imposed by the
current architecture
Xeon can make use of Hyperthreading
Found to improve performance by 3 - 5

90
AnandTech Database Benchmarks

SQL workload based on sites forum usage,
database was forums themselves
i.e. trying to be real world
Two categories 2-way and 4-way setups
Labels
Xeon Clock Speed / FSB Speed / L3 Cache Size
Opteron Clock Speed / L2 Cache Size

91
AnandTech Average Load 2-way

Longer line is better
Opterons at 2.2 GHz maintain 5 lead
over Xeons at 3.2 GHz

92
AnandTech Average Load 4-way

With two more processors, best Opteron system
increases performance lead to 11
Opterons _at_ 1.8 GHz nearly equal Xeons at 3.0
GHz

93
AnandTech Enterprise benchmarks
Stored Procedures / Second

2-way Xeon at 3GHz and large L3 cache does
better
4-way Opteron jumps ahead (8.5 lead)

94
AnandTech Test Conclusions

Opteron is clear winner for gt2 processor systems
Even for dual-processors, Xeon essentially only
ties
Clearly illustrates the scaling bottleneck
Xeons are using most of their huge (4MB) L3 cache
to keep traffic off the FSB
Also Opteron systems used in tests cost ½ as much

95
Toms Hardware Benchmarks

AMD's Opteron 250 vs. Intel's Xeon 3.6GH
Xeon Nocona (i.e. 64-bit processing)
Results enhanced by chipset used (875P) which has
improved memory controller
Still suffers from lack in memory performance
Workstation applications rather than server based
tests

96
Toms Hardware
97
Toms Hardware
98
Toms Hardware Conclusions

AMD has memory benefits, as before
Opteron better in video, Intel better with 3D but
only when 875P-chipset is used
Otherwise Opteron wins in spite of inferior
graphics hardware
Still undecided re 64-bit, no good applications
to benchmark on

99
K8 in Different Packages
100
K8 in Different Packages

Opteron
Server Market
Registered memory
940 pin count
Three HyperTransport links
Multi-cpu configurations (1,2,4, or 8 cpus)
Multiple multi-core cpus supported as well
Up to 8 1GB DIMMs

101
K8 in Different Packages

Athlon 64
Desktop market
Unregistered memory
754 or 939pin count
Up to 4 1GB DIMMs
Single HyperTransport links
Single slot configurations
X2 has multiple cores in one slot
Athlon 64 FX
Same feature set as Athlon 64
Unlocked multiplier for overclocking
Offered at higher clock speeds (2.8GHz vs.
2.6GHz)

102
(No Transcript)
103
K8 in Different Packages

Turion 64
Named to evoke the touring concept
90nm Lancaster Athlon 64 core
64bit computing
SSE3 support
High quality core yields, can run at high clock
speeds with low voltage
Similar process for low wattage opterons
On chip memory controller
Saves power by running in single channel mode
Better compared to Petium Ms extra controller on
the mobo

104
(No Transcript)
105
Thermal Design Points

Pentium 4s TDP 130w
Athlon 64s TDP 89-104w
Opteron HE - 50w EE -30w
Athlon 64 mobiles 50w
DTR market sector
Pentium M 27w
Turion 64 25w

106
K8 in Different Packages

Turion 64 continued
Uses PowerNow! Technology
Similar to Intels SpeedStep
Identical to desktop CoolNQuiet
Dynamic voltage and clock frequency modulation
Operates on demand
Run Cooler and Quieter even when plugged in

107
(No Transcript)
108
(No Transcript)
109
K8 in Different Packages

AMD uses Mobile Technology name
Intel has a monopoly on centrino
Supplies Wireless, chipset and cpu
invested 300 million in Centrino advertising
Some consumers think Centrino is the only way to
get wireless connectivity in a notebook
AMD supplies only the cpu
Chipset and wireless are left up to the
motherboard manufacturer/OEM

110
Marketing

111
Intels Marketing

Men who are Blue
Moores Law
Megahertz
Most importantly Money
Beginning with In order to correctly
communicate the benefits of new processors to PC
buyers it became important that Intel transfer
any brand equity from the ambiguous and
unprotected processor numbers to the company
itself

112
Industry on AMD vs. Intel

Intel spends more on RD in one quarter than AMD
makes in a year
Intel still has a tremendous amount of arrogance
Has been shamed technologically by a flea-sized
(relatively speaking) firm
Humbling? Intel is still grudgingly turning to
the high IPC, low clock rate, dual-core, x86-64,
on-die memory controller design pioneered by its
diminutive rival.
Geek.com

113
AMDs Marketing

Mascot The AMD Arrow
AMD makes superior CPUs, but the marketing
department is acting like they are still selling
the K6 -theinquirer.net
Guilty with Intel on poor metrics
AMD made all the marketing hay it could on the
historically significant clock-speed number. By
trying to turn attention away from that number
now, it runs the risk of appearing to want to
change the subject when it no longer has the
perceived advantage. In marketing, appearance is
everything. And no one wants to look like a sore
loser, even when they aren't. - Forbes

114
(No Transcript)
115
Anandtech on AMDs Marketing

AMD argued that they didn't have to talk about a
new architecture, as Intel is just playing
catch-up to their current architecture.
However, we look at it like this - AMD has the
clear advantage today, and for a variety of
reasons, their stance in the marketplace has not
changed all that much.

116
Conclusion

Improvements over K7
64-bit
Integrated memory controller
HyperTransport
Pipeline
Multiprocessor scaling gt Xeon
K8 is dominant in every market performance-wise
K8 is trounced in every market in sales

117
Reason for 64-bit in Consumer Market

If there aren't widespread, consumer-priced
64-bit machines available in three years, we're
going to have a hard time developing games that
are more compelling than last year's games.- Tim
Sweeney, Founder President Epic Games