Title: The x86 Server Platform
1The x86 Server Platform
- .. Resistance is futile.
- Dec 6, 2004
2Server shipments Total vs x86
3Market Share Servers, United States, 2Q04
Michael McLaughlin, Market Share Servers, United
States, 2Q04 7 October 2004, Gartner
4x86 Platform CPUs
- Intel
- Xeon MP Gallatin (future is Potomac)
- Xeon SP/DP EM64T - Nacona
- Itanium II MP Madison (future is Montecito)
- AMD
- Opteron
5Gallatin - MP
- 130 nm
- 3 GHz
- 4 MB L3 Cache
- FSB - 400 MHz
6ES7000 32 Gallatins
7Nacona Single Processor with EM64T
- 90 nm
- Clock Speed 3.2-3.6 GHz
- L3 4 MB
- FSB 800 Mhz
8Itanium II - Madison
- 130 nm
- 9 MB L3 cache
- 1.6 GHz
- FSB 400 MHz
9(No Transcript)
10(No Transcript)
11STOP
- Why Multi-Core?
- .. And while were at it, why Multi-Threading?
- Its all about the balance of
- Silicon real estate
- Compiler technology
- Cost
- Power
- . to meeting the constant pressure to double
performance every 18 months
12Memory Latency vs CPU Speed
MicroprocessorOperating Frequency (GHz)
DRAM AccessFrequency (10-9 sec)-1
10.0
10.0
1.0
1.0
Microprocessor on-chip clock
Commodity DRAM
0.1
0.1
0.01
0.01
1990
1995
2000
2005
2010
Production Year
13Processor Architecture
- When latency ? Ø and bandwidth ? 8 we will have
the perfect CPU - A great deal of innovation has centered around
approximating this perfect world - CISC
- CPU Cache
- RISC
- EPIC
- Multi-Threading
- Multiple Cores
14Complex Instruction Set Computer
- Hardware implements assembler instructions
- MULT A, B
- hardware loads registers, multiplies and stores
results - Multiple clocks needed for an instruction
- RAM requirements are relatively small
- Compilers translate high level languages down to
assembler instructions Von Neumann
hardware
http//www.hardwarecentral.com/hardwarecentral/tut
orials/2427
15CPU Cache
- When CPU speeds started to increase, memory
latency emerged as a bottleneck - CPU caches were used to keep local references
close to the CPU - For SMP systems, memory banks were more than a
clock away - It is not uncommon today to find 3 orders of
magnitude between the fastest and slowest memory
latency
16Reduced Instruction Set Computer
- Hardware is simplified fewer transistors are
needed for full instruction set - RAM requirements are higher to store intermediate
results and more code - Compilers are more complex
- Clock speeds increase because instructions are
simpler - Deterministic, simple instructions allow
pipelining
17Pipelining
25 busy
Higher Clock Speeds!
100 busy
80 busy
60 busy
40 busy
18Branch Prediction
- While processing in parallel, branches occur
- Branch prediction is used to increase the
probability that a specific branch will be
followed - If incorrect, the pipeline is dead and the CPU
stalls - Statistics
- 10-20 of instructions are branches
- Predictions are incorrect about 10 of the time
- As the pipeline increases, probability of miss
increases and cycles will be discarded - 80-deep pipeline / 20 branches / 10 miss 80
chance of miss and a penalty of 80 cycles
19Itanium II Epic Instruction SetExplicitly
Parallel Instruction Computing
- Compiler can indicate code that can be executed
in parallel - Both branches are pipelined
- No lost cycles due to miss-prediction
- Pipeline can be deeper
- Complexity continues to move into the compiler
20Multi-Threading
21(No Transcript)
22Multiple Cores
- Fabrication sizes continue to diminish
- The additional real estate has been used to put
more and more memory on the die - Multi-core technology provides a new way to
exploit the additional space - The clock rates cannot continue to climb due to
the excessive heat - P C V2 f C - switch capacitance V
Supply Voltage f clock frequency - Multiple cores is the next step to providing
faster execution times for applications
23(End of 2005?)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30AMD Opteron 800 Series
- 130 nm
- Clock Speed 1.4-2.4 GHz
- L2 1 MB
- 6.4 GB/s Hypertransport
31Architectural Comparison
Hypertransport - 6.4 GB/s
Opteron
Opteron
Xeon
Xeon
Xeon
Xeon
6.4 GB/s
Opteron
Opteron
PCI-XBridge
MemoryAddressBuffer
DDR 144-bit
PCI-XBridge
SNC
PCI-XBridge
PCI-XBridge
MemoryAddressBuffer
PCI-XBridge
I/OHub
OtherBridge
MemoryAddressBuffer
I/OHub
MemoryAddressBuffer
32Mapping Workloads onto Architecture
- Consider a dichotomy of workloads
- Large Memory Model This needs a large, single
system image and a large amount of coherent
memory - Database apps - SQL Server / Oracle
- Business Intelligence Data Warehousing
Analytics - Memory-resident databases
- 64 bit architectures allow memory addressability
above 1 TB - Small/Medium Memory Model This can be
cost-effective in workloads that do not require
extensive shared memory/state - Stateless Applications and Web Services
- Web Servers
- Clusters of systems for parallelized applications
and grids
33Large Server Vendors
- Intel Announcement (Nov 19)
- Otellini said product development, marketing and
software efforts (for Itanium) will all now be
aimed at "greater than four-way systems". He also
said, "The mainframe isn't dead. That's where I'd
like to push Itanium over time." - The size of the SMP is affected by Intels chip
set support for coherent memory - OEM Vendors (Unisys, HP, SGI, Fujitsu, IBM)
- Each has unique chip set to build basic
four-ways into large SMP systems - IBM has Power5, which is a direct competitor
- Intel 32-bit and EM674T
- This could emerge as the flagship product
34Where Are We Going?
- Since the early CISC computers, we have moved
more and more of the complexity out to the
compiler to achieve parallelism and fully exploit
the silicon real estate - The power requirements, along with the smaller
fabrication sizes, have pushed the CPU vendors to
exploit multiple cores - The key to performance for these future machines
will be the applications ability to exploit
parallelism