Title: Windows 2000 Multiprocessor Scalability
1Windows 2000Multiprocessor Scalability
- Demand Technology Software
- 1020 Eighth Avenue South, Suite 6, Naples, FL
34102 - phone (941) 261-8945 fax (941) 261-5456
- e-mail markf_at_demandtech.com
- http//www.demandtech.com
2Outline
- Processor Performance
- Windows NT Thread Scheduler
- Win32 Thread Scheduling API
- NT Scheduler tuning options
- Processor performance monitoring
- Application optimization and tuning
- Intel hardware performance
- Multiprocessing
3Windows NT Scheduler
- Multiprogramming
- Priority Queuing
- Preemptive Scheduling
- Foreground/Background
- Multiprocessing
4Multiprocessing
- Symmetric Multiprocessing (SMP)
- Multiple processors
- dedicated L1 and L2 cache
- Shared memory and shared memory bus
- All processors can perform all functions
- e.g., process I/O interrupts
- Single Dispatcher queue
- Processor affinity when a specific process must
execute on a specific processor - Performance advantage fewer cache cold starts
5Symmetric Multiprocessing (SMP)
6Intel Multiprocessing
- Wintel and SMPs - a chronology
- 486, Pentium support for multiprocessors (LOCK
prefix) - NT 3.5 supported 2-way asymmetric multiprocessing
- Only CPU 0 could process interrupts
- Pentium Pro designed for 4-way SMPs
- NT 4.0 re-engineered for multiprocessing
- kernel locking and synchronization functions
- NT 4.0 scalability approximately 85
- Axil and others built 8-way technology
- Pentium II Xeon chip supports native 8X
7Intel Multiprocessing
- Wintel and SMPs - a chronology
- Win2K further re-engineered for multiprocessing
- Queued spin locks reduce bus contention (PAUSE
instruction) - NT 5.0 scalability may approximate 90 or better
- Pentium III features 133 MHz memory bus
integrated L2 cache - 16x and 32x multiprocessors planned
8Intel Multiprocessing
- SMP scalability
- Pipeline stalls
- on LOCK instructions
- on interprocessor signals
- Cache coherence
- Intel processors use a snooping protocol listen
on the memory bus for transactions which change
the state of a cache line - MESI cache coherence protocol
- Shared memory (bus) tends to be the bottleneck in
SMPs, generally.
9MESI cache coherence protocol
10MESI cache coherence protocol
mem1
modified
mem1
11Multiprocessing scalability
- SMP scalability
- Microsoft reports NT 4.0 has an MP scalability
factor of 0.85 - CMG 1996
12Multiprocessing scalability
- SMP scalability
- As a function of the number of engines and the
size of L2 cache - Intel, 1998
13Multiprocessing scalability
- Root causes of SMP scalability problems
- Interprocessor signaling
- pipeline stalls due to cache coherence conflicts
- cycles wasted by code executing spin locks
- Shared memory (and access to it via the bus)
14Multiprocessing scalability
- pipeline stalls due to cache coherence conflicts
15Multiprocessing
- cycles wasted by code executing spin locks
- synchronizing instructions do not actually lock
the bus in IA-32
16Multiprocessing scalability
- cycles wasted by code executing spin locks
- Multiprocessor-safe Device Driver calls into
the HAL to access generic spin lock functions
17Multiprocessing scalability
- Monitoring bus utilization
- primarily memory transactions
18Multiprocessing scalability
- Monitor shared memory bus contention
19Multiprocessing Thread scheduling
- NT Scheduler support
- internal support for up to 32 processors
- Applications can use Win32 to set hard processor
affinity - Task Manager support
- specify a 32-bit processor mask to
SetThreadAffinityMask
20Multiprocessing Thread scheduling
- Soft affinity
- Scheduler attempts to assign a Thread to the
processor it last executed on - only within the last 20 milliseconds
- mitigate cache loading effects
- a higher Priority Ready Thread will preempt the
current Thread running on its ideal processor - based on either soft or hard affinity
- This can lead to Thread shuffling
21(No Transcript)
22Processor busy ? Processor thruput!
2-way Pentium Pro 200 MHz
23Multiprocessing
- Measurement support
- separate Processor instances
- If the configuration is truly symmetric,
examining the processors separately adds little
additional insight
24Multiprocessing
- Use the hardware measurement support to answer
questions about MP effects, etc. - Pentium Pro Counters
- Methodology
- Look at IER then drill down
- Internal bus traffic traditionally, this is
where most MPs bottleneck - P5ctrs.dll only instruments one processor, but
you cannot tell which one! - But try the CPUMon freeware utility to verify
that the MP configuration is really symmetric
25CPUMon
26(No Transcript)
27Multiprocessing
- Third party support
- MCSB AutoPilot
- augments the NT Scheduler
- Designed for n-way SMPs
- Uses measurements to make dispatching decisions
- e.g., processor affinity, bus utilization, CPU
cache performance, file cache, etc.
28Multiprocessing
- Third party support
- MCSB AutoPilot
- based on scheduling technology developed for
massively parallel supercomputers - scheduling algorithm can cause starvation
- provides no feedback
- probably most effective when there are more than
two processors
29Multiprocessing
- Third party support
- MCSB AutoPilot
- e.g., when bus utilization 75, APs scheduling
decision is influenced by the threads bus
utilization history
Calculate Bus latency ? Bus requests /
Transactions
30Calculate Bus latency ? Bus requests /
Transactions
31(No Transcript)
32(No Transcript)
33Multiprocessing tuning
- SQL Server support use with caution!
- enable Show advanced options
- specify a processor affinity mask
- priority boost
- Priority 15 on a UP
- Priority 24 on an SMP
- SMP concurrency controls release of threads
- -1 is automatic mode releases n-1 threads
- or, 1-64
- 0 is default behavior depends on of processors
34(No Transcript)
35(No Transcript)
36Multiprocessor partitioning
- NDIS driver support
- Isolate Network Interface Card interrupt
processing (DPCs only) to one or more processors - designed for Servers handling lots of network
traffic - less context switching, fewer cache cold starts
- Specify a 32-bit ProcessorAffinityMask that
controls which processor(s) are eligible to
dispatch the NDIS DPC code - default is one NIC card is assigned per processor
starting at the highest number engine and working
down
37Multiprocessor partitioning
- NCR SMP Utilization Manager
- Can be used to assign processor affinity to the
NDIS Interrupt Service Routines - Applies installation defaults for Thread priority
and processor affinity to specific processes when
they start-up - Two components
- run-time service
- configuration GUI
38NCR SMP Utilization Manager
39NCR SMP Utilization Manager
40Multiprocessor partitioning
- May be the only way to optimize large, n-way
multiprocessors - But it requires commitment!
- Understanding your current workload CPU
processing requirements - Continuous monitoring of the workload on a per
processor basis - Periodic review of the partitioning scheme
41Multiprocessor partitioning
- The workload must be concentrated enough on its
dedicated CPUs so that it will benefit from cache
warm starts, but not too concentrated that it
causes excessive processor queuing.
42(No Transcript)
43Clustering
- In general, clustering is used to describe
multicomputer technology designed to overcome the
scalability limitations of SMPs - High Availability (Wolfpack, aka MS Cluster
Server) - Scalable performance on workloads that lend
themselves to parallelism (Valence Cluster
Convoy, now MS Load Balancing Server) - Transaction processing workloads
- Web Server workloads
- Parallel query processing
44Clustering
- Shared Disk
- mainframe OS/390 approach (Coupling Facility,
shared DASD) - not practical with extensive memory-resident disk
caching - Shared Nothing clusters
- no shared memory
- require some form of communication/synchronization
across shared SCSI, Fibre Channel or Network
interconnections - Non-uniform memory access (NUMA) clusters
- shared memory access requires longer latency
- Although NUMA is transparent to applications, it
certainly helps if programmers understand the
performance trade-offs.
45Clustering
- Wolfpack shared nothing, high availability
clustering - Sends synchronization and heartbeat across a
shared SCSI link - Initial release supports clustering just two
systems for high availability - Standardization effort is highly regarded in the
industry because existing Unix clustering
products are so proprietary
46Clustered systems (Wolfpack)
47Clustering
- Convoy shared nothing, with load balancing
across multiple servers - Acquired from Valence Research in 8/98
- Clustered systems share a virtual IP address
- Sends session management data and heartbeat
across a shared network link - A dedicated high speed NIC attached to a
dedicated switched hub is very desirable. - Used internally at www.microsoft.com to scale IIS
across multiple servers
48Convoy
- For scaling ftp, http, and asp applications
- not for database applications like SQL Server,
Notes, Oracle or Exchange - MTS version 3 to offer transaction processing
load balancing, too.
49Where to get more information
- Windows 2000 Server Resource Kit
- Inside Windows 2000, Solomon Russinovich
- Microsoft Developer Network CD
- Intel vTune documentation (also available from
Intels Web site click here.) - Computer Architecture A Quantitative Approach,
Hennesey and Patterson - Pentium Pro and Pentium II System Architecture,
Mindshare, Inc. - Inner Loops, Booth
- The Indispensable Pentium Book, Messner
- Mark Russinovich, http//www.sysinternals.com/
- The Practical Performance Analyst, Gunther