Windows 2000 Multiprocessor Scalability - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Windows 2000 Multiprocessor Scalability

Description:

Windows 2000. Multiprocessor Scalability. Demand Technology Software ... Windows NT Thread Scheduler. Win32 Thread Scheduling API. NT Scheduler tuning options ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 50

Provided by: markf167

Category:

more less

Transcript and Presenter's Notes

Title: Windows 2000 Multiprocessor Scalability

1
Windows 2000Multiprocessor Scalability

Demand Technology Software
1020 Eighth Avenue South, Suite 6, Naples, FL
34102
phone (941) 261-8945 fax (941) 261-5456
e-mail markf_at_demandtech.com
http//www.demandtech.com

2
Outline

Processor Performance
Windows NT Thread Scheduler
Win32 Thread Scheduling API
NT Scheduler tuning options
Processor performance monitoring
Application optimization and tuning
Intel hardware performance
Multiprocessing

3
Windows NT Scheduler

Multiprogramming
Priority Queuing
Preemptive Scheduling
Foreground/Background
Multiprocessing

4
Multiprocessing

Symmetric Multiprocessing (SMP)
Multiple processors
dedicated L1 and L2 cache
Shared memory and shared memory bus
All processors can perform all functions
e.g., process I/O interrupts
Single Dispatcher queue
Processor affinity when a specific process must
execute on a specific processor
Performance advantage fewer cache cold starts

5
Symmetric Multiprocessing (SMP)
6
Intel Multiprocessing

Wintel and SMPs - a chronology
486, Pentium support for multiprocessors (LOCK
prefix)
NT 3.5 supported 2-way asymmetric multiprocessing
Only CPU 0 could process interrupts
Pentium Pro designed for 4-way SMPs
NT 4.0 re-engineered for multiprocessing
kernel locking and synchronization functions
NT 4.0 scalability approximately 85
Axil and others built 8-way technology
Pentium II Xeon chip supports native 8X

7
Intel Multiprocessing

Wintel and SMPs - a chronology
Win2K further re-engineered for multiprocessing
Queued spin locks reduce bus contention (PAUSE
instruction)
NT 5.0 scalability may approximate 90 or better
Pentium III features 133 MHz memory bus
integrated L2 cache
16x and 32x multiprocessors planned

8
Intel Multiprocessing

SMP scalability
Pipeline stalls
on LOCK instructions
on interprocessor signals
Cache coherence
Intel processors use a snooping protocol listen
on the memory bus for transactions which change
the state of a cache line
MESI cache coherence protocol
Shared memory (bus) tends to be the bottleneck in
SMPs, generally.

9
MESI cache coherence protocol
10
MESI cache coherence protocol
mem1
modified
mem1
11
Multiprocessing scalability

SMP scalability
Microsoft reports NT 4.0 has an MP scalability
factor of 0.85
CMG 1996

12
Multiprocessing scalability

SMP scalability
As a function of the number of engines and the
size of L2 cache
Intel, 1998

13
Multiprocessing scalability

Root causes of SMP scalability problems
Interprocessor signaling
pipeline stalls due to cache coherence conflicts
cycles wasted by code executing spin locks
Shared memory (and access to it via the bus)

14
Multiprocessing scalability

pipeline stalls due to cache coherence conflicts

15
Multiprocessing

cycles wasted by code executing spin locks
synchronizing instructions do not actually lock
the bus in IA-32

16
Multiprocessing scalability

cycles wasted by code executing spin locks
Multiprocessor-safe Device Driver calls into
the HAL to access generic spin lock functions

17
Multiprocessing scalability

Monitoring bus utilization
primarily memory transactions

18
Multiprocessing scalability

Monitor shared memory bus contention

19
Multiprocessing Thread scheduling

NT Scheduler support
internal support for up to 32 processors
Applications can use Win32 to set hard processor
affinity
Task Manager support
specify a 32-bit processor mask to
SetThreadAffinityMask

20
Multiprocessing Thread scheduling

Soft affinity
Scheduler attempts to assign a Thread to the
processor it last executed on
only within the last 20 milliseconds
mitigate cache loading effects
a higher Priority Ready Thread will preempt the
current Thread running on its ideal processor
based on either soft or hard affinity
This can lead to Thread shuffling

21
(No Transcript)
22
Processor busy ? Processor thruput!
2-way Pentium Pro 200 MHz
23
Multiprocessing

Measurement support
separate Processor instances
If the configuration is truly symmetric,
examining the processors separately adds little
additional insight

24
Multiprocessing

Use the hardware measurement support to answer
questions about MP effects, etc.
Pentium Pro Counters
Methodology
Look at IER then drill down
Internal bus traffic traditionally, this is
where most MPs bottleneck
P5ctrs.dll only instruments one processor, but
you cannot tell which one!
But try the CPUMon freeware utility to verify
that the MP configuration is really symmetric

25
CPUMon
26
(No Transcript)
27
Multiprocessing

Third party support
MCSB AutoPilot
augments the NT Scheduler
Designed for n-way SMPs
Uses measurements to make dispatching decisions
e.g., processor affinity, bus utilization, CPU
cache performance, file cache, etc.

28
Multiprocessing

Third party support
MCSB AutoPilot
based on scheduling technology developed for
massively parallel supercomputers
scheduling algorithm can cause starvation
provides no feedback
probably most effective when there are more than
two processors

29
Multiprocessing

Third party support
MCSB AutoPilot
e.g., when bus utilization 75, APs scheduling
decision is influenced by the threads bus
utilization history

Calculate Bus latency ? Bus requests /
Transactions
30
Calculate Bus latency ? Bus requests /
Transactions
31
(No Transcript)
32
(No Transcript)
33
Multiprocessing tuning

SQL Server support use with caution!
enable Show advanced options
specify a processor affinity mask
priority boost
Priority 15 on a UP
Priority 24 on an SMP
SMP concurrency controls release of threads
-1 is automatic mode releases n-1 threads
or, 1-64
0 is default behavior depends on of processors

34
(No Transcript)
35
(No Transcript)
36
Multiprocessor partitioning

NDIS driver support
Isolate Network Interface Card interrupt
processing (DPCs only) to one or more processors
designed for Servers handling lots of network
traffic
less context switching, fewer cache cold starts
Specify a 32-bit ProcessorAffinityMask that
controls which processor(s) are eligible to
dispatch the NDIS DPC code
default is one NIC card is assigned per processor
starting at the highest number engine and working
down

37
Multiprocessor partitioning

NCR SMP Utilization Manager
Can be used to assign processor affinity to the
NDIS Interrupt Service Routines
Applies installation defaults for Thread priority
and processor affinity to specific processes when
they start-up
Two components
run-time service
configuration GUI

38
NCR SMP Utilization Manager
39
NCR SMP Utilization Manager
40
Multiprocessor partitioning

May be the only way to optimize large, n-way
multiprocessors
But it requires commitment!
Understanding your current workload CPU
processing requirements
Continuous monitoring of the workload on a per
processor basis
Periodic review of the partitioning scheme

41
Multiprocessor partitioning

The workload must be concentrated enough on its
dedicated CPUs so that it will benefit from cache
warm starts, but not too concentrated that it
causes excessive processor queuing.

42
(No Transcript)
43
Clustering

In general, clustering is used to describe
multicomputer technology designed to overcome the
scalability limitations of SMPs
High Availability (Wolfpack, aka MS Cluster
Server)
Scalable performance on workloads that lend
themselves to parallelism (Valence Cluster
Convoy, now MS Load Balancing Server)
Transaction processing workloads
Web Server workloads
Parallel query processing

44
Clustering

Shared Disk
mainframe OS/390 approach (Coupling Facility,
shared DASD)
not practical with extensive memory-resident disk
caching
Shared Nothing clusters
no shared memory
require some form of communication/synchronization
across shared SCSI, Fibre Channel or Network
interconnections
Non-uniform memory access (NUMA) clusters
shared memory access requires longer latency
Although NUMA is transparent to applications, it
certainly helps if programmers understand the
performance trade-offs.

45
Clustering

Wolfpack shared nothing, high availability
clustering
Sends synchronization and heartbeat across a
shared SCSI link
Initial release supports clustering just two
systems for high availability
Standardization effort is highly regarded in the
industry because existing Unix clustering
products are so proprietary

46
Clustered systems (Wolfpack)
47
Clustering

Convoy shared nothing, with load balancing
across multiple servers
Acquired from Valence Research in 8/98
Clustered systems share a virtual IP address
Sends session management data and heartbeat
across a shared network link
A dedicated high speed NIC attached to a
dedicated switched hub is very desirable.
Used internally at www.microsoft.com to scale IIS
across multiple servers

48
Convoy