Windows 2000 Multiprocessor Scalability - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Windows 2000 Multiprocessor Scalability

Description:

Windows 2000. Multiprocessor Scalability. Demand Technology Software ... Windows NT Thread Scheduler. Win32 Thread Scheduling API. NT Scheduler tuning options ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 50
Provided by: markf167
Category:

less

Transcript and Presenter's Notes

Title: Windows 2000 Multiprocessor Scalability


1
Windows 2000Multiprocessor Scalability
  • Demand Technology Software
  • 1020 Eighth Avenue South, Suite 6, Naples, FL
    34102
  • phone (941) 261-8945 fax (941) 261-5456
  • e-mail markf_at_demandtech.com
  • http//www.demandtech.com

2
Outline
  • Processor Performance
  • Windows NT Thread Scheduler
  • Win32 Thread Scheduling API
  • NT Scheduler tuning options
  • Processor performance monitoring
  • Application optimization and tuning
  • Intel hardware performance
  • Multiprocessing

3
Windows NT Scheduler
  • Multiprogramming
  • Priority Queuing
  • Preemptive Scheduling
  • Foreground/Background
  • Multiprocessing

4
Multiprocessing
  • Symmetric Multiprocessing (SMP)
  • Multiple processors
  • dedicated L1 and L2 cache
  • Shared memory and shared memory bus
  • All processors can perform all functions
  • e.g., process I/O interrupts
  • Single Dispatcher queue
  • Processor affinity when a specific process must
    execute on a specific processor
  • Performance advantage fewer cache cold starts

5
Symmetric Multiprocessing (SMP)
6
Intel Multiprocessing
  • Wintel and SMPs - a chronology
  • 486, Pentium support for multiprocessors (LOCK
    prefix)
  • NT 3.5 supported 2-way asymmetric multiprocessing
  • Only CPU 0 could process interrupts
  • Pentium Pro designed for 4-way SMPs
  • NT 4.0 re-engineered for multiprocessing
  • kernel locking and synchronization functions
  • NT 4.0 scalability approximately 85
  • Axil and others built 8-way technology
  • Pentium II Xeon chip supports native 8X

7
Intel Multiprocessing
  • Wintel and SMPs - a chronology
  • Win2K further re-engineered for multiprocessing
  • Queued spin locks reduce bus contention (PAUSE
    instruction)
  • NT 5.0 scalability may approximate 90 or better
  • Pentium III features 133 MHz memory bus
    integrated L2 cache
  • 16x and 32x multiprocessors planned

8
Intel Multiprocessing
  • SMP scalability
  • Pipeline stalls
  • on LOCK instructions
  • on interprocessor signals
  • Cache coherence
  • Intel processors use a snooping protocol listen
    on the memory bus for transactions which change
    the state of a cache line
  • MESI cache coherence protocol
  • Shared memory (bus) tends to be the bottleneck in
    SMPs, generally.

9
MESI cache coherence protocol
10
MESI cache coherence protocol
mem1
modified
mem1
11
Multiprocessing scalability
  • SMP scalability
  • Microsoft reports NT 4.0 has an MP scalability
    factor of 0.85
  • CMG 1996

12
Multiprocessing scalability
  • SMP scalability
  • As a function of the number of engines and the
    size of L2 cache
  • Intel, 1998

13
Multiprocessing scalability
  • Root causes of SMP scalability problems
  • Interprocessor signaling
  • pipeline stalls due to cache coherence conflicts
  • cycles wasted by code executing spin locks
  • Shared memory (and access to it via the bus)

14
Multiprocessing scalability
  • pipeline stalls due to cache coherence conflicts

15
Multiprocessing
  • cycles wasted by code executing spin locks
  • synchronizing instructions do not actually lock
    the bus in IA-32

16
Multiprocessing scalability
  • cycles wasted by code executing spin locks
  • Multiprocessor-safe Device Driver calls into
    the HAL to access generic spin lock functions

17
Multiprocessing scalability
  • Monitoring bus utilization
  • primarily memory transactions

18
Multiprocessing scalability
  • Monitor shared memory bus contention

19
Multiprocessing Thread scheduling
  • NT Scheduler support
  • internal support for up to 32 processors
  • Applications can use Win32 to set hard processor
    affinity
  • Task Manager support
  • specify a 32-bit processor mask to
    SetThreadAffinityMask

20
Multiprocessing Thread scheduling
  • Soft affinity
  • Scheduler attempts to assign a Thread to the
    processor it last executed on
  • only within the last 20 milliseconds
  • mitigate cache loading effects
  • a higher Priority Ready Thread will preempt the
    current Thread running on its ideal processor
  • based on either soft or hard affinity
  • This can lead to Thread shuffling

21
(No Transcript)
22
Processor busy ? Processor thruput!
2-way Pentium Pro 200 MHz
23
Multiprocessing
  • Measurement support
  • separate Processor instances
  • If the configuration is truly symmetric,
    examining the processors separately adds little
    additional insight

24
Multiprocessing
  • Use the hardware measurement support to answer
    questions about MP effects, etc.
  • Pentium Pro Counters
  • Methodology
  • Look at IER then drill down
  • Internal bus traffic traditionally, this is
    where most MPs bottleneck
  • P5ctrs.dll only instruments one processor, but
    you cannot tell which one!
  • But try the CPUMon freeware utility to verify
    that the MP configuration is really symmetric

25
CPUMon
26
(No Transcript)
27
Multiprocessing
  • Third party support
  • MCSB AutoPilot
  • augments the NT Scheduler
  • Designed for n-way SMPs
  • Uses measurements to make dispatching decisions
  • e.g., processor affinity, bus utilization, CPU
    cache performance, file cache, etc.

28
Multiprocessing
  • Third party support
  • MCSB AutoPilot
  • based on scheduling technology developed for
    massively parallel supercomputers
  • scheduling algorithm can cause starvation
  • provides no feedback
  • probably most effective when there are more than
    two processors

29
Multiprocessing
  • Third party support
  • MCSB AutoPilot
  • e.g., when bus utilization 75, APs scheduling
    decision is influenced by the threads bus
    utilization history

Calculate Bus latency ? Bus requests /
Transactions
30
Calculate Bus latency ? Bus requests /
Transactions
31
(No Transcript)
32
(No Transcript)
33
Multiprocessing tuning
  • SQL Server support use with caution!
  • enable Show advanced options
  • specify a processor affinity mask
  • priority boost
  • Priority 15 on a UP
  • Priority 24 on an SMP
  • SMP concurrency controls release of threads
  • -1 is automatic mode releases n-1 threads
  • or, 1-64
  • 0 is default behavior depends on of processors

34
(No Transcript)
35
(No Transcript)
36
Multiprocessor partitioning
  • NDIS driver support
  • Isolate Network Interface Card interrupt
    processing (DPCs only) to one or more processors
  • designed for Servers handling lots of network
    traffic
  • less context switching, fewer cache cold starts
  • Specify a 32-bit ProcessorAffinityMask that
    controls which processor(s) are eligible to
    dispatch the NDIS DPC code
  • default is one NIC card is assigned per processor
    starting at the highest number engine and working
    down

37
Multiprocessor partitioning
  • NCR SMP Utilization Manager
  • Can be used to assign processor affinity to the
    NDIS Interrupt Service Routines
  • Applies installation defaults for Thread priority
    and processor affinity to specific processes when
    they start-up
  • Two components
  • run-time service
  • configuration GUI

38
NCR SMP Utilization Manager
39
NCR SMP Utilization Manager
40
Multiprocessor partitioning
  • May be the only way to optimize large, n-way
    multiprocessors
  • But it requires commitment!
  • Understanding your current workload CPU
    processing requirements
  • Continuous monitoring of the workload on a per
    processor basis
  • Periodic review of the partitioning scheme

41
Multiprocessor partitioning
  • The workload must be concentrated enough on its
    dedicated CPUs so that it will benefit from cache
    warm starts, but not too concentrated that it
    causes excessive processor queuing.

42
(No Transcript)
43
Clustering
  • In general, clustering is used to describe
    multicomputer technology designed to overcome the
    scalability limitations of SMPs
  • High Availability (Wolfpack, aka MS Cluster
    Server)
  • Scalable performance on workloads that lend
    themselves to parallelism (Valence Cluster
    Convoy, now MS Load Balancing Server)
  • Transaction processing workloads
  • Web Server workloads
  • Parallel query processing

44
Clustering
  • Shared Disk
  • mainframe OS/390 approach (Coupling Facility,
    shared DASD)
  • not practical with extensive memory-resident disk
    caching
  • Shared Nothing clusters
  • no shared memory
  • require some form of communication/synchronization
    across shared SCSI, Fibre Channel or Network
    interconnections
  • Non-uniform memory access (NUMA) clusters
  • shared memory access requires longer latency
  • Although NUMA is transparent to applications, it
    certainly helps if programmers understand the
    performance trade-offs.

45
Clustering
  • Wolfpack shared nothing, high availability
    clustering
  • Sends synchronization and heartbeat across a
    shared SCSI link
  • Initial release supports clustering just two
    systems for high availability
  • Standardization effort is highly regarded in the
    industry because existing Unix clustering
    products are so proprietary

46
Clustered systems (Wolfpack)
47
Clustering
  • Convoy shared nothing, with load balancing
    across multiple servers
  • Acquired from Valence Research in 8/98
  • Clustered systems share a virtual IP address
  • Sends session management data and heartbeat
    across a shared network link
  • A dedicated high speed NIC attached to a
    dedicated switched hub is very desirable.
  • Used internally at www.microsoft.com to scale IIS
    across multiple servers

48
Convoy
  • For scaling ftp, http, and asp applications
  • not for database applications like SQL Server,
    Notes, Oracle or Exchange
  • MTS version 3 to offer transaction processing
    load balancing, too.

49
Where to get more information
  • Windows 2000 Server Resource Kit
  • Inside Windows 2000, Solomon Russinovich
  • Microsoft Developer Network CD
  • Intel vTune documentation (also available from
    Intels Web site click here.)
  • Computer Architecture A Quantitative Approach,
    Hennesey and Patterson
  • Pentium Pro and Pentium II System Architecture,
    Mindshare, Inc.
  • Inner Loops, Booth
  • The Indispensable Pentium Book, Messner
  • Mark Russinovich, http//www.sysinternals.com/
  • The Practical Performance Analyst, Gunther
Write a Comment
User Comments (0)
About PowerShow.com