Multiprocessor Initialization - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Multiprocessor Initialization

Description:

This register is initially zero, but its APIC ID Field (8-bits) is programmed ... The BSP ('Boot-Strap Processor') wakes up other processors by broadcasting the ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 30
Provided by: professora5
Category:

less

Transcript and Presenter's Notes

Title: Multiprocessor Initialization


1
Multiprocessor Initialization
  • An introduction to the use of Interprocessor
    Interrupts

2
A traditional MP system
Main memory
CPU 0
CPU 1
system bus
3
Dual-Core Technology
Core 2 Duo processor
Main memory
CPU 0
CPU 1
Shared level-2 cache
system bus
4
Multi-Core Technology
Core 2 Quad processor
Main memory
CPU 0
CPU 1
CPU 2
CPU 3
Shared level-2 cache
Shared level-2 cache
system bus
5
CPU has its own Local-APIC
CPU
processors application registers EAX, EBX, ,
EIP, EFLAGS processors system registers CR0,
CR2, CR3, , IDTR, GDTR, TR
processors Execution Engine
processors Local-APIC registers Local-ID, IRR,
ISR, EOI, LVT0, LVT1, , ICR, TCFG
6
The Local-APIC ID register
31
24
0
reserved
APIC ID
This register is initially zero, but its APIC ID
Field (8-bits) is programmed by the BIOS during
system startup with a unique processor
identification- Number, which subsequently is
used when specifying the processor as a recipient
of inter-processor interrupts.
Memory-Mapped Register-Address 0xFEE00020
7
The Local-APIC EOI register
31
0
write-only register
This write-only register is used by Interrupt
Service Routines to issue an End-Of-Interrupt
command to the Local-APIC. Any value written to
this register will be interpreted by the
Local-APIC as an EOI command. The value stored
in this register is initially zero (and it will
remain unchanged).
Memory-Mapped Register-Address 0xFEE000B0
8
The Spurious Interrupt register
31
7
0
8
reserved
spurious vector
E N
Local-APIC is Enabled (1yes, 0no)
This register is used to Enable/Disable the
functioning of the Local-APIC, and when enabled,
to specify the interrupt-vector number to be
delivered to the processor in case the Local-APIC
generates a spurious interrupt. (In some
processor-models, the vectors lowest 4-bits are
hardwired 1s.)
Memory-Mapped Register-Address 0xFEE000F0
9
Interrupt Command Register
  • Each processors Local-APIC unit has a 64-bit
    Interrupt Command Register
  • It can be programmed by system software to
    transmit messages to one, or to several, of the
    other processors in the system
  • Each processor has a unique identification number
    in its APIC Local-ID Register that can be used
    for directing messages to it

10
ICR (upper 32-bits)
31
24
0
reserved
Destination field
The Destination Field (8-bits) can be used to
specify which processor (or group of processors)
will receive the message
Memory-Mapped Register-Address 0xFEE00310
11
ICR (lower 32-bits)
15
31
19 18
0
7
10 8
12
Vector field
R / O
Delivery Mode 000 Fixed 001 Lowest
Priority 010 SMI 011 (reserved) 100
NMI 101 INIT 110 Start Up 111
(reserved)
Destination Shorthand 00 no shorthand 01
only to self 10 all including self 11 all
excluding self
Trigger Mode 0 Edge 1 Level
Level 0 De-assert 1 Assert
Destination Mode 0 Physical 1 Logical
Delivery Status 0 Idle 1 Pending
Memory-Mapped Register-Address 0xFEE00300
12
MP initialization protocol
  • Set a shared processor-counter equal to 1
  • Step 1 issue an INIT IPI to all-except-self
  • Delay for 10 milliseconds
  • Step 2 issue Startup IPI to all-except-self
  • Delay for 200 microseconds
  • Step 3 issue Startup IPI to all-except-self
  • Delay for 200 microseconds
  • Check the value of the processor-counter

13
Issue an INIT IPI
  • address Local-APIC via register FS
  • mov sel_fs, ax
  • mov ax, fs
  • broadcast INIT IPI to all-except-self
  • mov 0x000C4500, eax
  • mov eax, fs0xFEE00300)
  • .B0 btl 12, fs(0xFEE00300)
  • jc .B0

14
Issue a Startup IPI
  • broadcast Startup IPI to all-except-self
  • using vector 0x11 to specify entry-point
  • at real memory-address 0x00011000
  • mov 0x000C4611, eax
  • mov eax, fs(0xFEE00300)
  • .B1 btl 12, fs(0xFEE00300)
  • jc .B1

15
Timing delays
  • Intels MP Initialization Protocol specifies the
    use of some timing-delays
  • 10 milliseconds ( 10,000 microseconds)
  • 200 microseconds
  • We can use the 8254 Timers Channel 2 for
    implementing these timed delays, by programming
    it for one-shot countdown mode, then polling
    bit 5 at i/o port 0x61

16
Mathematical examples
EXAMPLE 1 Delaying for 10-milliseconds means
delaying for 1/100-th of a second (because 100
times 10-milliseconds one-thousand milliseconds)
EXAMPLE 2 Delaying for 200-microseconds means
delaying 1/5000-th of a second (because 5000
times 200 microseconds one-million microseconds)
GENERAL PRINCIPLE Delaying for
xmicroseconds means delaying for 1000000/x
seconds (because 1000000/x times x-microseconds
one-million microseconds)

17
Mathematical theory
PROBLEM Given the desired delay-time in
microseconds, express the desired delay-time in
clock-frequency pulses and program that number
into the PITs Latch-Register
RECALL Clock-Frequency-in-Seconds 1193182
Hertz
ALSO One second equals one-million microseconds
APPLYING DIMENSIONAL ANALYSIS
Pulses-Per-Microsecond Pulses-Per-Second /
Microseconds-Per-Second
Delay-in-Clock-Pulses Delay-in-Microseconds
Pulses-Per-Microsecond
CONCLUSION
For a desired time-delay of x microseconds, the
number of clock-pulses may be computed as x
(1193182 /1000000) (1193182 x) / 1000000 as
dividing by a fraction amounts to multiplying by
that fractions reciprocal
18
Delaying for EAX microseconds
We compute the value for the 8254 Timers
Channel-2 Latch-register Delaying for EAX
microseconds means that Latch-registers value is
a certain fraction of one full seconds worth
of input-pulses fraction (EAX
microseconds)/(one-million microseconds-per-second
) Thus the latch-value should be
fraction(1193182 pulses-per-second) which we
can compute by doing a multiplication followed by
a division mov eax, ecx copy the delay
to ECX mov 1193182, eax setup
input-frequency in EAX mul ecx multiplied
by microseconds mov 1000000, ecx setup
one-million as a divisor div ecx so
quotient will be Latch-value Quotient in
register AX should be written to the timers
Latch Register
19
Intels MP terminology
  • When an MP system starts up, one of the CPUs will
    be selected to handle the boot procedures,
    while the other CPUs sleep
  • The BSP is this BootStrap Processor, and every
    other processor is known as an AP (i.e., a
    so-called Application Processor)

BSP
AP
AP
AP
20
parallel computing principles
  • When its awakened, each processor will need its
    own private stack-area, so it can handle any
    interrupts or procedure-calls without modifying
    an area in memory which another processor is
    also using
  • And whenever two or more processors do share
    write-access to any memory area, then those
    accesses must serialized

21
atomic memory-access
  • Shared variables must not be modified by more
    than one processor at a time (atomic access)
  • The x86 cpus lock prefix helps enforce this
  • Example every processor adds 1 to a counter
  • lock
  • incl (counter)
  • Some instructions have atomic access built in
  • Example all processors needs private stacks
  • mov 0x1000, ax
  • xadd (new_SS), ax
  • mov ax, ss

22
ROM-BIOS isnt reentrant
  • The video service-functions in ROM-BIOS often
    used to display a message-string at the current
    cursor-location (and afterward advance the
    cursor) modify global storage locations (as well
    as i/o ports), and hence must be called by one
    processor at a time
  • A shared memory-variable (called mutex) is used
    to enforce this mutual exclusion

23
Implementing a spinlock
Here is a global variable, which all of the
processors can modify mutex .word 1 initial
value for variable is 1 Here is a prologue
and epilog for using this variable to enforce
mutually exclusive access to a section of
non-reentrant code spin btw 0, mutex test
bit 0 to see if mutex is free jnc spin spin
if the mutex is not available lock else
request exclusive bus-access btrw 0, mutex
and try to grab mutex ownership jnc spin
unsuccessful? then try again lt CRITICAL
SECTION OF NON-REENTRANT CODEgt btsw 0,
mutex release the mutex when finished
24
Demo mphello.s
  • Each CPU needs to access its Local-APIC
  • The BSP (Boot-Strap Processor) wakes up other
    processors by broadcasting the INIT-SIPI-SIPI
    message-sequence
  • Each AP (Application Processor) starts
    executing at a 4K page-boundary -- and needs its
    own private stack-area
  • Shared variables require atomic access

25
Demos organization
  • MAIN the BSP will execute these calls
  • call allow_4GB_access
  • call display_APIC_LocalID
  • call broadcast_AP_starup
  • call delay_until_APs_halt
  • initAP each AP will execute these calls
  • call allow_4GB_access
  • call display_APIC_LocalID

26
In-class exercise 1
  • Add a call to this procedure by each of the
    processors, but do it without using a lock
    prefix (and outside mutex-protected code)
  • Then let the BSP print the value of total

total .word 0 include this shared
global-variable add_one_thousand let each
processor call this subroutine mov 1000,
cx nxadd addw 1, total loop nxadd ret
27
Binary-to-Decimal
  • Recall algorithm for converting numbers to
    decimal digit-strings (for console display)

num2dec converts value in register AX to a
decimal string at DSDI mov 10, bx setup
the number-base in BX xor cx, cx setup
remainder-count in CX nxdiv xor dx, dx
extend AX to a doubleword div bx divide the
doubleword by ten push dx save remainder on
the stack inc cx and count this
remainder or ax, ax was the quotient zero
yet? jnz nxdiv no, generate another
digit nxdgt pop dx recover saved
remainder add 0, dl convert remainder to
ASCII mov dl, (di) store numeral in
output-buffer inc di and advance
buffer-pointer loop nxdgt again for other
remainders
28
In-class exercise 2
  • Using a Core-2 Quad processor we might expect the
    value of total would be 4000
  • But see if thats what actually happens!
  • Without the lock prefix, the four CPUs may all
    try to increment total at once, resulting in a
    logically incorrect total
  • So fix this problem (by using a lock prefix
    ahead of the addw 1, total instruction)

29
Do you need a barrier?
  • You can use a software construct, known as a
    barrier, to stop CPUs from entering a block of
    code until a prescribed number of them are all
    ready to enter it together (i.e., simultaneously)
  • This may be helpful with the in-class exercises

arrived .word 0 allocate a shared global
variable barrier lock acquire exclusive
bus-access incw arrived each cpu adds 1 to
the variable await cmpw 4, arrived are four
cpus ready to proceed? jb await no, wait for
others to arrive here call add_one_thousand
then proceed together
Write a Comment
User Comments (0)
About PowerShow.com