Title:
1- Improving IPC by kernel design
- J. Liedtke - German National Research Center
- for Computer Science
- Presented by
- Karthik Chandrasekar
2Introduction
- µ Kernel performs
- Memory Management - Virtual Address Space
- Thread Management
- IPC
- IPC lends the following
- Modularity
- Security
- Scalability
3Related Work
- Mach
- RPC oriented IPC
- SRC - RPC
- Special path for Context Switching
- Shared memory buffers
- LRPC
- Simple stubs
- Direct Context Switching
- Shared memory buffers
4Approach Needed
- Synergetic approach in Design and Implementation
guided by IPC requirements - Architectural Level
- Algorithm Level
- Interface Level
- Coding Level
5L3 Operating System
- Data type - Task
- Threads ? Communicate via messages using Task and
Thread Ids (Even for device drivers and
Interrupts - delivered by µ Kernel) - Data Space ? Address Space
- Persistence of Data and Threads
- Clans and Chiefs Model for Message Integrity
6Principles and Methods for IPC Improvement
- Reconstruction of the following in L3
- Process Control
- Communication
- Newer Implementation of the following in L3
- IPC
- Thread Management
- Architecture
- Principles
- IPC performance is the key
- Design decisions require a performance discussion
- Poor performs punished
- Synergetic effects
- Synergy at all levels
7Performance Metrics defined using Null IPC
- Achieved 250 cycles (5 µs)
8Architectural Level
- System Calls - Kernel Mode
- Messages
- Direct Transfer by Temporary Mapping
- Strict Process Orientation
- Control Blocks as Virtual Objects
9Architectural Level - System Calls
- I. System Calls - Kernel Mode (40)
- Call
- Reply Receive Next
- Instead of
- Call
- Reply
- Send
- Receive
- No need for scheduling to handle replies
differently from - requests!
10Architectural Level - Messages
- II. Messages (20)
- Sequence of Send operations can be combined, if
no intermediate - reply is required!
- Eg. Text to Screen Driver
- Operation Code
- Co-ordinates
- Text String specified by address and length
11Architectural Level - Messages
- User Level Sender and Receiver Buffer Structures
are same - Sending a Complex Message maintaining program
variables
12Architectural Level - Transfer
- III. Direct Transfer by Temporary Mapping
- Refer the senders communication window at the
target region of the dest. address space and then
copy the message into it! (Exists per address
space only kernel accessible) - Issues
- Mapping must be fast
- Parallel Threads must co-exist in the same
address space
13Architectural Level - Transfer
- 1024 entry in page directory - each corresponds
to a - 1024-entry 2nd level table ? 4 GB
- 1 word copy from B to A refers to 4 MB of
information in - 1024-entry 2nd level table
- Data page size 4 KB
14Architectural Level - Transfer
- TLB Flush is mandatory for proper mapping - data
integrity! - Approaches to maintain Clean TLB
- One thread per address space process
- Clean TLB
- Thread access - TLB Clean
- Transfer (Copy) - TLB Clean
- Address space switch after copying in an IPC -
use TLB Window Clean - Multiple threads per address space! - Using one
Window!! - TLB is flushed when thread switching does not
affect address space - Invalidate communication window values (will lead
to a page fault) - Multiprocessor One window per address space per
processor - Different Processors Spl. TLB flush for multiple
address space support
15Architectural Level - Process
- IV. Strict Process Orientation
- Allocate one kernel stack per thread! - No stack
switching or copying - as in Continuations - Cheaper - In Virtual Memory
- Also supports Thread Control Blocks (TCBs)
- Thread Control Block (TCB) information has
- A pointer so it can be chained into a linked
list - Value of its stack pointer
- A stack area that includes local variables
- Thread number, type, priority and name
- Age and resources granted.
16Architectural Level - TCB
- V. Control Blocks as Virtual Objects
- Virtual array to hold TCBs
- Faster access array base tcb no. tcb size
- Saves 3 TLB Misses
- Direct Access to Destination TCB
- Kernel Stack is accessed using Stack pointer with
a bit mask! - Sender Kernel Stack Access
- Receiver Kernel Stack Access
- Applicable to Page Directories
- my pdir window pdir destbuffer gtgt 22
- my pdir window 1 pdir dest(buffer gtgt
22) 1
17Algorithmic Level
- Thread Identifier
- Virtual Queues
- Timeouts and Wakeups
- Lazy Scheduling
- Direct Process Switch
- Short Messages Via Registers
18Algorithmic Level - Thread Identifier
- Thread addressed by 64-bit UID in user-mode
- Thread Number
- Generation
- Station Number
- Chief Id
- Thread number in lower 32-bits of UID
- AND with bit mask ADD to TCBs array base
- Check for Validity
- Thread UID with Requested UID - 4 cycles
19Algorithmic Level - Virtual Queues
- Busy queue, present queue and a polling-me queue
per thread - Doubly linked lists, where the links are held in
the TCBs - TCBs are chained in virtual address space, but
parsing the chains and inserting or deleting TCBs
will never lead to page faults - unmapping TCBs
upon parsing!
20Algorithmic Level - Timeouts and Wakeups
- The frequently used values t 8 and t 0
- Wakeups
- Array indexed by thread number - but sequential
- A set of n unordered wakeup lists implemented by
doubly linked lists. If a thread is entered with
wakeup time t, its tcb is linked into the list t
mod n. - For a total of k threads, scheduler will have to
inspect k/n entries per clock interrupt, on
average. - Wakeup point is far in the future - long time
wakeup list - n8, wakeup time 4 ms ? 400 threads ? 12500
IPC/sec - ? 1 of total (6 is IPC ? 16), but 25 of IPC
use wakeups ? 50,000 IPC/sec! ? (1.5 IPC ? 4) - Base Offset to represent time ? every 224
offset updated
21Algorithmic Level - Lazy Scheduling
- IPC operation call or reply receive next
- Delete sending thread from ready queue
- Insert into waiting queue
- Delete receiving thread from waiting queue
- Insert into ready queue
- L3 queue invariants
- Ready queue contains all ready threads
- Waiting queue contains at least all threads
waiting - TCB contains threads state (ready/waiting) -
updated! - Scheduler removes all threads not belonging to
queue during queue parsing - delete operation can
be omitted! - Call reply receive next - need not be
inserted! - Performs better with increasing IPC rate!
22Algorithmic Level Direct Process Switch
- When B sends a reply to A and another thread C is
waiting to send a message to B (polling B), C's
IPC to B is immediately initiated before
continuing A. - When multiple threads try to send messages to one
receiver, it will get the messages in the
sequence in which the IPC operations were invoked
23Algorithmic Level - Short Messages Via Registers
- High proportion of messages are short
- Ex. Driver ack/error, hardware interrupts
- 486
- 7 general registers
- 3 needed sender ID, result code
- 4 available
- 8-byte messages using coding scheme
24Interface Level
- Simple and short RPC stubs
- Load registers, Issue system call, check success
- Compiler generates stubs inline
- Avoiding Unnecessary Copies
- Complex Messages not arbitrarily mixed
- Similar structuring helps in parsing and tracing
- Sharing common variables
- Parameter Passing
- Use registers when possible
- More efficient than stacks
- Support Better code optimization
25Coding Level
- Reduce cache and TLB misses
- Short kernel code
- Use Short jumps, use registers, short address
displacements - Frequently accessed data - 1 byte displacement
- Data frequently used together - same cache line
and access sequence - IPC related kernel code in 1 page - else TLB
flush - Internal tables should be with the data - same
page - (at least the heavily used entries) - 4 TLB
Miss avoided! - Handle save/restore of coprocessor lazily
- Delayed until different thread needs to use it
26Coding Level
- Segment Registers and General Registers
- Take 9 cycles for loading - Instead use one flat
segment covering the entire address space! - Loading flat descriptor for every segment
register - 66 cycles - Checking on entry and loading before returning
-10 cycles - Fit message in one 32-bit register
- Counter accessed through 8-bit register
27Coding Level
- Avoiding Jumps
- Reduce Jump statements
- Process Switch
- Stack pointer change
- Address space change
- Co-processor
- Lazy handling save/restore
28Summary of Techniques
- Add new system calls (5.2.1)
- Rich message structure, symmetry of send
receive buers (5.2.2) - Single copy through temporary mapping (5.2.3)
- Kernel stack per thread (5.2.4)
- Control blocks held in virtual memory (5.2.5)
- Thread uid structure (5.3.1)
- Unlink tcbs from queues when unmapping (5.3.2)
- Optimized timeout bookkeeping (5.3.3)
- Lazy scheduling (5.3.4)
- Direct process switch (5.3.5)
- Pass short messages in register (5.3.6)
- Reduce cache misses (5.5.1) and TLB misses
(careful placement) (5.5.2) - Optimize use of segment registers (5.5.3)
- Make best use of general registers (5.5.4)
- Avoid jumps and checks (5.5.5)
- Minimize process switch activities (5.5.6)
29Results
30Results
31(No Transcript)
32Results
33Remarks
- Introducing Ports
- 1 Port Link Table/address space ? Global Port
Table - Access port table port link table port index
access - Dash-like Message Passing
- Same Virtual Address in both address spaces
- Cache
- Cache Thrashing
- Processor Dependencies
- Virtual Address Space and hierarchical mapping
- Kernel access expensive in 486!
34Conclusion
- IPC improved by applying
- Performance based reasoning
- Synergetic effects
- Architecture ? coding
35Discussion
- Are the techniques used to improve the speed of
IPC minor tweaks or significantly novel ideas? - Security Impact?
- Virtual Machine Monitors!!
36Thank you!