- PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Description:

... accessed using Stack pointer with a bit mask! Sender Kernel Stack Access ... Thread number in lower 32-bits of UID. AND with bit mask & ADD to TCB's array base ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 37
Provided by: Kart158
Category:
Tags:

less

Transcript and Presenter's Notes

Title:


1
  • Improving IPC by kernel design
    - J. Liedtke
  • German National Research Center
  • for Computer Science
  • Presented by
  • Karthik Chandrasekar

2
Introduction
  • µ Kernel performs
  • Memory Management - Virtual Address Space
  • Thread Management
  • IPC
  • IPC lends the following
  • Modularity
  • Security
  • Scalability

3
Related Work
  • Mach
  • RPC oriented IPC
  • SRC - RPC
  • Special path for Context Switching
  • Shared memory buffers
  • LRPC
  • Simple stubs
  • Direct Context Switching
  • Shared memory buffers

4
Approach Needed
  • Synergetic approach in Design and Implementation
    guided by IPC requirements
  • Architectural Level
  • Algorithm Level
  • Interface Level
  • Coding Level

5
L3 Operating System
  • Data type - Task
  • Threads ? Communicate via messages using Task and
    Thread Ids (Even for device drivers and
    Interrupts - delivered by µ Kernel)
  • Data Space ? Address Space
  • Persistence of Data and Threads
  • Clans and Chiefs Model for Message Integrity

6
Principles and Methods for IPC Improvement
  • Reconstruction of the following in L3
  • Process Control
  • Communication
  • Newer Implementation of the following in L3
  • IPC
  • Thread Management
  • Architecture
  • Principles
  • IPC performance is the key
  • Design decisions require a performance discussion
  • Poor performs punished
  • Synergetic effects
  • Synergy at all levels

7
Performance Metrics defined using Null IPC
  • Achieved 250 cycles (5 µs)

8
Architectural Level
  1. System Calls - Kernel Mode
  2. Messages
  3. Direct Transfer by Temporary Mapping
  4. Strict Process Orientation
  5. Control Blocks as Virtual Objects

9
Architectural Level - System Calls
  • I. System Calls - Kernel Mode (40)
  • Call
  • Reply Receive Next
  • Instead of
  • Call
  • Reply
  • Send
  • Receive
  • No need for scheduling to handle replies
    differently from
  • requests!

10
Architectural Level - Messages
  • II. Messages (20)
  • Sequence of Send operations can be combined, if
    no intermediate
  • reply is required!
  • Eg. Text to Screen Driver
  • Operation Code
  • Co-ordinates
  • Text String specified by address and length

11
Architectural Level - Messages
  • User Level Sender and Receiver Buffer Structures
    are same
  • Sending a Complex Message maintaining program
    variables

12
Architectural Level - Transfer
  • III. Direct Transfer by Temporary Mapping
  • Refer the senders communication window at the
    target region of the dest. address space and then
    copy the message into it! (Exists per address
    space only kernel accessible)
  • Issues
  • Mapping must be fast
  • Parallel Threads must co-exist in the same
    address space

13
Architectural Level - Transfer
  • 1024 entry in page directory - each corresponds
    to a
  • 1024-entry 2nd level table ? 4 GB
  • 1 word copy from B to A refers to 4 MB of
    information in
  • 1024-entry 2nd level table
  • Data page size 4 KB

14
Architectural Level - Transfer
  • TLB Flush is mandatory for proper mapping - data
    integrity!
  • Approaches to maintain Clean TLB
  • One thread per address space process
  • Clean TLB
  • Thread access - TLB Clean
  • Transfer (Copy) - TLB Clean
  • Address space switch after copying in an IPC -
    use TLB Window Clean
  • Multiple threads per address space! - Using one
    Window!!
  • TLB is flushed when thread switching does not
    affect address space
  • Invalidate communication window values (will lead
    to a page fault)
  • Multiprocessor One window per address space per
    processor
  • Different Processors Spl. TLB flush for multiple
    address space support

15
Architectural Level - Process
  • IV. Strict Process Orientation
  • Allocate one kernel stack per thread! - No stack
    switching or copying - as in Continuations
  • Cheaper - In Virtual Memory
  • Also supports Thread Control Blocks (TCBs)
  • Thread Control Block (TCB) information has
  • A pointer so it can be chained into a linked
    list
  • Value of its stack pointer
  • A stack area that includes local variables
  • Thread number, type, priority and name
  • Age and resources granted.

16
Architectural Level - TCB
  • V. Control Blocks as Virtual Objects
  • Virtual array to hold TCBs
  • Faster access array base tcb no. tcb size
  • Saves 3 TLB Misses
  • Direct Access to Destination TCB
  • Kernel Stack is accessed using Stack pointer with
    a bit mask!
  • Sender Kernel Stack Access
  • Receiver Kernel Stack Access
  • Applicable to Page Directories
  • my pdir window pdir destbuffer gtgt 22
  • my pdir window 1 pdir dest(buffer gtgt
    22) 1

17
Algorithmic Level
  1. Thread Identifier
  2. Virtual Queues
  3. Timeouts and Wakeups
  4. Lazy Scheduling
  5. Direct Process Switch
  6. Short Messages Via Registers

18
Algorithmic Level - Thread Identifier
  • Thread addressed by 64-bit UID in user-mode
  • Thread Number
  • Generation
  • Station Number
  • Chief Id
  • Thread number in lower 32-bits of UID
  • AND with bit mask ADD to TCBs array base
  • Check for Validity
  • Thread UID with Requested UID - 4 cycles

19
Algorithmic Level - Virtual Queues
  • Busy queue, present queue and a polling-me queue
    per thread
  • Doubly linked lists, where the links are held in
    the TCBs
  • TCBs are chained in virtual address space, but
    parsing the chains and inserting or deleting TCBs
    will never lead to page faults - unmapping TCBs
    upon parsing!

20
Algorithmic Level - Timeouts and Wakeups
  • The frequently used values t 8 and t 0
  • Wakeups
  • Array indexed by thread number - but sequential
  • A set of n unordered wakeup lists implemented by
    doubly linked lists. If a thread is entered with
    wakeup time t, its tcb is linked into the list t
    mod n.
  • For a total of k threads, scheduler will have to
    inspect k/n entries per clock interrupt, on
    average.
  • Wakeup point is far in the future - long time
    wakeup list
  • n8, wakeup time 4 ms ? 400 threads ? 12500
    IPC/sec
  • ? 1 of total (6 is IPC ? 16), but 25 of IPC
    use wakeups ? 50,000 IPC/sec! ? (1.5 IPC ? 4)
  • Base Offset to represent time ? every 224
    offset updated

21
Algorithmic Level - Lazy Scheduling
  • IPC operation call or reply receive next
  • Delete sending thread from ready queue
  • Insert into waiting queue
  • Delete receiving thread from waiting queue
  • Insert into ready queue
  • L3 queue invariants
  • Ready queue contains all ready threads
  • Waiting queue contains at least all threads
    waiting
  • TCB contains threads state (ready/waiting) -
    updated!
  • Scheduler removes all threads not belonging to
    queue during queue parsing - delete operation can
    be omitted!
  • Call reply receive next - need not be
    inserted!
  • Performs better with increasing IPC rate!

22
Algorithmic Level Direct Process Switch
  • When B sends a reply to A and another thread C is
    waiting to send a message to B (polling B), C's
    IPC to B is immediately initiated before
    continuing A.
  • When multiple threads try to send messages to one
    receiver, it will get the messages in the
    sequence in which the IPC operations were invoked

23
Algorithmic Level - Short Messages Via Registers
  • High proportion of messages are short
  • Ex. Driver ack/error, hardware interrupts
  • 486
  • 7 general registers
  • 3 needed sender ID, result code
  • 4 available
  • 8-byte messages using coding scheme

24
Interface Level
  • Simple and short RPC stubs
  • Load registers, Issue system call, check success
  • Compiler generates stubs inline
  • Avoiding Unnecessary Copies
  • Complex Messages not arbitrarily mixed
  • Similar structuring helps in parsing and tracing
  • Sharing common variables
  • Parameter Passing
  • Use registers when possible
  • More efficient than stacks
  • Support Better code optimization

25
Coding Level
  • Reduce cache and TLB misses
  • Short kernel code
  • Use Short jumps, use registers, short address
    displacements
  • Frequently accessed data - 1 byte displacement
  • Data frequently used together - same cache line
    and access sequence
  • IPC related kernel code in 1 page - else TLB
    flush
  • Internal tables should be with the data - same
    page
  • (at least the heavily used entries) - 4 TLB
    Miss avoided!
  • Handle save/restore of coprocessor lazily
  • Delayed until different thread needs to use it

26
Coding Level
  • Segment Registers and General Registers
  • Take 9 cycles for loading - Instead use one flat
    segment covering the entire address space!
  • Loading flat descriptor for every segment
    register - 66 cycles
  • Checking on entry and loading before returning
    -10 cycles
  • Fit message in one 32-bit register
  • Counter accessed through 8-bit register

27
Coding Level
  • Avoiding Jumps
  • Reduce Jump statements
  • Process Switch
  • Stack pointer change
  • Address space change
  • Co-processor
  • Lazy handling save/restore

28
Summary of Techniques
  • Add new system calls (5.2.1)
  • Rich message structure, symmetry of send
    receive buers (5.2.2)
  • Single copy through temporary mapping (5.2.3)
  • Kernel stack per thread (5.2.4)
  • Control blocks held in virtual memory (5.2.5)
  • Thread uid structure (5.3.1)
  • Unlink tcbs from queues when unmapping (5.3.2)
  • Optimized timeout bookkeeping (5.3.3)
  • Lazy scheduling (5.3.4)
  • Direct process switch (5.3.5)
  • Pass short messages in register (5.3.6)
  • Reduce cache misses (5.5.1) and TLB misses
    (careful placement) (5.5.2)
  • Optimize use of segment registers (5.5.3)
  • Make best use of general registers (5.5.4)
  • Avoid jumps and checks (5.5.5)
  • Minimize process switch activities (5.5.6)

29
Results
30
Results
31
(No Transcript)
32
Results
33
Remarks
  • Introducing Ports
  • 1 Port Link Table/address space ? Global Port
    Table
  • Access port table port link table port index
    access
  • Dash-like Message Passing
  • Same Virtual Address in both address spaces
  • Cache
  • Cache Thrashing
  • Processor Dependencies
  • Virtual Address Space and hierarchical mapping
  • Kernel access expensive in 486!

34
Conclusion
  • IPC improved by applying
  • Performance based reasoning
  • Synergetic effects
  • Architecture ? coding

35
Discussion
  • Are the techniques used to improve the speed of
    IPC minor tweaks or significantly novel ideas?
  • Security Impact?
  • Virtual Machine Monitors!!

36
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com