Effect of Context Aware Scheduler on TLB - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Effect of Context Aware Scheduler on TLB

Description:

priority bitmap and array of linked list of threads. Behavior. search priority bitmap and choose a thread with the highest priority. Scheduling overhead ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 57
Provided by: aleCsceK
Category:

less

Transcript and Presenter's Notes

Title: Effect of Context Aware Scheduler on TLB


1
Effect of Context Aware Scheduler on TLB
  • Satoshi Yamada and Shigeru Kusakabe
  • Kyushu University

2
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • Related Work
  • Conclusion

3
Contents
  • Introduction
  • What is Context?
  • Motivation
  • Task Switch and Cache
  • Approach of our Scheduler
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • Related Work
  • Conclusion

4
What is context ?
  • Definition in this presentation
  • Context Memory Address Space
  • Task switch
  • Context switch

5
Motivation
  • More chances of using native threads in OS today
  • Java, Perl, Python, Erlang, and Ruby
  • OpenMP, MPI
  • The more threads increase, the heavier the
    overhead due to a task switch tends to get
  • Agarwal, et al. Cache performance of operating
    system and multiprogramming workloads (1988)

6
Task Switch and Cache
  • Overhead due a task switch
  • includes that of loading a working set of next
    process
  • is deeply related with the utilization of caches
  • Mogul, et al. The effect of of context switches
    on cache performance (1991)

Working set of A
Working sets overflows the cache
Working set of B
Working set of A
Working set of B
Process B
Process A
Cache
7
Approach of our Scheduler
  • Three solutions to reduce the overhead due to
    task switches
  • Agarwal, et al. Cache performance of operating
    system and multiprogramming workloads (1988)
  • Increase the size of caches
  • Reuse the shared date among threads
  • Utilize tagged caches and/or restrain cache
    flushes

We utilize sibling threads to achieve 2. and 3.
We mainly discuss on 3.
8
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Working Set and Task Switch
  • TLB tag and Task Switch
  • Advantage of Sibling Threads
  • Effect of Sibling Threads on Task Switches
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • Related Work
  • Conclusion

9
Working Set and Task Switch
  • Task Switch with small overhead
  • Task Switch with large overhead

Process B
Process A
Working set of A
Working set of B
Process B
Process A
10
TLB and Task Switch
Tagged TLB
Non - Tagged TLB
2056
496
0x0123 0xc567 0x23ab 0xcea4 0x3614
0xc345 0x8a24 0xcacd
0x0123 0x0a67 0x23ab 0x0aa4 0x3614
0x0a45 0x8a24 0x0acd
  • Tagged TLB TLB flush is not necessary (ARM,
    MIPS, etc)
  • Non-tagged TLB TLB flush is necessary(x86, etc)

11
Advantage of Sibling Threads
Parent
Parent
fork()
task_struct
task_struct
mm_struct
mm signal file . .
mm signal file . .
signal_struct
signal_struct
. .
create a THREAD
create a PROCESS
  • Advantage on task switches
  • Higher possibility of sharing data among sibling
    threads
  • Context switch does not happen
  • Restrain TLB flushes in non-tagged TLB

12
Effect of Sibling Threads on Task
SwitchesMeasurement
We use the idea of lat_ctx program in LMbench
13
Effect of Sibling Threads on Task SwitchesResults
(sibling threads / process)
14
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • O(1) Scheduler in Linux
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • Related Work
  • Conclusion

15
O(1) Scheduler in Linux
  • Structure
  • active queue and expired queue
  • priority bitmap and array of linked list of
    threads
  • Behavior
  • search priority bitmap and choose a thread with
    the highest priority
  • Scheduling overhead
  • independent of the number of threads

bitmap
bitmap
high
A
1
1
B
C
1
0
0
1
D
0
0
low
0
0
active
expired
Processor
16
Context Aware Scheduler (CAS) (1/2)
regular O(1) scheduler runqueue
A
B
1
0
C
D
E
1
0
  • CAS creates auxiliary runqueues per context
  • CAS compares Preg and Paux
  • Preg the highest priority in regular O(1)
    scheduler runqueue
  • Paux the highest priority in the auxiliary
    runqueue
  • if Preg - Paux ? threshold, then we choose Paux

17
Context Aware Scheduler (CAS) (2/2)
regular O(1) scheduler runqueue
auxiliary runqueues per context
A
B
B
A
1
1
1
1
1
0
C
E
1
D
1
C
E
D
1
0
0
0
A
C
E
B
D
CAS with threshold 2
context switch1 time
O(1) scheduler
A
B
C
D
E
context switch4 times
18
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Measurement Environment
  • Benchmarks
  • Measurements
  • Scheduler
  • Result
  • Related Work
  • Conclusion

19
Measurement Environment
  • Intel Core 2 Duo 1.86 GHz

Spec of each memory hierarchy
20
Benchmarks
21
Measurements
Chat SysBench Volano DaCapo
DTLB and ITLB misses (user/kernel spaces)
Elapsed Time of executing 4 applications
22
Scheduler
  • O(1) scheduler in Linux 2.6.21
  • CAS
  • threshold 1
  • threshold 10

23
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • TLB misses
  • Process Time
  • Elapsed Time
  • Comparison between Completely Fair Scheduler
  • Related Work
  • Conclusion

24
TLB misses
(million times)
25
Why larger threshold better?
1
larger threshold can aggregate more
0
0
0
1
Dynamic priority works against small threshold
0
26
Process Time
(seconds)
27
Elapsed Time
(seconds)
28
Comparison between Completely Fair Scheduler (CFS)
  • What is CFS?
  • Introduced from Linux 2.6.23
  • Cut off the heuristic calculation of dynamic
    priority
  • Not consider the address space in scheduling
  • Why compare?
  • Investigate if applying CAS into CFS is valuable
  • CAS idea can reduce TLB misses and process time
    in CFS?

29
TLB misses
30
Process Time and Total Elapsed Time
(seconds)
31
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • Related Work
  • Conclusion

32
Sujay Parekh, et. al,Thread Sensitive
Scheduling for SMT Processors (2000)
  • Parekhs scheduler
  • tries groups of threads to execute in parallel
    and sample the information about
  • IPC
  • TLB misses
  • L2 cache misses, etc
  • schedules on the information sampled

Sampling Phase
Scheduling Phase
Sampling Phase
Scheduling Phase
33
Contents
  • Introduction
  • Effect of Sibling Threads on TLB
  • Context Aware Scheduler (CAS)
  • Benchmark Applications and Measurement
    Environment
  • Result
  • Related Work
  • Conclusion

34
Conclusion
  • Conclusion
  • CAS is effective in reducing TLB misses
  • CAS enhances the throughput of every application
  • Future Works
  • Evaluation on other architectures
  • Applying CAS into CFS scheduler
  • Extension to SMP platforms

35
additional slides
36
Effect of sibling threads on context switches
(counts)
37
Result of Cache Misses
(thousand times)
38
Result of Cache Misses
(thousand times)
39
Memory Consumption of CAS
  • Additional memory consumption of CAS
  • About 40 bytes per thread
  • About 150 K bytes per thread group
  • 6 150 K 1700 40 970K

40
Effective and Ineffective Case of CAS
  • Effective case
  • Consecutive threads share certain amount of data
  • Ineffective case
  • Consecutive threads do not share data

cache
Working set of B
Working set of A
cache
Working set of B
Working set of A
41
Pranay Koka, et. al, Opportunities for Cache
Friendly Process (2005)
  • Kokas scheduler
  • traces the execution of each thread
  • puts the focus on the shared memory space between
    threads

Tracing Phase
Scheduling Phase
Tracing Phase
Scheduling Phase
42
Extension to SMP
  • Aggregation into limited processors

CPU 0
CPU 1
43
Extension to SMP
  • Execute threads with the same address space in
    parallel

CPU 0
CPU 1
44
TLB misses and Total Elapsed Time
45
(No Transcript)
46
widely spread multithreading
ThreadA ThreadB
  • Multithreading hides the latency of disk I/O and
    network access
  • Threads in many languages, Java, Perl, and Python
    correspond to OS threads

ThreadB waits
disk
More context switches happen today Process
scheduler in OS is more responsible for
the system performance
47
Context Aware (CA) scheduler
Our CA scheduler aggregates sibling threads
Linux O(1) scheduler CA scheduler
A
C
D
B
E
Context switches between processes3 times
A
C
D
B
E
Context switches between processes1 time
48
Results of Context Switch
(micro seconds)
Process C
Process A
2MB
L2 cache size 2MB
Process B
1MB
Cache
0
49
Overhead due to a context switch by lat_ctx in
LMbench
50
Fairness
bitmap
bitmap
  • O(1) scheduler keeps the fairness by epoch
  • cycles of active queue and expired queue
  • CA scheduler also follows epoch
  • guarantee the same level of fairness as O(1)
    scheduler

A
1
1
B
C
1
0
1
1
D
0
0
0
0
active
expired
Processor 0
51
Influence of sibling threads on the overhead of
context switch
Ratio of each events (process / sibling threads)
52
Results of TLB misses (million times)
  • CA scheduler significantly reduces TLB misses
  • Bigger threshold is more effective
  • frequent changes of priority happened
    especially in DaCapo and Volano

53
Effect on Process Time (seconds)
  • CA scheduler gives benefit to process time of
    every application
  • CA is especially effective in Chat application

54
Effect on Elapsed Time (seconds)
CA scheduler reduces the total elapsed time by 48
55
Measuring Tools
  • Perfctr to count the TLB misses and Total Elapsed
    Time
  • GNUs time command to measure the process time
  • Counter implemented in each application (elapsed
    time)

56
TLB flush in Context Switch
  • Example of x86 processors
  • Switch of memory address spaces triggers TLB
    flush except small number of entries with G flag

In case of switching sibling threads, TLB
entries are not flushed
Write a Comment
User Comments (0)
About PowerShow.com