Title: Effect of Context Aware Scheduler on TLB
1Effect of Context Aware Scheduler on TLB
- Satoshi Yamada and Shigeru Kusakabe
- Kyushu University
2Contents
- Introduction
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- Related Work
- Conclusion
3Contents
- Introduction
- What is Context?
- Motivation
- Task Switch and Cache
- Approach of our Scheduler
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- Related Work
- Conclusion
4What is context ?
- Definition in this presentation
- Context Memory Address Space
- Task switch
- Context switch
5Motivation
- More chances of using native threads in OS today
- Java, Perl, Python, Erlang, and Ruby
- OpenMP, MPI
- The more threads increase, the heavier the
overhead due to a task switch tends to get - Agarwal, et al. Cache performance of operating
system and multiprogramming workloads (1988)
6Task Switch and Cache
- Overhead due a task switch
- includes that of loading a working set of next
process - is deeply related with the utilization of caches
- Mogul, et al. The effect of of context switches
on cache performance (1991)
Working set of A
Working sets overflows the cache
Working set of B
Working set of A
Working set of B
Process B
Process A
Cache
7Approach of our Scheduler
- Three solutions to reduce the overhead due to
task switches - Agarwal, et al. Cache performance of operating
system and multiprogramming workloads (1988) - Increase the size of caches
- Reuse the shared date among threads
- Utilize tagged caches and/or restrain cache
flushes
We utilize sibling threads to achieve 2. and 3.
We mainly discuss on 3.
8Contents
- Introduction
- Effect of Sibling Threads on TLB
- Working Set and Task Switch
- TLB tag and Task Switch
- Advantage of Sibling Threads
- Effect of Sibling Threads on Task Switches
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- Related Work
- Conclusion
9Working Set and Task Switch
- Task Switch with small overhead
- Task Switch with large overhead
Process B
Process A
Working set of A
Working set of B
Process B
Process A
10TLB and Task Switch
Tagged TLB
Non - Tagged TLB
2056
496
0x0123 0xc567 0x23ab 0xcea4 0x3614
0xc345 0x8a24 0xcacd
0x0123 0x0a67 0x23ab 0x0aa4 0x3614
0x0a45 0x8a24 0x0acd
- Tagged TLB TLB flush is not necessary (ARM,
MIPS, etc) - Non-tagged TLB TLB flush is necessary(x86, etc)
11Advantage of Sibling Threads
Parent
Parent
fork()
task_struct
task_struct
mm_struct
mm signal file . .
mm signal file . .
signal_struct
signal_struct
. .
create a THREAD
create a PROCESS
- Advantage on task switches
- Higher possibility of sharing data among sibling
threads - Context switch does not happen
- Restrain TLB flushes in non-tagged TLB
12Effect of Sibling Threads on Task
SwitchesMeasurement
We use the idea of lat_ctx program in LMbench
13Effect of Sibling Threads on Task SwitchesResults
(sibling threads / process)
14Contents
- Introduction
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- O(1) Scheduler in Linux
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- Related Work
- Conclusion
15O(1) Scheduler in Linux
- Structure
- active queue and expired queue
- priority bitmap and array of linked list of
threads - Behavior
- search priority bitmap and choose a thread with
the highest priority - Scheduling overhead
- independent of the number of threads
bitmap
bitmap
high
A
1
1
B
C
1
0
0
1
D
0
0
low
0
0
active
expired
Processor
16Context Aware Scheduler (CAS) (1/2)
regular O(1) scheduler runqueue
A
B
1
0
C
D
E
1
0
- CAS creates auxiliary runqueues per context
- CAS compares Preg and Paux
- Preg the highest priority in regular O(1)
scheduler runqueue - Paux the highest priority in the auxiliary
runqueue - if Preg - Paux ? threshold, then we choose Paux
17Context Aware Scheduler (CAS) (2/2)
regular O(1) scheduler runqueue
auxiliary runqueues per context
A
B
B
A
1
1
1
1
1
0
C
E
1
D
1
C
E
D
1
0
0
0
A
C
E
B
D
CAS with threshold 2
context switch1 time
O(1) scheduler
A
B
C
D
E
context switch4 times
18Contents
- Introduction
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Measurement Environment
- Benchmarks
- Measurements
- Scheduler
- Result
- Related Work
- Conclusion
19Measurement Environment
- Intel Core 2 Duo 1.86 GHz
Spec of each memory hierarchy
20Benchmarks
21Measurements
Chat SysBench Volano DaCapo
DTLB and ITLB misses (user/kernel spaces)
Elapsed Time of executing 4 applications
22Scheduler
- O(1) scheduler in Linux 2.6.21
- CAS
- threshold 1
- threshold 10
23Contents
- Introduction
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- TLB misses
- Process Time
- Elapsed Time
- Comparison between Completely Fair Scheduler
- Related Work
- Conclusion
24TLB misses
(million times)
25Why larger threshold better?
1
larger threshold can aggregate more
0
0
0
1
Dynamic priority works against small threshold
0
26Process Time
(seconds)
27Elapsed Time
(seconds)
28Comparison between Completely Fair Scheduler (CFS)
- What is CFS?
- Introduced from Linux 2.6.23
- Cut off the heuristic calculation of dynamic
priority - Not consider the address space in scheduling
- Why compare?
- Investigate if applying CAS into CFS is valuable
- CAS idea can reduce TLB misses and process time
in CFS?
29TLB misses
30Process Time and Total Elapsed Time
(seconds)
31Contents
- Introduction
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- Related Work
- Conclusion
32Sujay Parekh, et. al,Thread Sensitive
Scheduling for SMT Processors (2000)
- Parekhs scheduler
- tries groups of threads to execute in parallel
and sample the information about - IPC
- TLB misses
- L2 cache misses, etc
- schedules on the information sampled
Sampling Phase
Scheduling Phase
Sampling Phase
Scheduling Phase
33Contents
- Introduction
- Effect of Sibling Threads on TLB
- Context Aware Scheduler (CAS)
- Benchmark Applications and Measurement
Environment - Result
- Related Work
- Conclusion
34Conclusion
- Conclusion
- CAS is effective in reducing TLB misses
- CAS enhances the throughput of every application
- Future Works
- Evaluation on other architectures
- Applying CAS into CFS scheduler
- Extension to SMP platforms
35additional slides
36Effect of sibling threads on context switches
(counts)
37Result of Cache Misses
(thousand times)
38Result of Cache Misses
(thousand times)
39Memory Consumption of CAS
- Additional memory consumption of CAS
- About 40 bytes per thread
- About 150 K bytes per thread group
- 6 150 K 1700 40 970K
40Effective and Ineffective Case of CAS
- Effective case
- Consecutive threads share certain amount of data
- Ineffective case
- Consecutive threads do not share data
cache
Working set of B
Working set of A
cache
Working set of B
Working set of A
41Pranay Koka, et. al, Opportunities for Cache
Friendly Process (2005)
- Kokas scheduler
- traces the execution of each thread
- puts the focus on the shared memory space between
threads
Tracing Phase
Scheduling Phase
Tracing Phase
Scheduling Phase
42Extension to SMP
- Aggregation into limited processors
CPU 0
CPU 1
43Extension to SMP
- Execute threads with the same address space in
parallel
CPU 0
CPU 1
44TLB misses and Total Elapsed Time
45(No Transcript)
46widely spread multithreading
ThreadA ThreadB
- Multithreading hides the latency of disk I/O and
network access - Threads in many languages, Java, Perl, and Python
correspond to OS threads
ThreadB waits
disk
More context switches happen today Process
scheduler in OS is more responsible for
the system performance
47Context Aware (CA) scheduler
Our CA scheduler aggregates sibling threads
Linux O(1) scheduler CA scheduler
A
C
D
B
E
Context switches between processes3 times
A
C
D
B
E
Context switches between processes1 time
48Results of Context Switch
(micro seconds)
Process C
Process A
2MB
L2 cache size 2MB
Process B
1MB
Cache
0
49Overhead due to a context switch by lat_ctx in
LMbench
50Fairness
bitmap
bitmap
- O(1) scheduler keeps the fairness by epoch
- cycles of active queue and expired queue
- CA scheduler also follows epoch
- guarantee the same level of fairness as O(1)
scheduler
A
1
1
B
C
1
0
1
1
D
0
0
0
0
active
expired
Processor 0
51Influence of sibling threads on the overhead of
context switch
Ratio of each events (process / sibling threads)
52Results of TLB misses (million times)
- CA scheduler significantly reduces TLB misses
- Bigger threshold is more effective
- frequent changes of priority happened
especially in DaCapo and Volano
53Effect on Process Time (seconds)
- CA scheduler gives benefit to process time of
every application - CA is especially effective in Chat application
54Effect on Elapsed Time (seconds)
CA scheduler reduces the total elapsed time by 48
55Measuring Tools
- Perfctr to count the TLB misses and Total Elapsed
Time - GNUs time command to measure the process time
- Counter implemented in each application (elapsed
time)
56TLB flush in Context Switch
- Example of x86 processors
- Switch of memory address spaces triggers TLB
flush except small number of entries with G flag
In case of switching sibling threads, TLB
entries are not flushed