Many-Core Operating Systems

About This Presentation

Title:

Many-Core Operating Systems

Description:

Overhead is automatically amortized. I talked about this stuff ... It would be nice to amortize it... The more resources you get, the longer you may keep them ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 18

Provided by: burt161

Category:

more less

Transcript and Presenter's Notes

Title: Many-Core Operating Systems

1
Many-Core Operating Systems

Burton SmithTechnical FellowAdvanced Strategies
and Policy

2
The von Neumann Premise

Simply put there is exactly one program
counter
It has led to some artifacts
Synchronous coprocessor coroutining (e.g. 8087)
Interrupts for asynchronous concurrency
Demand paging
to make memory allocation incremental
to let virtual gt physical
And some serious problems
The memory wall (insufficient memory concurrency)
The ILP wall (diminished improvement in ILP)
The power wall (the cost of run-time ILP
exploitation)
Given multiple program counters, what should we
change?
Scheduling?
Synchronization?

3
Computing is at a Crossroads

Continual performance improvement is our
lifeblood
It encourages people to buy new hardware
It opens up new software possibilities
Single-thread performance is nearing the end of
the line
But Moores Law will continue for some time to
come
What can we do with all those transistors?
Computation needs to become as parallel as
possible
Henceforth, serial means slow
Systems must support general purpose parallel
computing
The alternative to all this is commoditization
New many-core chips will need new system software
And vice versa!
This talk is about interplay between OS and
hardware

4
Many-Core OS Challenges

Architecture of the parallel virtual machine
Processor management
Multiple processors
A mix of in-order and out-of-order CPUs
GPUs and other performance accelerators
I/O processors and devices
Memory management
Performance problems due to paging
TLB pressure from larger working sets
Bandwidth resources
Quality of service (time management)
For media applications, games, real-time apps,
etc.
For deadlines

5
The Parallel Virtual Machine

What should the interface that the OS presents to
parallel application software look like?
Stable, negotiated resource allocation
Isolation among protection domains
Freedom from bottlenecks in OS services
The key objective is fine-grain application
parallelism
We need the whole tree, not just the low-hanging
fruit

6
Fine-grain Parallelism

Exploitable parallelism grows as task granularity
shrinks
But dependences among tasks become more numerous
Inter-task dependence enforcement demands
scheduling
A task needing a value from elsewhere must wait
for it
User-level work scheduling is called for
No privilege change is needed to stop or restart
a task
Locality (e.g. cache content) can be better
preserved
Todays OS and hardware dont encourage waiting
OS thread scheduling makes blocking dangerous
Instruction sets encourage non-blocking
approaches
Busy-waiting wastes instruction issue
opportunities
Impact
Better instruction set support for blocking
synchronization
Changes to OS processor and memory resource
management

7
Multithreading and Synchronization

Fine-grain multithreading can use TLP to tolerate
latency
Memory latency
Other operation latency, e.g. branch latency
Synchronization latency
In the latter case, some architectural support is
helpful
To stop issuing from a context while it is
waiting
To resume issuing when the wait is over
To free up the context if and when a wait becomes
long
The benefits
Waiting does not consume issue slots
Overhead is automatically amortized
I talked about this stuff in my 1996 FCRC keynote

8
Resource Scheduling Consequences

Since the user runtime is scheduling work on
processors, the OS should not attempt to do the
same
An asynchronous OS API is a necessary corollary
Scheduling memory via demand paging is also
problematic
Instead, the two schedulers should negotiate
The application tells the OS its resource
needs/desires
The OS makes decisions based on the big picture
Availability of resources
Appropriateness of power consumption level
Requirements for quality of service
The OS can preempt resources to reclaim them
But with notification, so the application can
rearrange things
Resources should be time- and space-shared in
chunks
Scheduling turns into a bin-packing problem

9
Bin Packing

The more resources allocated, the more swapping
overhead
It would be nice to amortize it
The more resources you get, the longer you may
keep them
Roughly, this means scheduling packing squarish
blocks
QOS applications might need long rectangles
instead
When the blocks dont fit, the OS can morph them
a little
Or cut corners when absolutely necessary

Quantity of resource
Time
10
What About Priority Scheduling?

Priorities are appropriate for some kinds of
scheduling
Especially when some things to be scheduled are
optional
If it all has to be done, how do the priorities
get set?
The answer is usually ad-hoc, and often!
Fairness is seldom maintained in the process
Quality of service needs a different approach
How much work must be done before the next
deadline?
Even highly interactive tasks can benefit
Deadlines are harder to implement than priorities
Then again, so is bin packing compared to fixed
quanta
Fairness can also be based on quality-of-service
concepts
Relative work rates rather than absolute
In the next 16 milliseconds, give level i
activities r times as many processor-seconds as
level i-1 activities

11
Heterogeneous Processors

There are two kinds of heterogeny
In architecture, i.e. different instruction sets
In implementation, i.e. different performance
characteristics
Both are likely to be important
A single application might ask for a
heterogeneous mix
Failure in the HA case might need multiple
versions or JIT
In the HI case, scheduling might be based on
instrumentation
A key question is whether a processor is
time-sharable
If not, the OS has to dedicate it to one
application at a time
With user-level scheduling and some support for
preemption, application state save and restore
can be done at user level

12
Virtual Memory Design Alternatives

Swapping instead of demand paging
Address-space names/identifiers
TLB shootdown becomes a rarer event
Hardware TLB coherence
Two-dimensional addressing (segmentation w/o
registers)
To assist with variable granularity memory
allocation
To help mitigate upward pressure on TLB size
To leverage persistent memory via segment sharing
A variation of mmap()might suffice for this
purpose
To accommodate variations in memory bank
architecture
Local versus global, for example

13
Physical Memory Bank Architecture

Consider this example
An application is using 31 cores, about half of
them
50 of its cache misses are stack references
The stacks are all allocated in a compact virtual
region
How many of the 128 memory banks are available?
Interleaving addresses across the banks is a
solution
Page granularity is the standard choice
If memory access is non-uniform, this is not the
best idea
Stacks should be allocated near their processors
So should compiler-allocated temporary arrays on
the heap
Is it one bank architecture scheme fits all, or
not?
If not, how do we manage the virtual address
space?

14
Hot Spots

When processors share memory, they can interfere
Not only data races, but also bandwidth
oversubscription
Within an application, this creates performance
problems
Hardware help is needed to discover where these
are
Beween applications, interference is even more
serious
Performance unpredictability
Denial of service
Covert-channel signaling
Bandwidth is a resource like any other
We need to be able to partition and isolate it

15
I/O Architecture

Direct memory access is usually a good way to do
I/O
Todays DMA mostly demands wired down pages
This leads to lots of data copying and other OS
warts
But I/O devices are getting smarter all the time
Transistors are cheaper than almost anything else
Why not treat I/O devices like heterogeneous
processors?
Teach them to do virtual address translation
Allocate them to real-time or sensor-intensive
applications
Allocate them to a not-very-trusted driver
application
Address space sharing can be partial, as it is
now
There is a problem, though inter-domain
signaling (IPC)
This is what interrupts do
I have some issues with interrupts

16
Interrupts

Interrupts are OK when there is only one
processor
Some people avoid them to make systems more
predictable
If there are many processors, which one do you
interrupt?
The usual solution just pick one and leave it
to software
A better idea is to signal via an address space
you already share (perhaps only in part) with the
intended recipient
The DMA at address ltagt is (ready)(done)
This is kinda like doing programmed I/O via
device CSRs
Its also the way the CDC 6600 and 7600 did
things
You may not want to have the signal recipient
busy-wait

17
Conclusions

It is time to rethink some of the basics of
computing
There is lots of work for everyone to do
e.g. Ive left out compilers, debuggers, and
applications
We need basic research as well as industrial
development
Research in computer systems is deprecated these
days
In the USA, NSF and DOD need to take the
initiative

Write a Comment

User Comments (0)

About PowerShow.com

Many-Core Operating Systems - PowerPoint PPT Presentation

Many-Core Operating Systems

Overhead is automatically amortized. I talked about this stuff ... It would be nice to amortize it... The more resources you get, the longer you may keep them ... – PowerPoint PPT presentation