The Transition to MultiCore: Is Your Software Ready AN170 - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

The Transition to MultiCore: Is Your Software Ready AN170

Description:

Applications with poor synchronization among threads may not work properly in a ... POSIX provides lightweight primitives for MP programming (threads, mutexes) ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 48

Provided by: QNX

Category:

more less

Transcript and Presenter's Notes

Title: The Transition to MultiCore: Is Your Software Ready AN170

1
The Transition to Multi-Core Is Your Software
Ready? (AN170)

Toby Foster
Product Marketing, Freescale

Sebastien Marineau-Mes Director, OS Group, QNX
2
Agenda

Overview
MPC8641D Overview
Asymmetric Multi-Processing
Symmetric Multi-Processing
The QNX Solution
The Role of Tools
QAs

3
The Transition to Multi-Core

Overview

4
Dual Core Example Applications

Dual core and high integration ideal for
High-end line card
Extensive processing power for extreme control
plane activities
Mid-range line card
Capability to support both control and data
Services card
Upgrade platform with advanced features

Data plane ASIC/NPU
High-end line card

Management port

Mid-range line card
Services Card
5
Multiprocessing Configurations

Asymmetric Multiprocessing
Two separate OS or two copies of one non-SMP OS
Collapse two processors into one
Task offload or division of labor
Operating systems, data reside in different
address spaces
Resource sharing handled by user
Static load balancing
Bound and Symmetric Multiprocessing
Homogenous OS support
High-performance option
Software transparency
Cores share address space for OS and data
Resource sharing handled by OS
Dynamic load balancing by OS (SMP)
Static task partitioning (BMP)

Memory Map Overlap
6
Usage of a Dual Core Device
A B D
E
Core1
Core2
One core handles data plane, one control plane
High End
Mid Range
7
Usage of a Dual Core Device
A B D
E
Core1
Core2
One core handles data plane, one control plane
High End
Mid Range
8
Usage of a Dual Core Device
A B D
E
Core1
Core2
One core handles data plane, one control plane
C F
Core1
Core2
Network and disk partioning
High End
Mid Range
9
Usage of a Dual Core Device
A B D
E
Core1
Core1
Core2
Core2
One core handles data plane, one control plane
Data plane ASIC
Task offload
C F
Core1
Core2
Network and disk partioning
High End
Mid Range
10
Usage of a Dual Core Device
A B D
E
Core1
Core1
Core1
Core2
Core2
Core2
One core handles data plane, one control plane
Data plane ASIC
Data plane ASIC
Task offload
Each core handles a separate aspect of control
plane
C F
Core1
Core2
Network and disk partioning
High End
Mid Range
11
MPC8641D Packed with Processing Power

Dual e600 PowerPC cores
AltiVecTM
36-bit addressing
1MB L2 Cache w/ECC per core
Dual Memory Controller
Dual DDR2/3 SDRAM
64 bit data bus w/ECC
Support for up to 32GB memory
High Speed Interconnect
One x8/x4/x2/x1 PCIe AND
One x8/x4/x2/x1 PCIeOr One x4/x1 sRapidIO
Ethernet
4x 10/100/1000 Ethernet Controllers w/
Classification/Policing, 8 Rx/Tx Queues,
Checksum Offload, QoS, Lossless Flow Control,
and FIFO mode
90nm SOI Process, 1023 Pin package
Availability
Alpha Samples Q206
Production Mid 2007

MPC8641D
MPX Bus
Peripheral Logic Bus
12
Asymmetric Integration
non-SMP OS
non-SMP OS
non-SMP OS
non-SMP OS
8641D
e600
e600
system logic
system logic

Two OS kernel images in physical memory
Each core executes a separate OS kernel image
Non-SMP OSes must cooperate in sharing resources
VxWorks, OSE, Integrity, Jaluna-1, many others

13
Asymmetric MP Memory Organization
e600 core0
OS, Apps "A"
OS "A"
OS "B"

Each OS kernel expects to control physical memory
beginning at address 0
Each wants its own interrupt vectors
The MMU can relocate applications and shared
memory appropriately
The 8641D includes a hardware translator to
relocate physical address 0 for core1

MMU
Apps "A"
e600 core1
OS, Apps "B"
Apps "B"
Shared memory
MMU
Physical memory
14
Resources Shared or Multiple Instances
e600 core1
e600 core0
MPIC
Local Bus
SRIO
Multiple resource instances
Shared Resource
Partially shared or multiple instances in some
circumstances
15
QNX and Multi-core

QNX has done the heavy lifting to enable
migration to multi-core
Let developers focus on product differentiation
Reliable, proven support for multi-core
applications
1997 Industrys first to bring SMP to embedded
1984 High performance, transparent distributed
messaging
Full support for asymmetric and symmetric
multiprocessing
Linux and VxWorks interoperability
Migrate existing software base and enable new
multi-core optimized applications
Multi-core capable tool suite
World class professional services and expert
training
Active role in developing standards through
Multi-core Exchange consortium
Enable portability of applications across various
platforms
Derive common set of APIs that multi-core
development tools can utilize to support
interoperability

16
Asymmetric Processing

Asymmetric Model Pros
Only possible mode when different OSs are running
CPU core can be dedicated to specific
applications
One possible mode for applications that cannot
operate with parallel processing
Asymmetric Model Cons
Resource sharing / arbitration needs to be
designed into system by developers
Neither OS owns the whole system
Memory, I/O, interrupts are shared
Evolution - complexity will increase as more
cores are added
Static configuration, difficult to add dynamic
resourcing
Time to market?
Contention possible during system initialization,
during normal operation, on interrupts, on system
error conditions. All must be dealt with by the
designer.
Synchronization between cores done through
application level messages
Sub-optimal performance
Complexity of the problem is not linear
Addition of more cores may require
re-architecting application to take full
advantage of additional CPUs

17
Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet

Extends message passing bus over a transport
layer
Applications / services can be built in a fully
distributed manner without special code
Message queues
File systems
Hardware ports
Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)

Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
18
Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet

Extends message passing bus over a transport
layer
Applications / services can be built in a fully
distributed manner without special code
Message queues
File systems
Hardware ports
Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)

Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/dev/ffs1,) write(fd, )
19
Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet

Extends message passing bus over a transport
layer
Applications / services can be built in a fully
distributed manner without special code
Message queues
File systems
Hardware ports
Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)

Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/dev/ffs1,) write(fd, )
20
Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet

Extends message passing bus over a transport
layer
Applications / services can be built in a fully
distributed manner without special code
Message queues
File systems
Hardware ports
Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)

Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/net/core0/dev/ffs1,) write(fd, )
21
Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet

Extends message passing bus over a transport
layer
Applications / services can be built in a fully
distributed manner without special code
Message queues
File systems
Hardware ports
Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)

Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/dev/ffs1,) write(fd, )
fd open(/net/core0/dev/ffs1,) write(fd, )
22
Heterogeneous AMP

Asymmetric Processing with Neutrino and Linux
Run Carrier Grade Linux on one core with QNX RTOS
on the other
Inter-process communication between OSs
TIPC is emerging standard between applications
http//tipc.sourceforge.net/
Location Transparency
Higher performance than TCP/IP
Quality of Service
Linux benefits
Wide availability of open source and commercial
software
No run time licensing
QNX benefits
Real time performance
High availability framework
Memory protection
Market leading distributed processing capability
No GPL contamination issues
Combined benefit best of both worlds

23
Symmetric Multiprocessing
SMP OS
SMP OS

One OS kernel image in physical memory
Both cores execute the same OS kernel image
SMP OS owns all of the resources
Linux, QNX, BSD only embedded SMP OSes

24
SMP Memory Organization
e600 core0
Apps "A"
OS
OS
Apps "A"
MMU
Shared memory
Apps "A"

The OS kernel resides at physical memory address
0, addressable by both cores
The MMU relocates applications and shared memory
appropriately

OS
Apps "B"
e600 core1
Shared memory
OS
Apps "B"
Physical memory
MMU
Shared memory
Apps "B"
25
What is Coherency?

Consistent view of memory across multiple agents
Buffer descriptors and data buffers updated by
processor(s) as well as external agent(s)
Software-managed coherency
Processor overhead to keep track of who owns
what when
Hardware-managed coherency
Each processors hardware ensures consistency of
shared data by snooping other agentss broadcasts
on the system bus

26
Performance Features of HW Coherency

Coherency protocol
MEI
MESI
Update mechanism
Push
Intervention
Cache Tags
Single-ported
Dual-ported

Processor A
Processor B
MPX Bus
Memory
I/O Device
27
Symmetric Processing

Symmetric Model Pros
Highly scalable. Supports multiple processing
cores seamlessly without code modification
One OS sees all and handles all resource
sharing / arbitration issues
Dynamic load balancing can handle processing
bursts with OS controlled thread scheduling
Dynamic memory allocation means that all cores
can draw on full pool of available memory without
penalty.
High performance inter-core messaging and thread
synchronization
Core-to-core application synchronization using
POSIX OS primitives
System wide statistics / information gathering
capability for performance optimizations,
debugging, etc.
Symmetric Model Cons
Load balancing is dynamic and application may
require dedicated CPU
Applications with poor synchronization among
threads may not work properly in a true parallel
processing environment
Difficult to change software
3rd party software

Applications
OS
CPU
CPU
Cache
Cache
System Interconnect
I/O
I/O
Memory Controller
I/O
Memory
28
Multi-core Scaling Software

QNX conforms to POSIX (Portable Operating System
Interface) Application Programming Interface
Allows straightforward porting of code from one
OS to another that is also conformant
POSIX provides lightweight primitives for MP
programming (threads, mutexes)
Application broken down into memory protected
units called processes
Processes further divided into internal,
schedulable units called threads
Threads share all of the same resources (memory
space included)
PROCESSES run on individual cores concurrently in
asymmetric mode (all threads for a process are
tied to one core)
THREADS run on individual cores concurrently in
symmetric operation

29
Scaling Applications Asymmetrically
Core-to-core IPC

Process per core required for full performance
State information maintained in shared memory or
through IPC
Clustering protocols (e.g. TIPC)
Heavy-weight synchronization required
Potentially complex interaction required between
processes to share work
Difficult to scale to more processors

30
Scaling Applications Symmetrically

Pool of POSIX worker threads
Dispatch work to worker threads
Scales very well / easily with SMP
Simply adjust number of worker threads to number
of CPUs
No code change required
Very lightweight OS primitives to synchronize

Worker thread
Worker thread
Worker thread
Main thread
Threads
CPU 1
Worker thread
CPU 0
Process
Worker thread
Main thread
CPU N
Worker thread
31
The Transition to Multi-Core

The QNX Solution

32
AMP or SMP?

Sometimes this can be a clear cut decision
Two operating systems AMP
Application requires all available CPUs to
maximize performance SMP
Pre-selecting the operating system can force the
decision (usually AMP support only)
What if the versatility of SMP is desired but the
control of AMP is needed?

33
QNX Bound Multiprocessing

The Best of Both Worlds
Bound Multiprocessing offers an approach that
provides benefits of both asymmetric and
symmetric modes
Support existing code base and multi-core
optimized applications
Supports bound and symmetric operation,
selectable by process / thread
Designer has full control over applications
Applications and/or threads can be bound to a
specific core
Load balancing
OS dynamic or designer controlled
Tools to optimize load balancing
Resource sharing handled by OS
High Performance
Kernel support for message passing and thread
synchronization

34
Multiprocessing Summary
35
The Transition to Multi-Core

The Role of Tools

36
The Role of Tools

The right toolset eases the transition to
multi-core processors
Assess current software when moving to multi-core
Should processes be separated between cores?
Determine how closely coupled the current
processes are
Where can concurrent processing help?
Show the current processing bottlenecks
Debugging in a multi-core environment
Characterize and debug interaction between
threads on multiple CPUs
Tuning and Optimization in a multi-core
environment
Move processes and threads between cores
Examine processing bottle necks
Examine inter-process communications

37
Instrumented Kernel

The instrumented kernel logs events which are
filtered and stored into buffers which are
captured and analyzed.

System calls
Interrupts
Process/thread creation
On/Off filters
Static event filters
User defined filters
Events
Microkernel
Event buffers
State changes
E1
E2
E3
E4
E5
E6
System Profiler
Network
Capture
File
38
Thread / Process Coupling QNX Momentics System
Profiler
Determine amount of messaging between processes.
39
Finding Processing Bottlenecks QNX Momentics
Application Profiler
Determine which threads are busiest
Pinpoint which source lines consume the most CPU.
Use call pairing to identify your programs
execution structure, then use the information to
make your code more efficient.
40
Load Balancing QNX Momentics System Profiler
Measure CPU activity for all cores and to
determine optimal load balancing
41
The Transition to Multi-core

Software Architecture and Optimization

42
Architecting Multi-core Applications

Design a concurrency model (task is either a
thread or a process)
Assign each external event or each peripheral a
separate task
Use one task to service events that occur at
approximately the same rate
Assign separate tasks to operations of widely
differing durations
Perform related computations (such as
safety-critical or multi-stage, sequential)
within a single task
Isolate unrelated operations into separate tasks
Assign proper priorities to tasks within a CPU
E.g. rate monotonic analysis (RMA)
For asymmetric operation, partition application
appropriately
AMP or BMP

43
Partitioning System Applications

Partition by functionality
Processes related to a particular functionality
are grouped on a CPU
Data path on CPU 0, control plane on CPU 1
Receive path on CPU 0, transmit path on CPU 1
Partition by CPU load
Process with high (or highly variable) CPU load
runs on its own CPU
Routing application Route calculation on CPU 1,
remainder of the application on CPU 0
High priority, high CPU usage threads can starve
other threads
Partition by information-sharing requirements
Applications requiring access to same data
grouped on a CPU (reduces contention and
resulting serialization between cores)

44
Optimizing Multi-core Applications

Reduce contention
Minimize or remove core-core interactions to
ensure most parallelism
Scale to number of available processors
Use system analysis tools to tune performance
Asymmetric operation
Properly partition to produce desired CPU loading
for each core
Symmetric operation
Asymmetric application operation
Thread affinity
Bound Multiprocessing for dedicated CPU
allocation
Select proper thread / process priorities to
optimize real-time performance / CPU allocation

45
QNX Enables Multi-core Migration

The QNX provides complete solution
Proven OS support for any multi-core processing
model
Full suite of development tools to characterize
and optimize multi-core applications
Expert professional services and support
Market leading multi-core board support packages
Professional Training

Asymmetric Multiprocessing
Support existing software base, non-optimized
uni-processor approach
Mixed OS environment

Design Needs

Bound Multiprocessing
Migrate existing software base
Mix existing applications with multi-core
optimized applications
Transparent scaling beyond dual core

Symmetric Multiprocessing
Multi-core optimized applications
Transparent scaling beyond dual core

46
QNX, Freescale and Multi-core Processors

Freescale and QNX have collaborated on PPC for
many years
QNX has extensive support of Freescale Processors
QNX and Freescale have existing customers
shipping products using both multi-processing and
distributed processing based on MPC744x
processors
QNX and Freescale committed to enabling customer
success on multi-core processors starting with
the MPC8641D
See QNX Multi-core Edition running on the
MPC8641D in the technology lab today

47
Thank You!