Title: The Transition to MultiCore: Is Your Software Ready AN170
1The Transition to Multi-Core Is Your Software
Ready? (AN170)
- Toby Foster
- Product Marketing, Freescale
Sebastien Marineau-Mes Director, OS Group, QNX
2Agenda
- Overview
- MPC8641D Overview
- Asymmetric Multi-Processing
- Symmetric Multi-Processing
- The QNX Solution
- The Role of Tools
- QAs
3The Transition to Multi-Core
4Dual Core Example Applications
- Dual core and high integration ideal for
- High-end line card
- Extensive processing power for extreme control
plane activities - Mid-range line card
- Capability to support both control and data
- Services card
- Upgrade platform with advanced features
Data plane ASIC/NPU
High-end line card
Management port
Mid-range line card
Services Card
5Multiprocessing Configurations
- Asymmetric Multiprocessing
- Two separate OS or two copies of one non-SMP OS
- Collapse two processors into one
- Task offload or division of labor
- Operating systems, data reside in different
address spaces - Resource sharing handled by user
- Static load balancing
- Bound and Symmetric Multiprocessing
- Homogenous OS support
- High-performance option
- Software transparency
- Cores share address space for OS and data
- Resource sharing handled by OS
- Dynamic load balancing by OS (SMP)
- Static task partitioning (BMP)
Memory Map Overlap
6Usage of a Dual Core Device
A B D
E
Core1
Core2
One core handles data plane, one control plane
High End
Mid Range
7Usage of a Dual Core Device
A B D
E
Core1
Core2
One core handles data plane, one control plane
High End
Mid Range
8Usage of a Dual Core Device
A B D
E
Core1
Core2
One core handles data plane, one control plane
C F
Core1
Core2
Network and disk partioning
High End
Mid Range
9Usage of a Dual Core Device
A B D
E
Core1
Core1
Core2
Core2
One core handles data plane, one control plane
Data plane ASIC
Task offload
C F
Core1
Core2
Network and disk partioning
High End
Mid Range
10Usage of a Dual Core Device
A B D
E
Core1
Core1
Core1
Core2
Core2
Core2
One core handles data plane, one control plane
Data plane ASIC
Data plane ASIC
Task offload
Each core handles a separate aspect of control
plane
C F
Core1
Core2
Network and disk partioning
High End
Mid Range
11MPC8641D Packed with Processing Power
- Dual e600 PowerPC cores
- AltiVecTM
- 36-bit addressing
- 1MB L2 Cache w/ECC per core
- Dual Memory Controller
- Dual DDR2/3 SDRAM
- 64 bit data bus w/ECC
- Support for up to 32GB memory
- High Speed Interconnect
- One x8/x4/x2/x1 PCIe AND
- One x8/x4/x2/x1 PCIeOr One x4/x1 sRapidIO
- Ethernet
- 4x 10/100/1000 Ethernet Controllers w/
Classification/Policing, 8 Rx/Tx Queues,
Checksum Offload, QoS, Lossless Flow Control,
and FIFO mode - 90nm SOI Process, 1023 Pin package
- Availability
- Alpha Samples Q206
- Production Mid 2007
MPC8641D
MPX Bus
Peripheral Logic Bus
12Asymmetric Integration
non-SMP OS
non-SMP OS
non-SMP OS
non-SMP OS
8641D
e600
e600
system logic
system logic
- Two OS kernel images in physical memory
- Each core executes a separate OS kernel image
- Non-SMP OSes must cooperate in sharing resources
- VxWorks, OSE, Integrity, Jaluna-1, many others
13Asymmetric MP Memory Organization
e600 core0
OS, Apps "A"
OS "A"
OS "B"
- Each OS kernel expects to control physical memory
beginning at address 0 - Each wants its own interrupt vectors
- The MMU can relocate applications and shared
memory appropriately - The 8641D includes a hardware translator to
relocate physical address 0 for core1
MMU
Apps "A"
e600 core1
OS, Apps "B"
Apps "B"
Shared memory
MMU
Physical memory
14Resources Shared or Multiple Instances
e600 core1
e600 core0
MPIC
Local Bus
SRIO
Multiple resource instances
Shared Resource
Partially shared or multiple instances in some
circumstances
15QNX and Multi-core
- QNX has done the heavy lifting to enable
migration to multi-core - Let developers focus on product differentiation
- Reliable, proven support for multi-core
applications - 1997 Industrys first to bring SMP to embedded
- 1984 High performance, transparent distributed
messaging - Full support for asymmetric and symmetric
multiprocessing - Linux and VxWorks interoperability
- Migrate existing software base and enable new
multi-core optimized applications - Multi-core capable tool suite
- World class professional services and expert
training - Active role in developing standards through
Multi-core Exchange consortium - Enable portability of applications across various
platforms - Derive common set of APIs that multi-core
development tools can utilize to support
interoperability
16Asymmetric Processing
- Asymmetric Model Pros
- Only possible mode when different OSs are running
- CPU core can be dedicated to specific
applications - One possible mode for applications that cannot
operate with parallel processing - Asymmetric Model Cons
- Resource sharing / arbitration needs to be
designed into system by developers - Neither OS owns the whole system
- Memory, I/O, interrupts are shared
- Evolution - complexity will increase as more
cores are added - Static configuration, difficult to add dynamic
resourcing - Time to market?
- Contention possible during system initialization,
during normal operation, on interrupts, on system
error conditions. All must be dealt with by the
designer. - Synchronization between cores done through
application level messages - Sub-optimal performance
- Complexity of the problem is not linear
- Addition of more cores may require
re-architecting application to take full
advantage of additional CPUs
17Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet
- Extends message passing bus over a transport
layer - Applications / services can be built in a fully
distributed manner without special code - Message queues
- File systems
- Hardware ports
- Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)
Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
18Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet
- Extends message passing bus over a transport
layer - Applications / services can be built in a fully
distributed manner without special code - Message queues
- File systems
- Hardware ports
- Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)
Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/dev/ffs1,) write(fd, )
19Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet
- Extends message passing bus over a transport
layer - Applications / services can be built in a fully
distributed manner without special code - Message queues
- File systems
- Hardware ports
- Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)
Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/dev/ffs1,) write(fd, )
20Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet
- Extends message passing bus over a transport
layer - Applications / services can be built in a fully
distributed manner without special code - Message queues
- File systems
- Hardware ports
- Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)
Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/net/core0/dev/ffs1,) write(fd, )
21Homogeneous AMPNeutrino Transparent Distributed
Processing
Internet
- Extends message passing bus over a transport
layer - Applications / services can be built in a fully
distributed manner without special code - Message queues
- File systems
- Hardware ports
- Seamless sharing of I/O resources between cores
(e.g. use a serial port owned by another core)
Flash File System
MessageQueues
NetworkingStack
Message-Passing Bus
Microkernel Core 0
Application
Message Bridge (Ethernet, RapidIO,Shared Memory)
Flash File System
Database
Microkernel Core 1
Application
fd open(/dev/ffs1,) write(fd, )
fd open(/net/core0/dev/ffs1,) write(fd, )
22Heterogeneous AMP
- Asymmetric Processing with Neutrino and Linux
- Run Carrier Grade Linux on one core with QNX RTOS
on the other - Inter-process communication between OSs
- TIPC is emerging standard between applications
- http//tipc.sourceforge.net/
- Location Transparency
- Higher performance than TCP/IP
- Quality of Service
- Linux benefits
- Wide availability of open source and commercial
software - No run time licensing
- QNX benefits
- Real time performance
- High availability framework
- Memory protection
- Market leading distributed processing capability
- No GPL contamination issues
- Combined benefit best of both worlds
23Symmetric Multiprocessing
SMP OS
SMP OS
- One OS kernel image in physical memory
- Both cores execute the same OS kernel image
- SMP OS owns all of the resources
- Linux, QNX, BSD only embedded SMP OSes
24SMP Memory Organization
e600 core0
Apps "A"
OS
OS
Apps "A"
MMU
Shared memory
Apps "A"
- The OS kernel resides at physical memory address
0, addressable by both cores - The MMU relocates applications and shared memory
appropriately
OS
Apps "B"
e600 core1
Shared memory
OS
Apps "B"
Physical memory
MMU
Shared memory
Apps "B"
25What is Coherency?
- Consistent view of memory across multiple agents
- Buffer descriptors and data buffers updated by
processor(s) as well as external agent(s) - Software-managed coherency
- Processor overhead to keep track of who owns
what when - Hardware-managed coherency
- Each processors hardware ensures consistency of
shared data by snooping other agentss broadcasts
on the system bus
26Performance Features of HW Coherency
- Coherency protocol
- MEI
- MESI
- Update mechanism
- Push
- Intervention
- Cache Tags
- Single-ported
- Dual-ported
Processor A
Processor B
MPX Bus
Memory
I/O Device
27Symmetric Processing
- Symmetric Model Pros
- Highly scalable. Supports multiple processing
cores seamlessly without code modification - One OS sees all and handles all resource
sharing / arbitration issues - Dynamic load balancing can handle processing
bursts with OS controlled thread scheduling - Dynamic memory allocation means that all cores
can draw on full pool of available memory without
penalty. - High performance inter-core messaging and thread
synchronization - Core-to-core application synchronization using
POSIX OS primitives - System wide statistics / information gathering
capability for performance optimizations,
debugging, etc. - Symmetric Model Cons
- Load balancing is dynamic and application may
require dedicated CPU - Applications with poor synchronization among
threads may not work properly in a true parallel
processing environment - Difficult to change software
- 3rd party software
Applications
OS
CPU
CPU
Cache
Cache
System Interconnect
I/O
I/O
Memory Controller
I/O
Memory
28Multi-core Scaling Software
- QNX conforms to POSIX (Portable Operating System
Interface) Application Programming Interface - Allows straightforward porting of code from one
OS to another that is also conformant - POSIX provides lightweight primitives for MP
programming (threads, mutexes) - Application broken down into memory protected
units called processes - Processes further divided into internal,
schedulable units called threads - Threads share all of the same resources (memory
space included) - PROCESSES run on individual cores concurrently in
asymmetric mode (all threads for a process are
tied to one core) - THREADS run on individual cores concurrently in
symmetric operation
29Scaling Applications Asymmetrically
Core-to-core IPC
- Process per core required for full performance
- State information maintained in shared memory or
through IPC - Clustering protocols (e.g. TIPC)
- Heavy-weight synchronization required
- Potentially complex interaction required between
processes to share work - Difficult to scale to more processors
30Scaling Applications Symmetrically
- Pool of POSIX worker threads
- Dispatch work to worker threads
- Scales very well / easily with SMP
- Simply adjust number of worker threads to number
of CPUs - No code change required
- Very lightweight OS primitives to synchronize
Worker thread
Worker thread
Worker thread
Main thread
Threads
CPU 1
Worker thread
CPU 0
Process
Worker thread
Main thread
CPU N
Worker thread
31The Transition to Multi-Core
32AMP or SMP?
- Sometimes this can be a clear cut decision
- Two operating systems AMP
- Application requires all available CPUs to
maximize performance SMP - Pre-selecting the operating system can force the
decision (usually AMP support only) - What if the versatility of SMP is desired but the
control of AMP is needed?
33QNX Bound Multiprocessing
- The Best of Both Worlds
- Bound Multiprocessing offers an approach that
provides benefits of both asymmetric and
symmetric modes - Support existing code base and multi-core
optimized applications - Supports bound and symmetric operation,
selectable by process / thread - Designer has full control over applications
- Applications and/or threads can be bound to a
specific core - Load balancing
- OS dynamic or designer controlled
- Tools to optimize load balancing
- Resource sharing handled by OS
- High Performance
- Kernel support for message passing and thread
synchronization
34Multiprocessing Summary
35The Transition to Multi-Core
36The Role of Tools
- The right toolset eases the transition to
multi-core processors - Assess current software when moving to multi-core
- Should processes be separated between cores?
- Determine how closely coupled the current
processes are - Where can concurrent processing help?
- Show the current processing bottlenecks
- Debugging in a multi-core environment
- Characterize and debug interaction between
threads on multiple CPUs - Tuning and Optimization in a multi-core
environment - Move processes and threads between cores
- Examine processing bottle necks
- Examine inter-process communications
37Instrumented Kernel
- The instrumented kernel logs events which are
filtered and stored into buffers which are
captured and analyzed.
System calls
Interrupts
Process/thread creation
On/Off filters
Static event filters
User defined filters
Events
Microkernel
Event buffers
State changes
E1
E2
E3
E4
E5
E6
System Profiler
Network
Capture
File
38Thread / Process Coupling QNX Momentics System
Profiler
Determine amount of messaging between processes.
39Finding Processing Bottlenecks QNX Momentics
Application Profiler
Determine which threads are busiest
Pinpoint which source lines consume the most CPU.
Use call pairing to identify your programs
execution structure, then use the information to
make your code more efficient.
40Load Balancing QNX Momentics System Profiler
Measure CPU activity for all cores and to
determine optimal load balancing
41The Transition to Multi-core
- Software Architecture and Optimization
42Architecting Multi-core Applications
- Design a concurrency model (task is either a
thread or a process) - Assign each external event or each peripheral a
separate task - Use one task to service events that occur at
approximately the same rate - Assign separate tasks to operations of widely
differing durations - Perform related computations (such as
safety-critical or multi-stage, sequential)
within a single task - Isolate unrelated operations into separate tasks
- Assign proper priorities to tasks within a CPU
- E.g. rate monotonic analysis (RMA)
- For asymmetric operation, partition application
appropriately - AMP or BMP
43Partitioning System Applications
- Partition by functionality
- Processes related to a particular functionality
are grouped on a CPU - Data path on CPU 0, control plane on CPU 1
- Receive path on CPU 0, transmit path on CPU 1
- Partition by CPU load
- Process with high (or highly variable) CPU load
runs on its own CPU - Routing application Route calculation on CPU 1,
remainder of the application on CPU 0 - High priority, high CPU usage threads can starve
other threads - Partition by information-sharing requirements
- Applications requiring access to same data
grouped on a CPU (reduces contention and
resulting serialization between cores)
44Optimizing Multi-core Applications
- Reduce contention
- Minimize or remove core-core interactions to
ensure most parallelism - Scale to number of available processors
- Use system analysis tools to tune performance
- Asymmetric operation
- Properly partition to produce desired CPU loading
for each core - Symmetric operation
- Asymmetric application operation
- Thread affinity
- Bound Multiprocessing for dedicated CPU
allocation - Select proper thread / process priorities to
optimize real-time performance / CPU allocation
45QNX Enables Multi-core Migration
- The QNX provides complete solution
- Proven OS support for any multi-core processing
model - Full suite of development tools to characterize
and optimize multi-core applications - Expert professional services and support
- Market leading multi-core board support packages
- Professional Training
- Asymmetric Multiprocessing
- Support existing software base, non-optimized
uni-processor approach - Mixed OS environment
Design Needs
- Bound Multiprocessing
- Migrate existing software base
- Mix existing applications with multi-core
optimized applications - Transparent scaling beyond dual core
- Symmetric Multiprocessing
- Multi-core optimized applications
- Transparent scaling beyond dual core
46QNX, Freescale and Multi-core Processors
- Freescale and QNX have collaborated on PPC for
many years - QNX has extensive support of Freescale Processors
- QNX and Freescale have existing customers
shipping products using both multi-processing and
distributed processing based on MPC744x
processors - QNX and Freescale committed to enabling customer
success on multi-core processors starting with
the MPC8641D - See QNX Multi-core Edition running on the
MPC8641D in the technology lab today
47Thank You!