Title: An Embedded CoherentMultithreading Multimedia Processor and Its Programming Model
1An Embedded Coherent-Multithreading Multimedia
Processor and Its Programming Model
- J.C. Chu ?W.C. Ku?Shu-Hsuan Chou?
- T.F. Chen?J.I. Guo
DAC 2007, San Diego June 7, 2007
SOC Research Center National Chung Cheng
University Chia-Yi, Taiwan, R.O.C
2Motivation
- Current embedded multimedia system
- Multi-mode, high complexity
- Low-power, configurable for low design cost
- Design challenges
- Parallelism
- Communication/Sync
- Memory bandwidth
- HW cost
What is interesting for embedded multimedia?
One of the solutions Coherent-multithreading
3Related work
- Niagara A 32-way Multithreaded SPARC processor
- IEEE MICRO 2005
- 8 cores, 32 threads on a single CPU
- 4-way multithreaded pipelines per core
- High bandwidth crossbar share L2 cache
- Sharing of resources at all levels leads to area
and power efficient design - The Vector-Thread Architecture
- IEEE ISCA 2004
- Vector and Multithreaded Architectures
- Unified Control-processor and Vector-Thread unit
(4 lanes) - High bandwidth crossbar share L1 cache
- Allow intermixing of multiple levels of
parallelism
4Objectives
Designing Embedded coherent-multithreading archite
cture
Constructing Efficient embedded multithreading mem
ory system
Building Multimedia data parallel programming
model
Reducing Communication synchronization cost
5Outline
- Introduction
- Exploration parallelism/ bandwidth/ cost
- Architecture programming model
- Methodology I Coherent-multithreading
- Methodology II One-stop streaming processing
- Case study H.264 Program
- Conclusion
6Overview of VisoMT processor
- Coherent-Multithreading
- Unified RISC/ multithreading DSP
- Correlative SIMD streams parallel execution
- Fast data communication on thread-level
- One-stop streaming processing
- Efficient streaming buffer
- Background bulk data movement/ Pre-transposition
(BB_L/S)
Block Diagram
Data
Virtual VLIW, SMT with Open configurable
architecture
7Methodology I Coherent-multithreading
- Unified RISC/multithreading DSP
- Separate control (scheduler) and data-intensive
computation - Simple/directive connection
- Share address space/L1 cache
- Correlative SIMD streams parallel execution
- Collect independent tasks into P-group for
parallel execution - Heterogeneous SMT architecture
- SIMD stream with flow-control ability
- Fast data communication on thread-level
- Pick tasks from P-group into working set with
fine-grained data locality - Banked register files
- Configurable parallel access switch (CPAS)
8Heterogeneous SMT architecture
SIMD Multithreading
No FU-resources conflict
Multithreading DSP
RR dynamic scheduling
Select T0, T2
T-Core
LS engine
DP engine
Media Core 0
Media Core 1
Media Core 2
Media Core 3
9Construct fast data communicationon thread-level
M-Core0
T-Core
M-Core1
M-Core2
M-Core3
Configurable parallel access switch (CPAS)
How to make them to cooperate smoothly !!
BB_L/S
- Streaming register files
- Total 2KB/ eight banks/ 32 entries/ 64b
- CPAS switch
- 7 masters/ 8 slaves
- Up to 160B/cycle parallel access
- Fast data communication by reconfiguring switch
10Thread-level cooperation byfast data
communication models
Working set (Wx) means those tasks can parallel
execute.
11Mapping into application Motion estimation
Current MBs
1
Motion search
2
3
1
2
1
1
4
5
6
4
3
Ref MBs
Current frame
Reference frame
Streaming RFs
- Load the necessary data for
- prediction.
- Share current MB data.
- Calculate the best mode.
- Speculative data
- independent multithreading
- programming.
Cur MB
Ref MB1
Ref MB2
Ref MB3
Ref MB4
Streaming Register files
12Mapping into application Texture Coding
Data in
- Macroblocks finish all Texture coding processes
in streaming RFs. - Reduce 27.6 exe time, 74.4 L/S times.
- SW pipeline multithreading programming.
Data out
Data out
Texture Coding
Load from memory
Streaming Register files
2
0
1
3
4
5
6
7
Block sub
M-core 0
T, Q
M-core 1
IQ, IT
M-core 2
Block add
M-core 3
Store to memory
13Outline
- Introduction
- Exploration parallelism/ bandwidth/ cost
- Architecture programming model
- Methodology I Coherent-multithreading
- Methodology II One-stop streaming processing
- Case study H.264 Program
- Conclusion
14Methodology II One-stop streaming processing
- Coarse-grained Fine-grained data-locality
- Memory hierarchy with suitable size access time
- Rich bandwidth sharing address space for
coherent SIMD threads - Pre-transposition
- Hide memory latency
- Reduce data-packing operations
- Separate program control and data to minimize
cache size - 8KB I/D cache have 90 for H.264 program
I/D Cache
Program control
Coherent- Multithreading
data
External Memory
Streaming RFs
Off-chip SRAM
On-chip SRAM
2KB,
16KB,
16MB,
256MB,
160B/cycle
8B/cycle
4B/40 cycle
4B/cycle
BB_L/S
Pre-transposition
15Explore data locality on VisoMT processor
Search Range
Zoom in
Fine-grained locality
Motion search
Reference frame
16Reduce memory bandwidth by one-stop
Background bulk data movement transposition
between streaming buffers
Store data into external memory
Store data into external memory
17Programming model
Step I.
Step II.
Step III.
Coarse-grained data locality
Fine-grained data locality
W0
W1
W2
Sequential computation
P a r a l l e l i z a t i o n
D e c o m p o s i t i o n
O p t i m i z a t i o n
W3
W4
W5
Mapping into architecture
Task-tree
Optimized by data independent, speculative,
SW pipeline etc.
Extract program flow tasks
Define micro-tasks for physical threads, loop
unrolling etc.
18Overhead ofcommunication synchronization
19Outline
- Introduction
- Exploration parallelism/ bandwidth/ cost
- Architecture programming model
- Methodology I Coherent-multithreading
- Methodology II One-stop streaming processing
- Case study H.264 Program
- Conclusion
20Experimental results H.264 program
H.264 encoding _at_CIF
Frequency simulates at 180MHz
80
60
External memory requirements (MB)
40
RISC
20
(1.76fps, 30MB/s)
2.0
1.9
1.8
1.7
17
Processing time (s)
21VisoMT silicon results
- Die photo
- 4.71x4.70 mm2 (chip)
- 4.02x4.01 mm2 (core)
- 180Mhz, 245mW
- (simulation result)
- TSMC 0.13um, 1P8M
In progress
- FPGA prototype
- ARM Integrator
- development board
- Xillinx XC2V8000 (FPGA)
- Partial components
- of VisoMT on FPGA
- Layout
- 4.0 x 3.4 mm2 (chip)
- 180Mhz, 125mW
- (simulation result)
- UMC-90nm 1P9M
22Conclusion
- Minimizes integration costs for embedded
multithreading/multi-core design by independent
coherent threads - Reduces the memory bandwidth requirements by
one-stop streaming processing mechanism - Our simulation experiment achieves H.264 encoding
on CIF video_at_16.6fps and 15.1MB/sec bandwidth
requirement at 180MHz - Facilitates the low power realization H.264 video
encoding for portable multimedia applications
23The EndThank you!
Keep going