An Embedded CoherentMultithreading Multimedia Processor and Its Programming Model presentation

About This Presentation

Transcript and Presenter's Notes

Title: An Embedded CoherentMultithreading Multimedia Processor and Its Programming Model

1
An Embedded Coherent-Multithreading Multimedia
Processor and Its Programming Model

J.C. Chu ?W.C. Ku?Shu-Hsuan Chou?
T.F. Chen?J.I. Guo

DAC 2007, San Diego June 7, 2007
SOC Research Center National Chung Cheng
University Chia-Yi, Taiwan, R.O.C
2
Motivation

Current embedded multimedia system
Multi-mode, high complexity
Low-power, configurable for low design cost

Design challenges
Parallelism
Communication/Sync
Memory bandwidth
HW cost

What is interesting for embedded multimedia?
One of the solutions Coherent-multithreading
3
Related work

Niagara A 32-way Multithreaded SPARC processor
IEEE MICRO 2005
8 cores, 32 threads on a single CPU
4-way multithreaded pipelines per core
High bandwidth crossbar share L2 cache
Sharing of resources at all levels leads to area
and power efficient design
The Vector-Thread Architecture
IEEE ISCA 2004
Vector and Multithreaded Architectures
Unified Control-processor and Vector-Thread unit
(4 lanes)
High bandwidth crossbar share L1 cache
Allow intermixing of multiple levels of
parallelism

4
Objectives
Designing Embedded coherent-multithreading archite
cture
Constructing Efficient embedded multithreading mem
ory system
Building Multimedia data parallel programming
model
Reducing Communication synchronization cost
5
Outline

Introduction
Exploration parallelism/ bandwidth/ cost
Architecture programming model
Methodology I Coherent-multithreading
Methodology II One-stop streaming processing
Case study H.264 Program
Conclusion

6
Overview of VisoMT processor

Coherent-Multithreading
Unified RISC/ multithreading DSP
Correlative SIMD streams parallel execution
Fast data communication on thread-level
One-stop streaming processing
Efficient streaming buffer
Background bulk data movement/ Pre-transposition
(BB_L/S)

Block Diagram
Data
Virtual VLIW, SMT with Open configurable
architecture
7
Methodology I Coherent-multithreading

Unified RISC/multithreading DSP
Separate control (scheduler) and data-intensive
computation
Simple/directive connection
Share address space/L1 cache
Correlative SIMD streams parallel execution
Collect independent tasks into P-group for
parallel execution
Heterogeneous SMT architecture
SIMD stream with flow-control ability
Fast data communication on thread-level
Pick tasks from P-group into working set with
fine-grained data locality
Banked register files
Configurable parallel access switch (CPAS)

8
Heterogeneous SMT architecture
SIMD Multithreading
No FU-resources conflict
Multithreading DSP
RR dynamic scheduling
Select T0, T2
T-Core
LS engine
DP engine
Media Core 0
Media Core 1
Media Core 2
Media Core 3
9
Construct fast data communicationon thread-level
M-Core0
T-Core
M-Core1
M-Core2
M-Core3
Configurable parallel access switch (CPAS)
How to make them to cooperate smoothly !!
BB_L/S

Streaming register files
Total 2KB/ eight banks/ 32 entries/ 64b
CPAS switch
7 masters/ 8 slaves
Up to 160B/cycle parallel access
Fast data communication by reconfiguring switch

10
Thread-level cooperation byfast data
communication models
Working set (Wx) means those tasks can parallel
execute.
11
Mapping into application Motion estimation
Current MBs
1
Motion search
2
3
1
2
1
1
4
5
6
4
3
Ref MBs
Current frame
Reference frame
Streaming RFs

Load the necessary data for
prediction.
Share current MB data.
Calculate the best mode.
Speculative data
independent multithreading
programming.

Cur MB
Ref MB1
Ref MB2
Ref MB3
Ref MB4
Streaming Register files
12
Mapping into application Texture Coding
Data in

Macroblocks finish all Texture coding processes
in streaming RFs.
Reduce 27.6 exe time, 74.4 L/S times.
SW pipeline multithreading programming.

Data out
Data out
Texture Coding
Load from memory
Streaming Register files
2
0
1
3
4
5
6
7
Block sub
M-core 0
T, Q
M-core 1
IQ, IT
M-core 2
Block add
M-core 3
Store to memory
13
Outline

Introduction
Exploration parallelism/ bandwidth/ cost
Architecture programming model
Methodology I Coherent-multithreading
Methodology II One-stop streaming processing
Case study H.264 Program
Conclusion

14
Methodology II One-stop streaming processing

Coarse-grained Fine-grained data-locality
Memory hierarchy with suitable size access time
Rich bandwidth sharing address space for
coherent SIMD threads
Pre-transposition
Hide memory latency
Reduce data-packing operations
Separate program control and data to minimize
cache size
8KB I/D cache have 90 for H.264 program

I/D Cache
Program control
Coherent- Multithreading
data
External Memory
Streaming RFs
Off-chip SRAM
On-chip SRAM
2KB,
16KB,
16MB,
256MB,
160B/cycle
8B/cycle
4B/40 cycle
4B/cycle
BB_L/S
Pre-transposition
15
Explore data locality on VisoMT processor
Search Range
Zoom in
Fine-grained locality
Motion search
Reference frame
16
Reduce memory bandwidth by one-stop
Background bulk data movement transposition
between streaming buffers
Store data into external memory
Store data into external memory
17
Programming model
Step I.
Step II.
Step III.
Coarse-grained data locality
Fine-grained data locality
W0
W1
W2
Sequential computation
P a r a l l e l i z a t i o n
D e c o m p o s i t i o n
O p t i m i z a t i o n
W3
W4
W5
Mapping into architecture
Task-tree
Optimized by data independent, speculative,
SW pipeline etc.
Extract program flow tasks
Define micro-tasks for physical threads, loop
unrolling etc.
18
Overhead ofcommunication synchronization
19
Outline

Introduction
Exploration parallelism/ bandwidth/ cost
Architecture programming model
Methodology I Coherent-multithreading
Methodology II One-stop streaming processing
Case study H.264 Program
Conclusion

20
Experimental results H.264 program
H.264 encoding _at_CIF
Frequency simulates at 180MHz
80
60
External memory requirements (MB)
40
RISC
20
(1.76fps, 30MB/s)
2.0
1.9
1.8
1.7
17
Processing time (s)
21
VisoMT silicon results

Die photo
4.71x4.70 mm2 (chip)
4.02x4.01 mm2 (core)
180Mhz, 245mW
(simulation result)
TSMC 0.13um, 1P8M

In progress

FPGA prototype
ARM Integrator
development board
Xillinx XC2V8000 (FPGA)
Partial components
of VisoMT on FPGA

Layout
4.0 x 3.4 mm2 (chip)
180Mhz, 125mW
(simulation result)
UMC-90nm 1P9M

22
Conclusion

Minimizes integration costs for embedded
multithreading/multi-core design by independent
coherent threads
Reduces the memory bandwidth requirements by
one-stop streaming processing mechanism
Our simulation experiment achieves H.264 encoding
on CIF video_at_16.6fps and 15.1MB/sec bandwidth
requirement at 180MHz
Facilitates the low power realization H.264 video
encoding for portable multimedia applications

23
The EndThank you!
Keep going

Write a Comment

User Comments (0)

About PowerShow.com

An Embedded CoherentMultithreading Multimedia Processor and Its Programming Model PowerPoint PPT Presentation