An Embedded CoherentMultithreading Multimedia Processor and Its Programming Model PowerPoint PPT Presentation

presentation player overlay
1 / 23
About This Presentation
Transcript and Presenter's Notes

Title: An Embedded CoherentMultithreading Multimedia Processor and Its Programming Model


1
An Embedded Coherent-Multithreading Multimedia
Processor and Its Programming Model
  • J.C. Chu ?W.C. Ku?Shu-Hsuan Chou?
  • T.F. Chen?J.I. Guo

DAC 2007, San Diego June 7, 2007
SOC Research Center National Chung Cheng
University Chia-Yi, Taiwan, R.O.C
2
Motivation
  • Current embedded multimedia system
  • Multi-mode, high complexity
  • Low-power, configurable for low design cost
  • Design challenges
  • Parallelism
  • Communication/Sync
  • Memory bandwidth
  • HW cost

What is interesting for embedded multimedia?
One of the solutions Coherent-multithreading
3
Related work
  • Niagara A 32-way Multithreaded SPARC processor
  • IEEE MICRO 2005
  • 8 cores, 32 threads on a single CPU
  • 4-way multithreaded pipelines per core
  • High bandwidth crossbar share L2 cache
  • Sharing of resources at all levels leads to area
    and power efficient design
  • The Vector-Thread Architecture
  • IEEE ISCA 2004
  • Vector and Multithreaded Architectures
  • Unified Control-processor and Vector-Thread unit
    (4 lanes)
  • High bandwidth crossbar share L1 cache
  • Allow intermixing of multiple levels of
    parallelism

4
Objectives
Designing Embedded coherent-multithreading archite
cture
Constructing Efficient embedded multithreading mem
ory system
Building Multimedia data parallel programming
model
Reducing Communication synchronization cost
5
Outline
  • Introduction
  • Exploration parallelism/ bandwidth/ cost
  • Architecture programming model
  • Methodology I Coherent-multithreading
  • Methodology II One-stop streaming processing
  • Case study H.264 Program
  • Conclusion

6
Overview of VisoMT processor
  • Coherent-Multithreading
  • Unified RISC/ multithreading DSP
  • Correlative SIMD streams parallel execution
  • Fast data communication on thread-level
  • One-stop streaming processing
  • Efficient streaming buffer
  • Background bulk data movement/ Pre-transposition
    (BB_L/S)

Block Diagram
Data
Virtual VLIW, SMT with Open configurable
architecture
7
Methodology I Coherent-multithreading
  • Unified RISC/multithreading DSP
  • Separate control (scheduler) and data-intensive
    computation
  • Simple/directive connection
  • Share address space/L1 cache
  • Correlative SIMD streams parallel execution
  • Collect independent tasks into P-group for
    parallel execution
  • Heterogeneous SMT architecture
  • SIMD stream with flow-control ability
  • Fast data communication on thread-level
  • Pick tasks from P-group into working set with
    fine-grained data locality
  • Banked register files
  • Configurable parallel access switch (CPAS)

8
Heterogeneous SMT architecture
SIMD Multithreading
No FU-resources conflict
Multithreading DSP
RR dynamic scheduling
Select T0, T2
T-Core
LS engine
DP engine
Media Core 0
Media Core 1
Media Core 2
Media Core 3
9
Construct fast data communicationon thread-level
M-Core0
T-Core
M-Core1
M-Core2
M-Core3
Configurable parallel access switch (CPAS)
How to make them to cooperate smoothly !!
BB_L/S
  • Streaming register files
  • Total 2KB/ eight banks/ 32 entries/ 64b
  • CPAS switch
  • 7 masters/ 8 slaves
  • Up to 160B/cycle parallel access
  • Fast data communication by reconfiguring switch

10
Thread-level cooperation byfast data
communication models
Working set (Wx) means those tasks can parallel
execute.
11
Mapping into application Motion estimation
Current MBs
1
Motion search
2
3
1
2
1
1
4
5
6
4
3
Ref MBs
Current frame
Reference frame
Streaming RFs
  • Load the necessary data for
  • prediction.
  • Share current MB data.
  • Calculate the best mode.
  • Speculative data
  • independent multithreading
  • programming.

Cur MB
Ref MB1
Ref MB2
Ref MB3
Ref MB4
Streaming Register files
12
Mapping into application Texture Coding
Data in
  • Macroblocks finish all Texture coding processes
    in streaming RFs.
  • Reduce 27.6 exe time, 74.4 L/S times.
  • SW pipeline multithreading programming.

Data out
Data out
Texture Coding
Load from memory
Streaming Register files
2
0
1
3
4
5
6
7
Block sub
M-core 0
T, Q
M-core 1
IQ, IT
M-core 2
Block add
M-core 3
Store to memory
13
Outline
  • Introduction
  • Exploration parallelism/ bandwidth/ cost
  • Architecture programming model
  • Methodology I Coherent-multithreading
  • Methodology II One-stop streaming processing
  • Case study H.264 Program
  • Conclusion

14
Methodology II One-stop streaming processing
  • Coarse-grained Fine-grained data-locality
  • Memory hierarchy with suitable size access time
  • Rich bandwidth sharing address space for
    coherent SIMD threads
  • Pre-transposition
  • Hide memory latency
  • Reduce data-packing operations
  • Separate program control and data to minimize
    cache size
  • 8KB I/D cache have 90 for H.264 program

I/D Cache
Program control
Coherent- Multithreading
data
External Memory
Streaming RFs
Off-chip SRAM
On-chip SRAM
2KB,
16KB,
16MB,
256MB,
160B/cycle
8B/cycle
4B/40 cycle
4B/cycle
BB_L/S
Pre-transposition
15
Explore data locality on VisoMT processor
Search Range
Zoom in
Fine-grained locality
Motion search
Reference frame
16
Reduce memory bandwidth by one-stop
Background bulk data movement transposition
between streaming buffers
Store data into external memory
Store data into external memory
17
Programming model
Step I.
Step II.
Step III.
Coarse-grained data locality
Fine-grained data locality
W0
W1
W2
Sequential computation
P a r a l l e l i z a t i o n
D e c o m p o s i t i o n
O p t i m i z a t i o n
W3
W4
W5
Mapping into architecture
Task-tree
Optimized by data independent, speculative,
SW pipeline etc.
Extract program flow tasks
Define micro-tasks for physical threads, loop
unrolling etc.
18
Overhead ofcommunication synchronization
19
Outline
  • Introduction
  • Exploration parallelism/ bandwidth/ cost
  • Architecture programming model
  • Methodology I Coherent-multithreading
  • Methodology II One-stop streaming processing
  • Case study H.264 Program
  • Conclusion

20
Experimental results H.264 program
H.264 encoding _at_CIF
Frequency simulates at 180MHz
80
60
External memory requirements (MB)
40
RISC
20
(1.76fps, 30MB/s)
2.0
1.9
1.8
1.7
17
Processing time (s)
21
VisoMT silicon results
  • Die photo
  • 4.71x4.70 mm2 (chip)
  • 4.02x4.01 mm2 (core)
  • 180Mhz, 245mW
  • (simulation result)
  • TSMC 0.13um, 1P8M

In progress
  • FPGA prototype
  • ARM Integrator
  • development board
  • Xillinx XC2V8000 (FPGA)
  • Partial components
  • of VisoMT on FPGA
  • Layout
  • 4.0 x 3.4 mm2 (chip)
  • 180Mhz, 125mW
  • (simulation result)
  • UMC-90nm 1P9M

22
Conclusion
  • Minimizes integration costs for embedded
    multithreading/multi-core design by independent
    coherent threads
  • Reduces the memory bandwidth requirements by
    one-stop streaming processing mechanism
  • Our simulation experiment achieves H.264 encoding
    on CIF video_at_16.6fps and 15.1MB/sec bandwidth
    requirement at 180MHz
  • Facilitates the low power realization H.264 video
    encoding for portable multimedia applications

23
The EndThank you!
Keep going
Write a Comment
User Comments (0)
About PowerShow.com