Title: Multi-core SOC for Future Media Processing
1 Multi-core SOC for Future Media
Processing
- Qin Xing, Yan Xiaolang
- The Institute of VLSI Design, Zhejiang University
2Outline
- Opportunities challenges from media processing
- Multimedia algorithm characteristics mapping
- Multi-core SOC architecture technology
- Benchmarking results
- Project status
- Future work
3Opportunities
- Video conference
- IP-phone
- Smart terminal
- PDA
- Video camera
- HDTV
- Set-top box
-
4Challengesmultiple standards
1st MPEG-2 Encoder
6
MPEG-2
MPEG-4
2nd Generation Encoder
5
H.26L
H.263
H.264
3rd Generation Encoder
WMV
4
VP3
AVS
4th Generation Encoder
3
Mbit/s
5th Generation Encoder
WMV
2
VP3
AVS
1
H.264 / MPEG-4 part 10
0
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
5Challenges excellent hardware
- Very high computation complexity
- H.264 encoding of 720 x 576 pixels _at_ 30 frames/s
needs up to 30 GOPS - Multiple standards co-exist
- Demands of flexibility programmability
- Low power
- Low cost
Best choice Application Specific Instruction
Processor
6Multimedia algorithm characteristics
- Outer-loop and inner loop
- Outer loop
- Interface (GUI)
- Os (Linux)
- Bit-stream parsing
- (park/unpack, VLC, CABAC)
- Data transferring
- Inner loop
- Regular algorithms
- (Prediction, FIR, DCT,
- motion estimation)
-
7Multimedia algorithm mapping
- Programmable and heterogeneous processors are the
preferred choice for the implementation - General MCU (RISC core) outer loop
- Enhanced DSP(EDSP, bit wise operation) outer
loop - Vector processor(VP, VLIWSIMD) inner loop
8Multi-core SOC architecture
Media processing kernel
9Inside the media processing kernel
GAG2
GAG1
GAG3
GAG4
GDM
V-DM1
V-DM2
V-DM3
V-DM4
GTM
EDSP-control path
Vector control path
DMA and off chip memories
2D crossbar connection network
E-DP
V-DP1
V-DP2
V-DP3
V-DP4
10Technologies specified instruction set
__asm mov edx, mptr movdqu xmm1, edx packssdw
xmm1,xmm1// read m50 from memory to xmm1 __asm
movdqu xmm4, edx 48 packssdw xmm4,xmm4// read
m53 from memory __asm movq xmm5,xmm1 psubw
xmm1,xmm3 //m61(m50-m52) paddw xmm3,xmm5
//m60(m50m52) movq xmm5, xmm2 psraw
xmm2,1 psubw xmm2,xmm4 //m62(m51gtgt1)-m53 ps
raw xmm4,1 paddw xmm4,xmm5 //m63m51(m53gtgt1
)
for (j0jltBLOCK_SIZEj) for
(i0iltBLOCK_SIZEi) m5iimg-gtcofi0j0i
j m60(m50m52) m61(m50-m52) m6
2(m51gtgt1)-m53 m63m51(m53gtgt1)
Our IS
6 cycles
Integer IDCT in H.264
Intel MMX13 cycles
11Technologiesinstruction mergence
Load/Store
30
result 0 pres_y dy 1 ? y_pos
y_pos1 pres_y max(0,min(maxold_y,pres_
y))//load for(x-2xlt4x) //control
pres_x max(0,min(maxold_x,x_posx))//
load result imYpres_ypres_xCOEFx
2 // computation, permutation and
load result1 max(0, min(255,
(result16)/32))//computation
Permutation
25
Computation
35
Control
10
Ld/St and Perm. Merged
Computation
6 tap sub- pixels interpolation
Control
Reduce a half of time
12Benchmarking results for CPU core
13Simulation results for DSP performance
- Enhanced DSP
- CAVLC(context adaptive variable length coding)
- OGG(new audio standard)
Sequence (CIF) MIPS/frame MIPS/frame
Sequence (CIF) Max Average
Foreman 0.147,832 0.029,898
Mobile 0.541,943 0.134,240
Function MIPS/frame
MDCT 6
De_VQ 2.5
Floor/Coupling 3.5
14Simulation results for DSP performance
- Vector processor
- H.264 baseline decoder
Sequence (298 frames) Sequence (298 frames) MIPS_at_30 frames MIPS_at_30 frames
Sequence (298 frames) Sequence (298 frames) Max Average
QCIF Foreman 28.1 12.7
QCIF Aikyo 19.8 5.3
CIF Foreman 116.3 52.3
CIF Aikyo 92.9 22.8
15Project status
- Finished 2 versions of CPU Core
- Released DSP instruction set
- Writing and verifying RTL of the enhanced DSP
- Benchmarking vector processor
- Developing software tools
16Future work
- Scheduling for task level parallelism(TLP)
between heterogeneous processors - Simulation/debugging tools for heterogeneous
processors - Methodologies for design space exploration
17Thank you!