Title: A Multiprocessor SystemonChip for RealTime Biomedical Monitoring and Analysis: Architectural Design
1A Multiprocessor System-on-Chip for Real-Time
Biomedical Monitoring and Analysis
Architectural Design Space Exploration
Iyad Al Khatib IMIT, ICT, KTH Royal Institute of
Technology Stockholm, Sweden
Davide Bertozzi ENDIF University of Ferrara
Ferrara, Italy
Luca Benini DEIS University of Bologna Bologna,
Italy
Francesco Poletti DEIS University of Bologna
Bologna, Italy
Rustam Nabiev Biomedical Engineering
Dept. Karolinska University Hospital Huddinge,
Stockholm, Sweden
Mohamed Bechara ECE, FEA American University of
Beirut Beirut, Lebanon
Axel Jantsch IMIT, ICT, KTH Royal Institute of
Technology Stockholm, Sweden
Hasan Khalifeh ECE, FEA American University of
Beirut Beirut, Lebanon
43rd Design Automation Conference (DAC 06)
2Outline
- Motivation
- MPSoC for ECG analysis
- ECG analysis algorithm
- Architectural bottleneck analysis
- Architecture exploration
- Architecture tuning and optimization
- Scalability analysis
- Comparison with state-of-the-art solutions
- Conclusions
3Motivation
United States, 2003
1,000,000
All Ages
lt85
85
800,000
Alzheimer
COPD
600,000
Cancer
Deaths
Other CVD
Stroke
400,000
Heart Disease
Heart diseases and stroke statistics 2006
update American Heart Association
200,000
0
50 of these deaths could be avoided with a
reliable combination of cost effective monitoring
and analysis
World market for biomedical devices for ECG
monitoring gt 1B Novosense05
4State of the art
Limited processing power and tight power budgets
of Holter devices has traditionally limited their
functionality to data acquisition
- Remote real-time monitoring through a
communication link involves - Transmission of a huge amount of life-critical
data - A 100 functional always-ON connection
5Real-time ECG analysis
- Real-time in-situ ECG MONITORING ANALYSIS aims
to - Promptly react to life-threatening heart
malfunctions - Relax requirements on telemedicine links
- Challenges
- Physiological variability of QRS complexes
- Base-line wander
- Muscle noise
- Artifacts due to electrode motion
- Power-line interference
- Preserve patient mobility
- Moving from 3-to-12-lead analysis
- Larger sampling frequencies
- Tight power budgets
Algorithm development
Scalable energy-efficient HW-SW platforms
6Contribution of this work
- 1. Remove HW bottlenecks
- Scalable computational
- horsepower
- Scalable communication
- architecture
- 2. Remove SW bottleneck
- Scalable algorithm for RT ECG analysis
- Parallelization strategy
- 4. Create a functionally and timing-accurate
virtual platform - 0.13um industrial technology-homogeneous power
models - Integrates industrial IP cores, interconnect
fabric, IOs - 5. Explore the design space
- Demonstrates 12-lead analysis _at_ gt1KHz
- Performance and power analysis and tuning
7Medical background
ECG is an electrical recording of the heart
activity
- 1-lead ECG signal PQRSTU peaks
- Each peak and inter-peak distance is related to a
different heart activity - Sampling frequencies 250 Hz, 1kHz
- Higher sampling frequencies might enhance
analysis accuracy (e.g., resolve two peaks very
close to each other) - A common analysis algorithm is Pan-Tompkins
- QRS detection
- Cascade of 4 filters band pass, differentiator,
squaring operation, and finally a moving window
integrator
8Proposed ECG system
Up to 12 chan
Interconnection of up to 9 sensors
- Commercial off-the-shelf sensors
- Ambu Inc. silver/silver chloride Blue sensor R
(www.ambuusa.com) - A/D Conversion up to 10 kHz
- IIR Filters to eliminate sensor noise and effect
of patient movements - 64 Mbyte SDRAM off-chip memory
- ECG MPSoC based on STMicroelectronics components
- Computation performed on chunks of 4 sec. of
recorded data (4-beat cycles)
9ECG analysis algorithm
- ECG analysis starts from a reference point in the
heart cycle - The R-peak is commonly used
- Accurate detection of the R-peak of the QRS
complex is prerequisite for the reliable
functionality of ECG analyzers Bobbie2004
- ECG signal variability is high
- R-peak detection might be inaccurate
- (e.g., R-T peak detection instead of
- R-R peak detection)
- As a consequence, other QRS
- parameters will be inaccurate
- Traditional techniques may fail
- in detecting some serious heart disorders
- R-on-T complex (premature
- ventricular complexes)
- Risk of ventricular fibrillation
10Novel approach to ECG analysis
- By autocorrelation, derive the period without
looking for peaks - Accurately find peaks in a time window equal to
the period
11Autocorrelation analysis
For the heartbeat period, we need at least 4 secs
of ECG data in order for the ACF to give accurate
results 100 on MIT-BIH database
12MPSoC architecture
INTERRUPT
PEn
PE1
PE2
CONTROLLER
System Global Interconnect
8kB SHARED MEMORY
512 kB PRI MEM 1
HARDWARE SEMAPHORES
PRI MEM N
Memory
Controller
Off-chip SDRAM Memory
- We exploit industrial IP cores (200 MHz System)
- ST220 4-issue VLIW DSPs with 32 kB instruction
and data caches - STBus interconnect from STMicroelectronics
- In-house optimized memory controller with DMA
capability - Whole system modeled with the MPSIM virtual
platform - Cycle accurate and bus-signal accurate
- Up to 200 kcycles/sec (Pentium 4, 3.5GHz clock)
- 0.13 um technology-homogeneous industrial power
models
13The memory bottleneck
Programming
Memory Controller
SDRAM
Data transfer
CORE
INTERCONNECT
Off-chip Memory Interface Unit
Controller
Transfer Engine
RAM
- Push memory channel
- Control Block keeps a table of objects to be
moved - Table entries can be programmed by different
cores - Transfer engine moves data
- Triggers bus SDRAM transactions
- Memory Controller handles SDRAM accesses
14STBus interconnect
- Advanced features with respect to widely used
AMBA AHB
AMBA AHB
STBus
Forward channel
Backward channel
Split request and response channels Wait states
can be masked, depending on the depth of slave
FIFO buffers Multiple outstanding
transactions Out-of-order completion Low latency
arbitration
Straigthforward shared bus topology 2 data links,
but only 1 active at a time In order
completion Transaction pipelining
15Flexible bus topology
- STBus can be instantiated either
- as a shared bus or as partial or full crossbar
Full
Crossbar
Partial
Crossbar
16Crossbar-based interconnect
DSP 1
PRI MEM1
Each private memory on a crossbar branch,
accessible by its DSP or by the MemCtrl master
port
DSP N
PRI MEM N
SHM MEM
SEM
IRUPT
MEM CTRL
MemCtrl slave port for DMA programming
MEM CTRL
PRI MEM1
PRI MEM N
DSP 1
Partial grouping of initiators and targets may
result in marginal performance penalties while
reducing interconnect area (partial vs. full
crossbars)
MEM CTRL
DSP N
IRUPT
MEM CTRL
SHM MEM
SEM
17Data management - I
- Each DSP programs the DMA engine to periodically
transfer input data chunks (4 secs of ECG signal)
to its private on-chip memory - With 1kHz sampling frequency and 12 processors,
required bandwidth is 6 Mbyte/sec (DMA
programming plus actual data transfers) - Negligible with respect to STBus bandwidth (with
1 wait state memory, it exceeds 400 Mbyte/sec)
18Data management - II
Cache
line refills
- Independent computation of each DSP in its
private memory - High communication bandwidth requirement on the
interconnect - More leads can be processed by the same DSP
- The RTEMS OS supports multiple tasks
19Data management - III
64 bytes output data to shared memory
- Negligible bus bandwidth
- When the shared memory gets filled beyond a
certain level, - stored output data can be swapped to the
off-chip SDRAM - 8 hours of history can be recorded
- Data can also be transmitted via a telemedicine
link
20PE efficiency
- We compared performance of
- ST220 VLIW DSPs with respect to ARM7TDMI cores
2.5 times more energy-efficient
9 times faster
Same cache size (32 kB) Processing of 1 ECG
lead on 1 core 250 Hz sampling frequency
- High-quality VLIW code generation
- ARM7 (no Thumb) executable is 1.7 times larger
- static IPC for the 4-issue ST220 VLIW DSP 2.9
21Architectural tuning
- Let us configure the system to satisfy
application requirements at the minimum hardware
cost
Processing of 4 secs of input data (250 Hz
sampling frequency). 12-lead ECG
- Execution time scales linearly
- Communication architecture (shared STBus) is
well tuned - Peak memory controller bandwidth satisfies perf.
requirement
22Architectural tuning
Processing of 4 secs of input data (1 kHz
sampling frequency). 12-lead ECG
- Load increases quadratically with sampling
frequency - About 3 secs for 1 DSP to process 12 leads
- Employing more processors is more effective here
- Smoother energy degradation
- Larger margin for heart disorders diagnosis
23Looking forward
- What is the maximum achievable sampling frequency
- while meeting real-time requirements?
12 processors running. 12-lead ECG. 3.5 sec
real-time requirements
3.5
Hz
4000
2200
- Typical state-of-the-art frequency range is
250Hz-1kHz - 2.2 kHz achievable with a shared bus
- about 4 kHz with an optimized partial crossbar
24Interconnect optimization
- System interconnect saturation
- limits performance scalability
- 100 busy at 2.2 KHz
Now the architecture is computation-limited
25Comparison with research/commercial ECG SoCs
Let us compare two of our MPSoC platform
instances with similar designs in research and
on the market
Application results
Pre-filter
Leads per SoC
Real-Time analysis window
Freq. (Hz)
Memory
Data bits
Solution
Hear-period P,Q,R,S,T,U peaks, potential
disease detect
IIR
12
lt3.5s
4000
512kB pri.mems. 32kB I- and 32kB D-cache
16
Partial crossbar
Same as above
IIR
12
lt4s
2200
Same as above
16
Shared bus
Only QRS, only decide if healthy or unhealthy
Notch
1
No info
250
8kB Cache
10
1
Only QRS, only decide if healthy or unhealthy
IIR
8
No info
800
No info
12
2
1 Chang, M. et al., Design of a System-on-Chip
for ECG signal processing, The 2004 IEEE
Asia-Pacific Conference on Circuits and Systems,
December 2004. 2 FreescaleTM semiconductor,
Personal Electrocardiogram (ECG) Monitor,
http//www.freescale.com/
26Conclusions
- Real-time nomadic EKG analysis challenges
- 12-lead, Multi KHz frequency
- Algorithmic robustness
- Software parallelization
- Hardware bottlenecks (computation and
communication arch.) - Real-time diagnosis
- Autocorrelation-based algorithm is a promising
alternative to traditional techniques - MPSoC required to handle increased computational
requirements - HW-SW platform exploration
- VLIW DSP more energy efficient than RISC core
- Bus-based interconnect limits rate to 2KHz
- VLIW core becomes the bottleneck at 4KHz
- Future explore DVFS power management
27Filtering stage
- Filters out DC offsets and signal interferences
- Hardware-implemented order-3 IIR filter
- Output results in 16-bit binary format
- Facilitates peak resolution and makes heartbeat
period - computation more precise
28The bus bottleneck
- Bus bandwidth saturation limits scalability of
state-of-the-art SoCs - Trends
- Evolution of communication protocols
- (AMBA AHB, STBus, CoreConnect, AMBA AXI)
- Evolution of bus topology
- (shared bus, partial/full crossbar,
multi-layer architecture)