A High-End Reconfigurable Computation Platform for Particle Physics Experiments presentation

About This Presentation

Transcript and Presenter's Notes

Title: A High-End Reconfigurable Computation Platform for Particle Physics Experiments

1
A High-End Reconfigurable Computation Platform
for Particle Physics Experiments

Lic. Thesis Presentation in
ICT/ECS KTH
Under the collaboration between KTH JLU
by
Ming Liu

Supervisors Prof. Axel Jantsch (KTH) Dr.
Zhonghai Lu (KTH) Prof.
Wolfgang Kuehn (JLU, Germany)
2
Contributions

The thesis is mainly based on the following
contributions
Ming Liu, Johannes Lang, Shuo Yang, Tiago Perez,
Wolfgang Kuehn, Hao Xu, Dapeng Jin, Qiang Wang,
Lu Li, Zhenan Liu, Zhonghai Lu, and Axel Jantsch,
ATCA-based Computation Platform for Data
Acquisition and Triggering in Particle Physics
Experiments, In Proc. of the International
Conference on Field Programmable Logic and
Applications 2008 (FPL08), Sep. 2008. (System
architecture)
Ming Liu, Wolfgang Kuehn, Zhonghai Lu and Axel
Jantsch, System-on-an-FPGA Design for Real-time
Particle Track Recognition and Reconstruction in
Physics Experiments, In Proc. of the 11th
EUROMICRO Conference on Digital System Design
(DSD08), Sep. 2008. (Algorithm implementation
and evaluation)
Ming Liu, Wolfgang Kuehn, Zhonghai Lu, Axel
Jantsch, Shuo Yang, Tiago Perez and Zhenan Liu,
Hardware/Software Co-design of a General-Purpose
Computation Platform in Particle Physics, In
Proc. of the 2007 IEEE International Conference
on Field Programmable Technology (ICFPT07), Dec.
2007. (HW/SW co-design)

3
Overview

Background in Physics Experiments
Computation Platform for DAQ and Triggering
Network architecture
Compute Node (CN) architecture
HW/SW Co-design of the System-on-an-FPGA
Partitioning strategy
HW design
SW design
Algorithm Implementation and Evaluation
Conclusion and Future Work

4
Background

Nuclear Particle Physics a branch of physics
that studies the constituents and interactions of
atomic nuclei and particles.
Some elementary particles do not occur under
normal circumstances in nature.
Many can be created and detected during energetic
collisions of others.
Beam ? Target, or Beam ? Beam. Produced particles
are studied with huge/complex detector systems.
Examples
HADES PANDA _at_ GSI, Germany
ATLAS, CMS, LHCb, ALICE at the LHC _at_ CERN,
Switzerland France
BES III _at_ IHEP, China
WASA _at_ FZ-Juelich, Germany

5
Detector Systems

HADES
RICH (Ring Image CHerenkov)
MDC (Mini Drift Chamber)
TOF (Time-Of-Flight)
TOFino (small TOF)
Shower (Electromagnetic Shower)
RPC (Resistive Plate Chamber, will be added to
substitute TOFino)

HADES
BES III
WASA
PANDA
6
HADES Detector System
7
Challenge Motivation

Challenge high reaction rate and high data rate
(PANDA, 10-20 MHz, data rate up to 200 GB/s!!!)
Not possible to entirely store all the data, due
to the storage capacity limitation.
Only a rare fraction (e.g. 1/106) is of interest
for extensive offline analysis. The background
can be discarded on the fly.
Pattern recognition algorithms used to identify
interesting events.
Motivation a reconfigurable and scalable
computation platform for high data rate
processing.

8
Data Flow

Pattern recognition algorithms
Data correlation
Largely reduced data rate for storage

9
Related Work

Previously commercial bus systems, such as
VMEbus, FASTbus, CAMAC, etc., were used for DAQ
and triggering.
Time-multiplexing of the system bus exacerbates
the data exchange efficiency and cannot meet
high-performance requirements.
The solution of existing reconfigurable computers
sounds good, but not suitable for physics
experiment applications
Some are augmented computer clusters with FPGAs
attached to the system bus as accelerators.
(Bandwidth bottleneck between the microprocessor
and the accelerator)
Some are standalone boards. (Not straightforward
to scale the system to a large size, due to the
lack of efficient inter-board connectivity)
Flexible and massive communication channels are
required to interface with detectors and the PC
farm.
All-board-switched or tree-like topology may
result in communication penalty between algorithm
steps. (P2P direct links are preferred.)

10
Overview

Background in Physics Experiments
Computation Platform for DAQ and Triggering
Network architecture
Compute Node (CN) architecture
HW/SW Co-design of the System-on-an-FPGA
Partitioning strategy
HW design
SW design
Algorithm Implementation and Evaluation
Conclusion and Future Work

11
DAQ and Trigger Systems

Detectors detect particles and generate signals
Signals digitized by ADCs
Data buffered in concentrators/buffers
Pattern recognition algorithms extract features
from events.
Interesting events stored in the mass storage.
Background discarded on the fly.

12
Network Topology

Compute Nodes (CN) interconnected for
parallel/pipelined processing
Hierarchical network topology
External channels
Optical links
Gigabit Ethernet
Internal interconnections
On-board IO connections
Inter-board backplane
Inter-chassis switching

13
ATCA Backplane

Advanced Telecommunications Computing
Architecture (ATCA)
Full-mesh direct Point-to-Point (P2P) backplane
High flexibility to correlate results from
different algorithms
High performance compared to shared buses

14
Compute Node

Prototype board with 5 Xilinx Virtex-4 FX60 FPGAs
4 FPGAs as algo. processors
1 FPGA as a switch
2GB DDR2 per FPGA, IPMC, Flash, CPLD...
Full-mesh on-board communications of GPIOs
RocketIOs
RocketIO-based backplane channels
External channels of optical links Gigabit
Ethernet

15
Compute Node PCB

14-layer PCB design
Standard 12U size of 280 x 322 mm

16
Performance Summary

1 ATCA chassis 14 CNs
1890 Gbps on-board connections
1456 Gbps inter-board backplane connections
728 Gbps full-duplex optical bandwidth
70 Gbps Ethernet
140 GB DDR2 SDRAM
All computing resources of 70 Virtex-4 FX60 FPGAs
(140 PowerPC 405 microprocessors programmable
resources)
Power consumption evaluation Max. 170 W/CN (Each
ATCA slot 200 W)

17
Overview

Background in Physics Experiments
Computation Platform for DAQ and Triggering
Network architecture
Compute Node (CN) architecture
HW/SW Co-design of the System-on-an-FPGA
Partitioning strategy
HW design
SW design
Algorithm Implementation and Evaluation
Conclusion and Future Work

18
Partitioning Strategy

Multiple tasks during experiment operations (data
processing, control tasks, ...)
Partitioned between FPGA HW fabric embedded
PowerPC CPUs

19
Partitioning Strategy

Concrete strategy
All pattern recognition algorithms customized in
the FPGA fabric , as HW parallel/pipelined
processing modules
Slow control tasks, (e.g. Monitoring the system
status, modifying experimental parameters, ...),
implemented in SW (applications OS)
Soft TCP/IP stack in Linux OS

20
HW Design

Old bus-based arch. (PLB OPB)
CPU fast peripherals on PLB
Slow peripherals on OPB
Customized processing modules (e.g. TPU) on PLB

Improved MPMC LocalLink-based arch.
Multi-Port Memory Controller (8 ports)
Direct access to the memory from the device
Customized processing unit interfaced to MPMC
directly

21
SW Design

Open-source embedded Linux on the embedded
PowerPC CPUs
Device drivers
Standard devices (Ethernet, RS232, Flash memory,
etc.)
Customized modules
Applications for slow controls
High level scripts
C/C programs
Apache webserver
Java programs on the VM
Software cost low budget!!!

22
Remote Reconfigurability

Remote reconfigurability (HW SW) is desired due
to the spatial constraint in experiments.
Both OS and FPGA bitstream are stored in the NOR
flash memories.

With the support of the MTD driver, the bitstream
and the OS kernel can be overwritten/upgraded in
Linux.
Roboot the system and the updated system will
function.
Backup mechanism to guarantee the system alive.

23
Overview

Background in Physics Experiments
Computation Platform for DAQ and Triggering
Network architecture
Compute Node (CN) architecture
HW/SW Co-design of the System-on-an-FPGA
Partitioning strategy
HW design
SW design
Algorithm Implementation and Evaluation
Conclusion and Future Work

24
Pattern Recognition in HADES

New DAQ Trigger system for HADES upgrade (10
GB/s)
Pattern recognition algorithms
Cherenkov ring recognition (RICH)
MDC particle track reconstruction (MDCs)
Time-Of-Flight processing (TOF RPC)
Electromagnetic shower recognition (Shower)
Partitioned and distributed on FPGA nodes
Algorithms correlated by hierarchical connections

25
Pattern Recognition in HADES

Pattern recognition
Correlation
Event building storage

26
Particle Track Reconstruction

Particle tracks bent in the magnetic field
between the coils
Straight lines before after the coil
approximately
Inner and outer tracks pointing to RICH and TOF
detector respectively and helping them to find
patterns (correlation)
Similar principle for inner and outer segments.
Only inner part discussed
The particle track reconstruction algorithm for
HADES was previously implemented in SW, due to
the complexity.
Now implemented and investigated as a case study
in HW

27
Basic Principle

Wires fired by flying particles
Project fired wires to a plane
Recognize the overlap area and reconstruct tracks
from the target

6 sectors
2110 wires per sector (inner)
6 orientations

28
Basic Principle
29
Hardware Implementation

PLB interface (Slave)
MPMC interface (Master)
Algorithm processor Tracking Processing Unit
(TPU)

30
Modular Design

TPU for track reconstruction computation
Input fired wire Nos.
Output position of track candidates on the proj.
plane
Sub-modules
Wire No. Wr. FIFO
Proj. LUT Addr. LUT
Bus master
Accumulate unit
Peak finder

31
Implementation Results

Resource utilization of Virtex-4 FX60
acceptable!
Timing limitation 125 MHz without optimization
effort
Clock frequency fixed at 100 MHz, to match the
PLB speed

32
Performance Evaluation

Experimental setup
MPMC-based structure used for measurements
Different measurement points on different wire
multiplicities (10, 30, 50, 200, 400 fired wires
out of 2110)
Projection LUT 5.7 Kbits per wire in average
(1.5 MB/2110 wires)
A C program running on the Xeon 2.4 GHz computer
as the reference
Results
Speedup of 10.8 24.3 times per module have been
seen compared to the software solution.
Considering the FPGA resource consumption,
multiple TPU modules may be integrated in the
system for parallel processing and high
performance.

33
Performance Analysis

Non-TPU factors introducing overhead and
restricting the performance
Complex DDR2 address mechanism (large latency)
Data transfer burst mode of only 8 beats (clock
cycles wasted)
MPMC arbitrating the memory access among multiple
ports (clock cycles wasted)
TPU module is powerful, but memory accessories
(LUTs) are slow.
Solution SRAM memory added to enhance the memory
bandwidth and reduce the access latency
Speed up of from 20 to 50 per module compared to
software expected

34
Overview

Background in Physics Experiments
Computation Platform for DAQ and Triggering
Network architecture
Compute Node (CN) architecture
HW/SW Co-design of the System-on-an-FPGA
Partitioning strategy
HW design
SW design
Algorithm Implementation and Evaluation
Conclusion and Future Work

35
Conclusion

An FPGA- and ATCA-based computation platform is
being constructed for the DAQ and trigger system
in modern nuclear and particle physics
experiments.
The platform features high-performance,
scalability, reconfigurability, and the potential
to be used for different application projects
(physics non-physics).
A co-design approach is proposed to develop
applications on the platform.
HW system design customized processing modules
SW Linux OS device drivers application
programs
A case study, the particle track reconstruction
algorithm, has been implemented and evaluated on
the system. Speedup of one order of magnitude per
module has been observed when compared to the
software solution. (multiple modules integrated
for parallel processing)

36
Future Work

The network communication will be investigated
with multiple CN PCBs.
All pattern recognition algorithms are to be
implemented and parallelized.
Study of more efficient memory addressing
mechanism for multiple modules
More advanced features, e.g. dynamic partial
reconfiguration for adaptive computing

37
Thank you!
38
Measurements

Computing throughput study of the event selector
core
Study of PLB data transfer performance
DMA transfers
148.1 MB/s for 25, 97.3 MB/s for 100
interesting event rate

39
Measurements

Optical link communication test with the
front-end Trigger and Readout Board (TRBv2)
2 Gbps with 8B/10B encoding
150 hour test with no bit error

40
Measurements

P2P Ethernet performance measurements
Benchmark Netperf
Features enable to improve performance S/G DMA,
checksum offloading, jumbo frame of 8982,
interrupt coalescing,
Bottleneck PowerPC 300 MHz CPU for stack
processing

Write a Comment

User Comments (0)

About PowerShow.com

A High-End Reconfigurable Computation Platform for Particle Physics Experiments PowerPoint PPT Presentation