Title: Memory Support Design for LU Decomposition on the Starbridge Hypercomputer
1Memory Support Design for LU Decomposition on the
Starbridge Hyper-computer
- Seth Young, Arvind Sudarsanam, Aravind Dasu, and
Thomas Hauser - Utah State University
- Presented by Yi-Gang Tai
2- Data transfer hardware which supports an
implementation of a block-based LU algorithm on a
multi-FPGA system
3Introduction
- LU decomposition splits a matrix into the product
of an upper triangular matrix and a lower
triangular matrix - Block-based LU
- An FPGA does not have enough local memory
- The matrix is broken down into smaller matrices
- Inter-node communication eliminated
4Block-based LU Decomposition
- Block partitioning shown above
- Four steps of the block-based LU
- Normal LU factorization of A11
- Create L21 using L11, U11, A21
- Create U12 using L11, U11, A12
- Create A using L21, U12, A22
- Repeat iteratively with A as new A
5Platform Overview
- A hyper-computer by Starbridge systems
- A PC as the main controller
- Xilinx vertex II 6000 FPGA
- 2GB DRAM for each PE (FPGADRAM)
- FPGA hardware design using viva toolset by
Starbridge
6Platform Overview (cont.)
7Hardware Platform Limitations
- PC to FPGA bandwidth
- 64-bit 66MHz PCI bus
- 128-bit complex double precision floating-point
- DRAM size
- 2GB DRAM of a PE holds 512K 16x16 blocks
- FPGA BRAM size
- Determines the size of blocks
- 324Kb of BRAM fits 79 blocks
8Memory Transfer H/W Design
- Top level PE block diagram
9Data Flow StepsBetween PC and FPGA
- Four steps of data flow process
- Raw data from PC to PE and stored in DRAM
- Data moved from DRAM to FPGA in blocks
- Processed data transferred back to DRAMSteps 2
and 3 alternate until all data processed - Processed data moved back to PC
- Different LU decomposition steps have same data
flow steps but different data organization
10Data Ordering for LU Steps 2 3
Addr N
Addr 0
11Data Ordering for LU Step 4
12PC to DRAM Data Transfer H/W
- Governed by a 3-state FSM
13Block Diagram of State 1
14Structure of Sequence Detector
15H/W Interaction of State 3
16Block Diagram ofData to PC Controller
17DRAM to FPGA H/W Implementation for LU Steps 2 3
- Implemented in Viva
- Memory control module
- Interface with Viva DRAM controller
- 2-state state machine wait and load
- Block control module
- Controls memory control module
- 2-state state machine read and write
18DRAM to FPGA H/W Implementation for LU Steps 4
19State Diagram of State MC Module
20Resource Utilization Results
21Data Transfer Times
22Block Size vs. Times
- 64 is the most efficient block size (?)