Title: Early Experience with Out-of-Core Applications on the Cray XMT
1Early Experience with Out-of-Core Applications on
the Cray XMT
- Daniel Chavarría-Miranda, Andrés Márquez, Jarek
Nieplocha, Kristyn Maschhoff and Chad Scherrer - Pacific Northwest National Laboratory (PNNL)
- Cray, Inc.
2Introduction
- Increasing gap between memory and processor speed
- Causing many applications to become memory-bound
- Mainstream processors utilize cache hierarchy
- Caches not effective for highly irregular,
data-intensive applications - Multithreaded architectures provide an
alternative - Switch computation context to hide memory latency
- Cray MTA-2 processors and newer ThreadStorm
processors on the Cray XMT utilize this strategy
3Cray XMT
- 3rd generation multithreaded system from Cray
- Infrastructure is based on XT3/4, scalable up to
8192 processors - SeaStar network, torus topology, service and I/O
nodes - Compute nodes contain 4 ThreadStorm multithreaded
processors instead of 4 AMD Opteron processors - Hybrid execution capabilities code can run on
ThreadStorm processors in collaboration with code
running on Opteron processors
4Cray XMT (cont.)
- ThreadStorm processors run at 500 MHz
- 128 hardware thread contexts, each with its own
set of 32 registers - No data cache
- 128KB, 4-way associative data buffer on the
memory side - Extra bits in each 64-bit memory word full/empty
for synchronization - Hashed memory at a 64-byte level, i.e. contiguous
logical addresses at a 64-byte boundary might be
mapped to uncontiguous physical locations - Global shared memory
5Cray XMT (cont.)
- Lightweight User Communication library (LUC) to
coordinate data transfers and hybrid execution
between ThreadStorm and Opteron processors - Portals-based on Opterons
- Fast I/O API-based on ThreadStorms
- RPC-style semantics
- Service and I/O (SIO) nodes provide Lustre, a
high-performance parallel file system - ThreadStorm processors cannot directly access
Lustre - LUC-based execution and transfers combined with
Lustre access on the SIO nodes - Attractive and high-performance alternative for
processing very large datasets on the XMT system
6Outline
- Introduction
- Cray XMT
- PDTree
- Multithreaded implementation
- Static dynamic versions
- Experimental setup and Results
- Conclusions
7PDTree (or Anomaly Detection for Categorical
Data)
- Originates from cyber security analysis
- Detect anomalies in packet headers
- Locate and characterize network attacks
- Analysis method is more widely applicable
- Uses ideas from conditional probability
- Multivariate categorical data analysis
- For a combination of variables and instances of
values for these variables, find out how many
times the pattern has occurred - Resulting count table or contingency table
specifies a joint distribution - Efficient implementation of algorithms using such
tables are very important in statistical analysis - ADTree data structure (Moore Lee 1998), can be
used to store data counts - Stores all combinations of values for variables
8PDTree (cont.)
- We use an enhancement to the ADTree data
structure called a PDTree where we dont need to
store all possible combinations of values - Only store a priori specified combinations
9Multithreaded Implementation
- PDTree implemented using a multiple type,
recursive tree structure - Root node is an array of ValueNodes (counts for
different value instances of the root variables) - Interior and leaf nodes are linked lists of
ValueNodes - Inserting a record at the top level involves just
incrementing the counter of the corresponding
ValueNode - XMTs int_fetch_add() atomic operation is used to
increment counters - Inserting a record at other levels requires the
traversal of a linked list to find the right
ValueNode - If the ValueNode does not exist, it must be
appended to the end of the list - Inserting at other levels when the node does not
exist is tricky - To ensure safety the end pointer of the list must
be locked - Use readfe() and writeef() MTA operations to
create critical sections - Take advantage of full/empty bits on each memory
word - As data analysis progresses the probability of
conflicts between threads is lower
10Multithreaded Implementation (cont.)
T1 trying to grab the end pointer
vi j (count)
vi k (count)
T2 trying to grab the end pointer
T1 succeeded and inserted a new node
vi j (count)
vi k (count)
vi m (count)
T2 now has a lock to a non-end pointer
11Static and Dynamic Versions
12Outline
- Introduction
- Cray XMT
- PDTree
- Multithreaded Implementation
- Static dynamic versions
- Experimental setup and Results
- Conclusions
13Experimental setup and Results
- Large dataset to be analyzed by PDTree
- 4 GB resident on disk (64M records, 9 column
guide tree) - Options
- Direct file I/O from ThreadStorm procesors via
NFS - Not very efficient
- Indirect I/O via LUC server running on Opteron
processors on the SIO nodes - Large input file can reside on high-performance
Lustre file system - Simulates the use of PDTree for online network
traffic analysis - Need to use dynamic PDTree
- 128K element hash table
14Experimental setup and Results (cont.)
Note results obtained on a preproduction XMT
with only half of the DIMM slots populated
15Experimental setup and Results (cont.)
of procs. XMT Insertion XMT Speedup MTA Insertion MTA Speedup
1 239.26 1.00 200.17 1.00
2 116.36 2.06 98.25 2.04
4 56.48 4.24 48.07 4.16
8 27.53 8.69 23.29 8.59
16 13.97 17.13 11.61 17.24
32 7.13 33.56 5.81 34.45
64 3.68 65.02 N/A N/A
96 2.60 92.02 N/A N/A
In-core, 1M record execution, static PDTree
version
16Experimental setup and Results (cont.)
16
17Experimental setup and Results (cont.)
17
18Conclusions
- Results indicate the value of the XMT hybrid
architecture and its improved I/O capabilities - Indirect access to Lustre through LUC interface
- Need to improve I/O operation implementation to
take full advantage of Lustre - Multiple LUC transfers in parallel should improve
performance - Scalability of the system is very good for
complex, data-dependent irregular accesses in the
PDTree application - Future work includes comparisons against parallel
cache-based systems
18