Early Experience with Out-of-Core Applications on the Cray XMT

About This Presentation

Title:

Early Experience with Out-of-Core Applications on the Cray XMT

Description:

Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarr a-Miranda , Andr s M rquez , Jarek Nieplocha , Kristyn Maschhoff and Chad ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 19

Provided by: Staf1239

Learn more at: https://hpc.pnl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Early Experience with Out-of-Core Applications on the Cray XMT

1
Early Experience with Out-of-Core Applications on
the Cray XMT

Daniel Chavarría-Miranda, Andrés Márquez, Jarek
Nieplocha, Kristyn Maschhoff and Chad Scherrer
Pacific Northwest National Laboratory (PNNL)
Cray, Inc.

2
Introduction

Increasing gap between memory and processor speed
Causing many applications to become memory-bound
Mainstream processors utilize cache hierarchy
Caches not effective for highly irregular,
data-intensive applications
Multithreaded architectures provide an
alternative
Switch computation context to hide memory latency
Cray MTA-2 processors and newer ThreadStorm
processors on the Cray XMT utilize this strategy

3
Cray XMT

3rd generation multithreaded system from Cray
Infrastructure is based on XT3/4, scalable up to
8192 processors
SeaStar network, torus topology, service and I/O
nodes
Compute nodes contain 4 ThreadStorm multithreaded
processors instead of 4 AMD Opteron processors
Hybrid execution capabilities code can run on
ThreadStorm processors in collaboration with code
running on Opteron processors

4
Cray XMT (cont.)

ThreadStorm processors run at 500 MHz
128 hardware thread contexts, each with its own
set of 32 registers
No data cache
128KB, 4-way associative data buffer on the
memory side
Extra bits in each 64-bit memory word full/empty
for synchronization
Hashed memory at a 64-byte level, i.e. contiguous
logical addresses at a 64-byte boundary might be
mapped to uncontiguous physical locations
Global shared memory

5
Cray XMT (cont.)

Lightweight User Communication library (LUC) to
coordinate data transfers and hybrid execution
between ThreadStorm and Opteron processors
Portals-based on Opterons
Fast I/O API-based on ThreadStorms
RPC-style semantics
Service and I/O (SIO) nodes provide Lustre, a
high-performance parallel file system
ThreadStorm processors cannot directly access
Lustre
LUC-based execution and transfers combined with
Lustre access on the SIO nodes
Attractive and high-performance alternative for
processing very large datasets on the XMT system

6
Outline

Introduction
Cray XMT
PDTree
Multithreaded implementation
Static dynamic versions
Experimental setup and Results
Conclusions

7
PDTree (or Anomaly Detection for Categorical
Data)

Originates from cyber security analysis
Detect anomalies in packet headers
Locate and characterize network attacks
Analysis method is more widely applicable
Uses ideas from conditional probability
Multivariate categorical data analysis
For a combination of variables and instances of
values for these variables, find out how many
times the pattern has occurred
Resulting count table or contingency table
specifies a joint distribution
Efficient implementation of algorithms using such
tables are very important in statistical analysis
ADTree data structure (Moore Lee 1998), can be
used to store data counts
Stores all combinations of values for variables

8
PDTree (cont.)

We use an enhancement to the ADTree data
structure called a PDTree where we dont need to
store all possible combinations of values
Only store a priori specified combinations

9
Multithreaded Implementation

PDTree implemented using a multiple type,
recursive tree structure
Root node is an array of ValueNodes (counts for
different value instances of the root variables)
Interior and leaf nodes are linked lists of
ValueNodes
Inserting a record at the top level involves just
incrementing the counter of the corresponding
ValueNode
XMTs int_fetch_add() atomic operation is used to
increment counters
Inserting a record at other levels requires the
traversal of a linked list to find the right
ValueNode
If the ValueNode does not exist, it must be
appended to the end of the list
Inserting at other levels when the node does not
exist is tricky
To ensure safety the end pointer of the list must
be locked
Use readfe() and writeef() MTA operations to
create critical sections
Take advantage of full/empty bits on each memory
word
As data analysis progresses the probability of
conflicts between threads is lower

10
Multithreaded Implementation (cont.)
T1 trying to grab the end pointer
vi j (count)
vi k (count)
T2 trying to grab the end pointer
T1 succeeded and inserted a new node
vi j (count)
vi k (count)
vi m (count)
T2 now has a lock to a non-end pointer
11
Static and Dynamic Versions
12
Outline

Introduction
Cray XMT
PDTree
Multithreaded Implementation
Static dynamic versions
Experimental setup and Results
Conclusions

13
Experimental setup and Results

Large dataset to be analyzed by PDTree
4 GB resident on disk (64M records, 9 column
guide tree)
Options
Direct file I/O from ThreadStorm procesors via
NFS
Not very efficient
Indirect I/O via LUC server running on Opteron
processors on the SIO nodes
Large input file can reside on high-performance
Lustre file system
Simulates the use of PDTree for online network
traffic analysis
Need to use dynamic PDTree
128K element hash table

14
Experimental setup and Results (cont.)
Note results obtained on a preproduction XMT
with only half of the DIMM slots populated
15
Experimental setup and Results (cont.)
of procs. XMT Insertion XMT Speedup MTA Insertion MTA Speedup
1 239.26 1.00 200.17 1.00
2 116.36 2.06 98.25 2.04
4 56.48 4.24 48.07 4.16
8 27.53 8.69 23.29 8.59
16 13.97 17.13 11.61 17.24
32 7.13 33.56 5.81 34.45
64 3.68 65.02 N/A N/A
96 2.60 92.02 N/A N/A
In-core, 1M record execution, static PDTree
version
16
Experimental setup and Results (cont.)
16
17
Experimental setup and Results (cont.)
17
18
Conclusions

Results indicate the value of the XMT hybrid
architecture and its improved I/O capabilities
Indirect access to Lustre through LUC interface
Need to improve I/O operation implementation to
take full advantage of Lustre
Multiple LUC transfers in parallel should improve
performance
Scalability of the system is very good for
complex, data-dependent irregular accesses in the
PDTree application
Future work includes comparisons against parallel
cache-based systems