Early Experience with Out-of-Core Applications on the Cray XMT - PowerPoint PPT Presentation

About This Presentation
Title:

Early Experience with Out-of-Core Applications on the Cray XMT

Description:

Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarr a-Miranda , Andr s M rquez , Jarek Nieplocha , Kristyn Maschhoff and Chad ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 19
Provided by: Staf1239
Learn more at: https://hpc.pnl.gov
Category:

less

Transcript and Presenter's Notes

Title: Early Experience with Out-of-Core Applications on the Cray XMT


1
Early Experience with Out-of-Core Applications on
the Cray XMT
  • Daniel Chavarría-Miranda, Andrés Márquez, Jarek
    Nieplocha, Kristyn Maschhoff and Chad Scherrer
  • Pacific Northwest National Laboratory (PNNL)
  • Cray, Inc.

2
Introduction
  • Increasing gap between memory and processor speed
  • Causing many applications to become memory-bound
  • Mainstream processors utilize cache hierarchy
  • Caches not effective for highly irregular,
    data-intensive applications
  • Multithreaded architectures provide an
    alternative
  • Switch computation context to hide memory latency
  • Cray MTA-2 processors and newer ThreadStorm
    processors on the Cray XMT utilize this strategy

3
Cray XMT
  • 3rd generation multithreaded system from Cray
  • Infrastructure is based on XT3/4, scalable up to
    8192 processors
  • SeaStar network, torus topology, service and I/O
    nodes
  • Compute nodes contain 4 ThreadStorm multithreaded
    processors instead of 4 AMD Opteron processors
  • Hybrid execution capabilities code can run on
    ThreadStorm processors in collaboration with code
    running on Opteron processors

4
Cray XMT (cont.)
  • ThreadStorm processors run at 500 MHz
  • 128 hardware thread contexts, each with its own
    set of 32 registers
  • No data cache
  • 128KB, 4-way associative data buffer on the
    memory side
  • Extra bits in each 64-bit memory word full/empty
    for synchronization
  • Hashed memory at a 64-byte level, i.e. contiguous
    logical addresses at a 64-byte boundary might be
    mapped to uncontiguous physical locations
  • Global shared memory

5
Cray XMT (cont.)
  • Lightweight User Communication library (LUC) to
    coordinate data transfers and hybrid execution
    between ThreadStorm and Opteron processors
  • Portals-based on Opterons
  • Fast I/O API-based on ThreadStorms
  • RPC-style semantics
  • Service and I/O (SIO) nodes provide Lustre, a
    high-performance parallel file system
  • ThreadStorm processors cannot directly access
    Lustre
  • LUC-based execution and transfers combined with
    Lustre access on the SIO nodes
  • Attractive and high-performance alternative for
    processing very large datasets on the XMT system

6
Outline
  • Introduction
  • Cray XMT
  • PDTree
  • Multithreaded implementation
  • Static dynamic versions
  • Experimental setup and Results
  • Conclusions

7
PDTree (or Anomaly Detection for Categorical
Data)
  • Originates from cyber security analysis
  • Detect anomalies in packet headers
  • Locate and characterize network attacks
  • Analysis method is more widely applicable
  • Uses ideas from conditional probability
  • Multivariate categorical data analysis
  • For a combination of variables and instances of
    values for these variables, find out how many
    times the pattern has occurred
  • Resulting count table or contingency table
    specifies a joint distribution
  • Efficient implementation of algorithms using such
    tables are very important in statistical analysis
  • ADTree data structure (Moore Lee 1998), can be
    used to store data counts
  • Stores all combinations of values for variables

8
PDTree (cont.)
  • We use an enhancement to the ADTree data
    structure called a PDTree where we dont need to
    store all possible combinations of values
  • Only store a priori specified combinations

9
Multithreaded Implementation
  • PDTree implemented using a multiple type,
    recursive tree structure
  • Root node is an array of ValueNodes (counts for
    different value instances of the root variables)
  • Interior and leaf nodes are linked lists of
    ValueNodes
  • Inserting a record at the top level involves just
    incrementing the counter of the corresponding
    ValueNode
  • XMTs int_fetch_add() atomic operation is used to
    increment counters
  • Inserting a record at other levels requires the
    traversal of a linked list to find the right
    ValueNode
  • If the ValueNode does not exist, it must be
    appended to the end of the list
  • Inserting at other levels when the node does not
    exist is tricky
  • To ensure safety the end pointer of the list must
    be locked
  • Use readfe() and writeef() MTA operations to
    create critical sections
  • Take advantage of full/empty bits on each memory
    word
  • As data analysis progresses the probability of
    conflicts between threads is lower

10
Multithreaded Implementation (cont.)
T1 trying to grab the end pointer
vi j (count)
vi k (count)
T2 trying to grab the end pointer
T1 succeeded and inserted a new node
vi j (count)
vi k (count)
vi m (count)
T2 now has a lock to a non-end pointer
11
Static and Dynamic Versions
12
Outline
  • Introduction
  • Cray XMT
  • PDTree
  • Multithreaded Implementation
  • Static dynamic versions
  • Experimental setup and Results
  • Conclusions

13
Experimental setup and Results
  • Large dataset to be analyzed by PDTree
  • 4 GB resident on disk (64M records, 9 column
    guide tree)
  • Options
  • Direct file I/O from ThreadStorm procesors via
    NFS
  • Not very efficient
  • Indirect I/O via LUC server running on Opteron
    processors on the SIO nodes
  • Large input file can reside on high-performance
    Lustre file system
  • Simulates the use of PDTree for online network
    traffic analysis
  • Need to use dynamic PDTree
  • 128K element hash table

14
Experimental setup and Results (cont.)
Note results obtained on a preproduction XMT
with only half of the DIMM slots populated
15
Experimental setup and Results (cont.)
of procs. XMT Insertion XMT Speedup MTA Insertion MTA Speedup
1 239.26 1.00 200.17 1.00
2 116.36 2.06 98.25 2.04
4 56.48 4.24 48.07 4.16
8 27.53 8.69 23.29 8.59
16 13.97 17.13 11.61 17.24
32 7.13 33.56 5.81 34.45
64 3.68 65.02 N/A N/A
96 2.60 92.02 N/A N/A
In-core, 1M record execution, static PDTree
version
16
Experimental setup and Results (cont.)
16
17
Experimental setup and Results (cont.)
17
18
Conclusions
  • Results indicate the value of the XMT hybrid
    architecture and its improved I/O capabilities
  • Indirect access to Lustre through LUC interface
  • Need to improve I/O operation implementation to
    take full advantage of Lustre
  • Multiple LUC transfers in parallel should improve
    performance
  • Scalability of the system is very good for
    complex, data-dependent irregular accesses in the
    PDTree application
  • Future work includes comparisons against parallel
    cache-based systems

18
Write a Comment
User Comments (0)
About PowerShow.com