Designing High Performance Data Access Systems

About This Presentation

Title:

Designing High Performance Data Access Systems

Description:

Designing High Performance Data Access Systems ... Need extremely large number for statistical significance ... Limits direct inter-chatter to 64 entities ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 52

Provided by: AndrewHan3

Learn more at: https://xrootd.slac.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Designing High Performance Data Access Systems

1
Designing High Performance Data Access Systems

Andrew Hanushevsky
Bill Weeks
Stanford Linear Accelerator Center
Stanford University
13-July-05
http//xrootd.slac.stanford.edu

Fifth International Workshop on Software and
PerformanceWOSP 2005July 11-14, 2005Palma de
Mallorca, Illes Balears Spain
2
Outline

Motivation, Problem Statement, Environment
Design consequences
Goals, Design, Attainment
Penultimate conclusion
Going beyond high performance
Impact
Conclusion

3
Motivation

BaBar (B B interactions)
High Energy Physics (HEP) Experiment
800 physicists, 87 locations, 9 countries
Measure interactions of B-Meson particles
Produced by colliding electron and positrons
Produces relatively rare events
Need extremely large number for statistical
significance
Determine where all the anti-matter went
Occasionally a new particle, Y(4260) pops up!

4
The Linear Accelerator
5
The Problem

Experiment relies on a rare event
Huge amount of data to get a significant number
of events
Intensive data analysis to find the B B events
Need scalable high performance data access
Analyze large amounts of experimental physics
data
316TB and growing every day
All file oriented in root object format
Objects represent particle collisions or events
Over 230,000,000 events so far
File Based Access
Over 600,000 files
Average size 650MB

6
The Processing Environment

Distributed Computing
FZK, De IN2P3, Fr CNAF INFN, It RAL, UK
SLAC, US
Currently, subsets of the data are replicated
across sites
Data is mostly read only
About 20 of I/O devoted to new file creation
More data than disk space
316TB data vs 160TB of disk (at SLAC)
Thousands of expensive compute nodes
Jobs run from a few hours to several days

7
The Applications

Complex embarrassingly parallel analysis
Determine particle decay products
1000s of parallel clients hitting the same data
Small block sparse random access
Median size lt 3K
Uniform seek across whole file (mean 650MB)
Only about 22 of the file read (mean 140MB)

8
Design Consequences

Write once read many times processing mode
Can capitalize on simplified semantics
Large scale small block sparse random access
Needs very low latency per request
Large compute investment
Needs high degree of fault-tolerance
More data than disk space
Must accommodate offline storage (Mass Storage
System)
Highly distributed environment
Component based system (replaceable objects)
Simple setup with little 3rd party requirements

9
Performance Consequences

Performance is relative to requirements
Large scale small block sparse random access
High performance bulk transfer system would be
terrible
Thousands of parallel clients
System must scale to number of clients
In this context latency defines performance
Need the lowest latency possible
Serve as many clients as possible
As always, there are budgetary constraints
Restricted to commodity parts
Success can now be measured

10
Brief History

1997 Objectivity, Inc. collaboration
Design Development to scale Objectivity/DB
First attempt to use commercial DB for Physics
data
Successful but very problematical
2001 BaBar decides to use root framework
Collaboration with INFN, Padova SLAC
Design develop high performance data access
Work based on what we learned with Objectivity
2003 First deployment of xrootd system at SLAC
2005 Collaboration extended
Root collaboration Alice LHC experiment, CERN
CNAF, Bologna, It FZK, De IN2P3, Fr INFN,
Padova, It RAL, UK SLAC are current production
deployment sites

11
Design The 10,000 Foot View
latency
12
Eliminating The Obvious

Client Latency Immaterial . . .
If CPU/(Bytes Read) gtgt External Latency
Then as number of parallel clients increase
Overall system throughput increases
Without impacting individual client latency
Assuming random distribution of requests
Usually up to the servers performance limit
This is the case with HEP applications
The ingest rate is relatively low

13
Network Latency µ Cost
Performance Measurement using netpipe Ames Lab
http//www.scl.ameslab.gov/netpipe/
14
Device Latency µ Cost
105
15
This Leaves the Server

Best software bet to impact overall performance
We can design a data access system specific to
Client access patterns
Globally distributed processing environment
Write once and read mostly data
Thousands of parallel batch clients
Average run-time of job
The result is xrootd
A low latency self-clustering data access system

16
xrootd Server Architecture
p2p clustering heart
application
Protocol Thread Manager
xrd
xrootd
Protocol Layer
xroot
Authentication
Filesystem Logical Layer
ofs
odc
Authorization
optional
Filesystem Physical Layer
oss
Included in distribution as shared library
Filesystem Implementation
mss
_fs
17
Making the Server Perform I

Protocol is a key component in performance
Compact efficient protocol
Minimal request/response overhead (24/8 bytes)
Minimal encoding/decoding (network ordered
binary)
Parallel requests on a single client stream
High degree of server-side flexibility
Request response reordering
Dynamic transfer size selection
Rich set of operations
Allows hints for improved performance
Pre-read, prepare, client access processing
hints
Especially important for accessing offline
storage
Integrated peer-to-peer clustering
Inherent scaling and fault tolerance

18
Making the Server Perform II

Short code paths critical
Massively threaded design
Avoids synchronization bottlenecks
Adapts well to next generation multi-core chips
Internal wormhole mechanisms
Minimizes code paths in a multi-layered design
Does not flatten the overall architecture
Use the most efficient OS-specific system
interfaces
Dynamic and compile-time selection
Dynamic aio_read() vs read()
Compile-time /dev/poll or kqueue() vs poll() or
select()

19
Making the Server Perform III

Intelligent memory management
Minimize cross-thread shared objects
Avoids thrashing the processor cache
Maximize object re-use
Less fragmenting the free space heap
Avoids major serialization bottleneck (malloc)
Load adaptive I/O buffer management
Minimize server growth to avoid paging

20
Making the Server Perform IV

Solve only the problem at hand
Avoids high overhead but unused features
xrootd is only a Data Access System
It may look like a file system but it is not one
Avoids high overhead consistency semantics
Not needed in write once read many applications

This is common sense that is hard to follow
21
Performance Goals Achieved?

Goals
Very low latency
Handle many parallel clients
Test setup
Sun V20z 1.86MHz dual Opteron, 2GB RAM
1Gb on board Broadcom NIC (same subnet)
Solaris 10 x86
Linux RHEL3 2.4.21-2.7.8.ELsmp
Client running BetaMiniApp with analysis removed

22
Latency Per Request (xrootd)
23
Capacity vs Load (xrootd)
24
xrootd Server Scaling

Linear scaling relative to load
Allows deterministic sizing of server
Disk
NIC
CPU
Memory
Performance tied directly to hardware cost
How does that compare to competitive boxes?

25
Event Rate Comparison
NetApp FAS270 1250 dual 650 MHz cpu, 1Gb NIC,
1GB cache, RAID 5 FC
140 GB 10k rpm Apple Xserve UltraSparc 3 dual
900MHz cpu, 1Gb NIC,
RAID 5 FC 180 GB 7.2k rpm
Sun 280r, Solaris 8, Seagate ST118167FC Cost
factor 1.45
26
Can It Do Better?

Measurement now becomes a key factor
Must understand
OS effects
Disk and filesystem effects
Network fabric effects
NIC driver effects
Overhead distribution

27
OS Effects
28
Device Filesystem Effects
I/O limited
CPU limited
UFS good on small reads VXFS good on big reads
1 Event 2K
29
Network Fabric Effects
Cisco Catalyst 6509
100Mb 32Gb 32Gb 326Gb 720Gb - 32Gb
1Gb 100Mb 32Gb 256Gb 32Gb 1Gb 1Gb
32Gb 32Gb 326Gb 720Gb - 32Gb 1Gb 1Gb
32Gb 1Gb
30
NIC Driver Effects on Latency
31
NIC Driver Effects on Request Rate
CPU limited
32
NIC Driver Optimization Impact
33
Overhead Distribution
34
Network Overhead Dominates
35
First Conclusion
With sufficient attention to detail, it is
possible to create a Data Access Server with
sufficiently low overhead and scaling
capacity that it no longer becomes a significant
performance factor.
36
Beyond High Performance

xrootd servers can be clustered
Increase access points and available data
Allow for automatic failover
The trick is to do so in a way that
Cluster overhead (human non-human) scales
linearly
Allows deterministic sizing of cluster
Cluster size is not artificially limited
I/O performance is not affected
Achieves scaling and fault-tolerance

37
Basic Cluster Architecture

Software cross bar switch
Allows point-to-point connections
Client and data server
I/O performance not compromised
Assuming switch overhead can be amortized
Scale interconnections by stacking switches
Virtually unlimited connection points
Switch overhead must be very low

38
Single Level Switch
A
open file X
Redirectors Cache file location
go to C
Who has file X?
2nd open X
B
go to C
I have
open file X
C
Redirector (Head Node)
Client
Data Servers
Cluster
Client sees all servers as xrootd data servers
39
Two Level Switch
Client
A
Who has file X?
Data Servers
open file X
B
D
go to C
Who has file X?
I have
open file X
I have
C
E
I have
go to F
Supervisor (sub-redirector)
Redirector (Head Node)
F
open file X
Cluster
Client sees all servers as xrootd data servers
40
Example SLAC Configuration
kan01
kan02
kan03
kan04
kanxx
kanolb-a
bbr-olb03
bbr-olb04
client machines
Hidden Details
41
Making Clusters Efficient

Cell size, structure, search protocol are
critical
Cell Size is 64
Limits direct inter-chatter to 64 entities
Compresses incoming information by up to a factor
of 64
Can use very efficient 64-bit logical operations
Hierarchical structures usually most efficient
Cells arranged in a B-Tree (i.e., B64-Tree)
Scales 64h (where h is the tree height)
Client needs h-1 hops to find one of 64h servers
(2 hops for 262,144 servers)
Number of responses is bounded at each level of
the tree
Search is a directed broadcast query/rarely
respond protocol
Provably best scheme if less than 50 of servers
have the wanted file
Generally true if number of files gtgt cluster
capacity
Cluster protocol becomes more efficient as it
grows

42
Cluster Scale Management

Massive clusters must be self-managing
Scales 64n where n is height of tree
Scales very quickly (642 4096, 643 262,144)
Well beyond direct human management capabilities
Therefore clusters self-organize
Uses a minimal spanning tree algorithm
280 nodes self-cluster in about 7 seconds
890 nodes self-cluster in about 56 seconds
Most overhead is in wait time to prevent
thrashing

43
Redirection Overhead
Server cache search
Linux Solaris
(only xrootd protocol overhead measured)
44
Clustering Impact

Redirection overhead must be amortized
This is deterministic process for xrootd
All I/O is via point-to-point connections
Can trivially use single-server performance data
Clustering overhead is non-trivial
Not good for very small files or short open
times
However, compatible with the HEP access patterns

45
Other Necessary Items

Items that peripherally affect performance
Fault Tolerance
Proxy Service
Integrated Security
Application Server Monitoring
Mass Storage System Support
Grid Support

Hidden Details
46
Future Direction
High Performance Data Access Servers plus Efficien
t large scale clustering Allows Novel
cost-effective super-fast massive
storage Optimized for sparse random
access Imagine 30TB of DRAM At commodity prices
47
Device Speed Delivery
48
Memory Access Characteristics
Server zsuntwo CPU Sparc NIC 100Mb OS
Solaris 10 UFS Sandard
49
The Peta-Cache

Cost-effect memory access impacts science
Nature of all random access analysis
Not restricted to just High Energy Physics
Enables faster and more detailed analysis
Opens new analytical frontiers

50
Conclusion

High performance data access systems achievable
The devil is in the details
Must understand processing domain and deployment
infrastructure
Comprehensive repeatable measurement strategy
High performance and clustering are synergetic
Allows unique performance, usability,
scalability, and recoverability characteristics
Such systems produce novel software architectures
Challenges
Creating application algorithms that can make use
of such systems
Opportunities
Fast low cost access to huge amounts of data to
speed discovery

51
Acknowledgements

Fabrizio Furano, INFN Padova
Client-side design development
Bill Weeks
Performance measurement guru
100s of measurements repeated 100s of times
US Department of Energy
Contract DE-AC02-76SF00515 with Stanford
University

Write a Comment

User Comments (0)

About PowerShow.com

Designing High Performance Data Access Systems - PowerPoint PPT Presentation

Designing High Performance Data Access Systems

Designing High Performance Data Access Systems ... Need extremely large number for statistical significance ... Limits direct inter-chatter to 64 entities ... – PowerPoint PPT presentation