Title: Designing High Performance Data Access Systems
1Designing High Performance Data Access Systems
- Andrew Hanushevsky
- Bill Weeks
- Stanford Linear Accelerator Center
- Stanford University
- 13-July-05
- http//xrootd.slac.stanford.edu
Fifth International Workshop on Software and
PerformanceWOSP 2005July 11-14, 2005Palma de
Mallorca, Illes Balears Spain
2Outline
- Motivation, Problem Statement, Environment
- Design consequences
- Goals, Design, Attainment
- Penultimate conclusion
- Going beyond high performance
- Impact
- Conclusion
3 Motivation
- BaBar (B B interactions)
- High Energy Physics (HEP) Experiment
- 800 physicists, 87 locations, 9 countries
- Measure interactions of B-Meson particles
- Produced by colliding electron and positrons
- Produces relatively rare events
- Need extremely large number for statistical
significance - Determine where all the anti-matter went
- Occasionally a new particle, Y(4260) pops up!
4The Linear Accelerator
5The Problem
- Experiment relies on a rare event
- Huge amount of data to get a significant number
of events - Intensive data analysis to find the B B events
- Need scalable high performance data access
- Analyze large amounts of experimental physics
data - 316TB and growing every day
- All file oriented in root object format
- Objects represent particle collisions or events
- Over 230,000,000 events so far
- File Based Access
- Over 600,000 files
- Average size 650MB
6The Processing Environment
- Distributed Computing
- FZK, De IN2P3, Fr CNAF INFN, It RAL, UK
SLAC, US - Currently, subsets of the data are replicated
across sites - Data is mostly read only
- About 20 of I/O devoted to new file creation
- More data than disk space
- 316TB data vs 160TB of disk (at SLAC)
- Thousands of expensive compute nodes
- Jobs run from a few hours to several days
7The Applications
- Complex embarrassingly parallel analysis
- Determine particle decay products
- 1000s of parallel clients hitting the same data
- Small block sparse random access
- Median size lt 3K
- Uniform seek across whole file (mean 650MB)
- Only about 22 of the file read (mean 140MB)
8Design Consequences
- Write once read many times processing mode
- Can capitalize on simplified semantics
- Large scale small block sparse random access
- Needs very low latency per request
- Large compute investment
- Needs high degree of fault-tolerance
- More data than disk space
- Must accommodate offline storage (Mass Storage
System) - Highly distributed environment
- Component based system (replaceable objects)
- Simple setup with little 3rd party requirements
9Performance Consequences
- Performance is relative to requirements
- Large scale small block sparse random access
- High performance bulk transfer system would be
terrible - Thousands of parallel clients
- System must scale to number of clients
- In this context latency defines performance
- Need the lowest latency possible
- Serve as many clients as possible
- As always, there are budgetary constraints
- Restricted to commodity parts
- Success can now be measured
10Brief History
- 1997 Objectivity, Inc. collaboration
- Design Development to scale Objectivity/DB
- First attempt to use commercial DB for Physics
data - Successful but very problematical
- 2001 BaBar decides to use root framework
- Collaboration with INFN, Padova SLAC
- Design develop high performance data access
- Work based on what we learned with Objectivity
- 2003 First deployment of xrootd system at SLAC
- 2005 Collaboration extended
- Root collaboration Alice LHC experiment, CERN
- CNAF, Bologna, It FZK, De IN2P3, Fr INFN,
Padova, It RAL, UK SLAC are current production
deployment sites
11Design The 10,000 Foot View
latency
12Eliminating The Obvious
- Client Latency Immaterial . . .
- If CPU/(Bytes Read) gtgt External Latency
- Then as number of parallel clients increase
- Overall system throughput increases
- Without impacting individual client latency
- Assuming random distribution of requests
- Usually up to the servers performance limit
- This is the case with HEP applications
- The ingest rate is relatively low
13Network Latency µ Cost
Performance Measurement using netpipe Ames Lab
http//www.scl.ameslab.gov/netpipe/
14Device Latency µ Cost
105
15This Leaves the Server
- Best software bet to impact overall performance
- We can design a data access system specific to
- Client access patterns
- Globally distributed processing environment
- Write once and read mostly data
- Thousands of parallel batch clients
- Average run-time of job
- The result is xrootd
- A low latency self-clustering data access system
16xrootd Server Architecture
p2p clustering heart
application
Protocol Thread Manager
xrd
xrootd
Protocol Layer
xroot
Authentication
Filesystem Logical Layer
ofs
odc
Authorization
optional
Filesystem Physical Layer
oss
Included in distribution as shared library
Filesystem Implementation
mss
_fs
17Making the Server Perform I
- Protocol is a key component in performance
- Compact efficient protocol
- Minimal request/response overhead (24/8 bytes)
- Minimal encoding/decoding (network ordered
binary) - Parallel requests on a single client stream
- High degree of server-side flexibility
- Request response reordering
- Dynamic transfer size selection
- Rich set of operations
- Allows hints for improved performance
- Pre-read, prepare, client access processing
hints - Especially important for accessing offline
storage - Integrated peer-to-peer clustering
- Inherent scaling and fault tolerance
18Making the Server Perform II
- Short code paths critical
- Massively threaded design
- Avoids synchronization bottlenecks
- Adapts well to next generation multi-core chips
- Internal wormhole mechanisms
- Minimizes code paths in a multi-layered design
- Does not flatten the overall architecture
- Use the most efficient OS-specific system
interfaces - Dynamic and compile-time selection
- Dynamic aio_read() vs read()
- Compile-time /dev/poll or kqueue() vs poll() or
select()
19Making the Server Perform III
- Intelligent memory management
- Minimize cross-thread shared objects
- Avoids thrashing the processor cache
- Maximize object re-use
- Less fragmenting the free space heap
- Avoids major serialization bottleneck (malloc)
- Load adaptive I/O buffer management
- Minimize server growth to avoid paging
20Making the Server Perform IV
- Solve only the problem at hand
- Avoids high overhead but unused features
- xrootd is only a Data Access System
- It may look like a file system but it is not one
- Avoids high overhead consistency semantics
- Not needed in write once read many applications
This is common sense that is hard to follow
21Performance Goals Achieved?
- Goals
- Very low latency
- Handle many parallel clients
- Test setup
- Sun V20z 1.86MHz dual Opteron, 2GB RAM
- 1Gb on board Broadcom NIC (same subnet)
- Solaris 10 x86
- Linux RHEL3 2.4.21-2.7.8.ELsmp
- Client running BetaMiniApp with analysis removed
22Latency Per Request (xrootd)
23Capacity vs Load (xrootd)
24xrootd Server Scaling
- Linear scaling relative to load
- Allows deterministic sizing of server
- Disk
- NIC
- CPU
- Memory
- Performance tied directly to hardware cost
- How does that compare to competitive boxes?
25Event Rate Comparison
NetApp FAS270 1250 dual 650 MHz cpu, 1Gb NIC,
1GB cache, RAID 5 FC
140 GB 10k rpm Apple Xserve UltraSparc 3 dual
900MHz cpu, 1Gb NIC,
RAID 5 FC 180 GB 7.2k rpm
Sun 280r, Solaris 8, Seagate ST118167FC Cost
factor 1.45
26Can It Do Better?
- Measurement now becomes a key factor
- Must understand
- OS effects
- Disk and filesystem effects
- Network fabric effects
- NIC driver effects
- Overhead distribution
27OS Effects
28Device Filesystem Effects
I/O limited
CPU limited
UFS good on small reads VXFS good on big reads
1 Event 2K
29Network Fabric Effects
Cisco Catalyst 6509
100Mb 32Gb 32Gb 326Gb 720Gb - 32Gb
1Gb 100Mb 32Gb 256Gb 32Gb 1Gb 1Gb
32Gb 32Gb 326Gb 720Gb - 32Gb 1Gb 1Gb
32Gb 1Gb
30NIC Driver Effects on Latency
31NIC Driver Effects on Request Rate
CPU limited
32NIC Driver Optimization Impact
33Overhead Distribution
34Network Overhead Dominates
35First Conclusion
With sufficient attention to detail, it is
possible to create a Data Access Server with
sufficiently low overhead and scaling
capacity that it no longer becomes a significant
performance factor.
36Beyond High Performance
- xrootd servers can be clustered
- Increase access points and available data
- Allow for automatic failover
- The trick is to do so in a way that
- Cluster overhead (human non-human) scales
linearly - Allows deterministic sizing of cluster
- Cluster size is not artificially limited
- I/O performance is not affected
- Achieves scaling and fault-tolerance
37Basic Cluster Architecture
- Software cross bar switch
- Allows point-to-point connections
- Client and data server
- I/O performance not compromised
- Assuming switch overhead can be amortized
- Scale interconnections by stacking switches
- Virtually unlimited connection points
- Switch overhead must be very low
38Single Level Switch
A
open file X
Redirectors Cache file location
go to C
Who has file X?
2nd open X
B
go to C
I have
open file X
C
Redirector (Head Node)
Client
Data Servers
Cluster
Client sees all servers as xrootd data servers
39Two Level Switch
Client
A
Who has file X?
Data Servers
open file X
B
D
go to C
Who has file X?
I have
open file X
I have
C
E
I have
go to F
Supervisor (sub-redirector)
Redirector (Head Node)
F
open file X
Cluster
Client sees all servers as xrootd data servers
40Example SLAC Configuration
kan01
kan02
kan03
kan04
kanxx
kanolb-a
bbr-olb03
bbr-olb04
client machines
Hidden Details
41Making Clusters Efficient
- Cell size, structure, search protocol are
critical - Cell Size is 64
- Limits direct inter-chatter to 64 entities
- Compresses incoming information by up to a factor
of 64 - Can use very efficient 64-bit logical operations
- Hierarchical structures usually most efficient
- Cells arranged in a B-Tree (i.e., B64-Tree)
- Scales 64h (where h is the tree height)
- Client needs h-1 hops to find one of 64h servers
(2 hops for 262,144 servers) - Number of responses is bounded at each level of
the tree - Search is a directed broadcast query/rarely
respond protocol - Provably best scheme if less than 50 of servers
have the wanted file - Generally true if number of files gtgt cluster
capacity - Cluster protocol becomes more efficient as it
grows
42Cluster Scale Management
- Massive clusters must be self-managing
- Scales 64n where n is height of tree
- Scales very quickly (642 4096, 643 262,144)
- Well beyond direct human management capabilities
- Therefore clusters self-organize
- Uses a minimal spanning tree algorithm
- 280 nodes self-cluster in about 7 seconds
- 890 nodes self-cluster in about 56 seconds
- Most overhead is in wait time to prevent
thrashing
43Redirection Overhead
Server cache search
Linux Solaris
(only xrootd protocol overhead measured)
44Clustering Impact
- Redirection overhead must be amortized
- This is deterministic process for xrootd
- All I/O is via point-to-point connections
- Can trivially use single-server performance data
- Clustering overhead is non-trivial
- Not good for very small files or short open
times - However, compatible with the HEP access patterns
45Other Necessary Items
- Items that peripherally affect performance
- Fault Tolerance
- Proxy Service
- Integrated Security
- Application Server Monitoring
- Mass Storage System Support
- Grid Support
Hidden Details
46Future Direction
High Performance Data Access Servers plus Efficien
t large scale clustering Allows Novel
cost-effective super-fast massive
storage Optimized for sparse random
access Imagine 30TB of DRAM At commodity prices
47Device Speed Delivery
48Memory Access Characteristics
Server zsuntwo CPU Sparc NIC 100Mb OS
Solaris 10 UFS Sandard
49The Peta-Cache
- Cost-effect memory access impacts science
- Nature of all random access analysis
- Not restricted to just High Energy Physics
- Enables faster and more detailed analysis
- Opens new analytical frontiers
50Conclusion
- High performance data access systems achievable
- The devil is in the details
- Must understand processing domain and deployment
infrastructure - Comprehensive repeatable measurement strategy
- High performance and clustering are synergetic
- Allows unique performance, usability,
scalability, and recoverability characteristics - Such systems produce novel software architectures
- Challenges
- Creating application algorithms that can make use
of such systems - Opportunities
- Fast low cost access to huge amounts of data to
speed discovery
51Acknowledgements
- Fabrizio Furano, INFN Padova
- Client-side design development
- Bill Weeks
- Performance measurement guru
- 100s of measurements repeated 100s of times
- US Department of Energy
- Contract DE-AC02-76SF00515 with Stanford
University