Designing High Performance Data Access Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Designing High Performance Data Access Systems

Description:

Designing High Performance Data Access Systems ... Need extremely large number for statistical significance ... Limits direct inter-chatter to 64 entities ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 52
Provided by: AndrewHan3
Category:

less

Transcript and Presenter's Notes

Title: Designing High Performance Data Access Systems


1
Designing High Performance Data Access Systems
  • Andrew Hanushevsky
  • Bill Weeks
  • Stanford Linear Accelerator Center
  • Stanford University
  • 13-July-05
  • http//xrootd.slac.stanford.edu

Fifth International Workshop on Software and
PerformanceWOSP 2005July 11-14, 2005Palma de
Mallorca, Illes Balears Spain
2
Outline
  • Motivation, Problem Statement, Environment
  • Design consequences
  • Goals, Design, Attainment
  • Penultimate conclusion
  • Going beyond high performance
  • Impact
  • Conclusion

3
Motivation
  • BaBar (B B interactions)
  • High Energy Physics (HEP) Experiment
  • 800 physicists, 87 locations, 9 countries
  • Measure interactions of B-Meson particles
  • Produced by colliding electron and positrons
  • Produces relatively rare events
  • Need extremely large number for statistical
    significance
  • Determine where all the anti-matter went
  • Occasionally a new particle, Y(4260) pops up!

4
The Linear Accelerator
5
The Problem
  • Experiment relies on a rare event
  • Huge amount of data to get a significant number
    of events
  • Intensive data analysis to find the B B events
  • Need scalable high performance data access
  • Analyze large amounts of experimental physics
    data
  • 316TB and growing every day
  • All file oriented in root object format
  • Objects represent particle collisions or events
  • Over 230,000,000 events so far
  • File Based Access
  • Over 600,000 files
  • Average size 650MB

6
The Processing Environment
  • Distributed Computing
  • FZK, De IN2P3, Fr CNAF INFN, It RAL, UK
    SLAC, US
  • Currently, subsets of the data are replicated
    across sites
  • Data is mostly read only
  • About 20 of I/O devoted to new file creation
  • More data than disk space
  • 316TB data vs 160TB of disk (at SLAC)
  • Thousands of expensive compute nodes
  • Jobs run from a few hours to several days

7
The Applications
  • Complex embarrassingly parallel analysis
  • Determine particle decay products
  • 1000s of parallel clients hitting the same data
  • Small block sparse random access
  • Median size lt 3K
  • Uniform seek across whole file (mean 650MB)
  • Only about 22 of the file read (mean 140MB)

8
Design Consequences
  • Write once read many times processing mode
  • Can capitalize on simplified semantics
  • Large scale small block sparse random access
  • Needs very low latency per request
  • Large compute investment
  • Needs high degree of fault-tolerance
  • More data than disk space
  • Must accommodate offline storage (Mass Storage
    System)
  • Highly distributed environment
  • Component based system (replaceable objects)
  • Simple setup with little 3rd party requirements

9
Performance Consequences
  • Performance is relative to requirements
  • Large scale small block sparse random access
  • High performance bulk transfer system would be
    terrible
  • Thousands of parallel clients
  • System must scale to number of clients
  • In this context latency defines performance
  • Need the lowest latency possible
  • Serve as many clients as possible
  • As always, there are budgetary constraints
  • Restricted to commodity parts
  • Success can now be measured

10
Brief History
  • 1997 Objectivity, Inc. collaboration
  • Design Development to scale Objectivity/DB
  • First attempt to use commercial DB for Physics
    data
  • Successful but very problematical
  • 2001 BaBar decides to use root framework
  • Collaboration with INFN, Padova SLAC
  • Design develop high performance data access
  • Work based on what we learned with Objectivity
  • 2003 First deployment of xrootd system at SLAC
  • 2005 Collaboration extended
  • Root collaboration Alice LHC experiment, CERN
  • CNAF, Bologna, It FZK, De IN2P3, Fr INFN,
    Padova, It RAL, UK SLAC are current production
    deployment sites

11
Design The 10,000 Foot View
latency
12
Eliminating The Obvious
  • Client Latency Immaterial . . .
  • If CPU/(Bytes Read) gtgt External Latency
  • Then as number of parallel clients increase
  • Overall system throughput increases
  • Without impacting individual client latency
  • Assuming random distribution of requests
  • Usually up to the servers performance limit
  • This is the case with HEP applications
  • The ingest rate is relatively low

13
Network Latency µ Cost
Performance Measurement using netpipe Ames Lab
http//www.scl.ameslab.gov/netpipe/
14
Device Latency µ Cost
105
15
This Leaves the Server
  • Best software bet to impact overall performance
  • We can design a data access system specific to
  • Client access patterns
  • Globally distributed processing environment
  • Write once and read mostly data
  • Thousands of parallel batch clients
  • Average run-time of job
  • The result is xrootd
  • A low latency self-clustering data access system

16
xrootd Server Architecture
p2p clustering heart
application
Protocol Thread Manager
xrd
xrootd
Protocol Layer
xroot
Authentication
Filesystem Logical Layer
ofs
odc
Authorization
optional
Filesystem Physical Layer
oss
Included in distribution as shared library
Filesystem Implementation
mss
_fs
17
Making the Server Perform I
  • Protocol is a key component in performance
  • Compact efficient protocol
  • Minimal request/response overhead (24/8 bytes)
  • Minimal encoding/decoding (network ordered
    binary)
  • Parallel requests on a single client stream
  • High degree of server-side flexibility
  • Request response reordering
  • Dynamic transfer size selection
  • Rich set of operations
  • Allows hints for improved performance
  • Pre-read, prepare, client access processing
    hints
  • Especially important for accessing offline
    storage
  • Integrated peer-to-peer clustering
  • Inherent scaling and fault tolerance

18
Making the Server Perform II
  • Short code paths critical
  • Massively threaded design
  • Avoids synchronization bottlenecks
  • Adapts well to next generation multi-core chips
  • Internal wormhole mechanisms
  • Minimizes code paths in a multi-layered design
  • Does not flatten the overall architecture
  • Use the most efficient OS-specific system
    interfaces
  • Dynamic and compile-time selection
  • Dynamic aio_read() vs read()
  • Compile-time /dev/poll or kqueue() vs poll() or
    select()

19
Making the Server Perform III
  • Intelligent memory management
  • Minimize cross-thread shared objects
  • Avoids thrashing the processor cache
  • Maximize object re-use
  • Less fragmenting the free space heap
  • Avoids major serialization bottleneck (malloc)
  • Load adaptive I/O buffer management
  • Minimize server growth to avoid paging

20
Making the Server Perform IV
  • Solve only the problem at hand
  • Avoids high overhead but unused features
  • xrootd is only a Data Access System
  • It may look like a file system but it is not one
  • Avoids high overhead consistency semantics
  • Not needed in write once read many applications

This is common sense that is hard to follow
21
Performance Goals Achieved?
  • Goals
  • Very low latency
  • Handle many parallel clients
  • Test setup
  • Sun V20z 1.86MHz dual Opteron, 2GB RAM
  • 1Gb on board Broadcom NIC (same subnet)
  • Solaris 10 x86
  • Linux RHEL3 2.4.21-2.7.8.ELsmp
  • Client running BetaMiniApp with analysis removed

22
Latency Per Request (xrootd)
23
Capacity vs Load (xrootd)
24
xrootd Server Scaling
  • Linear scaling relative to load
  • Allows deterministic sizing of server
  • Disk
  • NIC
  • CPU
  • Memory
  • Performance tied directly to hardware cost
  • How does that compare to competitive boxes?

25
Event Rate Comparison
NetApp FAS270 1250 dual 650 MHz cpu, 1Gb NIC,
1GB cache, RAID 5 FC
140 GB 10k rpm Apple Xserve UltraSparc 3 dual
900MHz cpu, 1Gb NIC,
RAID 5 FC 180 GB 7.2k rpm
Sun 280r, Solaris 8, Seagate ST118167FC Cost
factor 1.45
26
Can It Do Better?
  • Measurement now becomes a key factor
  • Must understand
  • OS effects
  • Disk and filesystem effects
  • Network fabric effects
  • NIC driver effects
  • Overhead distribution

27
OS Effects
28
Device Filesystem Effects
I/O limited
CPU limited
UFS good on small reads VXFS good on big reads
1 Event 2K
29
Network Fabric Effects
Cisco Catalyst 6509
100Mb 32Gb 32Gb 326Gb 720Gb - 32Gb
1Gb 100Mb 32Gb 256Gb 32Gb 1Gb 1Gb
32Gb 32Gb 326Gb 720Gb - 32Gb 1Gb 1Gb
32Gb 1Gb
30
NIC Driver Effects on Latency
31
NIC Driver Effects on Request Rate
CPU limited
32
NIC Driver Optimization Impact
33
Overhead Distribution
34
Network Overhead Dominates
35
First Conclusion
With sufficient attention to detail, it is
possible to create a Data Access Server with
sufficiently low overhead and scaling
capacity that it no longer becomes a significant
performance factor.
36
Beyond High Performance
  • xrootd servers can be clustered
  • Increase access points and available data
  • Allow for automatic failover
  • The trick is to do so in a way that
  • Cluster overhead (human non-human) scales
    linearly
  • Allows deterministic sizing of cluster
  • Cluster size is not artificially limited
  • I/O performance is not affected
  • Achieves scaling and fault-tolerance

37
Basic Cluster Architecture
  • Software cross bar switch
  • Allows point-to-point connections
  • Client and data server
  • I/O performance not compromised
  • Assuming switch overhead can be amortized
  • Scale interconnections by stacking switches
  • Virtually unlimited connection points
  • Switch overhead must be very low

38
Single Level Switch
A
open file X
Redirectors Cache file location
go to C
Who has file X?
2nd open X
B
go to C
I have
open file X
C
Redirector (Head Node)
Client
Data Servers
Cluster
Client sees all servers as xrootd data servers
39
Two Level Switch
Client
A
Who has file X?
Data Servers
open file X
B
D
go to C
Who has file X?
I have
open file X
I have
C
E
I have
go to F
Supervisor (sub-redirector)
Redirector (Head Node)
F
open file X
Cluster
Client sees all servers as xrootd data servers
40
Example SLAC Configuration
kan01
kan02
kan03
kan04
kanxx
kanolb-a
bbr-olb03
bbr-olb04
client machines
Hidden Details
41
Making Clusters Efficient
  • Cell size, structure, search protocol are
    critical
  • Cell Size is 64
  • Limits direct inter-chatter to 64 entities
  • Compresses incoming information by up to a factor
    of 64
  • Can use very efficient 64-bit logical operations
  • Hierarchical structures usually most efficient
  • Cells arranged in a B-Tree (i.e., B64-Tree)
  • Scales 64h (where h is the tree height)
  • Client needs h-1 hops to find one of 64h servers
    (2 hops for 262,144 servers)
  • Number of responses is bounded at each level of
    the tree
  • Search is a directed broadcast query/rarely
    respond protocol
  • Provably best scheme if less than 50 of servers
    have the wanted file
  • Generally true if number of files gtgt cluster
    capacity
  • Cluster protocol becomes more efficient as it
    grows

42
Cluster Scale Management
  • Massive clusters must be self-managing
  • Scales 64n where n is height of tree
  • Scales very quickly (642 4096, 643 262,144)
  • Well beyond direct human management capabilities
  • Therefore clusters self-organize
  • Uses a minimal spanning tree algorithm
  • 280 nodes self-cluster in about 7 seconds
  • 890 nodes self-cluster in about 56 seconds
  • Most overhead is in wait time to prevent
    thrashing

43
Redirection Overhead
Server cache search
Linux Solaris
(only xrootd protocol overhead measured)
44
Clustering Impact
  • Redirection overhead must be amortized
  • This is deterministic process for xrootd
  • All I/O is via point-to-point connections
  • Can trivially use single-server performance data
  • Clustering overhead is non-trivial
  • Not good for very small files or short open
    times
  • However, compatible with the HEP access patterns

45
Other Necessary Items
  • Items that peripherally affect performance
  • Fault Tolerance
  • Proxy Service
  • Integrated Security
  • Application Server Monitoring
  • Mass Storage System Support
  • Grid Support

Hidden Details
46
Future Direction
High Performance Data Access Servers plus Efficien
t large scale clustering Allows Novel
cost-effective super-fast massive
storage Optimized for sparse random
access Imagine 30TB of DRAM At commodity prices
47
Device Speed Delivery
48
Memory Access Characteristics
Server zsuntwo CPU Sparc NIC 100Mb OS
Solaris 10 UFS Sandard
49
The Peta-Cache
  • Cost-effect memory access impacts science
  • Nature of all random access analysis
  • Not restricted to just High Energy Physics
  • Enables faster and more detailed analysis
  • Opens new analytical frontiers

50
Conclusion
  • High performance data access systems achievable
  • The devil is in the details
  • Must understand processing domain and deployment
    infrastructure
  • Comprehensive repeatable measurement strategy
  • High performance and clustering are synergetic
  • Allows unique performance, usability,
    scalability, and recoverability characteristics
  • Such systems produce novel software architectures
  • Challenges
  • Creating application algorithms that can make use
    of such systems
  • Opportunities
  • Fast low cost access to huge amounts of data to
    speed discovery

51
Acknowledgements
  • Fabrizio Furano, INFN Padova
  • Client-side design development
  • Bill Weeks
  • Performance measurement guru
  • 100s of measurements repeated 100s of times
  • US Department of Energy
  • Contract DE-AC02-76SF00515 with Stanford
    University
Write a Comment
User Comments (0)
About PowerShow.com