Peter J. BraamTim Reddin - PowerPoint PPT Presentation

About This Presentation
Title:

Peter J. BraamTim Reddin

Description:

Lock management spread over cluster. Achieve 90-95% of network throughput ... Small scale clusters. CFS focused on big systems ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 44
Provided by: nsc2
Category:

less

Transcript and Presenter's Notes

Title: Peter J. BraamTim Reddin


1
The Lustre Storage ArchitectureLinux Clusters
for Super ComputingLinköping 2003
  • Peter J. Braam Tim Reddin
  • braam_at_clusterfs.com tim.reddin_at_hp.com
  • http//www.clusterfs.com

2
Topics
  • History of project
  • High level picture
  • Networking
  • Devices and fundamental APIs
  • File I/O
  • Metadata recovery
  • Project status
  • Cluster File Systems, Inc

3
Lustres History
4
Project history
  • 1999 CMU Seagate
  • Worked with Seagate for one year
  • Storage management, clustering
  • Built prototypes, much design
  • Much survives today

5
2000-2002 File system challenge
  • First put forward Sep 1999 Santa Fe
  • New architecture for National Labs
  • Characteristics
  • 100s GBs/sec of I/O throughput
  • trillions of files
  • 10,000s of nodes
  • Petabytes
  • From start Garth Peter in the running

6
2002 2003 fast lane
  • 3 year ASCI Path Forward contract
  • with HP and Intel
  • MCR ALC, 2x 1000 node Linux Clusters
  • PNNL HP IA64, 1000 node Linux cluster
  • Red Storm, Sandia (8000 nodes, Cray)
  • Lustre Lite 1.0
  • Many partnerships (HP, Dell, DDN, )

7
2003 Production, perfomance
  • Spring and summer
  • LLNL MCR from no, to partial, to full time use
  • PNNL similar
  • Stability much improved
  • Performance
  • Summer 2003 I/O problems tackled
  • Metadata much faster
  • Dec/Jan
  • Lustre 1.0

8
High level picture
9
Lustre Systems Major Components
  • Clients
  • Have access to file system
  • Typical role compute server
  • OST
  • Object storage targets
  • Handle (stripes of, references to) file data
  • MDS
  • Metadata request transaction engine.
  • Also LDAP, Kerberos, routers etc.

10
Linux OST Servers with disk arrays
QSW Net
SAN
Lustre Clients (1,000 Lustre Lite) Up to 10,000s
GigE
3rd party OST Appliances
Lustre Object Storage Targets (OST)
11
configuration information, network connection
details, security management
directory operations, meta-data, concurrency
file I/O file locking
recovery, file status, file creation
12
Networking
13
Lustre Networking
  • Currently runs over
  • TCP
  • Quadrics Elan 3 4
  • Lustre can route can use heterogeneous nets
  • Beta
  • Myrinet, SCI
  • Under development
  • SAN (FC/iSCSI), I/B
  • Planned
  • SCTP, some special NUMA and other nets

14
Lustre Network Stack - Portals
0-copy marshalling libraries, Service
framework, Client request dispatch, Connection
address naming, Generic recovery infrastructure
Move small large buffers, Remote DMA
handling, Generate events
Sandias API, CFS improved impl.
Network Abstraction Layer for TCP, QSW, etc.
Small hard Includes routing api.
15
Devices and APIs
16
Lustre Devices APIs
  • Lustre has numerous driver modules
  • One API - very different implementations
  • Driver binds to named device
  • Stacking devices is key
  • Generalized object devices
  • Drivers currently export several APIs
  • Infrastructure - a mandatory API
  • Object Storage
  • Metadata Handling
  • Locking
  • Recovery

17
Lustre Clients APIs
Data Object Lock
Metadata Lock


18
Object Storage Api
  • Objects are (usually) unnamed files
  • Improves on the block device api
  • create, destroy, setattr, getattr, read, write
  • OBD driver does block/extent allocation
  • Implementation
  • Linux drivers, using a file system backend

19
Bringing it all together
Recovery
Lustre Client File System
Metadata WB cache
Request Processing
NIO API
Portal Library
System Parallel File I/O, File Locking
OSCs
MDC
Lock Client
Portal NALs
Networking
Device (Elan,TCP,)
Directory Metadata Concurrency
OST
MDS
Networking
Recovery
Object-Based Disk Server (OBD server
Lock Server
Recovery, File Status, File Creation
Ext3, Reiser, XFS, FS
Fibre Channel
Fibre Channel
20
File I/O
21
File I/O Write Operation
  • Open file on meta-data server
  • Get information on all objects that are part of
    file
  • Objects ids
  • What storage controllers (OST)
  • What part of the file (offset)
  • Striping pattern
  • Create LOV, OSC drivers
  • Use connection to OST
  • Object writes to OST
  • No MDS involvement at all

22
Lustre Client
Meta-data Server
File system
File open request
MDS
MDC
LOV
File meta-data
OSC 2
OSC 1
Inode A (O1,obj1),(O3, obj2)
Write (obj 1)
Write (obj 2)
OST 1
OST 2
OST 3
23
I/O bandwidth
  • 100s GB/sec gt saturate many100s OSTs
  • OSTs
  • Do ext3 extent allocation, non-caching direct I/O
  • Lock management spread over cluster
  • Achieve 90-95 of network throughput
  • Single client, single thread Elan3 W 269MB/sec
  • OSTs handle up to 260MB/sec
  • W/O extent code, on 2 way 2.4GHz Xeon

24
Metadata
25
Intent locks Write Back caching
  • Clients MDS protocol adaptation
  • Low concurrency - write back caching
  • Client in memory updates
  • delayed replay to MDS
  • High concurrency (mostly merged in 2.6)
  • Single network request per transaction
  • No lock revocations to clients
  • Intent based lock includes complete request

26
(No Transcript)
27
Lustre 1.0
  • Only has high concurrency model
  • Aggregate throughput (1,000 clients)
  • Achieve 5000 file creations (open/close) /sec
  • Achieve 7800 stats in 10 x1M file directories
  • Single client
  • Around 1500 creations or stats /sec
  • Handling 10M file directories is effortless
  • Many changes to ext3 (all merged in 2.6)

28
Metadata Future
  • Lustre 2.0 2004
  • Metadata clustering
  • Common operations will parallelize
  • 100 WB caching in memory or on disk
  • Like AFS

29
Metadata Odds and Ends
  • Logical drivers
  • Local persistent metadata cache, like
    AFS/Coda/InterMezzo
  • Replicated metadata server driver
  • Remotely mirrored MDS
  • Small scale clusters
  • CFS focused on big systems
  • Our drivers ordinary FS can export all protocols
  • Get shared ext3/Reiser/.. file systems

30
Recovery
31
Recovery approach
  • Keep it simple!
  • Based on failover circles
  • Use existing failover software
  • Left working neighbor is failover node for you
  • At HP we use failover pairs
  • Simplify storage connectivity
  • I/O failure triggers
  • Peer node serves failed OST
  • Retry from client routed to new OST node

32
OST Server redundant pair
OST1
OST 2
FC Switch
FC Switch
C2
C1
C1
C2
33
Configuration
34
Lustre 1.0
  • Good tools to build configuration
  • Configuration is recorded on MDS
  • Or on dedicated management server
  • Configuration can be changed,
  • 1.0 requires downtime
  • Clients auto configure
  • mount t lustre o mds//fileset/sub/dir
    /mnt/pt
  • SNMP support

35
Futures
36
Advanced Management
  • Snapshots
  • All features you might expect
  • Global namespace
  • Combine best of AFS autofs4
  • HSM, hot migration
  • Driven by customer demand (we plan XDSM)
  • Online 0-downtime re-configuration
  • Part of Lustre 2.0

37
Security
38
Security
  • Authentication
  • POSIX style authorization
  • NASD style OST authorization
  • Refinement use OST ACLs and cookies
  • File crypting with group key service
  • STK secure file system

39
Step 1 Authenticate user, get session key
Step 7 Get SFS file key
Step 2 Authenticated open RPCs
Step 4 Get OST ACL
Step 6 Read encrypted file data
Step 5 Send ACL capability cookie
40
CFS Cluster Tools for 2.6
  • Remote serial GDB debugging over UDP
  • Conman UDP consoles for
  • syslog
  • sysrq
  • Core dumps over net or to local disk
  • Many dump format enhancements
  • Analyze dumps with gdb extension (not lcrash)
  • Llanalyze
  • Analyzes distributed Lustre logs

41
Metadata transaction protocol
  • No synchronous I/O unless requested
  • Reply and commit confirmation
  • Lustre covers single component failure
  • Replay of requests central
  • Preserve transaction sequence
  • Acknowledge replies to remove barriers
  • Avoid cascading aborts
  • In DB parlor strict execution

42
Distributed persistent data
  • Happens in many places
  • Inode object creation/removal (MDS/OST)
  • Replicating OSTs
  • Metadata clustering
  • Recovery with replay logs
  • Cancellation of log records
  • Logs ubiquitous in Lustre
  • Recovery, WB caching logs, replication etc.
  • Configuration

43
Project status
44
Lustre Feature Roadmap
Lustre (Lite) 1.0 (Linux 2.4 2.6) Lustre 2.0 (2.6) Lustre 3.0
2003 2004 2005
Failover MDS Metadata cluster Metadata cluster
Basic Unix security Basic Unix security Advanced Security
File I/O very fast (100s OSTs) Collaborative read cache Storage management
Intent based scalable metadata Write back metadata Load balanced MD
POSIX compliant Parallel I/O Global namespace
45
Cluster File Systems, Inc.
46
Cluster File Systems
  • Small service company 20-30 people
  • Software development service (95 Lustre)
  • contract work for Government labs
  • OSS but defense contracts
  • Extremely specialized and extreme expertise
  • we only do file systems and storage
  • Investments - not needed. Profitable.
  • Partners HP, Dell, DDN, Cray

47
Lustre conclusions
  • Great vehicle for advanced storage software
  • Things are done differently
  • Protocols design from Coda InterMezzo
  • Stacking DB recovery theory applied
  • Leverage existing components
  • Initial signs promising

48
HP Lustre
  • Two projects
  • ASCI PathForward Hendrix
  • Lustre Storage product
  • Field trial in Q1 of 04

49
Questions?
Write a Comment
User Comments (0)
About PowerShow.com