Peter J. BraamTim Reddin - PowerPoint PPT Presentation

About This Presentation

Title:

Peter J. BraamTim Reddin

Description:

Lock management spread over cluster. Achieve 90-95% of network throughput ... Small scale clusters. CFS focused on big systems ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 44

Provided by: nsc2

Category:

more less

Transcript and Presenter's Notes

Title: Peter J. BraamTim Reddin

1
The Lustre Storage ArchitectureLinux Clusters
for Super ComputingLinköping 2003

Peter J. Braam Tim Reddin
braam_at_clusterfs.com tim.reddin_at_hp.com
http//www.clusterfs.com

2
Topics

History of project
High level picture
Networking
Devices and fundamental APIs
File I/O
Metadata recovery
Project status
Cluster File Systems, Inc

3
Lustres History
4
Project history

1999 CMU Seagate
Worked with Seagate for one year
Storage management, clustering
Built prototypes, much design
Much survives today

5
2000-2002 File system challenge

First put forward Sep 1999 Santa Fe
New architecture for National Labs
Characteristics
100s GBs/sec of I/O throughput
trillions of files
10,000s of nodes
Petabytes
From start Garth Peter in the running

6
2002 2003 fast lane

3 year ASCI Path Forward contract
with HP and Intel
MCR ALC, 2x 1000 node Linux Clusters
PNNL HP IA64, 1000 node Linux cluster
Red Storm, Sandia (8000 nodes, Cray)
Lustre Lite 1.0
Many partnerships (HP, Dell, DDN, )

7
2003 Production, perfomance

Spring and summer
LLNL MCR from no, to partial, to full time use
PNNL similar
Stability much improved
Performance
Summer 2003 I/O problems tackled
Metadata much faster
Dec/Jan
Lustre 1.0

8
High level picture
9
Lustre Systems Major Components

Clients
Have access to file system
Typical role compute server
OST
Object storage targets
Handle (stripes of, references to) file data
MDS
Metadata request transaction engine.
Also LDAP, Kerberos, routers etc.

10
Linux OST Servers with disk arrays
QSW Net
SAN
Lustre Clients (1,000 Lustre Lite) Up to 10,000s
GigE
3rd party OST Appliances
Lustre Object Storage Targets (OST)
11
configuration information, network connection
details, security management
directory operations, meta-data, concurrency
file I/O file locking
recovery, file status, file creation
12
Networking
13
Lustre Networking

Currently runs over
TCP
Quadrics Elan 3 4
Lustre can route can use heterogeneous nets
Beta
Myrinet, SCI
Under development
SAN (FC/iSCSI), I/B
Planned
SCTP, some special NUMA and other nets

14
Lustre Network Stack - Portals
0-copy marshalling libraries, Service
framework, Client request dispatch, Connection
address naming, Generic recovery infrastructure
Move small large buffers, Remote DMA
handling, Generate events
Sandias API, CFS improved impl.
Network Abstraction Layer for TCP, QSW, etc.
Small hard Includes routing api.
15
Devices and APIs
16
Lustre Devices APIs

Lustre has numerous driver modules
One API - very different implementations
Driver binds to named device
Stacking devices is key
Generalized object devices
Drivers currently export several APIs
Infrastructure - a mandatory API
Object Storage
Metadata Handling
Locking
Recovery

17
Lustre Clients APIs
Data Object Lock
Metadata Lock

18
Object Storage Api

Objects are (usually) unnamed files
Improves on the block device api
create, destroy, setattr, getattr, read, write
OBD driver does block/extent allocation
Implementation
Linux drivers, using a file system backend

19
Bringing it all together
Recovery
Lustre Client File System
Metadata WB cache
Request Processing
NIO API
Portal Library
System Parallel File I/O, File Locking
OSCs
MDC
Lock Client
Portal NALs
Networking
Device (Elan,TCP,)
Directory Metadata Concurrency
OST
MDS
Networking
Recovery
Object-Based Disk Server (OBD server
Lock Server
Recovery, File Status, File Creation
Ext3, Reiser, XFS, FS
Fibre Channel
Fibre Channel
20
File I/O
21
File I/O Write Operation

Open file on meta-data server
Get information on all objects that are part of
file
Objects ids
What storage controllers (OST)
What part of the file (offset)
Striping pattern
Create LOV, OSC drivers
Use connection to OST
Object writes to OST
No MDS involvement at all

22
Lustre Client
Meta-data Server
File system
File open request
MDS
MDC
LOV
File meta-data
OSC 2
OSC 1
Inode A (O1,obj1),(O3, obj2)
Write (obj 1)
Write (obj 2)
OST 1
OST 2
OST 3
23
I/O bandwidth

100s GB/sec gt saturate many100s OSTs
OSTs
Do ext3 extent allocation, non-caching direct I/O
Lock management spread over cluster
Achieve 90-95 of network throughput
Single client, single thread Elan3 W 269MB/sec
OSTs handle up to 260MB/sec
W/O extent code, on 2 way 2.4GHz Xeon

24
Metadata
25
Intent locks Write Back caching

Clients MDS protocol adaptation
Low concurrency - write back caching
Client in memory updates
delayed replay to MDS
High concurrency (mostly merged in 2.6)
Single network request per transaction
No lock revocations to clients
Intent based lock includes complete request

26
(No Transcript)
27
Lustre 1.0

Only has high concurrency model
Aggregate throughput (1,000 clients)
Achieve 5000 file creations (open/close) /sec
Achieve 7800 stats in 10 x1M file directories
Single client
Around 1500 creations or stats /sec
Handling 10M file directories is effortless
Many changes to ext3 (all merged in 2.6)

28
Metadata Future

Lustre 2.0 2004
Metadata clustering
Common operations will parallelize
100 WB caching in memory or on disk
Like AFS

29
Metadata Odds and Ends

Logical drivers
Local persistent metadata cache, like
AFS/Coda/InterMezzo
Replicated metadata server driver
Remotely mirrored MDS
Small scale clusters
CFS focused on big systems
Our drivers ordinary FS can export all protocols
Get shared ext3/Reiser/.. file systems

30
Recovery
31
Recovery approach

Keep it simple!
Based on failover circles
Use existing failover software
Left working neighbor is failover node for you
At HP we use failover pairs
Simplify storage connectivity
I/O failure triggers
Peer node serves failed OST
Retry from client routed to new OST node

32
OST Server redundant pair
OST1
OST 2
FC Switch
FC Switch
C2
C1
C1
C2
33
Configuration
34
Lustre 1.0

Good tools to build configuration
Configuration is recorded on MDS
Or on dedicated management server
Configuration can be changed,
1.0 requires downtime
Clients auto configure
mount t lustre o mds//fileset/sub/dir
/mnt/pt
SNMP support

35
Futures
36
Advanced Management

Snapshots
All features you might expect
Global namespace
Combine best of AFS autofs4
HSM, hot migration
Driven by customer demand (we plan XDSM)
Online 0-downtime re-configuration
Part of Lustre 2.0

37
Security
38
Security

Authentication
POSIX style authorization
NASD style OST authorization
Refinement use OST ACLs and cookies
File crypting with group key service
STK secure file system

39
Step 1 Authenticate user, get session key
Step 7 Get SFS file key
Step 2 Authenticated open RPCs
Step 4 Get OST ACL
Step 6 Read encrypted file data
Step 5 Send ACL capability cookie
40
CFS Cluster Tools for 2.6

Remote serial GDB debugging over UDP
Conman UDP consoles for
syslog
sysrq
Core dumps over net or to local disk
Many dump format enhancements
Analyze dumps with gdb extension (not lcrash)
Llanalyze
Analyzes distributed Lustre logs

41
Metadata transaction protocol

No synchronous I/O unless requested
Reply and commit confirmation
Lustre covers single component failure
Replay of requests central
Preserve transaction sequence
Acknowledge replies to remove barriers
Avoid cascading aborts
In DB parlor strict execution

42
Distributed persistent data

Happens in many places
Inode object creation/removal (MDS/OST)
Replicating OSTs
Metadata clustering
Recovery with replay logs
Cancellation of log records
Logs ubiquitous in Lustre
Recovery, WB caching logs, replication etc.
Configuration

43
Project status
44
Lustre Feature Roadmap
Lustre (Lite) 1.0 (Linux 2.4 2.6) Lustre 2.0 (2.6) Lustre 3.0
2003 2004 2005
Failover MDS Metadata cluster Metadata cluster
Basic Unix security Basic Unix security Advanced Security
File I/O very fast (100s OSTs) Collaborative read cache Storage management
Intent based scalable metadata Write back metadata Load balanced MD
POSIX compliant Parallel I/O Global namespace
45
Cluster File Systems, Inc.
46
Cluster File Systems