Title: Conquest: Preparing for Life After Disks
1Conquest Preparing for Life After Disks
- CS239 Seminar
- October 24, 2002
- An-I Andy Wang
- University of California, Los Angeles
2Conquest Overview
- File systems are optimized for disks
- Performance problem
- Complexity
- Now we have tons of inexpensive RAM
- What can we do with that RAM?
3Conquest Approach
- Combine disk and persistent RAM (e.g.,
battery-backed RAM) in a novel way - Simplification
- gt 20 fewer semicolons than ext2, reiserfs, and
SGI XFS - Performance (under popular benchmarks)
- 24 to 1900 faster than LRU disk caching
4Outline of the Talk
- Motivation
- Conquest design (high level)
- Conquest components
- Performance evaluation
- Conclusion
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
5Motivation
- Most file systems are built for disks
- Problems with the disk assumption
- Performance
- Complexity
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
6Hardware Evolution
CPU (50 /yr)
1 GHz
memory (50 /yr)
accesses per second (log scale)
1 MHz
1 KHz
disk (15 /yr)
1990
2000
1995
(1 sec 6 days)
(1 sec 3 months)
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
7Inside Pandoras Box
- Access time seek time (disk arm)
- rotational delay (disk platter)
- transfer time
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
8Disk Optimization Methods
- Disk arm scheduling
- Group information on disk
- Disk readahead
- Buffered writes
- Disk caching
- Data mirroring
- Hardware parallelism
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
9Complexity Bytes
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
10Storage Media Alternatives
/MB (log scale)
10-3
106
100
103
accesses/sec (log scale)
10-3
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
11Price Trend of Persistent RAM
102
101
/MB (log scale)
100
10-1
10-2
1995
2005
2000
year
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
12Old Order New World
- Disk will stay around
- Cost, capacity, power, heat
- RAM as a viable storage alternative
- PDAs, digital cameras, MP3 players
- More architectural changes due to RAM
- A big assumption change from disk
- Rethink data structures, interfaces, applications
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
13Getting a Fresh Start
- What does it take to design and build a system
that assumes ample persistent RAM as the primary
storage medium?
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
14Conquest Design
- Design and build a disk/persistent-RAM hybrid
file system - Deliver all file system services from memory,
with the exception of high-capacity storage - Two separate data paths to memory and disk
- Benefits
- Simplicity
- Performance
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
15Simplicity
- Remove disk-related complexities for most files
- Make things simpler for disk as well
- Less complexity
- Fewer bugs
- Easier maintenance
- Shorter data paths
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
16Performance
- Overall
- All management performed in memory
- Memory data path
- No disk-related overhead
- Disk data path
- Faster speed due to simpler access models
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
17Conquest Components
- Media management
- Metadata representation
- Directory service
- Allocation service
- Persistence support
- Resiliency support
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
18User Access Patterns
- Small files
- Take little space (10)
- Represent most accesses (90)
- Large files
- Take most space
- Mostly sequential accesses
- Not characteristic of database applications
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
19Files Stored in Persistent RAM
- Small files (lt 1MB)
- No seek time or rotational delays
- Fast byte-level accesses
- Contiguous allocation
- Metadata
- Fast synchronous update
- No dual representations
- Executables and shared libraries
- In-place execution
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
20Memory Data Path of Conquest
Conventional File Systems
storage requests
IO buffer management
IO buffer
persistence support
disk management
disk
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
21Large-File-Only Disk Storage
- Allocate in big chunks
- Lower access overhead
- Reduced management overhead
- No fragmentation management
- No tricks for small files
- Storing data in metadata
- No elaborate data structures
- Wrapping a balanced tree onto disk cylinders
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
22Sequential-Access Large Files
- Sequential disk accesses
- Near-raw bandwidth
- Well-defined readahead semantics
- Read-mostly
- Little synchronization overhead (between memory
and disk)
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
23Disk Data Path of Conquest
Conventional File Systems
Conquest Disk Data Path
storage requests
storage requests
IO buffer management
IO buffer management
IO buffer
battery-backed RAM
IO buffer
small file and metadata storage
persistence support
disk management
disk management
disk
disk
large-file-only file system
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
24Random-Access Large Files
- Random access?
- Common definition nonsequential access
- A typical movie has 150 scene changes
- MP3 stores the title at the end of the files
- Near sequential access?
- Simplifies large-file metadata representation
significantly
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
25Logical File Representation
File
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
26Physical File Representation
Name(s)
File
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
27Ext2 Data Representation
i-node
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
28Disadvantages with Ext2 Design
- Designed for disk storage
- Optimization for small files makes things complex
- Random-access data structure for large files that
are accessed mostly sequentially - Data access time dependent on the byte position
in a file - Maximum file size is limited
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
29Conquest Representation
- Persistent RAM
- Hash(file name) location of data
- Offset(location of data)
- Disk storage
- Per-file, doubly linked list of disk block
segments (stored in persistent RAM)
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
30Advantages Conquest Design
- Direct data access for in-core files
- Worse case sequential memory search for random
disk locations - Maximum file size limited by physical storage
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
31Directory Service
- Requirements
- Fast sequential traversal (e.g., ls)
- Fast random lookup (e.g., locate file x)
- Hard links (apply multiple names to data)
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
32First Design
- A doubly hashed table for each directory
- Conserves space
- Problems
- Dynamic resizing of directories
- Need to handle the current file position
- Important for rm -fr
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
33Second Design
- A variant of extensible hash table for each
directory - An old data structure fits nicely
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
34Additional Engineering Details
- Popular hash functions randomize lower bits
- Dynamic file positioning
- Need to handle collisions
- Memory overhead and complexity tradeoffs
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
35Metadata Allocation
- Requirements
- Keep track of usage status of metadata entries
- Avoid duplicate allocation with unique IDs
- Fast retrieval of metadata with a given ID
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
36Existing Memory Allocation
- Services
- Keep track of unallocated memory
- No duplicate allocation of physical addresses
- Hmm
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
37Conquest Metadata Management
- Metadata memory allocated by memory manager
- Metadata ID physical address of metadata
ADDR 0xe000000 free
ADDR 0xe000038 in use
ADDR 0xe000070 free
ADDR 0xe0000A8 free
ADDR 0xe0000E0 free
ADDR 0xe000118 in use
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
38Persistence Support
- Restore file system states after a reboot
- Data
- Metadata
- Memory manager
- Keep track of metadata allocation
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
39Linux Memory Manager (1)
- Page allocator maintains individual pages
Page allocator
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
40Linux Memory Manager (2)
- Zone allocator allocates memory in power-of-two
sizes
Page allocator
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
41Linux Memory Manager (3)
- Slab allocator groups allocations by sizes to
reduce internal memory fragmentation
Zone allocator
Page allocator
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
42Linux Memory Manager (4)
- Difficult to restore the persistent states
- Three layers of pointer-rich mappings
- Mixing of persistent and temporary allocations
Slab allocator
Zone allocator
Page allocator
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
43Conquest Persistence
- Create memory zones with own instantiations of
memory managers
Slab allocator
Zone allocator
Page allocator
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
44Conquest Persistence
- Encapsulate all pointers within each zone
- Pointers can survive reboots
- No serialization and deserialization
- Swapping and paging
- Disabled for Conquest memory zones
- Enabled for non-Conquest zones
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
45Resiliency Support
- Instantaneous metadata commit
- No fsck (ad hoc metadata consistency check)
- Built-in checkpointing
- Pointer-switch commit semantics
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
46Implementation Status
- Kernel module under Linux 2.4.2
- Fully functional and POSIX compliant
- Modified memory manager to support Conquest
persistence - Need to overcome BIOS limitations for
distribution - Looking for licensing opportunities
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
47Performance Evaluation
- Architectural simplification
- Feature count
- Performance improvement
- Memory-only workload
- Memory and disk workload
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
48Conventional Data Path
- Buffer allocation management
- Buffer garbage collection
- Data caching
- Metadata caching
- Predictive readahead
- Write behind
- Cache replacement
- Metadata allocation
- Metadata placement
- Metadata translation
- Disk layout
- Fragmentation management
Conventional File Systems
storage requests
IO buffer management
IO buffer
persistence support
disk management
disk
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
49Memory Path of Conquest
- Buffer allocation management
- Buffer garbage collection
- Data caching
- Metadata caching
- Predictive readahead
- Write behind
- Cache replacement
- Metadata allocation
- Metadata placement
- Metadata translation
- Disk layout
- Fragmentation management
Conquest Memory Data Path
storage requests
Persistence support
battery-backed RAM
small file and metadata storage
- Memory manager encapsulation
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
50Disk Path of Conquest
- Buffer allocation management
- Buffer garbage collection
- Data caching
- Metadata caching
- Predictive readahead
- Write behind
- Cache replacement
- Metadata allocation
- Metadata placement
- Metadata translation
- Disk layout
- Fragmentation management
Conquest Disk Data Path
storage requests
IO buffer management
battery-backed RAM
IO buffer
small file and metadata storage
disk management
disk
large-file-only file system
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
51PostMark Benchmark (1)
- Conquest is comparable to ramfs
- At least 24 faster than the LRU disk cache
- ISP workload (emails, web-based transactions)
40 to 250 MB working set with 2 GB physical RAM
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
52PostMark Benchmark (2)
- When both memory and disk components are
exercised, Conquest can be several times faster
than ext2fs, reiserfs, and SGI XFS
10,000 files, 80 MB to 3.5 GB working set with 2
GB physical RAM
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
53PostMark Benchmark (3)
- When working set gt RAM, Conquest is 1.4 to 2
times faster than ext2fs, reiserfs, and SGI XFS
10,000 files, 80 MB to 3.5 GB working set with 2
GB physical RAM
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
54Sprite LFS Microbenchmarks (1)
- Small-file benchmark
- Operates on 10,000 1-KB files in three phases
Motivation Conquest Alternatives Conquest
Design Performance Evaluation Conclusion
55Sprite LFS Microbenchmarks (2)
- Modified large-file microbenchmark 10 1-MB
files (Conquest in-core files)
Motivation Conquest Alternatives Conquest
Design Performance Evaluation Conclusion
56Sprite LFS Microbenchmarks (3)
- Modified large-file microbenchmark 10 1.01-MB
files (Conquest on-disk files)
Motivation Conquest Alternatives Conquest
Design Performance Evaluation Conclusion
57Sprite LFS Microbenchmarks (4)
- Large-file microbenchmark 40 100-MB files
(Conquest on-disk files)
Motivation Conquest Alternatives Conquest
Design Performance Evaluation Conclusion
58Historys Mystery
-
- Puzzling Microbenchmark Numbers
Geoffrey Kuenning If Conquest is slower than
ext2, I will toss you off of the balcony
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
59With me hanging off a balcony
- Original large-file microbenchmark 1-MB file
(Conquest in-core file)
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
60Odd Microbenchmark Numbers
- Why are random reads slower than sequential
reads?
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
61Odd Microbenchmark Numbers
- Why are RAM-based file systems slower than
disk-based file systems?
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
62A Series of Hypotheses
- Warm-up effect?
- Maybe
- Why do RAM-based systems warm up slower?
- Bad initial states?
- No
- Pentium III streaming IO option?
- No
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
63Effects of Cache Footprint Sizes
footprint
footprint
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
64LFS Sprite Microbenchmarks
- Modified large-file microbenchmark 10 1-MB
files (Conquest in-core files)
faster random over sequential accesses due to
cache reuse
Motivation Conquest Alternatives Conquest
Design Performance Evaluation Conclusion
65LFS Sprite Microbenchmarks (2)
- Modified large-file microbenchmark 10 128-KB
files (Conquest in-core files)
slower random over sequential accesses due to
the extra lseek
Motivation Conquest Alternatives Conquest
Design Performance Evaluation Conclusion
66Lessons Learned
- Faster than LRU caching, unexpected
- Heavyweight disk handling
- Severe penalty for accessing memory content
- Matching user access patterns to storage media
offers considerable simplification and better
performance - Not an automatic result
- Need careful design
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
67More Lessons Learned
- Effects of L2 caching become highly visible in
memory workloads (modern workloads) - Cannot blindly apply existing disk-based
microbenchmarks to measure memory performance of
file systems - Need to consider states of L2 cache and memory
behaviors at each stage of microbenchmarking
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
68Additional Lessons Learned
- Dont discuss your performance numbers next to a
balconyunless
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
69Related Work (1)
- Disk caching
- Assumption of scarce memory
- Complex mechanisms to maintain consistency
- Especially with the presence of metadata
- RAM drives and RAM file systems
- Not meant to be persistent
- Use disk-related mechanisms
- Limitations on storage capacity
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
70Related Work (2)
- Disk emulators
- RAM storage accessed through SCSI interface
- Ad hoc approaches
- Manual transferring of files to and from ramfs
- Capacity limitation
- Background daemon to stage RAM files to a disk
- Semantic and name space problems
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
71Going Beyond Conquest (1)
- Matching usage patterns with heterogeneous
machines in the distributed domain - Specialized tasks for machines within a cluster
- Preferably self-organizing and self-evolving
- State-rich computing
- Caching of runtime data structures
- Similar to /tmp
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
72Going Beyond Conquest (2)
- Separate storage of metadata from data
- Association of metadata with data of different
fidelity - Opportunity for hierarchical replication across
devices with different calibers - Benchmarking memory performance of file systems
- Developing new memory benchmarks
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
73Contributions
- Demonstrated the feasibility of disk-memory
hybrid file systems - Showed performance does not preclude simplicity
- Pinpointed cache-related problems with modern
benchmarks - Opened doors to many exciting areas of research
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
74Conclusion
- Conquest demonstrates how rethinking changes in
underlying assumptions can lead to significant
architectural and performance improvements - Radical changes in hardware, applications, and
user expectations in the past decade should lead
us to rethink other aspects of OS as well.
Motivation Conquest Design Conquest
Components Performance Evaluation Conclusion
75Questions . . .
Conquest http//lasr.cs.ucla.edu/conquest Andy
Wang awang_at_cs.ucla.edu