Title: CS597A: Managing and Exploring Large Datasets
1CS597A Managing and Exploring Large Datasets
2About This Seminar
- Goal
- Identify research directions and issues in
managing and exploring large datasets - Plan
- Overview of a few of state-of-the-art storage
systems - Reading some papers on a few research systems in
storage systems, data management and data
exploration - Discussions on wild ideas
- Define, work, and present course projects
3Why Is This Area Interesting?(Where Are The
Bottlenecks?)
Network
Create
Transform
Transmit
Store and Retrieve
4Computer Food Chains
Mini-super (Convex, etc)
Mainframe (IBM 370)
Minicomputer (VAX)
WS (SUN)
PC
Supercomputer (Cray, etc)
(Computer systems in 1980s)
Supercomputer (Cray, etc)
Servers (IBM, SUN)
PC
Laptop
PDA
(Computer systems in 1990s and 2000s)
5Storage Arrays of Food Chains?
Direct Attached Storage (DAS)
USB, Microdrive, Flash
ATA disks
Super SCSI RAID
ATA RAID
Storage Area Network (SAN)
Super SAN storage (EMC, Hitachi, IBM)
MiniSuper SAN storage (HPQ, Startups)
iSCSI (Startups)
Network Attached Storage (NAS)
PC storage (Dell, Snap!, MSFT SAK boxes)
Super NAS (NetApp, SUN)
MiniSuper NAS (Startups)
6Typical General Infrastructures
File servers/wo disks
Storage Area Network
Network
Backuptape library
BCV or 3rd copy (e.g. EMC)
Mirroredstorage(e.g EMC)
Clients
File servers/w disks
Storage Area Network
Network
Backuptape library
Clients
7Exponential Growth(Courtesy Jim Gray, Turing
Lecture 99)
- Performance/Price doubles every 18 months
- 100x per decade
- Progress in next 18 months ALL previous
progress - New storage sum of all old storage (ever)
- New processing sum of all old processing.
15 years ago
8Disk Density vs. Moores Law
9Storage Capacity Grows Fast
10Raw Storage Is Cheap
- Disk drives beat tapes in 2002 in /TB (IDC)
- Disk /TB declines 50 / year
- Tape /TB declines 29 / year
- But, ATA arrays (/TB) beat tape libraries in
2006 (Gartner) - Disk system /TB declines 40/year
- Tape library /TB declines 29/year
2006
/TB
2002
(Source Gartner and IDC)
11Summary of Storage Trends
- Disk density beats Moores Law
- Data growth rate follows Moores law
- Raw disks are cheap while storage systems are
very expensive - Crossover from tapes to disks
12How Much Information Is there?(Courtesy Jim
Gray, Turing Lecture 99)
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded
- Soon everything can be recorded and indexed
- Most data never be seen by humans
- Precious Resource Human attention
Auto-Summarization Auto-Searchis key
technology.www.lesk.com/mlesk/ksg97/ksg.html
All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
13How Much Information Is There?(Hal Varian, Peter
Lyman et al. 2001)
- Web has a lot of documents
- Surface web had 2.5B docs, adding 7.5M
pages/day - Deep web had 550B docs, 95 publicly accessible
- Most websites are in English
- 78 all websites and 96 e-commerce
- E-mail generates a large amount of information
- A white-collar worker receives 40 messages/day
- E-mail information is 500x of web every year
14How Much Information Is There?(Hal Varian, Peter
Lyman et al. 2001)
Storage media TB/year (Upper est.) TB/year (Lower est.) Growth rate
Paper 240 23 2
Film 427,216 58,216 4
Optical 83 31 70
Magnetic 1,693,000 577,210 55
15Challenges In Managing and Exploring Datasets
- Disks behavior is like a big tape
- Storage is indeed infinitely large
- Ability to get information is slow
- Reliability is far from what we need
- Disks do fail
- Software and human corrupt data
- Managing storage is difficult
- Storage and data are both growing
- Retrieving data is difficult
- Get what you want
- See what you get
16Properties of A Research Goal(Jim Gray, 1999)
- Simple to state
- Not obvious how to do it
- Clear benefit
- Progress and solution is testable
- Can be broken in to smaller steps
- So that you can see intermediate progress
17Systems Challenges(Lampson, SOSP Keynote 99)
- Systems that work
- Meeting their specs
- Always available
- Adapting to changing environment
- Evolving while they run
- Made from unreliable components
- Growing without practical limit
- Credible simulations or analysis
- Writing good specs
- Testing
- Performance
- Understanding when it doesnt matter
18What Should the New World Focus Be?(Hennessy,
FCRC keynote 99)
- Availability
- Both appliance service
- Maintainability
- Two functions
- Enhancing availability by preventing failure
- Ease of SW and HW upgrades
- Scalability
- Especially of service
- Cost
- per device and per service transaction
- Performance
- Remains important, but its not SPECint
19Tentative Syllabus
- Today About the Course
- Week 2 Read several vision papers
- Week 3 Guest lecture on archival storage
- Week 4 Commercial storage systems (EMC, Veritas,
NetApp) - Week 5 Global-scale storage (OceanStore and the
like) - Week 6 Managing personal (Coda, Bayou, Personal
RAID) - Week 7 Managing geographical data (TerraServer)
- Week 8 Guest lecture on managing astrophysical
data (SkyServer) - Week 9 Managing and exploring large scientific
data - Week 10 Managing medical data
- Week 11 Managing genomic data
- Week 12 Project reports and presentations
- Detailed, tentative reading will be available
this weekend