Building Peta Byte Data Stores

About This Presentation

Title:

Building Peta Byte Data Stores

Description:

A Photo. 24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9 nano, 6 micro, 3 milli. Trends: ... Hot Swap Drives for Archive or Data Interchange. 35 MBps write ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 66

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Building Peta Byte Data Stores

1
Building Peta Byte Data Stores

Jim Gray
Microsoft Research
Research.Microsoft.com/Gray

2
The Asilomar Report on Database Research Phil
Bernstein, Michael Brodie, Stefano Ceri, David
DeWitt, Mike Franklin, Hector Garcia-Molina,
Jim Gray, Jerry Held, Joe Hellerstein, H. V.
Jagadish, Michael Lesk, Dave Maier, Jeff
Naughton, Hamid Pirahesh, Mike Stonebraker, and
Jeff Ullman September 1998

the field needs to radically broaden its
research focus to attack the issues of capturing,
storing, analyzing, and presenting the vast array
of online data.
-- broadening the definition of database
management to embrace all the content of the Web
and other online data stores, and rethinking our
fundamental assumptions in light of technology
shifts.
encouraging more speculative and long-range
work, moving conferences to a poster format, and
publishing all research literature on the Web.
http//research.microsoft.com/gray/Asilomar_DB_98
.html

3
So, how are we doing?

Capture, store, analyze, present terabytes?
Making web data accessible?
Publishing on the web (CoRR?)
Posters-Workshops vs Conferences-Journals?

4
Outline

Technology
1M/PB store everything online (twice!)
End-to-end high-speed networks
Gigabit to the desktop
So You can store everything,
Anywhere in the world
Online everywhere
Research driven by apps
TerraServer
National Virtual Astronomy Observatory.

5
Reality Check

Good news
In the limit, processing storage network is
free
Processing network is infinitely fast
Bad news
Most of us live in the present.
People are getting more expensive.Management/prog
ramming cost exceeds hardware cost.
Speed of light not improving.
WAN prices have not changed much in last 8 years.

6
How Much Information Is there?
Yotta Zetta Exa Peta Tera Giga Mega Kilo
Everything! Recorded

Soon everything can be recorded and indexed
Most data never be seen by humans
Precious Resource Human attention
Auto-Summarization Auto-Searchis key
technology.www.lesk.com/mlesk/ksg97/ksg.html

All Books MultiMedia
All LoC books (words)
.Movie
A Photo
A Book
24 Yecto, 21 zepto, 18 atto, 15 femto, 12 pico, 9
nano, 6 micro, 3 milli
7
Trends ops/s/ Had Three Growth Phases

1890-1945
Mechanical
Relay
7-year doubling
1945-1985
Tube, transistor,..
2.3 year doubling
1985-2000
Microprocessor
1.0 year doubling

8
Storage capacity beating Moores law

4 k/TB today (raw disk)

9
Cheap Storage and/or Balanced System

Low cost storage (2 x 3k servers) 6K TB2x (
800 Mhz, 256Mb 8x80GB disks 100MbE)
Balanced server (5k/.64 TB)
2x800Mhz (2k)
512 MB
8 x 80 GB drives (2.4K)
Gbps Ethernet switch (500/port)
10k TB, 20K/RAIDED TB

10
Hot Swap Drives for Archive or Data Interchange

35 MBps write(so can write N x 80 GB in 40
minutes)
80 GB/overnite
N x 3 MB/second
_at_ 19.95/nite

13
250
11
The Absurd Disk

2.5 hr scan time (poor sequential access)
1 access per second / 5 GB (VERY cold data)
Its a tape!

1 TB
100 MB/s
200 Kaps
12
Disk vs Tape

Disk
80 GB
35 MBps
5 ms seek time
3 ms rotate latency
4/GB for drive 3/GB for ctlrs/cabinet
4 TB/rack
1 hour scan

Tape
40 GB
10 MBps
10 sec pick time
30-120 second seek time
2/GB for media8/GB for drivelibrary
10 TB/rack
1 week scan

Guestimates Cern 200 TB 3480 tapes 2 col
50GB Rack 1 TB 12 drives
The price advantage of tape is narrowing, and
the performance advantage of disk is growing At
10K/TB, disk is competitive with nearline tape.
13
Its Hard to Archive a PetabyteIt takes a LONG
time to restore it.

At 1GBps it takes 12 days!
Store it in two (or more) places online (on
disk?). A geo-plex
Scrub it continuously (look for errors)
On failure,
use other copy until failure repaired,
refresh lost copy from safe copy.
Can organize the two copies differently
(e.g. one by time, one by space)

14
Next step in the Evolution

Disks become supercomputers
Controller will have 1bips, 1 GB ram, 1 GBps net
And a disk arm.
Disks will run full-blown app/web/db/os stack
Distributed computing
Processors migrate to transducers.

15
Terabyte (Petabyte) ProcessingRequires
Parallelism

parallelism use many little devices in parallel

16
Parallelism Must Be Automatic

There are thousands of MPI programmers.
There are hundreds-of-millions of people using
parallel database search.
Parallel programming is HARD!
Find design patterns and automate them.
Data search/mining has parallel design patterns.

17
Gilders Law 3x bandwidth/year for 25 more years

Today
10 Gbps per channel
4 channels per fiber 40 Gbps
32 fibers/bundle 1.2 Tbps/bundle
In lab 3 Tbps/fiber (400 x WDM)
In theory 25 Tbps per fiber
1 Tbps USA 1996 WAN bisection bandwidth
Aggregate bandwidth doubles every 8 months!

1 fiber 25 Tbps
18
Sense of scale
300 MBps OC48 G2 Or memcpy()

How fat is your pipe?
Fattest pipe on MS campus is the WAN!

20MBps disk / ATM / OC3
90 MBps PCI
94 MBps Coast to Coast
19
Redmond/Seattle, WA
Information Sciences Institute Microsoft Qwest Uni
versity of Washington Pacific Northwest
Gigapop HSCC (high speed connectivity
consortium) DARPA
New York
Arlington, VA
San Francisco, CA
5626 km 10 hops
20
Outline

Technology
1M/PB store everything online (twice!)
End-to-end high-speed networks
Gigabit to the desktop
So You can store everything,
Anywhere in the world
Online everywhere
Research driven by apps
TerraServer
National Virtual Astronomy Observatory.

21
Interesting Apps

EOS/DIS
TerraServer
Sloan Digital Sky Survey

Kilo 103 Mega 106 Giga 109 Tera 1012 today, we
are here Peta 1015 Exa 1018
22
The Challenge -- EOS/DIS

Antarctica is melting -- 77 of fresh water
liberated
sea level rises 70 meters
Chico Memphis are beach-front property
New York, Washington, SF, LA, London, Paris
Lets study it! Mission to Planet Earth
EOS Earth Observing System (17B gt 10B)
50 instruments on 10 satellites 1999-2003
Landsat (added later)
EOS DIS Data Information System
3-5 MB/s raw, 30-50 MB/s processed.
4 TB/day,
15 PB by year 2007

23
The Process Flow

Data arrives and is pre-processed.
instrument data is calibrated,
gridded averaged
Geophysical data is derived
Users ask for stored data OR to analyze and
combine data.
Can make the pull-push split dynamically

Pull Processing
Push Processing
Other Data
24
Key Architecture Features

2N data center design
Scaleable OR-DBMS
Emphasize Pull vs Push processing
Storage hierarchy
Data Pump
Just in time acquisition

25
2N data center design

duplex the archive (for fault tolerance)
let anyone build an extract (the N)
Partition data by time and by space (store 2 or 4
ways).
Each partition is a free-standing
OR-DBBMS (similar to Tandem, Teradata designs).
Clients and Partitions interact via standard
protocols
HTTPXML,

26
Data Pump

Some queries require reading ALL the data (for
reprocessing)
Each Data Center scans ALL the data every 2 days.
Data rate 10 PB/day 10 TB/node/day 120 MB/s
Compute on demand small jobs
less than 100 M disk accesses
less than 100 TeraOps.
(less than 30 minute response time)
For BIG JOBS scan entire 15PB database
Queries (and extracts) snoop this data pump.

27
Just-in-time acquisition 30

Hardware prices decline 20-40/year
So buy at last moment
Buy best product that day commodity
Depreciate over 3 years so that facility is
fresh.
(after 3 years, cost is 23 of original). 60
decline peaks at 10M

EOS DIS Disk Storage Size and Cost
assume 40 price decline/year
Data Need TB
Storage Cost M
2 PB _at_ 100M
1996
1994
1998
2000
2002
2004
2006
2008
28
Problems

Management (and HSM)
Design and Meta-data
Ingest
Data discovery, search, and analysis
Auto Parallelism
reorg-reprocess

29
What this system taught me

Traditional storage metrics
KAPS KB objects accessed per second
/GB Storage cost
New metrics
MAPS megabyte objects accessed per second
SCANS Time to scan the archive
Admin cost dominates (!!)
Auto parallelism is essential.

30
Outline

Technology
1M/PB store everything online (twice!)
End-to-end high-speed networks
Gigabit to the desktop
So You can store everything,
Anywhere in the world
Online everywhere
Research driven by apps
TerraServer
National Virtual Astronomy Observatory.

31
Microsoft TerraServer http//TerraServer.Microso
ft.com/

Build a multi-TB SQL Server database
Data must be
1 TB
Unencumbered
Interesting to everyone everywhere
And not offensive to anyone anywhere
Loaded
1.5 M place names from Encarta World Atlas
7 M Sq Km USGS doq (1 meter resolution)
10 M sq Km USGS topos (2m)
1 M Sq Km from Russian Space agency (2 m)
On the web (worlds largest atlas)
Sell images with commerce server.

32
Background

Earth is 500 Tera-meters square
USA is 10 tm2
100 TM2 land in 70ºN to 70ºS
We have pictures of 9 of it
7 tsm from USGS
1 tsm from Russian Space Agency
Compress 51 (JPEG) to 1.5 TB.
Slice into 10 KB chunks (200x200 pixels)
Store chunks in DB
Navigate with
Encarta Atlas
globe
gazetteer

Someday
multi-spectral image
of everywhere
once a day / hour

33
TerraServer 4.0 Configuration
3 Active Database Servers
SQL\Inst1 - Topo Relief Data
SQL\Inst2 Aerial Imagery
SQL\Inst3 Aerial Imagery
Logical Volume Structure
One rack per database All volumes triple mirrored
(3x) MetaData on 15k rpm 18.2 GB drives Image
Data on 10k rpm 72.8 GB drives
2 spare volumes allocated per cluster 6
Additional 339 GB volumes to be added by year
end (2 per Db Server)
34
TerraServer 4.0 Schema
35
File System Config

Use StorageWorks to form 28 RAID5 sets Each
raid set has 11 disks (16 spare drives)
Use NTFS to form 4 595GB NT volumes Each
striped over 7 Raid sets on 7 controllers
DB is File Group of 80 20,000 MB files (1.5TB)

36
BAD OLD Load
37
Load Process
Internet Data CenterTukwila, WA
2 TBDatabase
2 TBDatabase
2 TBDatabase
Read 4 Images
Write 1
TerraScale
CorporateNetwork
Executive Briefing Center, Redmond WA
TerraCutter
ReadImageFiles
38
After a Year
TerraServer Daily Traffic Jun 22, 1998 thru June
22, 1999
30M
Sessions

15 TB of data (raw) 3B records
2.3 billion Hits
2.0 billion DB Queries
1.7 billion Images sent(2 TB of download)
368 million Page Views
99.93 DB Availability
4rd design now Online
Built and operated by team of 4 people

20M
Hit
Count
Page View
DB Query
Image
10M
0
6/22/98
7/22/98
8/22/98
9/22/98
1/22/99
2/22/99
3/22/99
4/22/99
5/22/99
6/22/99
10/22/98
11/22/98
12/22/98
39
TerraServer Activity
40
TerraServer.Microsoft.NET A Web Service
Before .NET
With .NET
41
TerraServer Recent/Current Effort

Added USGS Topographic maps (4 TB)
High availability (4 node cluster with failover)
Integrated with Encarta Online
The other 25 of the US DOQs (photos)
Adding digital elevation maps
Open architecture publish SOAP interfaces.
Adding mult-layer maps (with UC Berkeley)
Geo-Spatial extension to SQL Server

42
Thank You!
43
Outline

Technology
1M/PB store everything online (twice!)
End-to-end high-speed networks
Gigabit to the desktop
So You can store everything,
Anywhere in the world
Online everywhere
Research driven by apps
TerraServer
National Virtual Astronomy Observatory.

44
Astronomy is Changing(and so are other sciences)

Astronomers have a few PB
Doubles every 2 years.
Data is public after 2 years.
So Everyone has ½ the data
Some people have 5more private data
So, its a nearly level playing field
Most accessible data is public.

45
(inter) National Virtual Observatory

Almost all astronomy datasets will be online
Some are big (gtgt10 TB)
Total is a few Petabytes
Bigger datasets coming
Data is public
Scientists can mine these datasets
Computer Science challenge Organize these
datasets Provide easy access to them.

46
The Sloan Digital Sky SurveySLIDES BY Alex Szlay
A project run by the Astrophysical Research
Consortium (ARC)
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington Fermi National
Accelerator Laboratory US Naval Observatory
The Japanese Participation Group The Institute
for Advanced Study SLOAN Foundation, NSF, DOE,
NASA
Goal To create a detailed multicolor map of the
Northern Sky over 5 years, with a budget of
approximately 80M Data Size 40 TB raw, 1 TB
processed
47
Features of the SDSS
Special 2.5m telescope, located at Apache Point,
NM 3 degree field of view. Zero distortion
focal plane. Two surveys in one Photometric
survey in 5 bands. Spectroscopic redshift
survey. Huge CCD Mosaic 30 CCDs 2K x
2K (imaging) 22 CCDs 2K x 400 (astrometry) Two
high resolution spectrographs 2 x 320 fibers,
with 3 arcsec diameter. R2000 resolution with
4096 pixels. Spectral coverage from 3900Å to
9200Å. Automated data reduction Over 70
man-years of development effort. (Fermilab
collaboration scientists) Very high data
volume Expect over 40 TB of raw data. About 3
TB processed catalogs. Data made available to
the public.
48
Apache Point Observatory
Located in New Mexico, near White Sands National
Monument
Special 2.5m telescope 3 degree field of
view Zero distortion focal plane Wind
screen moved separately
49
Scientific Motivation
Create the ultimate map of the Universe ? The
Cosmic Genome Project! Study the distribution of
galaxies ? What is the origin of
fluctuations? ? What is the topology of the
distribution? Measure the global properties of
the Universe ? How much dark matter is
there? Local census of the galaxy population ?
How did galaxies form? Find the most distant
objects in the Universe ? What are the highest
quasar redshifts?
50
Cosmology Primer
The Universe is expanding the galaxies move
away from us spectral lines are redshifted
v Ho r Hubbles law
The fate of the universe depends on the balance
between gravity and the expansion velocity
? density/criticalif ? lt1, expand forever
?dgt ?
Most of the mass in the Universe is dark matter,
and it may be cold (CDM)
P(k) power spectrum
The spatial distribution of galaxies is
correlated, due to small ripples in the early
Universe.
51
The Naught Problem
What are the global parameters of the
Universe? H0 the Hubble constant 55-75
km/s/Mpc ?0 the density parameter 0.25-1 ?0 the
cosmological constant 0 - 0.7 Their values are
still quite uncertain today... Goal measure
these parameters with an accuracy of a few percent
High Precision Cosmology!
52
The Cosmic Genome Project
The SDSS will create the ultimate mapof the
Universe, with much more detailthan any other
measurement before
53
Area and Size of Redshift Surveys
54
The Topology of Local Universe
Measure the Topology of the Universe
Does it consist of walls and voids
or is it randomly distributed?
55
Finding the Most Distant Objects
Intermediate and high redshift QSOs
Multicolor selection function.
Luminosity functions and spatial clustering.
High redshift QSOs (zgt5).
56
The Photometric Survey
Northern Galactic Cap 5 broad-band filters
( u', g', r', i', z )
limiting magnitudes (22.3, 23.3, 23.1, 22.3,
20.8) drift scan of 10,000 square degrees
55 sec exposure time 40 TB raw imaging
data -gt pipeline -gt 100,000,000 galaxies
50,000,000 stars calibration to 2 at
r'19.8 only done in the best seeing (20
nights/yr) pixel size is 0.4 arcsec,
astrometric precision is 60 milliarcsec Southern
Galactic Cap multiple scans (gt 30 times) of
the same stripe Continuous data rate of 8
Mbytes/sec
57
Survey Strategy
Overlapping 2.5 degree wide stripes Avoiding the
Galactic Plane (dust) Multiple exposures on the
three Southern stripes
58
The Spectroscopic Survey
Measure redshifts of objects ? distance SDSS
Redshift Survey 1 million galaxies 100,000
quasars 100,000 stars Two high throughput
spectrographs spectral range 3900-9200 Å. 640
spectra simultaneously. R2000
resolution. Automated reduction of spectra Very
high sampling density and completeness Objects in
other catalogs also targeted
59
First Light Images
Telescope First light May 9th 1998
Equatorial scans
60
The First Stripes
Camera 5 color imaging of gt100 square
degrees Multiple scans across the same
fields Photometric limits as expected
61
NGC 6070
62
The First Quasars
Three of the four highest redshift quasars have
been found in the first SDSS test data !
63
SDSS Data Flow
64
Data Processing Pipelines
65
SDSS Data Products
Object catalog 400 GB parameters of gt108
objects Redshift Catalog 2 GB
parameters of 106 objects Atlas Images
1.5 TB 5 color cutouts of gt109 objects
Spectra 60 GB in a one-dimensional
form 106 Derived Catalogs 60 GB -
clusters - QSO absorption lines 4x4 Pixel
All-Sky Map 1 TB heavily compressed 5 x
105
All raw data saved in a tape vault at Fermilab
66
Concept of the SDSS Archive
Science Archive (products accessible to users)
OperationalArchive (raw processed data)
67
Parallel Query Implementation

Getting 200MBps/node thru SQL today
4 GB/s on 20 node cluster.

User Interface
Analysis Engine
Master
SX Engine
DBMS Federation
DBMS
Slave
Slave
Slave
DBMS
Slave
DBMS
DBMS
RAID
DBMS
RAID
RAID
RAID
68
Who will be using the archive?
Power Users sophisticated, with lots of
resources research is centered around the
archive data moderate number of very intensive
queries mostly statistical, large output
sizes General Astronomy Public frequent, but
casual lookup of objects/regions the archives
help their research, but not central to
it large number of small queries a lot of
cross-identification requests Wide
Public browsing a Virtual Telescope can have
large public appeal need special
packaging could be a very large number of
requests
69
How will the data be analyzed?
The data are inherently multidimensional gt
positions, colors, size, redshift Improved
classifications result in complex N-dimensional
volumes gt complex constraints, not
ranges Spatial relations will be
investigated gt nearest neighbors gt other
objects within a radius Data Mining finding the
needle in the haystack gt separate typical
from rare gt recognize patterns in the
data Output size can be prohibitively large for
intermediate files gt import output directly
into analysis tools
70
Summary
SDSS combines astronomy, physics, and computer
science Promises to fundamentally change our view
of the universe High precision cosmology Serves
as standard astronomy reference for several
decades Virtual universe can be explored by both
scientists public A new paradigm in astronomy.
71
Desiging and Mining Multi-Terabyte Astronomy
Archives The Sloan Digital Sky Survey (SDSS)
http//www.sdss.org/

Scan 10,000 sq. degrees (50) of northern sky.
200,000,000 objects.
100 dimensions.
40 TB of raw data.
1 TB of catalog data.

Alex S. Szalay, Peter J. Kunszt, Ani Thakar (The
Johns Hopkins University)Jim Gray, Don Slutz
(Microsoft Research)Robert J. Brunner (Calif.
Institute of Technology)
72
Astronomical Growth of Collected Data

Data Gathering Rate doubles every 20
months.(Moores Law here too)
Several orders of magnitude more data now!
SDSS telescope has 120 Million CCDs
55 second photometric exposure.
8 MB/sec data rate.
0.4 arc-sec pixel size.
Also Spectroscopic Survey of 1 million objects.

73
Major Changes in Astronomy

Visual Observation --gt Photographic Plates--gt
Massive Scans of the Sky collecting Terabytes.
A Practice Scan of the SDSS Telescope Discovered
3 of the 4 most Distant Quasars!
SDSS plus other Surveys will yield a Digital Sky
Telescope Quality Data available Online.
Spatial Data Mining will find new objects.
New research areas - Study Density Fluctuations.

74
Different Kind of Spatial Data

All Objects on Celestial Sphere Surface
Position a point by 2 spherical angles (RA, DEC).
Position by Cartesian x,y,z easier to search
within 1 arc-minute.
Hierarchy of Spherical Trianglesfor Indexing.
SDSS tree is 5 levels deep 8192 triangles

75
Experiment with Relational DBMS

See if SQLs Good Indexing and Scanning
Compensates for Poor Object Support.
Leverage Fast/Big/Cheap Commodity Hardware.
Ported 40 GB Sample Database (from SDSS Sample
Scan) to SQL Server 2000
Building public web site and data server

76
20 Astronomy Queries

Implemented spatial access extension to SQL (HTM)
Implement 20 Astronomy Queries in SQL (see paper
for details).
15M rows 378 cols, 30 GB. Can scan it in 8
minutes (disk IO limited).
Many queries run in seconds
Create Covering Indexes on queried columns.
Create Neighbors Table listing objects within 1
arc-minute (5 neighbors on the average) for
spatial joins.
Install some more disks!

77
Query to Find Gravitational Lenses
Find all objects within 1 arc-minute of each
other that have very similar colors (the color
ratios u-g, g-r, r-i are less than 0.05m)
1 arc-minute
78
SQL Query to Find Gravitational Lenses

select count() from sxTag T, sxTag U, neighbors
Nwhere T.UObj_id N.UObj_id and U.UObj_id
N.neighbor_UObj_id and N.UObj_id lt
N.neighbor_UObj_id -- no dups and T.ugt0 and
T.ggt0 and T.rgt0 and T.igt0 and U.ugt0 and U.ggt0
and U.rgt0 and U.igt0 and ABS((T.u-T.g)-(U.u-U.g
))lt0.05 -- similar color and
ABS((T.g-T.r)-(U.g-U.r))lt0.05 and
ABS((T.r-T.i)-(U.r-U.i))lt0.05
Finds 5223 objects, executes in 6 minutes.

79
SQL Results so far.

Have run 17 of 20 Queries so far.
Most Queries are IO bound, scanning at 80MB/sec
on 4 disks in 6 minutes (at the PCI bus limit)
Covering indexes reduce execution to lt 30 secs.
Common to get Grid Distributionsselect
convert(int,ra30)/30.0, -- ra bucket
convert(int,dec30)/30.0, -- dec bucket
count() --
bucket count from Galaxieswhere (u-g)gt1 and
rlt21.5group by (1), (2)

80
Distribution of Galaxies
81
Outline