Scaleable Systems Research at Microsoft really: what we do at BARC - PowerPoint PPT Presentation

1 / 65

About This Presentation

Title:

Scaleable Systems Research at Microsoft really: what we do at BARC

Description:

RAGS: SQL Testing. TerraServer (a big DB) Sloan Sky Survey (CyberBricks) Billion Transactions per day. WolfPack Failover. NTFS IO measurements. NT-Cluster-Sort ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 66

Provided by: jimg178

Category:

more less

Transcript and Presenter's Notes

Title: Scaleable Systems Research at Microsoft really: what we do at BARC

1
Scaleable Systems Research at Microsoft(really
what we do at BARC)

Jim GrayMicrosoft Research Gray_at_Microsoft.comht
tp//research.Microsoft.com/GrayPresented to
DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

2
Outline

PowerCast, FileCast Reliable Multicast
RAGS SQL Testing
TerraServer (a big DB)
Sloan Sky Survey (CyberBricks)
Billion Transactions per day
WolfPack Failover
NTFS IO measurements
NT-Cluster-Sort
AlwaysUp

3
Telepresence

The next killer app
Space shifting
Reduce travel
Time shifting
Retrospective
Offer condensations
Just in time meetings.
Example ACM 97
NetShow and Web site.
More web visitors than attendees
People-to-People communication

4
Telepresence Prototypes

PowerCast multicast PowerPoint
Streaming - pre-sends next anticipated slide
Send slides and voice rather than talking head
and voice
Uses ECSRM for reliable multicast
1000s of receivers can join and leave any time.
No server needed no pre-load of slides.
Cooperating with NetShow
FileCast multicast file transfer.
Erasure encodes all packets
Receivers only need to receive as many bytes as
the length of the file
Multicast IE to solve Midnight-Madness problem
NT SRM reliable IP multicast library for NT
Spatialized Teleconference Station
Texture map faces onto spheres
Space map voices

5
RAGS RAndom SQL test Generator

Microsoft spends a LOT of money on testing.
(60 of development according to one source).
Idea test SQL by
generating random correct queries
executing queries against database
compare results with SQL 6.5, DB2, Oracle, Sybase
Being used in SQL 7.0 testing.
375 unique bugs found (since 2/97)
Very productive test tool

6
Sample Rags Generated Statement
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996
1023AM" , T0.notes FROM titles T0, roysched
T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 3.11 ,
"Apr 15 1996 1023AM" , T0.advance , (
"ltv3VF" (( UPPER(((T2.ord_num "22\0G3"
)T2.ord_num ))("1FL6t15m" RTRIM(
UPPER((T1.title_id ((("MlVCf1kA" "GS?"
)T2.payterms )T2.payterms ))))))(T2.ord_num
RTRIM((LTRIM((T2.title_id T2.stor_id ))"2"
))))), T0.advance , (((-(T2.qty ))/(1.0
))(((-(-(-1 )))( DEGREES(T2.qty )))-(-(( -4
)-(-(T2.qty ))))))(-(-1 )) FROM sales T2 WHERE
EXISTS ( SELECT "fQDs" , T2.ord_date , AVG
((-(7 ))/(1 )), MAX (DISTINCT -1 ),
LTRIM("0IL601H" ), ("jQ\" ((( MAX(T3.phone )
MAX((RTRIM( UPPER( T5.stor_name ))((("lt"
"9n0yN" ) UPPER("c" ))T3.zip ))))T2.payterms
) MAX("\?" ))) FROM authors T3, roysched
T4, stores T5 WHERE EXISTS ( SELECT DISTINCT
TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE
( (-(-(5 )))gt T4.royalty ) AND (( ( (
LOWER( UPPER((("9W8WgtkOa" T6.stor_address
)"P" ))))! ANY ( SELECT TOP 2 LOWER((
UPPER("B9WIX" )"J" )) FROM roysched T7
WHERE ( EXISTS ( SELECT (T8.city
(T9.pub_id (("gt" T10.country ) UPPER(
LOWER(T10.city))))), T7.lorange ,
((T7.lorange )((T7.lorange )(-2 )))/((-5
)-(-2.0 )) FROM publishers T8, pub_info T9,
publishers T10 WHERE ( (-10 )lt
POWER((T7.royalty )/(T7.lorange ),1)) AND
(-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) )
) --EOQ ) AND (NOT (EXISTS (
SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9,
stores T10 WHERE ( (T10.city
LOWER(T10.stor_id )) BETWEEN (("QNu_at_WI"
T10.stor_id )) AND ("DT" ) ) AND ("RJ"
BETWEEN ( LOWER(T10.zip )) AND (LTRIM(
UPPER(LTRIM( LOWER(("_\tkd" T8.title_id ))))))
) GROUP BY T9.i3, T8.royalty, T9.i3
HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty
))))) AND (COUNT()) ) --EOQ ) )
) --EOQ ) AND (((("iUv" T6.stor_id
)T6.state )T6.city ) BETWEEN ((((T6.zip (
UPPER(("ec4LrPlt" ((LTRIM(T6.stor_name )"faxlt"
)("5adWhS" T6.zip )))) T6.city ))""
)"?gt_0Wi" )) AND (T6.zip ) ) ) AND (T4.lorange
BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ
GROUP BY ( LOWER(((T3.address T5.stor_address
)REVERSE((T5.stor_id LTRIM( T5.stor_address
))))) LOWER(((("ztO5I" "" )("X3FN"
(REVERSE((RTRIM( LTRIM((("kwU" "wyn_S_at_y"
)(REVERSE(( UPPER(LTRIM("u2C" ))T4.title_id
))( RTRIM(("s" "1X" )) UPPER((REVERSE(T3.addr
ess )T5.stor_name ))))))) "6CRtdD" ))"j?k"
)))T3.phone ))), T5.city, T5.stor_address )
--EOQ ORDER BY 1, 6, 5 )
This Statement yields an error SQLState37000,
Error8623 Internal Query Processor ErrorQuery
processor could not produce a query plan.
7
Automation

Simpler Statement with same error
SELECT roysched.royalty
FROM titles, roysched
WHERE EXISTS (
SELECT DISTINCT TOP 1 titles.advance
FROM sales
ORDER BY 1)
Control statement attributes
complexity, kind, depth, ...
Multi-user stress tests
tests concurrency, allocation, recovery

8
One 4-Vendor Rags Test3 of them vs Us

60 k Selects on MSS, DB2, Oracle, Sybase.
17 SQL Server Beta 2 suspects 1 suspect per
3350 statements.
Examine 10 suspects, filed 4 Bugs!One
duplicate. Assume 3/10 are new
Note This is the SS Beta 2 ProductQuality
rising fast (and RAGS sees that)

9
Outline

FileCast Reliable Multicast
RAGS SQL Testing
TerraServer (a big DB)
Sloan Sky Survey (CyberBricks)
Billion Transactions per day
Wolfpack Failover
NTFS IO measurements
NT-Cluster-Sort

10
Billions Of Clients

Every device will be intelligent
Doors, rooms, cars
Computing will be ubiquitous

11
Billions Of ClientsNeed Millions Of Servers

All clients networked to servers
May be nomadicor on-demand
Fast clients wantfaster servers
Servers provide
Shared Data
Control
Coordination
Communication

Clients
Mobileclients
Fixedclients
Servers
Server
Super server
12
ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Micro
Nano
10 pico-second ram
1 MB
Mini
Mainframe
10
0

MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
program cache, On-Chip SMP
9"
14"

Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?

13
Performance Storage Accesses not Instructions
Executed

In the old days we counted instructions and
IOs
Now we count memory references
Processors wait most of the time

Where the time goes
clock ticks for AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
Sort
Disc Wait
Sort
OS
Disc Wait
Memory Wait
I-Cache

Miss
B-Cache

D-Cache
Data Miss
Miss
14
Scale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
15
Microsoft TerraServer Scaleup to Big Databases

Build a 1 TB SQL Server database
Data must be
1 TB
Unencumbered
Interesting to everyone everywhere
And not offensive to anyone anywhere
Loaded
1.5 M place names from Encarta World Atlas
3 M Sq Km from USGS (1 meter resolution)
1 M Sq Km from Russian Space agency (2 m)
On the web (worlds largest atlas)
Sell images with commerce server.

16
Microsoft TerraServer Background

Earth is 500 Tera-meters square
USA is 10 tm2
100 TM2 land in 70ºN to 70ºS
We have pictures of 6 of it
3 tsm from USGS
2 tsm from Russian Space Agency
Compress 51 (JPEG) to 1.5 TB.
Slice into 10 KB chunks
Store chunks in DB
Navigate with
Encarta Atlas
globe
gazetteer
StreetsPlus in the USA

Someday
multi-spectral image
of everywhere
once a day / hour

17
USGS Digital Ortho Quads (DOQ)

US Geologic Survey
4 Tera Bytes
Most data not yet published
Based on a CRADA
Microsoft TerraServer makes data available.

18
Russian Space Agency(SovInfomSputnik) SPIN-2
(Aerial Images is Worldwide Distributor)

1.5 Meter Geo Rectified imagery of (almost)
anywhere
Almost equal-area projection
De-classified satellite photos (from 200 KM),
More data coming (1 m)
Selling imagery on Internet.
Putting 2 tm2 onto Microsoft TerraServer.

19
Demo

navigate by coverage map to White House
Download image
buy imagery from USGS
navigate by name to Venice
buy SPIN2 image Kodak photo
Pop out to Expedia street map of Venice
Mention that DB will double in next 18 months (2x
USGS, 2X SPIN2)

20
Hardware
Map
Site
Server
Internet
Servers
100 Mbps
Ethernet Switch
Web Servers
Alpha
Enterprise Storage Array
STK
Server
9710
8400
DLT
Tape
8 x 440MHz
Library
Alpha
cpus
10 GB DRAM
1TB Database Server AlphaServer 8400 4x400. 10
GB RAM 324 StorageWorks disks 10 drive tape
library (STC Timber Wolf DLT7000 )
21
The Microsoft TerraServer Hardware

Compaq AlphaServer 8400
8x400Mhz Alpha cpus
10 GB DRAM
324 9.2 GB StorageWorks Disks
3 TB raw, 2.4 TB of RAID5
STK 9710 tape robot (4 TB)
WindowsNT 4 EE, SQL Server 7.0

22
Software
Web Client
Internet InformationServer 4.0
ImageServer Active Server Pages
HTML
JavaViewer
The Internet
browser

MTS
Terra-ServerStored Procedures
Internet InfoServer 4.0
Internet InformationServer 4.0
SQL Server 7
MicrosoftSite Server EE
Microsoft AutomapActiveX Server
Image DeliveryApplication
SQL Server7
Automap Server
TerraServer DB
Image Provider Site(s)
23
System Management Maintenance

Backup and Recovery
STK 9710 Tape robot
Legato NetWorker
SQL Server 7 Backup Restore
Clocked at 80 MBps (peak)( 200 GB/hr)
SQL Server Enterprise Mgr
DBA Maintenance
SQL Performance Monitor

24
Microsoft TerraServer File Group Layout

Convert 324 disks to 28 RAID5 sets plus 28 spare
drives
Make 4 WinNT volumes (RAID 50) 595 GB per
volume
Build 30 20GB files on each volume
DB is File Group of 120 files

25
Image Delivery and LoadIncremental load of 4
more TB in next 18 months
DLTTape
tar
\DropN
LoadMgrDB
DoJob
Wait 4 Load
DLTTape
NTBackup
...
Cutting Machines
LoadMgr
10 ImgCutter 20 Partition 30 ThumbImg40
BrowseImg 45 JumpImg 50 TileImg 55 Meta
Data 60 Tile Meta 70 Img Meta 80 Update Place
ImgCutter
100mbitEtherSwitch
\DropN \Images
TerraServer
Enterprise Storage Array
STKDLTTape Library
AlphaServer8400
108 9.1 GB Drives
108 9.1 GB Drives
108 9.1 GB Drives
26
Technical ChallengeKey idea

Problem Geo-Spatial Search without geo-spatial
access methods.(just standard SQL Server)
Solution
Geo-spatial search key
Divide earth into rectangles of 1/48th degree
longitude (X) by 1/96th degree latitude (Y)
Z-transform X Y into single Z value, build
B-tree on Z
Adjacent images stored next to each other
Search Method
Latitude and Longitude gt X, Y, then Z
Select on matching Z value

27
Sloan Digital Sky Survey

Digital Sky
30 TB raw
3TB cooked (1 billion 3KB objects)
Want to scan it frequently
Using cyberbricks
Current status
175 MBps per node
24 nodes gt 4 GBps
5 minutes to scan whole archive

28
Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta

The Web 1 TB of HTML
TerraServer 1 TB of images
Several other 1 TB (file) servers
Hotmail 7 TB of email
Sloan Digital Sky Survey 40 TB raw, 2 TB
cooked
EOS/DIS (picture of planet each week)
15 PB by 2007
Federal Clearing house images of checks
15 PB by 2006 (7 year history)
Nuclear Stockpile Stewardship Program
10 Exabytes (???!!)

29
Info Capture

You can record everything you see or hear or
read.
What would you do with it?
How would you organize analyze it?

Video 8 PB per lifetime (10GBph) Audio 30 TB
(10KBps) Read or write 8 GB (words) See
http//www.lesk.com/mlesk/ksg97/ksg.html
30
Michael Lesks Points www.lesk.com/mlesk/ksg97/ks
g.html

Soon everything can be recorded and kept
Most data will never be seen by humans
Precious Resource Human attention
Auto-Summarization Auto-Searchwill be a key
enabling technology.

31
Kilo Mega Giga Tera Peta Exa Zetta Yotta
A letter
A novel
A Movie
Library of Congress (text)
LoC (image)
LoC (sound cinima)
All Photos
All Disks
All Tapes
All Information!
32
Outline

FileCast Reliable Multicast
RAGS SQL Testing
TerraServer (a big DB)
Sloan Sky Survey (CyberBricks)
Billion Transactions per day
Wolfpack Failover
NTFS IO measurements
NT-Cluster-Sort

33
Scalability
100 millionweb hits
1 billion transactions

Scale up to large SMP nodes
Scale out to clusters of SMP nodes

1.8 million mail messages
4 terabytes of data
34
Billion Transactions per Day Project

Built a 45-node Windows NT Cluster (with help
from Intel Compaq) gt 900 disks
All off-the-shelf parts
Using SQL Server DTC distributed transactions
DebitCredit Transaction
Each node has 1/20 th of the DB
Each node does 1/20 th of the work
15 of the transactions are distributed

35
Billion Transactions Per Day Hardware

45 nodes (Compaq Proliant)
Clustered with 100 Mbps Switched Ethernet
140 cpu, 13 GB, 3 TB.

36
1.2 B tpd

1 B tpd ran for 24 hrs.
Out-of-the-box software
Off-the-shelf hardware
AMAZING!

Sized for 30 days
Linear growth
5 micro-dollars per transaction

37
How Much Is 1 Billion Tpd?

1 billion tpd 11,574 tps 700,000 tpm
(transactions/minute)
ATT
185 million calls per peak day (worldwide)
Visa 20 million tpd
400 million customers
250K ATMs worldwide
7 billion transactions (cardcheque) in 1994
New York Stock Exchange
600,000 tpd
Bank of America
20 million tpd checks cleared (more than any
other bank)
1.4 million tpd ATM transactions
Worldwide Airlines Reservations 250 Mtpd

38
NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html

National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana
512 Pentium II cpus, 2,096 disks, SAN
Compaq HP Myricom WindowsNT
A Super Computer for 3M
Classic Fortran/MPI programming
DCOM programming model

39
Outline

FileCast Reliable Multicast
RAGS SQL Testing
TerraServer (a big DB)
Sloan Sky Survey (CyberBricks)
Billion Transactions per day
Wolfpack Failover
NTFS IO measurements
NT-Cluster-Sort

40
NT Clusters (Wolfpack)

Scale DOWN to PDA WindowsCE
Scale UP an SMP TerraServer
Scale OUT with a cluster of machines
Single-system image
Naming
Protection/security
Management/load balance
Fault tolerance
Wolfpack
Hot pluggable hardware software

41
Symmetric Virtual Server Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Web site files
Database files
Database files
42
Clusters BackOffice

Research Instant Transparent failover
Making BackOffice PlugNPlay on Wolfpack
Automatic install configure
Virtual Server concept makes it easy
simpler management concept
simpler context/state migration
transparent to applications
SQL 6.5E 7.0 Failover
MSMQ (queues), MTS (transactions).

43
Next Steps in Availability

Study the causes of outages
Build AlwaysUp system
Two geographically remote sites
Users have instant and transparent failover to
2nd site.
Working with WindowsNT and SQL Server groups on
this.

44
Outline

FileCast Reliable Multicast
RAGS SQL Testing
TerraServer (a big DB)
Sloan Sky Survey (CyberBricks)
Billion Transactions per day
Wolfpack Failover
NTFS IO measurements
NT-Cluster-Sort

45
Storage Latency How Far Away is the Data?
9
Tape /Optical
10
Robot
6
Disk
10
Memory
100
10
On Board Cache
On Chip Cache
2
Registers
1
46
The Memory Hierarchy

Measuring Modeling Sequential IO
Where is the bottleneck?
How does it scale with
SMP, RAID, new interconnects

Goals balanced bottlenecks Low overhead Scale
many processors (10s) Scale many disks (100s)
Memory
App address space
Mem bus
File cache
Controller
Adapter
SCSI
PCI
47
PAP (peak advertised Performance) vs RAP (real
application performance)

Goal RAP PAP / 2 (the half-power point)

System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
10-15 MBps
Data
7.2 MB/s
File System
SCSI
Buffers
Disk
133 MBps
PCI
7.2 MB/s
48
The Best Case Temp File, NO IO

Temp file Read / Write File System Cache
Program uses small (in cpu cache) buffer.
So, write/read time is bus move time (3x better
than copy)
Paradox fastest way to move data is to write
then read it.
This hardware islimited to 150 MBpsper
processor

49
Bottleneck Analysis

Drawn to linear scale

Theoretical Bus Bandwidth 422MBps 66 Mhz x 64
bits
MemoryRead/Write 150 MBps
MemCopy 50 MBps
Disk R/W 9MBps
50
3 Stripes and Your Out!

CPU time goes down with request size
Ftdisk (striping is cheap)

3 disks can saturate adapter
Similar story with UltraWide

51
Parallel SCSI Busses Help

Second SCSI bus nearly doubles read and wce
throughput
Write needs deeper buffers
Experiment is unbuffered(3-deep WCE)

?
2 x
52
File System Buffering Stripes(UltraWide Drives)

FS buffering helps small reads
FS buffered writes peak at 12MBps
3-deep async helps

Write peaks at 20 MBps
Read peaks at 30 MBps

53
PAP vs RAP

Reads are easy, writes are hard
Async write can match WCE.

422 MBps
142
MBps
SCSI
Disks
Application
Data
40 MBps
10-15 MBps
31 MBps
File System
9 MBps

133 MBps
72 MBps
SCSI
PCI
54
Bottleneck Analysis

NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI 65
MBps Unbuffered read
43 MBps Unbuffered write
40 MBps Buffered read
35 MBps Buffered write

Adapter 30 MBps
Memory Read/Write 150 MBps
PCI 70 MBps
70 MBps
Adapter
55
Hypothetical Bottleneck Analysis

NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not
measured, we had only one PCI bus available, 2nd
one was internal)
120 MBps Unbuffered read
80 MBps Unbuffered write
40 MBps Buffered read
35 MBps Buffered write

120 MBps
56
Year 2002 Disks