Title: Scaleable Systems Research at Microsoft really: what we do at BARC
1Scaleable Systems Research at Microsoft(really
what we do at BARC)
- Jim GrayMicrosoft Research Gray_at_Microsoft.comht
tp//research.Microsoft.com/GrayPresented to
DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.
2Outline
- PowerCast, FileCast Reliable Multicast
- RAGS SQL Testing
- TerraServer (a big DB)
- Sloan Sky Survey (CyberBricks)
- Billion Transactions per day
- WolfPack Failover
- NTFS IO measurements
- NT-Cluster-Sort
- AlwaysUp
3Telepresence
- The next killer app
- Space shifting
- Reduce travel
- Time shifting
- Retrospective
- Offer condensations
- Just in time meetings.
- Example ACM 97
- NetShow and Web site.
- More web visitors than attendees
- People-to-People communication
4Telepresence Prototypes
- PowerCast multicast PowerPoint
- Streaming - pre-sends next anticipated slide
- Send slides and voice rather than talking head
and voice - Uses ECSRM for reliable multicast
- 1000s of receivers can join and leave any time.
- No server needed no pre-load of slides.
- Cooperating with NetShow
- FileCast multicast file transfer.
- Erasure encodes all packets
- Receivers only need to receive as many bytes as
the length of the file - Multicast IE to solve Midnight-Madness problem
- NT SRM reliable IP multicast library for NT
- Spatialized Teleconference Station
- Texture map faces onto spheres
- Space map voices
5RAGS RAndom SQL test Generator
- Microsoft spends a LOT of money on testing.
(60 of development according to one source). - Idea test SQL by
- generating random correct queries
- executing queries against database
- compare results with SQL 6.5, DB2, Oracle, Sybase
- Being used in SQL 7.0 testing.
- 375 unique bugs found (since 2/97)
- Very productive test tool
6Sample Rags Generated Statement
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996
1023AM" , T0.notes FROM titles T0, roysched
T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 3.11 ,
"Apr 15 1996 1023AM" , T0.advance , (
"ltv3VF" (( UPPER(((T2.ord_num "22\0G3"
)T2.ord_num ))("1FL6t15m" RTRIM(
UPPER((T1.title_id ((("MlVCf1kA" "GS?"
)T2.payterms )T2.payterms ))))))(T2.ord_num
RTRIM((LTRIM((T2.title_id T2.stor_id ))"2"
))))), T0.advance , (((-(T2.qty ))/(1.0
))(((-(-(-1 )))( DEGREES(T2.qty )))-(-(( -4
)-(-(T2.qty ))))))(-(-1 )) FROM sales T2 WHERE
EXISTS ( SELECT "fQDs" , T2.ord_date , AVG
((-(7 ))/(1 )), MAX (DISTINCT -1 ),
LTRIM("0IL601H" ), ("jQ\" ((( MAX(T3.phone )
MAX((RTRIM( UPPER( T5.stor_name ))((("lt"
"9n0yN" ) UPPER("c" ))T3.zip ))))T2.payterms
) MAX("\?" ))) FROM authors T3, roysched
T4, stores T5 WHERE EXISTS ( SELECT DISTINCT
TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE
( (-(-(5 )))gt T4.royalty ) AND (( ( (
LOWER( UPPER((("9W8WgtkOa" T6.stor_address
)"P" ))))! ANY ( SELECT TOP 2 LOWER((
UPPER("B9WIX" )"J" )) FROM roysched T7
WHERE ( EXISTS ( SELECT (T8.city
(T9.pub_id (("gt" T10.country ) UPPER(
LOWER(T10.city))))), T7.lorange ,
((T7.lorange )((T7.lorange )(-2 )))/((-5
)-(-2.0 )) FROM publishers T8, pub_info T9,
publishers T10 WHERE ( (-10 )lt
POWER((T7.royalty )/(T7.lorange ),1)) AND
(-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) )
) --EOQ ) AND (NOT (EXISTS (
SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9,
stores T10 WHERE ( (T10.city
LOWER(T10.stor_id )) BETWEEN (("QNu_at_WI"
T10.stor_id )) AND ("DT" ) ) AND ("RJ"
BETWEEN ( LOWER(T10.zip )) AND (LTRIM(
UPPER(LTRIM( LOWER(("_\tkd" T8.title_id ))))))
) GROUP BY T9.i3, T8.royalty, T9.i3
HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty
))))) AND (COUNT()) ) --EOQ ) )
) --EOQ ) AND (((("iUv" T6.stor_id
)T6.state )T6.city ) BETWEEN ((((T6.zip (
UPPER(("ec4LrPlt" ((LTRIM(T6.stor_name )"faxlt"
)("5adWhS" T6.zip )))) T6.city ))""
)"?gt_0Wi" )) AND (T6.zip ) ) ) AND (T4.lorange
BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ
GROUP BY ( LOWER(((T3.address T5.stor_address
)REVERSE((T5.stor_id LTRIM( T5.stor_address
))))) LOWER(((("ztO5I" "" )("X3FN"
(REVERSE((RTRIM( LTRIM((("kwU" "wyn_S_at_y"
)(REVERSE(( UPPER(LTRIM("u2C" ))T4.title_id
))( RTRIM(("s" "1X" )) UPPER((REVERSE(T3.addr
ess )T5.stor_name ))))))) "6CRtdD" ))"j?k"
)))T3.phone ))), T5.city, T5.stor_address )
--EOQ ORDER BY 1, 6, 5 )
This Statement yields an error SQLState37000,
Error8623 Internal Query Processor ErrorQuery
processor could not produce a query plan.
7Automation
- Simpler Statement with same error
- SELECT roysched.royalty
- FROM titles, roysched
- WHERE EXISTS (
- SELECT DISTINCT TOP 1 titles.advance
- FROM sales
- ORDER BY 1)
- Control statement attributes
- complexity, kind, depth, ...
- Multi-user stress tests
- tests concurrency, allocation, recovery
8One 4-Vendor Rags Test3 of them vs Us
- 60 k Selects on MSS, DB2, Oracle, Sybase.
- 17 SQL Server Beta 2 suspects 1 suspect per
3350 statements. - Examine 10 suspects, filed 4 Bugs!One
duplicate. Assume 3/10 are new - Note This is the SS Beta 2 ProductQuality
rising fast (and RAGS sees that)
9Outline
- FileCast Reliable Multicast
- RAGS SQL Testing
- TerraServer (a big DB)
- Sloan Sky Survey (CyberBricks)
- Billion Transactions per day
- Wolfpack Failover
- NTFS IO measurements
- NT-Cluster-Sort
10Billions Of Clients
- Every device will be intelligent
- Doors, rooms, cars
- Computing will be ubiquitous
11Billions Of ClientsNeed Millions Of Servers
- All clients networked to servers
- May be nomadicor on-demand
- Fast clients wantfaster servers
- Servers provide
- Shared Data
- Control
- Coordination
- Communication
Clients
Mobileclients
Fixedclients
Servers
Server
Super server
12ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Micro
Nano
10 pico-second ram
1 MB
Mini
Mainframe
10
0
MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
program cache, On-Chip SMP
9"
14"
- Smoking, hairy golf ball
- How to connect the many little parts?
- How to program the many little parts?
- Fault tolerance?
13Performance Storage Accesses not Instructions
Executed
- In the old days we counted instructions and
IOs - Now we count memory references
- Processors wait most of the time
Where the time goes
clock ticks for AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
Sort
Disc Wait
Sort
OS
Disc Wait
Memory Wait
I-Cache
Miss
B-Cache
D-Cache
Data Miss
Miss
14Scale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
15Microsoft TerraServer Scaleup to Big Databases
- Build a 1 TB SQL Server database
- Data must be
- 1 TB
- Unencumbered
- Interesting to everyone everywhere
- And not offensive to anyone anywhere
- Loaded
- 1.5 M place names from Encarta World Atlas
- 3 M Sq Km from USGS (1 meter resolution)
- 1 M Sq Km from Russian Space agency (2 m)
- On the web (worlds largest atlas)
- Sell images with commerce server.
16Microsoft TerraServer Background
- Earth is 500 Tera-meters square
- USA is 10 tm2
- 100 TM2 land in 70ºN to 70ºS
- We have pictures of 6 of it
- 3 tsm from USGS
- 2 tsm from Russian Space Agency
- Compress 51 (JPEG) to 1.5 TB.
- Slice into 10 KB chunks
- Store chunks in DB
- Navigate with
- Encarta Atlas
- globe
- gazetteer
- StreetsPlus in the USA
- Someday
- multi-spectral image
- of everywhere
- once a day / hour
17USGS Digital Ortho Quads (DOQ)
- US Geologic Survey
- 4 Tera Bytes
- Most data not yet published
- Based on a CRADA
- Microsoft TerraServer makes data available.
18Russian Space Agency(SovInfomSputnik) SPIN-2
(Aerial Images is Worldwide Distributor)
- 1.5 Meter Geo Rectified imagery of (almost)
anywhere - Almost equal-area projection
- De-classified satellite photos (from 200 KM),
- More data coming (1 m)
- Selling imagery on Internet.
- Putting 2 tm2 onto Microsoft TerraServer.
19Demo
- navigate by coverage map to White House
- Download image
- buy imagery from USGS
- navigate by name to Venice
- buy SPIN2 image Kodak photo
- Pop out to Expedia street map of Venice
- Mention that DB will double in next 18 months (2x
USGS, 2X SPIN2)
20Hardware
Map
Site
Server
Internet
Servers
100 Mbps
Ethernet Switch
Web Servers
Alpha
Enterprise Storage Array
STK
Server
9710
8400
DLT
Tape
8 x 440MHz
Library
Alpha
cpus
10 GB DRAM
1TB Database Server AlphaServer 8400 4x400. 10
GB RAM 324 StorageWorks disks 10 drive tape
library (STC Timber Wolf DLT7000 )
21The Microsoft TerraServer Hardware
- Compaq AlphaServer 8400
- 8x400Mhz Alpha cpus
- 10 GB DRAM
- 324 9.2 GB StorageWorks Disks
- 3 TB raw, 2.4 TB of RAID5
- STK 9710 tape robot (4 TB)
- WindowsNT 4 EE, SQL Server 7.0
22Software
Web Client
Internet InformationServer 4.0
ImageServer Active Server Pages
HTML
JavaViewer
The Internet
browser
MTS
Terra-ServerStored Procedures
Internet InfoServer 4.0
Internet InformationServer 4.0
SQL Server 7
MicrosoftSite Server EE
Microsoft AutomapActiveX Server
Image DeliveryApplication
SQL Server7
Automap Server
TerraServer DB
Image Provider Site(s)
23System Management Maintenance
- Backup and Recovery
- STK 9710 Tape robot
- Legato NetWorker
- SQL Server 7 Backup Restore
- Clocked at 80 MBps (peak)( 200 GB/hr)
- SQL Server Enterprise Mgr
- DBA Maintenance
- SQL Performance Monitor
24Microsoft TerraServer File Group Layout
- Convert 324 disks to 28 RAID5 sets plus 28 spare
drives - Make 4 WinNT volumes (RAID 50) 595 GB per
volume - Build 30 20GB files on each volume
- DB is File Group of 120 files
25Image Delivery and LoadIncremental load of 4
more TB in next 18 months
DLTTape
tar
\DropN
LoadMgrDB
DoJob
Wait 4 Load
DLTTape
NTBackup
...
Cutting Machines
LoadMgr
10 ImgCutter 20 Partition 30 ThumbImg40
BrowseImg 45 JumpImg 50 TileImg 55 Meta
Data 60 Tile Meta 70 Img Meta 80 Update Place
ImgCutter
100mbitEtherSwitch
\DropN \Images
TerraServer
Enterprise Storage Array
STKDLTTape Library
AlphaServer8400
108 9.1 GB Drives
108 9.1 GB Drives
108 9.1 GB Drives
26Technical ChallengeKey idea
- Problem Geo-Spatial Search without geo-spatial
access methods.(just standard SQL Server) - Solution
- Geo-spatial search key
- Divide earth into rectangles of 1/48th degree
longitude (X) by 1/96th degree latitude (Y) - Z-transform X Y into single Z value, build
B-tree on Z - Adjacent images stored next to each other
- Search Method
- Latitude and Longitude gt X, Y, then Z
- Select on matching Z value
27Sloan Digital Sky Survey
- Digital Sky
- 30 TB raw
- 3TB cooked (1 billion 3KB objects)
- Want to scan it frequently
- Using cyberbricks
- Current status
- 175 MBps per node
- 24 nodes gt 4 GBps
- 5 minutes to scan whole archive
28Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- The Web 1 TB of HTML
- TerraServer 1 TB of images
- Several other 1 TB (file) servers
- Hotmail 7 TB of email
- Sloan Digital Sky Survey 40 TB raw, 2 TB
cooked - EOS/DIS (picture of planet each week)
- 15 PB by 2007
- Federal Clearing house images of checks
- 15 PB by 2006 (7 year history)
- Nuclear Stockpile Stewardship Program
- 10 Exabytes (???!!)
29Info Capture
- You can record everything you see or hear or
read. - What would you do with it?
- How would you organize analyze it?
Video 8 PB per lifetime (10GBph) Audio 30 TB
(10KBps) Read or write 8 GB (words) See
http//www.lesk.com/mlesk/ksg97/ksg.html
30Michael Lesks Points www.lesk.com/mlesk/ksg97/ks
g.html
- Soon everything can be recorded and kept
- Most data will never be seen by humans
- Precious Resource Human attention
Auto-Summarization Auto-Searchwill be a key
enabling technology.
31Kilo Mega Giga Tera Peta Exa Zetta Yotta
A letter
A novel
A Movie
Library of Congress (text)
LoC (image)
LoC (sound cinima)
All Photos
All Disks
All Tapes
All Information!
32Outline
- FileCast Reliable Multicast
- RAGS SQL Testing
- TerraServer (a big DB)
- Sloan Sky Survey (CyberBricks)
- Billion Transactions per day
- Wolfpack Failover
- NTFS IO measurements
- NT-Cluster-Sort
33Scalability
100 millionweb hits
1 billion transactions
- Scale up to large SMP nodes
- Scale out to clusters of SMP nodes
1.8 million mail messages
4 terabytes of data
34Billion Transactions per Day Project
- Built a 45-node Windows NT Cluster (with help
from Intel Compaq) gt 900 disks - All off-the-shelf parts
- Using SQL Server DTC distributed transactions
- DebitCredit Transaction
- Each node has 1/20 th of the DB
- Each node does 1/20 th of the work
- 15 of the transactions are distributed
35Billion Transactions Per Day Hardware
- 45 nodes (Compaq Proliant)
- Clustered with 100 Mbps Switched Ethernet
- 140 cpu, 13 GB, 3 TB.
361.2 B tpd
- 1 B tpd ran for 24 hrs.
- Out-of-the-box software
- Off-the-shelf hardware
- AMAZING!
- Sized for 30 days
- Linear growth
- 5 micro-dollars per transaction
37How Much Is 1 Billion Tpd?
- 1 billion tpd 11,574 tps 700,000 tpm
(transactions/minute) - ATT
- 185 million calls per peak day (worldwide)
- Visa 20 million tpd
- 400 million customers
- 250K ATMs worldwide
- 7 billion transactions (cardcheque) in 1994
- New York Stock Exchange
- 600,000 tpd
- Bank of America
- 20 million tpd checks cleared (more than any
other bank) - 1.4 million tpd ATM transactions
- Worldwide Airlines Reservations 250 Mtpd
38NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
- National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana - 512 Pentium II cpus, 2,096 disks, SAN
- Compaq HP Myricom WindowsNT
- A Super Computer for 3M
- Classic Fortran/MPI programming
- DCOM programming model
39Outline
- FileCast Reliable Multicast
- RAGS SQL Testing
- TerraServer (a big DB)
- Sloan Sky Survey (CyberBricks)
- Billion Transactions per day
- Wolfpack Failover
- NTFS IO measurements
- NT-Cluster-Sort
40NT Clusters (Wolfpack)
- Scale DOWN to PDA WindowsCE
- Scale UP an SMP TerraServer
- Scale OUT with a cluster of machines
- Single-system image
- Naming
- Protection/security
- Management/load balance
- Fault tolerance
- Wolfpack
- Hot pluggable hardware software
41Symmetric Virtual Server Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Web site files
Database files
Database files
42Clusters BackOffice
- Research Instant Transparent failover
- Making BackOffice PlugNPlay on Wolfpack
- Automatic install configure
- Virtual Server concept makes it easy
- simpler management concept
- simpler context/state migration
- transparent to applications
- SQL 6.5E 7.0 Failover
- MSMQ (queues), MTS (transactions).
43Next Steps in Availability
- Study the causes of outages
- Build AlwaysUp system
- Two geographically remote sites
- Users have instant and transparent failover to
2nd site. - Working with WindowsNT and SQL Server groups on
this.
44Outline
- FileCast Reliable Multicast
- RAGS SQL Testing
- TerraServer (a big DB)
- Sloan Sky Survey (CyberBricks)
- Billion Transactions per day
- Wolfpack Failover
- NTFS IO measurements
- NT-Cluster-Sort
45Storage Latency How Far Away is the Data?
9
Tape /Optical
10
Robot
6
Disk
10
Memory
100
10
On Board Cache
On Chip Cache
2
Registers
1
46The Memory Hierarchy
- Measuring Modeling Sequential IO
- Where is the bottleneck?
- How does it scale with
- SMP, RAID, new interconnects
Goals balanced bottlenecks Low overhead Scale
many processors (10s) Scale many disks (100s)
Memory
App address space
Mem bus
File cache
Controller
Adapter
SCSI
PCI
47PAP (peak advertised Performance) vs RAP (real
application performance)
- Goal RAP PAP / 2 (the half-power point)
System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
10-15 MBps
Data
7.2 MB/s
File System
SCSI
Buffers
Disk
133 MBps
PCI
7.2 MB/s
48The Best Case Temp File, NO IO
- Temp file Read / Write File System Cache
- Program uses small (in cpu cache) buffer.
- So, write/read time is bus move time (3x better
than copy) - Paradox fastest way to move data is to write
then read it. - This hardware islimited to 150 MBpsper
processor
49Bottleneck Analysis
Theoretical Bus Bandwidth 422MBps 66 Mhz x 64
bits
MemoryRead/Write 150 MBps
MemCopy 50 MBps
Disk R/W 9MBps
503 Stripes and Your Out!
- CPU time goes down with request size
- Ftdisk (striping is cheap)
- 3 disks can saturate adapter
- Similar story with UltraWide
51Parallel SCSI Busses Help
- Second SCSI bus nearly doubles read and wce
throughput - Write needs deeper buffers
- Experiment is unbuffered(3-deep WCE)
?
2 x
52File System Buffering Stripes(UltraWide Drives)
- FS buffering helps small reads
- FS buffered writes peak at 12MBps
- 3-deep async helps
- Write peaks at 20 MBps
- Read peaks at 30 MBps
53PAP vs RAP
- Reads are easy, writes are hard
- Async write can match WCE.
422 MBps
142
MBps
SCSI
Disks
Application
Data
40 MBps
10-15 MBps
31 MBps
File System
9 MBps
133 MBps
72 MBps
SCSI
PCI
54Bottleneck Analysis
- NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI 65
MBps Unbuffered read - 43 MBps Unbuffered write
- 40 MBps Buffered read
- 35 MBps Buffered write
-
Adapter 30 MBps
Memory Read/Write 150 MBps
PCI 70 MBps
70 MBps
Adapter
55Hypothetical Bottleneck Analysis
- NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not
measured, we had only one PCI bus available, 2nd
one was internal) - 120 MBps Unbuffered read
- 80 MBps Unbuffered write
- 40 MBps Buffered read
- 35 MBps Buffered write
-
120 MBps
56Year 2002 Disks
- Big disk (10 /GB)
- 3
- 100 GB
- 150 kaps (k accesses per second)
- 20 MBps sequential
- Small disk (20 /GB)
- 3
- 4 GB
- 100 kaps
- 10 MBps sequential
- Both running Windows NT 7.0?(see below for why)
57How Do They Talk to Each Other?
- Each node has an OS
- Each node has local resources A federation.
- Each node does not completely trust the others.
- Nodes use RPC to talk to each other
- CORBA? DCOM? IIOP? RMI?
- One or all of the above.
- Huge leverage in high-level interfaces.
- Same old distributed system story.
Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
h
VIAL/VIPL
Wire(s)
58Outline
- FileCast Reliable Multicast
- RAGS SQL Testing
- TerraServer (a big DB)
- Sloan Sky Survey (CyberBricks)
- Billion Transactions per day
- Wolfpack Failover
- NTFS IO measurements
- NT-Cluster-Sort
59Penny Sort Ground Ruleshttp//research.microsoft.
com/barc/SortBenchmark
- How much can you sort for a penny.
- Hardware and Software cost
- Depreciated over 3 years
- 1M system gets about 1 second,
- 1K system gets about 1,000 seconds.
- Time (seconds) SystemPrice () / 946,080
- Input and output are disk resident
- Input is
- 100-byte records (random data)
- key is first 10 bytes.
- Must create output file and fill with sorted
version of input file. - Daytona (product) and Indy (special) categories
60PennySort
- Hardware
- 266 Mhz Intel PPro
- 64 MB SDRAM (10ns)
- Dual Fujitsu DMA 3.2GB EIDE
- Software
- NT workstation 4.3
- NT 5 sort
- Performance
- sort 15 M 100-byte records (1.5 GB)
- Disk to disk
- elapsed time 820 sec
- cpu time 404 sec
61Cluster Sort Conceptual Model
- Multiple Data Sources
- Multiple Data Destinations
- Multiple nodes
- Disks -gt Sockets -gt Disk -gt Disk
A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
62Cluster Install Execute
- If this is to be used by others,
- it must be
- Easy to install
- Easy to execute
-
- Installations of distributed systems take
- time and can be tedious. (AM2, GluGuard)
- Parallel Remote execution is
- non-trivial. (GLUnix, LSF)
- How do we keep this simple and built-in to
NTClusterSort ?
63Remote Install
- Add Registry entry to each remote node.
RegConnectRegistry() RegCreateKeyEx()
64Cluster Execution
- Setup
- MULTI_QI struct
- COSERVERINFO struct
- Retrieve remote object handle
- from MULTI_QI struct
65SAN Standard Interconnect
Gbps Ethernet 110 MBps
- LAN faster than memory bus?
- 1 GBps links in lab.
- 300 port cost soon
- Port is computer
PCI 32 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps