Title: Scaleable WindowsNT? Jim Gray Microsoft Research
1Scaleable WindowsNT?
- Jim GrayMicrosoft Research Gray_at_Microsoft.comht
tp//research.Microsoft.com/Gray
2Outline
- What is Scalability?
- Why does Microsoft care about ScaleUp
- Current ScaleUp Status?
- NT5 SQL7 Exchange
3Scale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4Billions Of Clients
- Every device will be intelligent
- Doors, rooms, cars
- Computing will be ubiquitous
5Billions Of ClientsNeed Millions Of Servers
- All clients networked to servers
- May be nomadicor on-demand
- Fast clients wantfaster servers
- Servers provide
- Shared Data
- Control
- Coordination
- Communication
Clients
Mobileclients
Fixedclients
Servers
Server
Super server
6ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Micro
Nano
10 pico-second ram
1 MB
Mini
Mainframe
10
0
MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
program cache, On-Chip SMP
9"
14"
- Smoking, hairy golf ball
- How to connect the many little parts?
- How to program the many little parts?
- Fault tolerance?
7Outline
- What is Scalability
- Why does Microsoft care about ScaleUp
- Current ScaleUp Status?
- NT5 SQL7 Exchange
8Scalability
100 millionweb hits
1 billion transactions
- Scale up to large SMP nodes
- Scale out to clusters of SMP nodes
1.8 million mail messages
4 terabytes of data
9Commercial NT Clusters
- 16-node Tandem Cluster
- 64 cpus
- 2 TB of disk
- Decision support
- 45-node Compaq Cluster
- 140 cpus
- 14 GB DRAM
- 4 TB RAID disk
- OLTP (Debit Credit)
- 1 B tpd (14 k tps)
10Tandem Oracle/NT
- 27,383 tpmC
- 71.50 /tpmC
- 4 x 6 cpus
- 384 disks2.7 TB
1124 cpu, 384 disks (2.7TB)
12Billion Transactions per Day Project
- Built a 45-node Windows NT Cluster (with help
from Intel Compaq) gt 900 disks - All off-the-shelf parts
- Using SQL Server DTC distributed transactions
- DebitCredit Transaction
- Each node has 1/20 th of the DB
- Each node does 1/20 th of the work
- 15 of the transactions are distributed
13Billion Transactions Per Day Hardware
- 45 nodes (Compaq Proliant)
- Clustered with 100 Mbps Switched Ethernet
- 140 cpu, 13 GB, 3 TB.
14How Much Is 1 Billion Tpd?
- 1 billion tpd 11,574 tps 700,000 tpm
(transactions/minute) - ATT
- 185 million calls per peak day (worldwide)
- Visa 20 million tpd
- 400 million customers
- 250K ATMs worldwide
- 7 billion transactions (cardcheque) in 1994
- New York Stock Exchange
- 600,000 tpd
- Bank of America
- 20 million tpd checks cleared (more than any
other bank) - 1.4 million tpd ATM transactions
- Worldwide Airlines Reservations 250 Mtpd
15Infinite, Ubiquitous ScalingRedefining the rules
Per Sec Per Min Per
Day 10K TPC 166 10,000
14,400,000 1 BTPD 11,574 694,444
1,000,000,000 1.4 BTPD 16,204 972,222
1,400,000,000
IIS
MTS
All ShippingProducts!
COM / ActiveX
16Microsoft.com 150x4 nodes
(3)
17NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
- National Center for Supercomputing
ApplicationsUniversity of Illinois _at_ Urbana - 512 Pentium II cpus, 2,096 disks, SAN
- Compaq HP Myricom WindowsNT
- A Super Computer for 3M
- Classic Fortran/MPI programming
- DCOM programming model
18TPC C Improved Fast(250/year!)
40 hardware, 100 software, 100 PC Technology
19Windows NT Versus UNIX
20Economy Of Scale
21Microsoft TerraServer Scaleup to Big Databases
- Build a 1 TB SQL Server database
- Data must be
- 1 TB
- Unencumbered
- Interesting to everyone everywhere
- And not offensive to anyone anywhere
- Loaded
- 1.5 M place names from Encarta World Atlas
- 3 M Sq Km from USGS (1 meter resolution)
- 1 M Sq Km from Russian Space agency (2 m)
- On the web (worlds largest atlas)
- Sell images with commerce server.
22Microsoft TerraServer Background
- Earth is 500 Tera-meters square
- USA is 10 tm2
- 100 TM2 land in 70ºN to 70ºS
- We have pictures of 6 of it
- 3 tsm from USGS
- 2 tsm from Russian Space Agency
- Compress 51 (JPEG) to 1.5 TB.
- Slice into 10 KB chunks
- Store chunks in DB
- Navigate with
- Encarta Atlas
- globe
- gazetteer
- StreetsPlus in the USA
- Someday
- multi-spectral image
- of everywhere
- once a day / hour
23Demo
- navigate by coverage map to White House
- Download image
- buy imagery from USGS
- navigate by name to Venice
- buy SPIN2 image Kodak photo
- Pop out to Expedia street map of Venice
- Mention that DB will double in next 18 months (2x
USGS, 2X SPIN2)
24The Microsoft TerraServer Hardware
- Compaq AlphaServer 8400
- 8x400Mhz Alpha cpus
- 10 GB DRAM
- 324 9.2 GB StorageWorks Disks
- 3 TB raw, 2.4 TB of RAID5
- STK 9710 tape robot (4 TB)
- WindowsNT 4 EE, SQL Server 7.0
25Software
Web Client
Internet InformationServer 4.0
ImageServer Active Server Pages
HTML
JavaViewer
The Internet
browser
MTS
Terra-ServerStored Procedures
Internet InfoServer 4.0
Internet InformationServer 4.0
SQL Server 7
MicrosoftSite Server EE
Microsoft AutomapActiveX Server
Image DeliveryApplication
SQL Server7
Automap Server
TerraServer DB
Image Provider Site(s)
26Image Delivery and LoadIncremental load of 4
more TB in next 18 months
DLTTape
tar
\DropN
LoadMgrDB
DoJob
Wait 4 Load
DLTTape
NTBackup
...
Cutting Machines
LoadMgr
10 ImgCutter 20 Partition 30 ThumbImg40
BrowseImg 45 JumpImg 50 TileImg 55 Meta
Data 60 Tile Meta 70 Img Meta 80 Update Place
ImgCutter
100mbitEtherSwitch
\DropN \Images
TerraServer
Enterprise Storage Array
STKDLTTape Library
AlphaServer8400
108 9.1 GB Drives
108 9.1 GB Drives
108 9.1 GB Drives
27TerraServer A Real World Example
- Largest DB on the Web
- 1.3TB
- 99.95 uptime since July 1
- No downtime, period, in August
- 70 of downtime for SQL software upgrades
28NT Clusters (Wolfpack)
- Scale DOWN to PDA WindowsCE
- Scale UP an SMP TerraServer
- Scale OUT with a cluster of machines
- Single-system image
- Naming
- Protection/security
- Management/load balance
- Fault tolerance
- Wolfpack
- Hot pluggable hardware software
29Symmetric Virtual Server Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Web site files
Database files
Database files
30Windows NT 5 (scalability features)
- Better SMP support
- Clusters
- 16x packs (fault tolerant clusters)
- 100x mobs arrays for manageability
- SAN/VIA support
- 64 bit addressing for data
- Apps like SQL, Oracle, will use it for data
- 64 bit API to NT comes later (in lab now).
- Remote management (scripting and DCOM)
- Active Directory
- Veritas volume manager
- Many 3rd party HSMs
- Batch support
31Microsoft SQL Server 7.0
- Fixes the famous performance bugs
- dynamic record locking
- online backup, quick recovery.
- 64 bit addressing buffer pool
- SMP parallelism and better SMP support
- Built in OLAP (cubes and MOLAP)
- Scale down to Win9x
- Improved management interfaces
- Data transform services (for warehouses)
32Outline
- What is Scalability
- Why does Microsoft care about ScaleUp
- Current ScaleUp Status?
- NT5 SQL7
33end
- Other slides would be interesting, but...
34Interesting other slidesNo time for them but...
- How much information is there?
- IO bandwidth in the Intel world
- Intelligent disks
- SAN/VIA
- NT Cluster Sort
35Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta
- The Web 1 TB of HTML
- TerraServer 1 TB of images
- Several other 1 TB (file) servers
- Hotmail 7 TB of email
- Sloan Digital Sky Survey 40 TB raw, 2 TB
cooked - EOS/DIS (picture of planet each week)
- 15 PB by 2007
- Federal Clearing house images of checks
- 15 PB by 2006 (7 year history)
- Nuclear Stockpile Stewardship Program
- 10 Exabytes (???!!)
36Info Capture
- You can record everything you see or hear or
read. - What would you do with it?
- How would you organize analyze it?
Video 8 PB per lifetime (10GBph) Audio 30 TB
(10KBps) Read or write 8 GB (words) See
http//www.lesk.com/mlesk/ksg97/ksg.html
37Michael Lesks Points www.lesk.com/mlesk/ksg97/ks
g.html
- Soon everything can be recorded and kept
- Most data will never be seen by humans
- Precious Resource Human attention
Auto-Summarization Auto-Searchwill be a key
enabling technology.
38PAP (peak advertised Performance) vs RAP (real
application performance)
- Goal RAP PAP / 2 (the half-power point)
System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
10-15 MBps
Data
7.2 MB/s
File System
SCSI
Buffers
Disk
133 MBps
PCI
7.2 MB/s
39PAP vs RAP
- Reads are easy, writes are hard
- Async write can match WCE.
422 MBps
142
MBps
SCSI
Disks
Application
Data
40 MBps
10-15 MBps
31 MBps
File System
9 MBps
133 MBps
72 MBps
SCSI
PCI
40Bottleneck Analysis
- NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not
measured, we had only one PCI bus available, 2nd
one was internal) - 120 MBps Unbuffered read
- 80 MBps Unbuffered write
- 40 MBps Buffered read
- 35 MBps Buffered write
-
120 MBps
41Year 2002 Disks
- Big disk (10 /GB)
- 3
- 100 GB
- 150 kaps (k accesses per second)
- 20 MBps sequential
- Small disk (20 /GB)
- 3
- 4 GB
- 100 kaps
- 10 MBps sequential
- Both running Windows NT 7.0?(see below for why)
42How Do They Talk to Each Other?
- Each node has an OS
- Each node has local resources A federation.
- Each node does not completely trust the others.
- Nodes use RPC to talk to each other
- CORBA? DCOM? IIOP? RMI?
- One or all of the above.
- Huge leverage in high-level interfaces.
- Same old distributed system story.
Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
h
VIAL/VIPL
Wire(s)
43SAN Standard Interconnect
Gbps Ethernet 110 MBps
- LAN faster than memory bus?
- 1 GBps links in lab.
- 300 port cost soon
- Port is computer
PCI 32 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
44PennySort
- Hardware
- 266 Mhz Intel PPro
- 64 MB SDRAM (10ns)
- Dual Fujitsu DMA 3.2GB EIDE
- Software
- NT workstation 4.3
- NT 5 sort
- Performance
- sort 15 M 100-byte records (1.5 GB)
- Disk to disk
- elapsed time 820 sec
- cpu time 404 sec
45Cluster Sort Conceptual Model
- Multiple Data Sources
- Multiple Data Destinations
- Multiple nodes
- Disks -gt Sockets -gt Disk -gt Disk
A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
46Cluster Install Execute
- If this is to be used by others,
- it must be
- Easy to install
- Easy to execute
-
- Installations of distributed systems take
- time and can be tedious. (AM2, GluGuard)
- Parallel Remote execution is
- non-trivial. (GLUnix, LSF)
- How do we keep this simple and built-in to
NTClusterSort ?
47Remote Install
- Add Registry entry to each remote node.
RegConnectRegistry() RegCreateKeyEx()
48Cluster Execution
- Setup
- MULTI_QI struct
- COSERVERINFO struct
- Retrieve remote object handle
- from MULTI_QI struct