Scaleable WindowsNT? Jim Gray Microsoft Research - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Scaleable WindowsNT? Jim Gray Microsoft Research

Description:

Scaleable WindowsNT? Jim Gray Microsoft Research Gray_at_Microsoft.com http://research.Microsoft.com/~Gray Outline What is Scalability? Why does Microsoft care about ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 49
Provided by: researchM6
Category:

less

Transcript and Presenter's Notes

Title: Scaleable WindowsNT? Jim Gray Microsoft Research


1
Scaleable WindowsNT?
  • Jim GrayMicrosoft Research Gray_at_Microsoft.comht
    tp//research.Microsoft.com/Gray

2
Outline
  • What is Scalability?
  • Why does Microsoft care about ScaleUp
  • Current ScaleUp Status?
  • NT5 SQL7 Exchange

3
Scale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
4
Billions Of Clients
  • Every device will be intelligent
  • Doors, rooms, cars
  • Computing will be ubiquitous

5
Billions Of ClientsNeed Millions Of Servers
  • All clients networked to servers
  • May be nomadicor on-demand
  • Fast clients wantfaster servers
  • Servers provide
  • Shared Data
  • Control
  • Coordination
  • Communication

Clients
Mobileclients
Fixedclients
Servers
Server
Super server
6
ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Micro
Nano
10 pico-second ram
1 MB
Mini
Mainframe
10
0

MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
program cache, On-Chip SMP
9"
14"
  • Smoking, hairy golf ball
  • How to connect the many little parts?
  • How to program the many little parts?
  • Fault tolerance?

7
Outline
  • What is Scalability
  • Why does Microsoft care about ScaleUp
  • Current ScaleUp Status?
  • NT5 SQL7 Exchange

8
Scalability
100 millionweb hits
1 billion transactions
  • Scale up to large SMP nodes
  • Scale out to clusters of SMP nodes

1.8 million mail messages
4 terabytes of data
9
Commercial NT Clusters
  • 16-node Tandem Cluster
  • 64 cpus
  • 2 TB of disk
  • Decision support
  • 45-node Compaq Cluster
  • 140 cpus
  • 14 GB DRAM
  • 4 TB RAID disk
  • OLTP (Debit Credit)
  • 1 B tpd (14 k tps)

10
Tandem Oracle/NT
  • 27,383 tpmC
  • 71.50 /tpmC
  • 4 x 6 cpus
  • 384 disks2.7 TB

11
24 cpu, 384 disks (2.7TB)
12
Billion Transactions per Day Project
  • Built a 45-node Windows NT Cluster (with help
    from Intel Compaq) gt 900 disks
  • All off-the-shelf parts
  • Using SQL Server DTC distributed transactions
  • DebitCredit Transaction
  • Each node has 1/20 th of the DB
  • Each node does 1/20 th of the work
  • 15 of the transactions are distributed

13
Billion Transactions Per Day Hardware
  • 45 nodes (Compaq Proliant)
  • Clustered with 100 Mbps Switched Ethernet
  • 140 cpu, 13 GB, 3 TB.

14
How Much Is 1 Billion Tpd?
  • 1 billion tpd 11,574 tps 700,000 tpm
    (transactions/minute)
  • ATT
  • 185 million calls per peak day (worldwide)
  • Visa 20 million tpd
  • 400 million customers
  • 250K ATMs worldwide
  • 7 billion transactions (cardcheque) in 1994
  • New York Stock Exchange
  • 600,000 tpd
  • Bank of America
  • 20 million tpd checks cleared (more than any
    other bank)
  • 1.4 million tpd ATM transactions
  • Worldwide Airlines Reservations 250 Mtpd

15
Infinite, Ubiquitous ScalingRedefining the rules
Per Sec Per Min Per
Day 10K TPC 166 10,000
14,400,000 1 BTPD 11,574 694,444
1,000,000,000 1.4 BTPD 16,204 972,222
1,400,000,000
IIS
MTS
All ShippingProducts!
COM / ActiveX
16
Microsoft.com 150x4 nodes
(3)
17
NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
  • National Center for Supercomputing
    ApplicationsUniversity of Illinois _at_ Urbana
  • 512 Pentium II cpus, 2,096 disks, SAN
  • Compaq HP Myricom WindowsNT
  • A Super Computer for 3M
  • Classic Fortran/MPI programming
  • DCOM programming model

18
TPC C Improved Fast(250/year!)
40 hardware, 100 software, 100 PC Technology
19
Windows NT Versus UNIX
20
Economy Of Scale
21
Microsoft TerraServer Scaleup to Big Databases
  • Build a 1 TB SQL Server database
  • Data must be
  • 1 TB
  • Unencumbered
  • Interesting to everyone everywhere
  • And not offensive to anyone anywhere
  • Loaded
  • 1.5 M place names from Encarta World Atlas
  • 3 M Sq Km from USGS (1 meter resolution)
  • 1 M Sq Km from Russian Space agency (2 m)
  • On the web (worlds largest atlas)
  • Sell images with commerce server.

22
Microsoft TerraServer Background
  • Earth is 500 Tera-meters square
  • USA is 10 tm2
  • 100 TM2 land in 70ºN to 70ºS
  • We have pictures of 6 of it
  • 3 tsm from USGS
  • 2 tsm from Russian Space Agency
  • Compress 51 (JPEG) to 1.5 TB.
  • Slice into 10 KB chunks
  • Store chunks in DB
  • Navigate with
  • Encarta Atlas
  • globe
  • gazetteer
  • StreetsPlus in the USA
  • Someday
  • multi-spectral image
  • of everywhere
  • once a day / hour

23
Demo
  • navigate by coverage map to White House
  • Download image
  • buy imagery from USGS
  • navigate by name to Venice
  • buy SPIN2 image Kodak photo
  • Pop out to Expedia street map of Venice
  • Mention that DB will double in next 18 months (2x
    USGS, 2X SPIN2)

24
The Microsoft TerraServer Hardware
  • Compaq AlphaServer 8400
  • 8x400Mhz Alpha cpus
  • 10 GB DRAM
  • 324 9.2 GB StorageWorks Disks
  • 3 TB raw, 2.4 TB of RAID5
  • STK 9710 tape robot (4 TB)
  • WindowsNT 4 EE, SQL Server 7.0

25
Software
Web Client
Internet InformationServer 4.0
ImageServer Active Server Pages
HTML
JavaViewer
The Internet
browser

MTS
Terra-ServerStored Procedures
Internet InfoServer 4.0
Internet InformationServer 4.0
SQL Server 7
MicrosoftSite Server EE
Microsoft AutomapActiveX Server
Image DeliveryApplication
SQL Server7
Automap Server
TerraServer DB
Image Provider Site(s)
26
Image Delivery and LoadIncremental load of 4
more TB in next 18 months
DLTTape
tar
\DropN
LoadMgrDB
DoJob
Wait 4 Load
DLTTape
NTBackup
...
Cutting Machines
LoadMgr
10 ImgCutter 20 Partition 30 ThumbImg40
BrowseImg 45 JumpImg 50 TileImg 55 Meta
Data 60 Tile Meta 70 Img Meta 80 Update Place
ImgCutter
100mbitEtherSwitch
\DropN \Images
TerraServer
Enterprise Storage Array
STKDLTTape Library
AlphaServer8400
108 9.1 GB Drives
108 9.1 GB Drives
108 9.1 GB Drives
27
TerraServer A Real World Example
  • Largest DB on the Web
  • 1.3TB
  • 99.95 uptime since July 1
  • No downtime, period, in August
  • 70 of downtime for SQL software upgrades

28
NT Clusters (Wolfpack)
  • Scale DOWN to PDA WindowsCE
  • Scale UP an SMP TerraServer
  • Scale OUT with a cluster of machines
  • Single-system image
  • Naming
  • Protection/security
  • Management/load balance
  • Fault tolerance
  • Wolfpack
  • Hot pluggable hardware software

29
Symmetric Virtual Server Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Web site files
Database files
Database files
30
Windows NT 5 (scalability features)
  • Better SMP support
  • Clusters
  • 16x packs (fault tolerant clusters)
  • 100x mobs arrays for manageability
  • SAN/VIA support
  • 64 bit addressing for data
  • Apps like SQL, Oracle, will use it for data
  • 64 bit API to NT comes later (in lab now).
  • Remote management (scripting and DCOM)
  • Active Directory
  • Veritas volume manager
  • Many 3rd party HSMs
  • Batch support

31
Microsoft SQL Server 7.0
  • Fixes the famous performance bugs
  • dynamic record locking
  • online backup, quick recovery.
  • 64 bit addressing buffer pool
  • SMP parallelism and better SMP support
  • Built in OLAP (cubes and MOLAP)
  • Scale down to Win9x
  • Improved management interfaces
  • Data transform services (for warehouses)

32
Outline
  • What is Scalability
  • Why does Microsoft care about ScaleUp
  • Current ScaleUp Status?
  • NT5 SQL7

33
end
  • Other slides would be interesting, but...

34
Interesting other slidesNo time for them but...
  • How much information is there?
  • IO bandwidth in the Intel world
  • Intelligent disks
  • SAN/VIA
  • NT Cluster Sort

35
Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • The Web 1 TB of HTML
  • TerraServer 1 TB of images
  • Several other 1 TB (file) servers
  • Hotmail 7 TB of email
  • Sloan Digital Sky Survey 40 TB raw, 2 TB
    cooked
  • EOS/DIS (picture of planet each week)
  • 15 PB by 2007
  • Federal Clearing house images of checks
  • 15 PB by 2006 (7 year history)
  • Nuclear Stockpile Stewardship Program
  • 10 Exabytes (???!!)

36
Info Capture
  • You can record everything you see or hear or
    read.
  • What would you do with it?
  • How would you organize analyze it?

Video 8 PB per lifetime (10GBph) Audio 30 TB
(10KBps) Read or write 8 GB (words) See
http//www.lesk.com/mlesk/ksg97/ksg.html
37
Michael Lesks Points www.lesk.com/mlesk/ksg97/ks
g.html
  • Soon everything can be recorded and kept
  • Most data will never be seen by humans
  • Precious Resource Human attention
    Auto-Summarization Auto-Searchwill be a key
    enabling technology.

38
PAP (peak advertised Performance) vs RAP (real
application performance)
  • Goal RAP PAP / 2 (the half-power point)

System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
10-15 MBps
Data
7.2 MB/s
File System
SCSI
Buffers
Disk
133 MBps
PCI
7.2 MB/s
39
PAP vs RAP
  • Reads are easy, writes are hard
  • Async write can match WCE.

422 MBps
142
MBps
SCSI
Disks
Application
Data
40 MBps
10-15 MBps
31 MBps
File System
9 MBps

133 MBps
72 MBps
SCSI
PCI
40
Bottleneck Analysis
  • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not
    measured, we had only one PCI bus available, 2nd
    one was internal)
  • 120 MBps Unbuffered read
  • 80 MBps Unbuffered write
  • 40 MBps Buffered read
  • 35 MBps Buffered write


120 MBps
41
Year 2002 Disks
  • Big disk (10 /GB)
  • 3
  • 100 GB
  • 150 kaps (k accesses per second)
  • 20 MBps sequential
  • Small disk (20 /GB)
  • 3
  • 4 GB
  • 100 kaps
  • 10 MBps sequential
  • Both running Windows NT 7.0?(see below for why)

42
How Do They Talk to Each Other?
  • Each node has an OS
  • Each node has local resources A federation.
  • Each node does not completely trust the others.
  • Nodes use RPC to talk to each other
  • CORBA? DCOM? IIOP? RMI?
  • One or all of the above.
  • Huge leverage in high-level interfaces.
  • Same old distributed system story.

Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
h
VIAL/VIPL
Wire(s)
43
SAN Standard Interconnect
Gbps Ethernet 110 MBps
  • LAN faster than memory bus?
  • 1 GBps links in lab.
  • 300 port cost soon
  • Port is computer

PCI 32 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
44
PennySort
  • Hardware
  • 266 Mhz Intel PPro
  • 64 MB SDRAM (10ns)
  • Dual Fujitsu DMA 3.2GB EIDE
  • Software
  • NT workstation 4.3
  • NT 5 sort
  • Performance
  • sort 15 M 100-byte records (1.5 GB)
  • Disk to disk
  • elapsed time 820 sec
  • cpu time 404 sec

45
Cluster Sort Conceptual Model
  • Multiple Data Sources
  • Multiple Data Destinations
  • Multiple nodes
  • Disks -gt Sockets -gt Disk -gt Disk

A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
46
Cluster Install Execute
  • If this is to be used by others,
  • it must be
  • Easy to install
  • Easy to execute
  • Installations of distributed systems take
  • time and can be tedious. (AM2, GluGuard)
  • Parallel Remote execution is
  • non-trivial. (GLUnix, LSF)
  • How do we keep this simple and built-in to
    NTClusterSort ?

47
Remote Install
  • Add Registry entry to each remote node.

RegConnectRegistry() RegCreateKeyEx()
48
Cluster Execution
  • Setup
  • MULTI_QI struct
  • COSERVERINFO struct
  • CoCreateInstanceEx()
  • Retrieve remote object handle
  • from MULTI_QI struct
  • Invoke methods as usual
Write a Comment
User Comments (0)
About PowerShow.com