Scaleable Systems Research at Microsoft really: what we do at BARC - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Scaleable Systems Research at Microsoft really: what we do at BARC

Description:

RAGS: SQL Testing. TerraServer (a big DB) Sloan Sky Survey (CyberBricks) Billion Transactions per day. WolfPack Failover. NTFS IO measurements. NT-Cluster-Sort ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 66
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: Scaleable Systems Research at Microsoft really: what we do at BARC


1
Scaleable Systems Research at Microsoft(really
what we do at BARC)
  • Jim GrayMicrosoft Research Gray_at_Microsoft.comht
    tp//research.Microsoft.com/GrayPresented to
    DARPA WindowsNT workshop 5 Aug 1998, Seattle WA.

2
Outline
  • PowerCast, FileCast Reliable Multicast
  • RAGS SQL Testing
  • TerraServer (a big DB)
  • Sloan Sky Survey (CyberBricks)
  • Billion Transactions per day
  • WolfPack Failover
  • NTFS IO measurements
  • NT-Cluster-Sort
  • AlwaysUp

3
Telepresence
  • The next killer app
  • Space shifting
  • Reduce travel
  • Time shifting
  • Retrospective
  • Offer condensations
  • Just in time meetings.
  • Example ACM 97
  • NetShow and Web site.
  • More web visitors than attendees
  • People-to-People communication

4
Telepresence Prototypes
  • PowerCast multicast PowerPoint
  • Streaming - pre-sends next anticipated slide
  • Send slides and voice rather than talking head
    and voice
  • Uses ECSRM for reliable multicast
  • 1000s of receivers can join and leave any time.
  • No server needed no pre-load of slides.
  • Cooperating with NetShow
  • FileCast multicast file transfer.
  • Erasure encodes all packets
  • Receivers only need to receive as many bytes as
    the length of the file
  • Multicast IE to solve Midnight-Madness problem
  • NT SRM reliable IP multicast library for NT
  • Spatialized Teleconference Station
  • Texture map faces onto spheres
  • Space map voices

5
RAGS RAndom SQL test Generator
  • Microsoft spends a LOT of money on testing.
    (60 of development according to one source).
  • Idea test SQL by
  • generating random correct queries
  • executing queries against database
  • compare results with SQL 6.5, DB2, Oracle, Sybase
  • Being used in SQL 7.0 testing.
  • 375 unique bugs found (since 2/97)
  • Very productive test tool

6
Sample Rags Generated Statement
SELECT TOP 3 T1.royalty , T0.price , "Apr 15 1996
1023AM" , T0.notes FROM titles T0, roysched
T1 WHERE EXISTS ( SELECT DISTINCT TOP 9 3.11 ,
"Apr 15 1996 1023AM" , T0.advance , (
"ltv3VF" (( UPPER(((T2.ord_num "22\0G3"
)T2.ord_num ))("1FL6t15m" RTRIM(
UPPER((T1.title_id ((("MlVCf1kA" "GS?"
)T2.payterms )T2.payterms ))))))(T2.ord_num
RTRIM((LTRIM((T2.title_id T2.stor_id ))"2"
))))), T0.advance , (((-(T2.qty ))/(1.0
))(((-(-(-1 )))( DEGREES(T2.qty )))-(-(( -4
)-(-(T2.qty ))))))(-(-1 )) FROM sales T2 WHERE
EXISTS ( SELECT "fQDs" , T2.ord_date , AVG
((-(7 ))/(1 )), MAX (DISTINCT -1 ),
LTRIM("0IL601H" ), ("jQ\" ((( MAX(T3.phone )
MAX((RTRIM( UPPER( T5.stor_name ))((("lt"
"9n0yN" ) UPPER("c" ))T3.zip ))))T2.payterms
) MAX("\?" ))) FROM authors T3, roysched
T4, stores T5 WHERE EXISTS ( SELECT DISTINCT
TOP 5 LTRIM(T6.state ) FROM stores T6 WHERE
( (-(-(5 )))gt T4.royalty ) AND (( ( (
LOWER( UPPER((("9W8WgtkOa" T6.stor_address
)"P" ))))! ANY ( SELECT TOP 2 LOWER((
UPPER("B9WIX" )"J" )) FROM roysched T7
WHERE ( EXISTS ( SELECT (T8.city
(T9.pub_id (("gt" T10.country ) UPPER(
LOWER(T10.city))))), T7.lorange ,
((T7.lorange )((T7.lorange )(-2 )))/((-5
)-(-2.0 )) FROM publishers T8, pub_info T9,
publishers T10 WHERE ( (-10 )lt
POWER((T7.royalty )/(T7.lorange ),1)) AND
(-1.0 BETWEEN (-9.0 ) AND (POWER(-9.0 ,0.0)) )
) --EOQ ) AND (NOT (EXISTS (
SELECT MIN (T9.i3 ) FROM roysched T8, d2 T9,
stores T10 WHERE ( (T10.city
LOWER(T10.stor_id )) BETWEEN (("QNu_at_WI"
T10.stor_id )) AND ("DT" ) ) AND ("RJ"
BETWEEN ( LOWER(T10.zip )) AND (LTRIM(
UPPER(LTRIM( LOWER(("_\tkd" T8.title_id ))))))
) GROUP BY T9.i3, T8.royalty, T9.i3
HAVING -1.0 BETWEEN (SUM (-( SIGN(-(T8.royalty
))))) AND (COUNT()) ) --EOQ ) )
) --EOQ ) AND (((("iUv" T6.stor_id
)T6.state )T6.city ) BETWEEN ((((T6.zip (
UPPER(("ec4LrPlt" ((LTRIM(T6.stor_name )"faxlt"
)("5adWhS" T6.zip )))) T6.city ))""
)"?gt_0Wi" )) AND (T6.zip ) ) ) AND (T4.lorange
BETWEEN ( 3 ) AND (-(8 )) ) ) ) --EOQ
GROUP BY ( LOWER(((T3.address T5.stor_address
)REVERSE((T5.stor_id LTRIM( T5.stor_address
))))) LOWER(((("ztO5I" "" )("X3FN"
(REVERSE((RTRIM( LTRIM((("kwU" "wyn_S_at_y"
)(REVERSE(( UPPER(LTRIM("u2C" ))T4.title_id
))( RTRIM(("s" "1X" )) UPPER((REVERSE(T3.addr
ess )T5.stor_name ))))))) "6CRtdD" ))"j?k"
)))T3.phone ))), T5.city, T5.stor_address )
--EOQ ORDER BY 1, 6, 5 )
This Statement yields an error SQLState37000,
Error8623 Internal Query Processor ErrorQuery
processor could not produce a query plan.
7
Automation
  • Simpler Statement with same error
  • SELECT roysched.royalty
  • FROM titles, roysched
  • WHERE EXISTS (
  • SELECT DISTINCT TOP 1 titles.advance
  • FROM sales
  • ORDER BY 1)
  • Control statement attributes
  • complexity, kind, depth, ...
  • Multi-user stress tests
  • tests concurrency, allocation, recovery

8
One 4-Vendor Rags Test3 of them vs Us
  • 60 k Selects on MSS, DB2, Oracle, Sybase.
  • 17 SQL Server Beta 2 suspects 1 suspect per
    3350 statements.
  • Examine 10 suspects, filed 4 Bugs!One
    duplicate. Assume 3/10 are new
  • Note This is the SS Beta 2 ProductQuality
    rising fast (and RAGS sees that)

9
Outline
  • FileCast Reliable Multicast
  • RAGS SQL Testing
  • TerraServer (a big DB)
  • Sloan Sky Survey (CyberBricks)
  • Billion Transactions per day
  • Wolfpack Failover
  • NTFS IO measurements
  • NT-Cluster-Sort

10
Billions Of Clients
  • Every device will be intelligent
  • Doors, rooms, cars
  • Computing will be ubiquitous

11
Billions Of ClientsNeed Millions Of Servers
  • All clients networked to servers
  • May be nomadicor on-demand
  • Fast clients wantfaster servers
  • Servers provide
  • Shared Data
  • Control
  • Coordination
  • Communication

Clients
Mobileclients
Fixedclients
Servers
Server
Super server
12
ThesisMany little beat few big
1 million
100 K
10 K
Pico Processor
Micro
Nano
10 pico-second ram
1 MB
Mini
Mainframe
10
0

MB
1
0 GB
1
TB
1
00 TB
1.8"
2.5"
3.5"
5.25"
1 M SPECmarks, 1TFLOP 106 clocks to bulk
ram Event-horizon on chip VM reincarnated Multi
program cache, On-Chip SMP
9"
14"
  • Smoking, hairy golf ball
  • How to connect the many little parts?
  • How to program the many little parts?
  • Fault tolerance?

13
Performance Storage Accesses not Instructions
Executed
  • In the old days we counted instructions and
    IOs
  • Now we count memory references
  • Processors wait most of the time

Where the time goes
clock ticks for AlphaSort Components
70 MIPS real apps have worse Icache misses so
run at 60 MIPS if well tuned, 20 MIPS if not
Sort
Disc Wait
Sort
OS
Disc Wait
Memory Wait
I-Cache

Miss
B-Cache

D-Cache
Data Miss
Miss
14
Scale Up and Scale Out
Grow Up with SMP 4xP6 is now standard Grow Out
with Cluster Cluster has inexpensive parts
Cluster of PCs
15
Microsoft TerraServer Scaleup to Big Databases
  • Build a 1 TB SQL Server database
  • Data must be
  • 1 TB
  • Unencumbered
  • Interesting to everyone everywhere
  • And not offensive to anyone anywhere
  • Loaded
  • 1.5 M place names from Encarta World Atlas
  • 3 M Sq Km from USGS (1 meter resolution)
  • 1 M Sq Km from Russian Space agency (2 m)
  • On the web (worlds largest atlas)
  • Sell images with commerce server.

16
Microsoft TerraServer Background
  • Earth is 500 Tera-meters square
  • USA is 10 tm2
  • 100 TM2 land in 70ºN to 70ºS
  • We have pictures of 6 of it
  • 3 tsm from USGS
  • 2 tsm from Russian Space Agency
  • Compress 51 (JPEG) to 1.5 TB.
  • Slice into 10 KB chunks
  • Store chunks in DB
  • Navigate with
  • Encarta Atlas
  • globe
  • gazetteer
  • StreetsPlus in the USA
  • Someday
  • multi-spectral image
  • of everywhere
  • once a day / hour

17
USGS Digital Ortho Quads (DOQ)
  • US Geologic Survey
  • 4 Tera Bytes
  • Most data not yet published
  • Based on a CRADA
  • Microsoft TerraServer makes data available.

18
Russian Space Agency(SovInfomSputnik) SPIN-2
(Aerial Images is Worldwide Distributor)
  • 1.5 Meter Geo Rectified imagery of (almost)
    anywhere
  • Almost equal-area projection
  • De-classified satellite photos (from 200 KM),
  • More data coming (1 m)
  • Selling imagery on Internet.
  • Putting 2 tm2 onto Microsoft TerraServer.

19
Demo
  • navigate by coverage map to White House
  • Download image
  • buy imagery from USGS
  • navigate by name to Venice
  • buy SPIN2 image Kodak photo
  • Pop out to Expedia street map of Venice
  • Mention that DB will double in next 18 months (2x
    USGS, 2X SPIN2)

20
Hardware
Map
Site
Server
Internet
Servers
100 Mbps
Ethernet Switch
Web Servers
Alpha
Enterprise Storage Array
STK
Server
9710
8400
DLT
Tape
8 x 440MHz
Library
Alpha
cpus
10 GB DRAM
1TB Database Server AlphaServer 8400 4x400. 10
GB RAM 324 StorageWorks disks 10 drive tape
library (STC Timber Wolf DLT7000 )
21
The Microsoft TerraServer Hardware
  • Compaq AlphaServer 8400
  • 8x400Mhz Alpha cpus
  • 10 GB DRAM
  • 324 9.2 GB StorageWorks Disks
  • 3 TB raw, 2.4 TB of RAID5
  • STK 9710 tape robot (4 TB)
  • WindowsNT 4 EE, SQL Server 7.0

22
Software
Web Client
Internet InformationServer 4.0
ImageServer Active Server Pages
HTML
JavaViewer
The Internet
browser

MTS
Terra-ServerStored Procedures
Internet InfoServer 4.0
Internet InformationServer 4.0
SQL Server 7
MicrosoftSite Server EE
Microsoft AutomapActiveX Server
Image DeliveryApplication
SQL Server7
Automap Server
TerraServer DB
Image Provider Site(s)
23
System Management Maintenance
  • Backup and Recovery
  • STK 9710 Tape robot
  • Legato NetWorker
  • SQL Server 7 Backup Restore
  • Clocked at 80 MBps (peak)( 200 GB/hr)
  • SQL Server Enterprise Mgr
  • DBA Maintenance
  • SQL Performance Monitor

24
Microsoft TerraServer File Group Layout
  • Convert 324 disks to 28 RAID5 sets plus 28 spare
    drives
  • Make 4 WinNT volumes (RAID 50) 595 GB per
    volume
  • Build 30 20GB files on each volume
  • DB is File Group of 120 files

25
Image Delivery and LoadIncremental load of 4
more TB in next 18 months
DLTTape
tar
\DropN
LoadMgrDB
DoJob
Wait 4 Load
DLTTape
NTBackup
...
Cutting Machines
LoadMgr
10 ImgCutter 20 Partition 30 ThumbImg40
BrowseImg 45 JumpImg 50 TileImg 55 Meta
Data 60 Tile Meta 70 Img Meta 80 Update Place
ImgCutter
100mbitEtherSwitch
\DropN \Images
TerraServer
Enterprise Storage Array
STKDLTTape Library
AlphaServer8400
108 9.1 GB Drives
108 9.1 GB Drives
108 9.1 GB Drives
26
Technical ChallengeKey idea
  • Problem Geo-Spatial Search without geo-spatial
    access methods.(just standard SQL Server)
  • Solution
  • Geo-spatial search key
  • Divide earth into rectangles of 1/48th degree
    longitude (X) by 1/96th degree latitude (Y)
  • Z-transform X Y into single Z value, build
    B-tree on Z
  • Adjacent images stored next to each other
  • Search Method
  • Latitude and Longitude gt X, Y, then Z
  • Select on matching Z value

27
Sloan Digital Sky Survey
  • Digital Sky
  • 30 TB raw
  • 3TB cooked (1 billion 3KB objects)
  • Want to scan it frequently
  • Using cyberbricks
  • Current status
  • 175 MBps per node
  • 24 nodes gt 4 GBps
  • 5 minutes to scan whole archive

28
Some Tera-Byte Databases
Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • The Web 1 TB of HTML
  • TerraServer 1 TB of images
  • Several other 1 TB (file) servers
  • Hotmail 7 TB of email
  • Sloan Digital Sky Survey 40 TB raw, 2 TB
    cooked
  • EOS/DIS (picture of planet each week)
  • 15 PB by 2007
  • Federal Clearing house images of checks
  • 15 PB by 2006 (7 year history)
  • Nuclear Stockpile Stewardship Program
  • 10 Exabytes (???!!)

29
Info Capture
  • You can record everything you see or hear or
    read.
  • What would you do with it?
  • How would you organize analyze it?

Video 8 PB per lifetime (10GBph) Audio 30 TB
(10KBps) Read or write 8 GB (words) See
http//www.lesk.com/mlesk/ksg97/ksg.html
30
Michael Lesks Points www.lesk.com/mlesk/ksg97/ks
g.html
  • Soon everything can be recorded and kept
  • Most data will never be seen by humans
  • Precious Resource Human attention
    Auto-Summarization Auto-Searchwill be a key
    enabling technology.

31
Kilo Mega Giga Tera Peta Exa Zetta Yotta
A letter
A novel
A Movie
Library of Congress (text)
LoC (image)
LoC (sound cinima)
All Photos
All Disks
All Tapes
All Information!
32
Outline
  • FileCast Reliable Multicast
  • RAGS SQL Testing
  • TerraServer (a big DB)
  • Sloan Sky Survey (CyberBricks)
  • Billion Transactions per day
  • Wolfpack Failover
  • NTFS IO measurements
  • NT-Cluster-Sort

33
Scalability
100 millionweb hits
1 billion transactions
  • Scale up to large SMP nodes
  • Scale out to clusters of SMP nodes

1.8 million mail messages
4 terabytes of data
34
Billion Transactions per Day Project
  • Built a 45-node Windows NT Cluster (with help
    from Intel Compaq) gt 900 disks
  • All off-the-shelf parts
  • Using SQL Server DTC distributed transactions
  • DebitCredit Transaction
  • Each node has 1/20 th of the DB
  • Each node does 1/20 th of the work
  • 15 of the transactions are distributed

35
Billion Transactions Per Day Hardware
  • 45 nodes (Compaq Proliant)
  • Clustered with 100 Mbps Switched Ethernet
  • 140 cpu, 13 GB, 3 TB.

36
1.2 B tpd
  • 1 B tpd ran for 24 hrs.
  • Out-of-the-box software
  • Off-the-shelf hardware
  • AMAZING!
  • Sized for 30 days
  • Linear growth
  • 5 micro-dollars per transaction

37
How Much Is 1 Billion Tpd?
  • 1 billion tpd 11,574 tps 700,000 tpm
    (transactions/minute)
  • ATT
  • 185 million calls per peak day (worldwide)
  • Visa 20 million tpd
  • 400 million customers
  • 250K ATMs worldwide
  • 7 billion transactions (cardcheque) in 1994
  • New York Stock Exchange
  • 600,000 tpd
  • Bank of America
  • 20 million tpd checks cleared (more than any
    other bank)
  • 1.4 million tpd ATM transactions
  • Worldwide Airlines Reservations 250 Mtpd

38
NCSA Super Cluster
http//access.ncsa.uiuc.edu/CoverStories/SuperClus
ter/super.html
  • National Center for Supercomputing
    ApplicationsUniversity of Illinois _at_ Urbana
  • 512 Pentium II cpus, 2,096 disks, SAN
  • Compaq HP Myricom WindowsNT
  • A Super Computer for 3M
  • Classic Fortran/MPI programming
  • DCOM programming model

39
Outline
  • FileCast Reliable Multicast
  • RAGS SQL Testing
  • TerraServer (a big DB)
  • Sloan Sky Survey (CyberBricks)
  • Billion Transactions per day
  • Wolfpack Failover
  • NTFS IO measurements
  • NT-Cluster-Sort

40
NT Clusters (Wolfpack)
  • Scale DOWN to PDA WindowsCE
  • Scale UP an SMP TerraServer
  • Scale OUT with a cluster of machines
  • Single-system image
  • Naming
  • Protection/security
  • Management/load balance
  • Fault tolerance
  • Wolfpack
  • Hot pluggable hardware software

41
Symmetric Virtual Server Failover Example
Server 1
Server 2
Web site
Web site
Database
Database
Web site files
Web site files
Database files
Database files
42
Clusters BackOffice
  • Research Instant Transparent failover
  • Making BackOffice PlugNPlay on Wolfpack
  • Automatic install configure
  • Virtual Server concept makes it easy
  • simpler management concept
  • simpler context/state migration
  • transparent to applications
  • SQL 6.5E 7.0 Failover
  • MSMQ (queues), MTS (transactions).

43
Next Steps in Availability
  • Study the causes of outages
  • Build AlwaysUp system
  • Two geographically remote sites
  • Users have instant and transparent failover to
    2nd site.
  • Working with WindowsNT and SQL Server groups on
    this.

44
Outline
  • FileCast Reliable Multicast
  • RAGS SQL Testing
  • TerraServer (a big DB)
  • Sloan Sky Survey (CyberBricks)
  • Billion Transactions per day
  • Wolfpack Failover
  • NTFS IO measurements
  • NT-Cluster-Sort

45
Storage Latency How Far Away is the Data?
9
Tape /Optical
10
Robot
6
Disk
10
Memory
100
10
On Board Cache
On Chip Cache
2
Registers
1
46
The Memory Hierarchy
  • Measuring Modeling Sequential IO
  • Where is the bottleneck?
  • How does it scale with
  • SMP, RAID, new interconnects

Goals balanced bottlenecks Low overhead Scale
many processors (10s) Scale many disks (100s)
Memory
App address space
Mem bus
File cache
Controller
Adapter
SCSI
PCI
47
PAP (peak advertised Performance) vs RAP (real
application performance)
  • Goal RAP PAP / 2 (the half-power point)

System Bus
422 MBps
40 MBps
7.2 MB/s
7.2 MB/s
Application
10-15 MBps
Data
7.2 MB/s
File System
SCSI
Buffers
Disk
133 MBps
PCI
7.2 MB/s
48
The Best Case Temp File, NO IO
  • Temp file Read / Write File System Cache
  • Program uses small (in cpu cache) buffer.
  • So, write/read time is bus move time (3x better
    than copy)
  • Paradox fastest way to move data is to write
    then read it.
  • This hardware islimited to 150 MBpsper
    processor

49
Bottleneck Analysis
  • Drawn to linear scale

Theoretical Bus Bandwidth 422MBps 66 Mhz x 64
bits
MemoryRead/Write 150 MBps
MemCopy 50 MBps
Disk R/W 9MBps
50
3 Stripes and Your Out!
  • CPU time goes down with request size
  • Ftdisk (striping is cheap)
  • 3 disks can saturate adapter
  • Similar story with UltraWide

51
Parallel SCSI Busses Help
  • Second SCSI bus nearly doubles read and wce
    throughput
  • Write needs deeper buffers
  • Experiment is unbuffered(3-deep WCE)

?
2 x
52
File System Buffering Stripes(UltraWide Drives)
  • FS buffering helps small reads
  • FS buffered writes peak at 12MBps
  • 3-deep async helps
  • Write peaks at 20 MBps
  • Read peaks at 30 MBps

53
PAP vs RAP
  • Reads are easy, writes are hard
  • Async write can match WCE.

422 MBps
142
MBps
SCSI
Disks
Application
Data
40 MBps
10-15 MBps
31 MBps
File System
9 MBps

133 MBps
72 MBps
SCSI
PCI
54
Bottleneck Analysis
  • NTFS Read/Write 9 disk, 2 SCSI bus, 1 PCI 65
    MBps Unbuffered read
  • 43 MBps Unbuffered write
  • 40 MBps Buffered read
  • 35 MBps Buffered write


Adapter 30 MBps
Memory Read/Write 150 MBps
PCI 70 MBps
70 MBps
Adapter
55
Hypothetical Bottleneck Analysis
  • NTFS Read/Write 12 disk, 4 SCSI, 2 PCI (not
    measured, we had only one PCI bus available, 2nd
    one was internal)
  • 120 MBps Unbuffered read
  • 80 MBps Unbuffered write
  • 40 MBps Buffered read
  • 35 MBps Buffered write


120 MBps
56
Year 2002 Disks
  • Big disk (10 /GB)
  • 3
  • 100 GB
  • 150 kaps (k accesses per second)
  • 20 MBps sequential
  • Small disk (20 /GB)
  • 3
  • 4 GB
  • 100 kaps
  • 10 MBps sequential
  • Both running Windows NT 7.0?(see below for why)

57
How Do They Talk to Each Other?
  • Each node has an OS
  • Each node has local resources A federation.
  • Each node does not completely trust the others.
  • Nodes use RPC to talk to each other
  • CORBA? DCOM? IIOP? RMI?
  • One or all of the above.
  • Huge leverage in high-level interfaces.
  • Same old distributed system story.

Applications
Applications
datagrams
datagrams
streams
RPC
?
streams
RPC
?
h
VIAL/VIPL
Wire(s)
58
Outline
  • FileCast Reliable Multicast
  • RAGS SQL Testing
  • TerraServer (a big DB)
  • Sloan Sky Survey (CyberBricks)
  • Billion Transactions per day
  • Wolfpack Failover
  • NTFS IO measurements
  • NT-Cluster-Sort

59
Penny Sort Ground Ruleshttp//research.microsoft.
com/barc/SortBenchmark
  • How much can you sort for a penny.
  • Hardware and Software cost
  • Depreciated over 3 years
  • 1M system gets about 1 second,
  • 1K system gets about 1,000 seconds.
  • Time (seconds) SystemPrice () / 946,080
  • Input and output are disk resident
  • Input is
  • 100-byte records (random data)
  • key is first 10 bytes.
  • Must create output file and fill with sorted
    version of input file.
  • Daytona (product) and Indy (special) categories

60
PennySort
  • Hardware
  • 266 Mhz Intel PPro
  • 64 MB SDRAM (10ns)
  • Dual Fujitsu DMA 3.2GB EIDE
  • Software
  • NT workstation 4.3
  • NT 5 sort
  • Performance
  • sort 15 M 100-byte records (1.5 GB)
  • Disk to disk
  • elapsed time 820 sec
  • cpu time 404 sec

61
Cluster Sort Conceptual Model
  • Multiple Data Sources
  • Multiple Data Destinations
  • Multiple nodes
  • Disks -gt Sockets -gt Disk -gt Disk

A
AAA BBB CCC
B
C
AAA BBB CCC
AAA BBB CCC
62
Cluster Install Execute
  • If this is to be used by others,
  • it must be
  • Easy to install
  • Easy to execute
  • Installations of distributed systems take
  • time and can be tedious. (AM2, GluGuard)
  • Parallel Remote execution is
  • non-trivial. (GLUnix, LSF)
  • How do we keep this simple and built-in to
    NTClusterSort ?

63
Remote Install
  • Add Registry entry to each remote node.

RegConnectRegistry() RegCreateKeyEx()
64
Cluster Execution
  • Setup
  • MULTI_QI struct
  • COSERVERINFO struct
  • CoCreateInstanceEx()
  • Retrieve remote object handle
  • from MULTI_QI struct
  • Invoke methods as usual

65
SAN Standard Interconnect
Gbps Ethernet 110 MBps
  • LAN faster than memory bus?
  • 1 GBps links in lab.
  • 300 port cost soon
  • Port is computer

PCI 32 70 MBps
UW Scsi 40 MBps
FW scsi 20 MBps
scsi 5 MBps
Write a Comment
User Comments (0)
About PowerShow.com