Microsoft.com Design for Resilience The Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center - PowerPoint PPT Presentation

About This Presentation
Title:

Microsoft.com Design for Resilience The Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center

Description:

Microsoft'com Design for Resilience The Infrastructure of www'microsoft'com, Microsoft Update, and t – PowerPoint PPT presentation

Number of Views:198
Avg rating:3.0/5.0
Slides: 42
Provided by: erikr157
Category:

less

Transcript and Presenter's Notes

Title: Microsoft.com Design for Resilience The Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center


1
Microsoft.comDesign for Resilience The
Infrastructure of www.microsoft.com, Microsoft
Update, and the Download Center
  • Paul Wright Technology Architect Manager
  • Microsoft.com Operations
  • pwright_at_microsoft.com

Sunjeev Pandey Senior Director Microsoft.com
Operations sunjeevp_at_microsoft.com
2
Agenda
  • Microsoft.com Introduction
  • Size and Scale
  • Network and System Architecture
  • How Do We Do It?
  • Questions

3
A Brief History Of Microsoft.com
Microsoft launches www.microsoft.com
Information support publishing hosting
Microsoft combines Web platform, ops, and content
teams Standardization effort begins,
consolidation hosted systems
Focus on MSCOM Network Programming and
campaign-to-Web integration Single MSCOM group
formed Brand, content, site stds, Privacy, brand
compliance
Enable an innovative customer experience online
in-product Product Info, Support, Dev / ITPro
Experience, Customer Intelligence, Profile Mgmt
Enterprise Downloads
4
History Of Microsoft.com for Geeks
4
5
Resiliency vs. Disaster Recovery
Disaster Recovery
Resiliency
Type of Failover
Reactive Static Manual Backup/Restore
Proactive Dynamic Automatic Data Mirroring
Characteristics
Pros Increased Availability Improved
Performance
Cons Higher Initial Costs More Complexity
5
6
Microsoft.com Operations Team
7
Microsoft.com Corporate Reach
  • Reach Overview June 06
  • 6 overall site in U.S 55.7M UU for 36 reach
  • 4 site worldwide reaching 248.5M UU
  • Avg 280M UU/month July 05 to Jun 06
  • Reach Surpasses All Corporate Sites
  • Apple ranked 22 17.8M UU, 11.5 reach
  • Netscape ranked 67 9.6M UU, 6.2 reach
  • Sony ranked 217 3.9M UU, 2.6 reach
  • SUN ranked 307 3.1M UU, 2.0 reach
  • IBM ranked 485 2.1M UU, 1.4 reach
  • (US data provided for relative comparison)

Nielsen/NetRatings June 2006 - (unique users in
millions) Worldwide data from comScore Media
Metrix June 2006 (unique users in millions)
8
Microsoft.com Quick Facts
  • Infrastructure and Application Footprint
  • 6 Internet Data Centers 3 CDN Partnerships
  • 120 Web Sites, 1000s App's and 2138 Databases
  • 120 Gigabit/sec Bandwidth
  • Solutions at High Scale
  • www.Microsoft.com
  • 17.1M UUsers/Day 70M Page Views/Day
  • 10K Req/Sec, 300K CC Conns on 80 Servers
  • 350 Vroots, 190 IIS Web Apps 12 App Pools
  • Microsoft Update
  • 250M UScans/Day, 18K ASP.NET Req/Sec, 1.1M
    ConCurrent
  • 28.2 Billion Downloads for CY 2005
  • Egress MS, Akamai Savvis (30-100 Gbit/Sec)

9
Web Site Availability
  • Externally Measured by Keynote Systems, Inc.
  • Benchmark Against Other Large Sites
  • Driving Cross-Team Maturity - Positive Trend in
    Availability
  • 2003 99.70
  • 2004 99.78
  • 2005 99.83
  • 2006 99.87 YTD

10
Web Site Availability
11
Web Site Availability
12
Web Site Availability
  • Total Errors and Daily Availability of
    www.microsoft.com - 06 YTD
  • Constantly monitored and analyzed
  • Corrective actions taken as needed
  • Total Errors 06 YTD grouped per error type
  • Content errors - 1 hit on availability
  • Only 1.3 of the total errors due to server
    issues (Service unavailable Server Error
    Connection Reset)

13
Resilient Against What?
Power / Cooling
Security
ISP / Telco
Infrastructure
Virus
Data Center
Unauthorized Access
HW / SW Failure
DDoS Attack
System/Data Corruption
Application
14
Infrastructure Architecture
Technologies
GLBS DNS
Caching WALB
DDoS
BGP Broad Peering
HSRP, OSFP
Spanning Tree
Clustering WLB
HSRP, OSFP
Spanning Tree
Clustering WLB
15
High Availability Architecture- Global Solutions
Networking
16
High Availability Architecture- Global Solutions
Networking
  • Global Solutions
  • Content Caching Partners Akamai Savvis
  • Global Load Balancing via DNS Web Cluster Level
    Mgmt
  • Health Checking and Automatic Fail-over
  • Security Infrastructure
  • Cisco Guards Anomaly Detection DOS Filtering
  • Router ACLs Allow HTTP/S Only Exceptions
    Require Review
  • Router Architecture Cookie Cutter
  • Redundant Router and Switch Pairs with VLAN
    Segregation
  • Simple, Scalable, Manageable, Repeatable
  • Agility Quickly Repurpose VLANs as Required

17
Enhanced DDos Protection
18
High Availability Architecture- Web Database
Hosting
19
High Availability Architecture - Web Database
Hosting
  • Standard Hosting Models
  • Agility - Quickly Reallocate from System to
    System
  • Efficiency - Less Staffing Equipment Required
  • Consistent Configurations
  • Repeatable Infrastructure Architecture

20
High Availability Architecture - Web Database
Hosting
  • Server Configurations
  • Standard Server Hardware Flexibility
  • Identical Baseline O/S, IIS, ASP.NET
    Configurations
  • Build Scripts for consistent site builds
  • Application Code Content Unique per Site
  • File, Registry, Service, and Local Security
    Attributes Collected for Configuration Auditing
    and Reporting

21
High Availability Architecture - Web Database
Hosting
  • Network Load Balancing (NLB) Clusters
  • Main Load Balancing Solution Today
  • Server Cluster Sizes 3 8 Servers/Cluster
  • Positives
  • Easy Mgmt Knowledge within Team
  • Free with Windows SKUs
  • Challenges
  • Switch Overhead
  • Connection Affinity
  • Application Layer Switching

22
High Availability Architecture - Web Database
Hosting
  • Hardware Load Balancing
  • Limited Use for App Layer Load Balancing
  • Future Greater Adoption for Non-NLB Features
  • Positives
  • App Layer Load Balancing
  • Connection Affinity
  • Challenges
  • Added Complexity/Risks
  • Costs Hardware People

23
High Availability Architecture - Collecting,
Monitoring, Reporting
SMTP
MOM
Tools Services Layer
IIS Log Monitor
IMQ
GAL
Cluster Sentinel
Core
Perf
SE Annotations
Keynote
IAdmin
AD
Cisco Guard
24
High Availability Architecture - Remote Server
Management
  • Integrated Lights Out (iLO) from HP
  • Cold Reboot
  • Power On/Off
  • Debugging Over iLO No More Crash Cart
  • Imaging for Dog Food OS Builds
  • RDP Over iLO
  • Movement to Lights Out Datacenter

25
Global Load Balancing Caching
  • Heath Checking and Fail-over
  • Automated pulling of clusters to watermark
  • Removal on demand for maintenance
  • Load Shaping Distribution
  • Control load percentages to specific clusters
  • Region specific traffic distribution
  • Distributing Patches/Files to 300M Clients
  • Partnership with 3 Providers
  • Akamai, Savvis, MSN
  • Load Distributed via Load Balancing
  • Functions via DNS Resolution and Custom Logic
    from CDNs

26
Global Load Balancing Caching Intelligent
Load Balancing
x
26
27
Global Load Balancing Caching- Geo Targeting
  • Load Shaping Based on Client Resolver Location
  • Direct Traffic to Particular Clusters or Caching
    Provider as Appropriate
  • Customer Experience Enhanced due to Improved
    Local Proximity
  • Load Shaping Based on Client Location
  • CDN Provider Proxies Requests Responds with
    File Based on Location of Client

28
SQL Server 2005Peer-To-Peer Replication
  • Redundancy
  • Each server hosts a copy of the database
  • Availability
  • Individual servers can be patched/upgraded
    without causing database availability issues
  • Performance
  • Application calls are load balanced between nodes
    of the cluster for improved scale-out
  • Zero perceived App Downtime
  • Eliminate single point of failure for R/W
    Databases
  • Considerations
  • Object names, object schema, and publication
    names should be identical
  • Publications must allow schema changes to be
    replicated
  • Updates for a given row should be made only at
    one database until it has synchronized with its
    peers

29
Scaling Out Real World Implementation
  • Data Center and Geo redundancy
  • Scalable Units
  • Content Publishing
  • WAN Replication
  • End-to-end monitoring

30
CPU Utilization Per Platform
Comparative Study x86 vs. x64
x86 x86 x64 x64
HTTP Req/Sec CPU HTTP Req/Sec CPU
222 65 216 35
  • Key Take Away's
  • Huge Gains due to 64-bit H/W Windows Platforms
  • Seamless migration provided with WoW64
  • Enabled www.Microsoft.com to leverage saved
    infrastructure to enable Data Center Redundancy
  • App Pool Recycles Eliminated Enjoying the new
    4GB VM address space running under WoW64!!
  • Enabled more App Pools driving further Isolation
    of Code Content in shared hosting models

31
Windows 32bit vs. 64bit Comparison Comparative
Study Results Windows Update Download System
Perf
Test Case 64-bit Hardware running 32bit vs 64bit Windows Test Case 64-bit Hardware running 32bit vs 64bit Windows Test Case 64-bit Hardware running 32bit vs 64bit Windows Test Case 64-bit Hardware running 32bit vs 64bit Windows
Windows Server 2003 Enterprise Edition SP1 Windows Server 2003 Enterprise Edition SP1 Windows Server 2003 Enterprise x64 Edition Windows Server 2003 Enterprise x64 Edition
Mbits/Sec Avg 784 Mbits/Sec Avg 976
Concurrent Connections Avg 15,746 Concurrent Connections Avg 13,600
Get Req/Sec Avg 2,000 Get Req/Sec Avg 3,400
Get Req/Sec Max 2,200 Get Req/Sec Max 6,800
CPU Avg 32 CPU Avg 60
Application Process (VM Usage) 2GB Application Process (VM Usage) 3.2GB
HTTP 500 Errors 2 HTTP 500 Errors 0
Scenario Stress generated by live HTTP traffic
from Windows Update Downloads 32bit Application
Processes bottlenecked by 2GB Virtual Memory
limit vs 4GB capabilities on 64bit operating
system enabling Max Mbits/Sec Improved compute
times on 64bit increased Req/Sec while lowering
Concurrent Connections (ie. Improved HTTP Request
Processing Times)
32
Windows 64bit Analysis Comparative Study
Results www.Microsoft.com Perf
  • Objective
  • Stress a live production server to identify Max
    ability to serve HTTP traffic from
    www.Microsoft.com client requests

Test Case 64-bit Hardware running 64bit Windows Test Case 64-bit Hardware running 64bit Windows
Windows Server 2003 Enterprise x64 Edition Results
Concurrent Connections Avg 11,697
Connection Attempts/Sec Avg 430
Connection Attempts/Sec Max 577
Get Req/Sec Avg 778
Get Req/Sec Max 956
CPU Avg 96
33
Questions?
34
Resources
  • http//blogs.technet.com/mscom
  • http//blogs.msdn.com/mscomts

35
Appendix
35
36
R/O NLB SQL Cluster
  • Redundancy - Each server hosts a copy of the
    database
  • SQL1 Read/Write
  • SQL2 SQL3 Read/Only
  • Availability
  • Individual servers can be patched/upgraded
    without causing database availability issues
  • Performance
  • Application calls are load balanced between nodes
    of the cluster for improved scale-out

37
R/W NLB SQL Cluster
  • Redundancy - Each server hosts a copy of the
    database
  • SQL1-Read/Write - Consolidator
  • SQL2-Primary Read/Write (active)
  • SQL3-Logshipping Secondary (stand by)
  • Availability
  • Single point of failure
  • Manual failover takes minutes to complete
  • Performance
  • Application calls to a database are not load
    balanced between the nodes of the cluster

38
Mirroring (SQL 2005 SP1)
  • Mirroring
  • Highest Availability Writes
  • Log Shipping for DC Redundancy
  • Reduced failover downtime from 10min avg to lt1min
    (planned)
  • Considerations
  • It works on a per database basis for DBs in full
    recovery model
  • Only one database is available for clients at any
    time
  • Supports two partners and an optional witness
    server for automated failover

39
TCP Window Size How it Works
40
TCP Improvements Client Testing
  • What Exactly Changed?
  • Compound TCP (CTCP) - controls TCP sending
    window size interesting when LH is the server
  • Receive Window Auto-Tuning controls TCP
    receive window size interesting when Vista is
    client
  • Test Scenario
  • Clients Dual boot client (XPSP2 Vista 5308)
  • Test Download (EN W2KSP4 135MB) from 4
    locations (Tukwila, Bay, Florida Frankfurt)
  • Results
  • Corporate network environment - direct Internet
    connectivity (high speed, low packet loss)
  • 57 relative speed gain in low latency
    scenarios (2-20msec RTT)
  • gt150 relative speed gain in mid to high latency
    scenarios (80-180msec RTT)
  • Home network environment (Comcast cable modem)
  • 40 relative speed gain (16-330msec RTT)

41
TCP/IP Throughput Improvements
  • Server to server transfer over 20ms RTT Link
  • W2K3 ?? W2K3 10-12 Mbps
  • Longhorn ?? Longhorn gt 300Mbps
  • Vista client Internet download speeds 160ms RTT gt
    2x
Write a Comment
User Comments (0)
About PowerShow.com