Title: Microsoft.com Design for Resilience The Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center
1Microsoft.comDesign for Resilience The
Infrastructure of www.microsoft.com, Microsoft
Update, and the Download Center
- Paul Wright Technology Architect Manager
- Microsoft.com Operations
- pwright_at_microsoft.com
Sunjeev Pandey Senior Director Microsoft.com
Operations sunjeevp_at_microsoft.com
2Agenda
- Microsoft.com Introduction
- Size and Scale
- Network and System Architecture
- How Do We Do It?
- Questions
3A Brief History Of Microsoft.com
Microsoft launches www.microsoft.com
Information support publishing hosting
Microsoft combines Web platform, ops, and content
teams Standardization effort begins,
consolidation hosted systems
Focus on MSCOM Network Programming and
campaign-to-Web integration Single MSCOM group
formed Brand, content, site stds, Privacy, brand
compliance
Enable an innovative customer experience online
in-product Product Info, Support, Dev / ITPro
Experience, Customer Intelligence, Profile Mgmt
Enterprise Downloads
4History Of Microsoft.com for Geeks
4
5Resiliency vs. Disaster Recovery
Disaster Recovery
Resiliency
Type of Failover
Reactive Static Manual Backup/Restore
Proactive Dynamic Automatic Data Mirroring
Characteristics
Pros Increased Availability Improved
Performance
Cons Higher Initial Costs More Complexity
5
6Microsoft.com Operations Team
7Microsoft.com Corporate Reach
- Reach Overview June 06
- 6 overall site in U.S 55.7M UU for 36 reach
- 4 site worldwide reaching 248.5M UU
- Avg 280M UU/month July 05 to Jun 06
- Reach Surpasses All Corporate Sites
- Apple ranked 22 17.8M UU, 11.5 reach
- Netscape ranked 67 9.6M UU, 6.2 reach
- Sony ranked 217 3.9M UU, 2.6 reach
- SUN ranked 307 3.1M UU, 2.0 reach
- IBM ranked 485 2.1M UU, 1.4 reach
- (US data provided for relative comparison)
Nielsen/NetRatings June 2006 - (unique users in
millions) Worldwide data from comScore Media
Metrix June 2006 (unique users in millions)
8Microsoft.com Quick Facts
- Infrastructure and Application Footprint
- 6 Internet Data Centers 3 CDN Partnerships
- 120 Web Sites, 1000s App's and 2138 Databases
- 120 Gigabit/sec Bandwidth
- Solutions at High Scale
- www.Microsoft.com
- 17.1M UUsers/Day 70M Page Views/Day
- 10K Req/Sec, 300K CC Conns on 80 Servers
- 350 Vroots, 190 IIS Web Apps 12 App Pools
- Microsoft Update
- 250M UScans/Day, 18K ASP.NET Req/Sec, 1.1M
ConCurrent - 28.2 Billion Downloads for CY 2005
- Egress MS, Akamai Savvis (30-100 Gbit/Sec)
9Web Site Availability
- Externally Measured by Keynote Systems, Inc.
- Benchmark Against Other Large Sites
- Driving Cross-Team Maturity - Positive Trend in
Availability - 2003 99.70
- 2004 99.78
- 2005 99.83
- 2006 99.87 YTD
10Web Site Availability
11Web Site Availability
12Web Site Availability
- Total Errors and Daily Availability of
www.microsoft.com - 06 YTD - Constantly monitored and analyzed
- Corrective actions taken as needed
- Total Errors 06 YTD grouped per error type
- Content errors - 1 hit on availability
- Only 1.3 of the total errors due to server
issues (Service unavailable Server Error
Connection Reset)
13Resilient Against What?
Power / Cooling
Security
ISP / Telco
Infrastructure
Virus
Data Center
Unauthorized Access
HW / SW Failure
DDoS Attack
System/Data Corruption
Application
14Infrastructure Architecture
Technologies
GLBS DNS
Caching WALB
DDoS
BGP Broad Peering
HSRP, OSFP
Spanning Tree
Clustering WLB
HSRP, OSFP
Spanning Tree
Clustering WLB
15High Availability Architecture- Global Solutions
Networking
16High Availability Architecture- Global Solutions
Networking
- Global Solutions
- Content Caching Partners Akamai Savvis
- Global Load Balancing via DNS Web Cluster Level
Mgmt - Health Checking and Automatic Fail-over
- Security Infrastructure
- Cisco Guards Anomaly Detection DOS Filtering
- Router ACLs Allow HTTP/S Only Exceptions
Require Review - Router Architecture Cookie Cutter
- Redundant Router and Switch Pairs with VLAN
Segregation - Simple, Scalable, Manageable, Repeatable
- Agility Quickly Repurpose VLANs as Required
17Enhanced DDos Protection
18High Availability Architecture- Web Database
Hosting
19High Availability Architecture - Web Database
Hosting
- Standard Hosting Models
- Agility - Quickly Reallocate from System to
System - Efficiency - Less Staffing Equipment Required
- Consistent Configurations
- Repeatable Infrastructure Architecture
20High Availability Architecture - Web Database
Hosting
- Server Configurations
- Standard Server Hardware Flexibility
- Identical Baseline O/S, IIS, ASP.NET
Configurations - Build Scripts for consistent site builds
- Application Code Content Unique per Site
- File, Registry, Service, and Local Security
Attributes Collected for Configuration Auditing
and Reporting
21High Availability Architecture - Web Database
Hosting
- Network Load Balancing (NLB) Clusters
- Main Load Balancing Solution Today
- Server Cluster Sizes 3 8 Servers/Cluster
- Positives
- Easy Mgmt Knowledge within Team
- Free with Windows SKUs
- Challenges
- Switch Overhead
- Connection Affinity
- Application Layer Switching
22High Availability Architecture - Web Database
Hosting
- Hardware Load Balancing
- Limited Use for App Layer Load Balancing
- Future Greater Adoption for Non-NLB Features
- Positives
- App Layer Load Balancing
- Connection Affinity
- Challenges
- Added Complexity/Risks
- Costs Hardware People
23High Availability Architecture - Collecting,
Monitoring, Reporting
SMTP
MOM
Tools Services Layer
IIS Log Monitor
IMQ
GAL
Cluster Sentinel
Core
Perf
SE Annotations
Keynote
IAdmin
AD
Cisco Guard
24High Availability Architecture - Remote Server
Management
- Integrated Lights Out (iLO) from HP
- Cold Reboot
- Power On/Off
- Debugging Over iLO No More Crash Cart
- Imaging for Dog Food OS Builds
- RDP Over iLO
- Movement to Lights Out Datacenter
25Global Load Balancing Caching
- Heath Checking and Fail-over
- Automated pulling of clusters to watermark
- Removal on demand for maintenance
- Load Shaping Distribution
- Control load percentages to specific clusters
- Region specific traffic distribution
- Distributing Patches/Files to 300M Clients
- Partnership with 3 Providers
- Akamai, Savvis, MSN
- Load Distributed via Load Balancing
- Functions via DNS Resolution and Custom Logic
from CDNs
26Global Load Balancing Caching Intelligent
Load Balancing
x
26
27Global Load Balancing Caching- Geo Targeting
- Load Shaping Based on Client Resolver Location
- Direct Traffic to Particular Clusters or Caching
Provider as Appropriate - Customer Experience Enhanced due to Improved
Local Proximity - Load Shaping Based on Client Location
- CDN Provider Proxies Requests Responds with
File Based on Location of Client
28SQL Server 2005Peer-To-Peer Replication
- Redundancy
- Each server hosts a copy of the database
- Availability
- Individual servers can be patched/upgraded
without causing database availability issues - Performance
- Application calls are load balanced between nodes
of the cluster for improved scale-out - Zero perceived App Downtime
- Eliminate single point of failure for R/W
Databases
- Considerations
- Object names, object schema, and publication
names should be identical - Publications must allow schema changes to be
replicated - Updates for a given row should be made only at
one database until it has synchronized with its
peers
29Scaling Out Real World Implementation
- Data Center and Geo redundancy
- Scalable Units
- Content Publishing
- WAN Replication
- End-to-end monitoring
30CPU Utilization Per Platform
Comparative Study x86 vs. x64
x86 x86 x64 x64
HTTP Req/Sec CPU HTTP Req/Sec CPU
222 65 216 35
- Key Take Away's
- Huge Gains due to 64-bit H/W Windows Platforms
- Seamless migration provided with WoW64
- Enabled www.Microsoft.com to leverage saved
infrastructure to enable Data Center Redundancy - App Pool Recycles Eliminated Enjoying the new
4GB VM address space running under WoW64!! - Enabled more App Pools driving further Isolation
of Code Content in shared hosting models
31Windows 32bit vs. 64bit Comparison Comparative
Study Results Windows Update Download System
Perf
Test Case 64-bit Hardware running 32bit vs 64bit Windows Test Case 64-bit Hardware running 32bit vs 64bit Windows Test Case 64-bit Hardware running 32bit vs 64bit Windows Test Case 64-bit Hardware running 32bit vs 64bit Windows
Windows Server 2003 Enterprise Edition SP1 Windows Server 2003 Enterprise Edition SP1 Windows Server 2003 Enterprise x64 Edition Windows Server 2003 Enterprise x64 Edition
Mbits/Sec Avg 784 Mbits/Sec Avg 976
Concurrent Connections Avg 15,746 Concurrent Connections Avg 13,600
Get Req/Sec Avg 2,000 Get Req/Sec Avg 3,400
Get Req/Sec Max 2,200 Get Req/Sec Max 6,800
CPU Avg 32 CPU Avg 60
Application Process (VM Usage) 2GB Application Process (VM Usage) 3.2GB
HTTP 500 Errors 2 HTTP 500 Errors 0
Scenario Stress generated by live HTTP traffic
from Windows Update Downloads 32bit Application
Processes bottlenecked by 2GB Virtual Memory
limit vs 4GB capabilities on 64bit operating
system enabling Max Mbits/Sec Improved compute
times on 64bit increased Req/Sec while lowering
Concurrent Connections (ie. Improved HTTP Request
Processing Times)
32Windows 64bit Analysis Comparative Study
Results www.Microsoft.com Perf
- Objective
- Stress a live production server to identify Max
ability to serve HTTP traffic from
www.Microsoft.com client requests
Test Case 64-bit Hardware running 64bit Windows Test Case 64-bit Hardware running 64bit Windows
Windows Server 2003 Enterprise x64 Edition Results
Concurrent Connections Avg 11,697
Connection Attempts/Sec Avg 430
Connection Attempts/Sec Max 577
Get Req/Sec Avg 778
Get Req/Sec Max 956
CPU Avg 96
33Questions?
34Resources
- http//blogs.technet.com/mscom
- http//blogs.msdn.com/mscomts
35Appendix
35
36R/O NLB SQL Cluster
- Redundancy - Each server hosts a copy of the
database - SQL1 Read/Write
- SQL2 SQL3 Read/Only
- Availability
- Individual servers can be patched/upgraded
without causing database availability issues - Performance
- Application calls are load balanced between nodes
of the cluster for improved scale-out
37R/W NLB SQL Cluster
- Redundancy - Each server hosts a copy of the
database - SQL1-Read/Write - Consolidator
- SQL2-Primary Read/Write (active)
- SQL3-Logshipping Secondary (stand by)
- Availability
- Single point of failure
- Manual failover takes minutes to complete
- Performance
- Application calls to a database are not load
balanced between the nodes of the cluster
38Mirroring (SQL 2005 SP1)
- Mirroring
- Highest Availability Writes
- Log Shipping for DC Redundancy
- Reduced failover downtime from 10min avg to lt1min
(planned) - Considerations
- It works on a per database basis for DBs in full
recovery model - Only one database is available for clients at any
time - Supports two partners and an optional witness
server for automated failover
39TCP Window Size How it Works
40TCP Improvements Client Testing
- What Exactly Changed?
- Compound TCP (CTCP) - controls TCP sending
window size interesting when LH is the server - Receive Window Auto-Tuning controls TCP
receive window size interesting when Vista is
client - Test Scenario
- Clients Dual boot client (XPSP2 Vista 5308)
- Test Download (EN W2KSP4 135MB) from 4
locations (Tukwila, Bay, Florida Frankfurt) - Results
- Corporate network environment - direct Internet
connectivity (high speed, low packet loss) - 57 relative speed gain in low latency
scenarios (2-20msec RTT) - gt150 relative speed gain in mid to high latency
scenarios (80-180msec RTT) - Home network environment (Comcast cable modem)
- 40 relative speed gain (16-330msec RTT)
41TCP/IP Throughput Improvements
- Server to server transfer over 20ms RTT Link
- W2K3 ?? W2K3 10-12 Mbps
- Longhorn ?? Longhorn gt 300Mbps
- Vista client Internet download speeds 160ms RTT gt
2x