Title: John C' Calvin
1Design to Debug Build to Last
- John C. Calvin
- Senior Systems Architect
- University of Toronto
2Toronto, Ontario CANADA (YYZ)
- population 5.6 million
- about the population of Chicago
- area 2,751 mi2 (7,125 km2)
- 4 universities 7 colleges
3University of Toronto
- 3 Campuses in the Greater Toronto Area
- Downtown 52,296 students 175 acres
- Eastern 10,465 students 300 acres
- Western 10,924 students 200 acres
- 1.4 Billion annual operating budget
- Degree Programs
- 840 undergraduate
- 520 graduate
- 75 doctoral
- 14,500 annual undergraduate intake
- 10,000 students in residence
- 247 buildings
- 200 acres net assignable space
- 7,591 parking spaces
4University of Toronto
- Students
- 55,352(46,940 FTEs) undergraduate degree-seeking
students - 13,702 (12,499 FTEs) graduate degree-seeking
students - 2,038 (902 FTEs) certificate, diploma and special
students - 2,593 residents and post-graduate medical
students - International students
- 5,182 undergraduate degree seeking students
- 1,579 graduate degree seeking students
- 326 certificate, diploma and special students
- 779 residents and post-graduate medical students
- Faculty
- 2,260 (2,185 FTEs) Professorial
- 432 (378 FTEs) Teaching Stream
- 1,079 (216 FTEs) Term-limited Sessional and
Stipendiary - 3,913 (2,351 FTEs) Clinical
- 2,707 (1,071 FTEs) Other
5Agenda
- System Design
- Implementation
- Observations
6Blackboard System Overview
7Raw Systems Statistics
- System Activity
- 225,703 average pages per day
- 414,415 peak pages per day
- User Accounts
- 224,192 defined
- 59,813 active
8Raw System Statistics
- 600,000 hits/hour (https GET/POST)
- 1200-1900 concurrent users (10min int.)
- 1 Terabyte/hour (JVM Memory Usage)
9Raw System Statistics
10Production Hardware Components
- F5 BIG-IP LTM load-balancers (redundant)
- Filbert firewalls (redundant)
- 3 x Sun T2000 application servers
- 1 x Sun T2000 NFS/collaboration services
- 1 x Sun v890 database server
- 9.4TB Sun 9985v storage array
- 2 x Brocade 200E 16-port FC switches
- 3 x 3Com 5500G 48-port gigabit switches
11Support Systems
- Warm Standby/Backup System
- Sun T5220 database server
- Sun T2000 application/content server
- QA Pre-production Test System (8.0)
- Sun T2000 application server
- Sun T2000 content server
- Sun T5220 database server
- Sun StorageTek 6140
- STAGING Upgrade and Migration Test (9.x)
- Sun T1000 content server
- Sun T5220 database server
- Sun StorageTek 6140
12Blackboard Software Components
- Solaris 10 Update 5 (SPARC)
- Sun JDK 1.6.0_14 (Java)
- Apache/Tomcat/ModPerl/Xythos/Pubcookie
- Blackboard v8.0.422.0
- Blackboard Learning System
- Blackboard Community System
- Blackboard Content System
- Oracle 10GR2
13Application Server Basics
- User requests come to Apache on port 443.
- Java requests go to Tomcat.
- Perl request go to Apache with ModPerl.
- Apache delivers results to users.
- Oracle provides everything but file content.
- Content and course permissions
- Grade-book and session history
- User-interface layout customizations
- NFS file system, mounted on the application
servers, provides Apache with access to shared
file content.
14Blackboard System Internals
15T2000 Application Servers
- 1.2GHz UltraSPARC T1 (Niagara) processor
- 8 cores, 32 virtual processors
- 64GB DDR2 EEC 400Mhz memory
- 2 x HW RAID-1pairs of 73GB SAS disks
- One pair dedicated to /usr/local/blackboard
- One pair for all other file systems
- 4 x 10/100/1000 network interfaces
- 2 x Redundant, hot-swap power supplies
- Advance Lights-Out Management (ALOM)
16Blackboard System Old Design
17Blackboard System New Design
182 Front-side Networks
- After the load-balancer
- Inside and outside the firewall
- 1514 byte frames
- SSH and HTTPS traffic share single NIC
- Production Collaboration Services
- QA and management systems
196 Back-end Networks
- 3 NFS backend networks
- 3 SQL backend networks
- Jumbo Frames 8192 byte frames
- /etc/hosts overrides DNS
- Each application server talks to a dedicated
interface
20Switch VLAN Configuration
- VLAN ID Description IP Address Range
- 1 LMS Front-side 128.100.87.0/24
- 666 Test Network 192.168.1.0/24
- 1103 CNS2 Network 128.100.103.0/24
- 4011 NFS Channel 1 172.16.11.0/24
- 4012 NFS Channel 2 172.16.12.0/24
- 4013 NFS Channel 3 172.16.13.0/24
- 4021 SQL Channel 1 172.16.21.0/24
- 4022 SQL Channel 2 172.16.22.0/24
- 4023 SQL Channel 3 172.16.23.0/24
21Redundant Load-Balancersand Firewalls
22(No Transcript)
23Physical Storage Components
- Sun StorageTek 9985v (15k RPM disks)
- 34 x 300GB disk drives
- 4 x 1.6TB RAID6 (6D2P) arrays
- 2 3.2TB Pools, each composed of 2 1.6TB arrays
- 2 hot-spare drives
- Sun StorageTek 6140 (15k RPM disks)
- 16 x 300GB disk drives
- 3.5 TB RAID6 (13D2P)
- 1 x hot-spare
24Physical Storage Components
- Sun StorageTek 6140 (10k RPM disks)
- 16 x 300GB disk drives
- 3.5 TB RAID6 (13D2P)
- 1 x hot-spare
- Sun StorageTek 6130 (15K RPM disks)
- 14 x 68GB disk drives
- 1 x 814GB RAID5 (12D1P)
- 1 x hot-spare
25RAID-5 (7D1P) 300GB Disks 1.9TB
26RAID-6 (6D2P) 300GB Disks 1.6TB
2714.2 TB of Usable Storage
- 15,000 RPM Drives
- 4 x 1.6TB - 300GB RAID6 (6D2P)
- 1 x 3.5TB - 300GB RAID6 (13D2P)
- 1 x 814GB - 68GB RAID5 (12D1P)
- 10,000 RPM Drives
- 1 x 3.5TB - 300GB RAID6 (13D2P)
282 Parity Groups per Pool 3.2TB
292 Pools 6.4TB Internal Total
30Parity Groups and Pools
31Now, forget about the disk arrays.Think about
pools of storage.
32Each pool serves a different purpose.
33Each pool is carved into 50 x 61GB LDEVs.
34The Storage Plan
35ShadowImage LDEV Duplication
36ShadowImage LDEV Split
37ShadowImage to Multiple LDEVs
38ShadowImage to Cascading LDEVs
39ShadowImage Consistent Split
40Normal Paired Operation
41Split for Online Backups
42Resync Standby from Production
43Resume Paired Operation
44Split for Standby Server Takeover
45Recover Production from Standby
46Resume Paired Operation
47Scope of the Upgrade
- Software updates
- Raidctrl support
- Upgrade Solaris
- Upgrade Blackboard
- Upgrade Oracle
- Sun 9985v upgrade
- Install configure
- Benchmark SAN
- Migrate content and DB
- BIOS upgrades
- 8 x T2000
- 2 x T1000
- 3 x T5220
- 1 x v890
- Firmware upgrades
- 3 x 3com switches
- 2 x Brocade 200E
48Physical System Size
- 4 19-inch 42U racks
- 22 pairs of mounting rails
- 18 custom cable harnesses
- 35 serial console ports
- 41 DB9 to RJ45 serial adapters
- 175 Ethernet cables
- 54 220v AC power cords
49System Availability Restrictions
- June 27 start date
- July long-weekend (1) x 56-hour window
- 1700 Friday to 0001 Monday
- August 8,15, 22 (3) x 2-hour windows
- 1800 2000 Fridays
- August long-weekend (1) x 56-hour window
- 1700 Friday to 0001Monday
- August 8 Sun StorageTek 9985v Installation
- August 12 Blackboard Consulting Engagement
- August 15 PRODUCTION LOCKDOWN!!
50Cable Colour Scheme
- NET0 (1st network interface)
- NET1 (2nd network interface)
- NET2 (3rd network interface)
- NET3 (4th network interface)
- MGT (management interface)
- SER (serial console port)
51That looks simple enough.
52(No Transcript)
53(No Transcript)
54Default JVM Generational Model
Young Generation
Old Generation
Eden Space
Survivor Space
Tenured Space
55Generational Memory Model
Young Generation
Old Generation
5664-Bit JVM
- Java(TM) 2 Runtime Environment, Standard Edition
- (build 1.5.0_14-b03)
- Java HotSpot(TM) 64-Bit Server VM
- (build 1.5.0_14-b03, mixed mode)
- Java(TM) Platform, Standard Edition for Business
- (build 1.6.0_14-b08)
- Java HotSpot(TM) 64-Bit Server VM
- (build 14.0-b16, mixed mode)
57Java on Sun Supported Platforms
- Java(TM) 2 Runtime Environment, Standard Edition
(build 1.5.0-2008-11-17-065212.va203678.j2se-jprta
dm_16_Nov_2008_23_48) - Java HotSpot(TM) 64-Bit Server VM (build
1.5.0_17rev_TEST_150_17revcr6786503cr6787254cr5
070073_AlwaysPreTouch_03_chrisphi_2009.02.04_1357
, mixed mode)
58JVM Memory Management
- Default JVM Settings
- MaxHeapSize 1GB
- NewRatio 2
- SurvivorRatio 6
- TenuringThreshold16
- MaxPermSize 84MB
- Memory Spaces
- New Generation
- Eden Space
- Survivor Spaces
- Old Generation
59Sun Default 64-bit JVM Heap
60UofT Blackboard JVM Heap
61UofT Blackboard JVM Heap
62JVM Options Memory Sizing
- Reserved, Perm., Stack
- -Xss256k
- -XXInitialCodeCacheSize128m
- -XXReservedCodeCacheSize128m
- -XXPermSize256m
- -XXMaxPermSize256m
- Misc
- -XXUseTLAB
- -XXAlwaysPreTouch
- -XXUseNiagaraIntrs
- Young Old Gen.
- -Xms16g
- -Xmx16g
- -XXNewSize4g
- -XXMaxNewSize4g
- -XXOldSize12g
- -XXNewRatio4
- -XXSurvivorRatio4096
63JVM Options Garbage Collection
- -XXDisableExplicitGC
- -XXUseParNewGC
- -XXParallelRefProcEnabled
- -XXUseConcMarkSweepGC
- -XXCMSClassUnloadingEnabled
- -XXCMSParallelRemarkEnabled
- -XXCMSPermGenSweepingEnabled
- -XXCMSScavengeBeforeRemark
- -XXCMSMarkStackSize8M
- -XXCMSMarkStackSizeMax8M
- -XXCMSInitiatingOccupancyFraction60
- -XXMaxTenuringThreshold0
- -XXParallelGCThreads16
- -XXParallelCMSThreads16
64JVM Options - Logging
- -XXPrintVMOptions
- -XXPrintCommandLineFlags
- -XXPrintGCDetails
- -XXPrintGCTimeStamps
- -XXPrintGCTaskTimeStamps
- -XXPrintGCApplicationStoppedTime
- -XXPrintGCApplicationConcurrentTime
- -XXPrintHeapAtGC
- -XXPrintTenuringDistribution
- -XXPrintCMSStatistics2
- -Xloggc/var/log/gc.ltpidgt.log
65JVM Options Monitoring
- Dcom.sun.management.snmp.interface0.0.0.0
- Dcom.sun.management.snmp.port8161
- Dcom.sun.management.snmp.aclfalse
- Dcom.sun.management.jmxremote.authenticatefalse
- Dcom.sun.management.jmxremote.port9161
- Dcom.sun.management.jmxremote.sslfalse
66Instrumentation
- 1500 monitored parameters
- 7 production JVMs
- 52 server network interfaces
- 13 server CPUs
- 70 disk sub-systems
- 144 gigabit switch ports
- 32 FC switch ports
- Load-balancers firewalls
67Best Indicators of System Stability
- Operating System
- CPU Usage
- Swap Space Usage
- TCP Connections
- JVM
- Thread Counts
- Memory Usage Pattern
- Garbage Collection Pattern
68Tenured Generation Usage
69Eden Space Usage
70The fork()/exec() Problem
71The fork()/exec() Problem (contd)
72Problems Become Obvious
73Peak Week Portal Connections
74JVM Memory Allocation Rate
75Peak Database Disk
76Single Application Server450k hits/hour
77Single Server Thread Counts
78Single Server Stop-the-World Times
79Single Server Memory Usage
80Midterm Marks Peak
81Useful URLs
- http//www.cacti.net/
- http//research.sun.com/techrep/2000/smli_tr-2000-
88.pdf - http//research.sun.com/jtech/pubs/04-g1-paper-ism
m.pdf - http//java.sun.com/docs/hotspot/gc5.0/gc_tuning_5
.html
82Summary
- Design to debug and build to last.
- Know what to cut and where.
- Its all about the cabling.
- SANs fall harder than LANs.
- Big, fast, or stable. Pick any two?
- Graph everything. Always!
83Question Answer
- John C. Calvin
- Senior Systems Architect
- University of Toronto
- john.calvin_at_utoronto.ca