Title: Reconstructing the Future
1Reconstructing the Future
- Capacity Planning with data thats gone Troppo
- Steve Jenkin - Info Tech.
- Neil Gunther - Performance Dynamics
2Overview
- Background
- Process acquire data, investigate, analyse/model
- The detail
- Sifting through the lumps
- Data Analysis and Modeling
- Summary
3Aims
- With relic data, wanted to write up analysis in
performance modeling terms. - Believed techniques for success in project useful
to designers and practitioners.
4Background
- ATO project - one of many hosted sites
- 12-18 month project
- Replacements I II followed
- Complex Environment
- System Diagram
- Software Contractor
- Our bad days
- Runaway Failure late February
- Anzac Day - security pinhole
5System Diagram
Net
F
Web
F
DB
System Design required by DSD
6System Diagram - II
Net
F
F
Bal
Admin
DB
Web
Web
Web
System as Run
7System Diagram - III
Net
F
F
Bal
Admin
DB
Web
Web
Web
The rate limiting factor
8Good Design Aspects
- Secure Hosting Facility, Dual Firewalls.
- No cookies, some javascript.
- Through testing on many browsers and OS.
- 128-bit SSL Unique ID Random Password
- Dual paths for transferring registrations
- E-mail - MD5 acknowledgements
- Plan B - CDs of files
- Load Balancer best operational decision.
9Software Contractor
- 0NF database design.
- Dont worry about faults.
- Message flood(s).
- Everything in the Database strategy.
- Monitoring, Capacity Planning, Traffic
Forecasting, Stress Testing, Pre-Production
10Zeroth Normal Form (0NF)
- Single 3-column table.
- Key ID tag 255 varchar data
- 600,000 full registrations with 300-500 fields
- Random check number, not Unique (sequential) ID,
used in Key. - Performance, scalability, capacity impacts
- Great for knocking up a quick PHP prototype.
11The Detail or Getting the Numbers
- 'struct acct' ain't 'struct acct
- Solaris 40 bytes, Linux 64 bytes
- Available process accounting tools didnt work.
- missing bytes undocumented feature
- Word alignment on 4 byte boundaries - padding
required - SPARC is big-endian, Intel is little-endian.
- perl can be better than C - pack/unpack
- Raw Data
- 4 machines, 165 days, 1 compressed file/day
- 3Gb uncompressed binary data
12struct acct - Solaris
struct acct char ac_flag / Accounting
flag / char ac_stat / Exit status /
char ac_pad2 / PADDING / uid_t
ac_uid / Accounting user ID / gid_t
ac_gid / Accounting group ID / dev_t
ac_tty / control tty / time_t ac_btime /
Beginning time / comp_t ac_utime /
accounting user time in clock ticks / comp_t
ac_stime / accounting system time in clock
ticks / comp_t ac_etime / accounting total
elapsed time in clock ticks / comp_t
ac_mem / memory usage in clicks (pages) /
comp_t ac_io / chars transferred by read/write
/ comp_t ac_rw / number of block
reads/writes / char ac_comm8 / command
name / Sizeof is 40 bytes, But
112(24)24(62)8 38 Huh? How long is a
clock tick? Whats a comp_t, uid_t, ?
13struct acct - Perl
Format string for unpack my SPACCT_T 'C C n
N N N N n n n n n n A8' Pack/unpack C - 1
char n - 2 byte int (network order) N - 4 byte
int (network order) A8 - 8 byte ascii
string Compt_t, dev_t 16 bits, rest 32. comp_t
is a 16-bit floating point number with a 3-bit
(base 8) exponent and a
13-bit fraction.
14Concluding struct
- Converted Solaris binary acct records to
tab-separated ascii. The Unix standard. - Ended up adding the unix time number in there,
not just YYYY-MM-DD HHMMSS - better for sorting
and joins. - Concatenated all those pesky daily files.
- 42M records compressed into 121Mb ascii for DB
server. - awk, grep, sort, join, etc to manipulate data.
- Created summary tools. EG Total counts in an
interval.
15The Last Detail
- Why only the DB Server?
- Too hard lining up times on servers and
identifying web traffic. - Why ORACLE connections?
- Summary count of commands run by the DB server.
- 19975897 oracle
- 19965977 tnslsnr
- 888528 beprms
- 480040 bepsend
- 453807 beprms-r
- 447294 beptms
- 54489 bepqueue
- 14479 beptest.
- 5984 imotof_a
- 399 beprms_r
- 382 bepimoto
- 6 bepcd_ac
- 5 bepmanag
16Sifting through the lumps
- With 20M records and 1M on your busiest day,
simple tools arent going to cut it. - Selecting and aggregating data is critical.
- Visualising the effects is really important.
- But asking gnuplot (or anything) to plot 5
minute samples for 165 days (50,000 points)
isnt a good idea.
17What to count?
- The DB server command counts
- 19975897 oracle
- 19965977 tnslsnr
- 888528 beprms
- 480040 bepsend
- bepsend should be a perfect measure - one email
sent per completed registration. - oracle is the DB process. Looks like one per DB
connection. Similar to the listener. - A good fallback?
18bepsend. Why not - I
bepsend completed registrations - whole period
19bepsend. Why not - II
bepsend completed registrations - 3 weeks of
spike
20bepsend. Why not - III
- E-mails not always sent.
- There were floods after stoppages.
- The rate was capped at 600 or 1000/hr.
- Turned off bepsend for the Busiest day.
- spent next 3 days clearing the backlog.
- Close, but no cigar.
21Why Oracle Connections?
Oracle Connections vs bepsend
22Analysing the Data
- Exponential Growth was initially used to model
the site traffic. - But was that right?
- Never went back and checked the model.
- From inspection, the traffic (DB connections),
has at least three distinct regions. - Is the obvious correct?
- Is traffic growth exponential or something else?
- Is there anything else useful in there?
23Mathematica Marvels
- Reconstructing the original exponential model
-
- Doubling period 6 weeks
- But significantly low around the spike
- Enter the Power Law -
- highly correlated behaviour as a cause
- The system is primarily doing just one thing
for a longer than average time Gun06
24Power Law
Log-Log plot demonstrating Power Law behaviour
25Exponential vs Power Law Fit
26The Spike
- Problems
- The spike is too high for the simple model
- Ummm, just how many registrations were there?
- Major Insight - something else was going on.
- Busy Tone impact wasnt modeled.
- PDQ model of two linked queues possible
- Busy Tone queue ( 5 seconds Q service time)
- Registration queue (30 minutes Q service time)
27Other time-series models
- Simpoint
- Irregular Sampling
- Holt-Winters - for seasonal (repeating) data
- From the paper
- Neil Gunthers expert area
- The other techniques are appropriate when many
chunks are missing. - That wasnt the problem here. But we know what
to do in those cases.
28Getting Expert help
- With my trusty Excel modest arsenal of
techniques, no surprising results likely. - Having someone to talk the problems through was
invaluable. - Especially when they have much better Maths, a
bunch of powerful tools and lots of experience
using them.
29Runaway Failures
- As web server response slows, users click again
and again. CGIs have to run to completion, but
more are started. - positive feedback vs negative/self-limiting.
- System Demand increases precisely because of slow
response, due to high system load. - Correlation of Load and Demand, as noted earlier,
leads to super-exponential growth.
30Busy Tone
- A response to a system meltdown due to sudden
increase in demand when the site was first
advertised. - Load Average incr. from 1 to 150 in meltdown.
- Response Time not measured. Off the chart.
- BT implementation weakness - DB access.
- Consumed large fraction of system on Busy Day.
- System averaged 1250-1500 registrations/hr.
31The Fudge Factor
- On the busiest day, sending completed
registrations halted to lighten load. - Email sent by DB server.
- Because of previous mail floods, email rate
limited - first to 600/hr, then 1000/hr. - From 3am Tuesday evening to 2am Sunday, 55,000
completed registrations. - How much did Busy Tone increase workload?
- A Fudge Factor to convert work-load
unitsconnections to registrations (internal
to external)
32The Fudge Factor - II
Completed Registrations sent/hour over Busy 4 Days
33The Fudge Factor - III
Adjusted Completed Registrations May-June
34The Fudge Factor - IV
- For whole period
- 19,965,976 Connections, 480,076 registrations,
and ratio of 41.5 - For the busiest 4 days 3am Tue 30/5 - 2am Sun
04/06 - 2,517,219 Oracle Connections
- 54,245 Registrations completed (12.5 total)
- 46.4 Connections/Registration ratio
- Rough effect of Busy Day
- 55,000 Connections at average rate
2,256,006an excess of 261,213 connections. - 10 for busy 4 days.
- 20 if all on the day of the spike.
- Averaged 12-13 retries for all users.
35Traffic Characteristics - Weekly
- Morning and Afternoon peaks
- distinct lunch and tea times.
- High evening peak.
- But not on Friday nights.
- People are surprisingly busy on weekends.
- And many work late Sunday night.
36Traffic Characteristics - I
37Traffic Characteristics - Busy
- System flat-out about 18 hours on Busy Day.
- High traffic loads in normally off-peak times.
- The Day After reasonably quiet.
- Subsequent days very quiet.
- How close to meltdown on the Busy Day because
of the Busy Tone? - Perhaps 25, but its Sudden death
- Busy Tone load (Retry RateQueue Len)
- Need PDQ model to know.
38Traffic Characteristics - II
Monday, Tuesday and Wednesday Load
39Traffic Characteristics - IV
Load in equivalent hours _at_ max throughput
40Our new insights
- Metrics captured/displayed during the life of the
system did not fully describe the load on the
system. - registrations, response time and vmstats.
- No DB connects.
- Busy Tone created significant load.
- The spike on Busy Day was super-exponential,
not the original model.
41Coulda, Woulda, Shoulda -Further work
- Develop simple PDQ model to incorporate busy tone
effect and show power law criticality. - Categorise connection data.
- Investigate webserver data to reconstruct user
response time, and correlate with load. - Show problems tuning Busy Tone parameters.
- Reconstruct system activity, CPU and IO
activity, by aggregating process accounting data.
42Further work
- General Recommendations
- Design in instrumentation,Performance
measurement, reporting and analysis, Capacity
Planning and Load Projections - Expect and prepare to report on Lessons
Learned. - E.g Canadian report - half or more of the
traffic in last 6 weeks. - Design Operational Support Fault Procedures,
Fault Prioritisation, escalation, a war-room,
security failure or disaster response plan. - For e-mail transfer, use digests as well as
backup CD's.. - Describe the many different traffic periods
- Pre-Xmas, Xmas, January Holidays, Christmas
rush, The Next Day, False Deadline and well
afterwards
43Summary
- We banged on some different system data and were
able to verify some important system effects. - And learnt a few new things along the way.
- About the system.
- About Busy Tone and more runaway failures.
- About handling and processing large datasets.
- About performance analysis and modeling .