Reconstructing the Future - PowerPoint PPT Presentation

About This Presentation
Title:

Reconstructing the Future

Description:

With relic data, wanted to write up analysis in performance modeling terms. ... Pre-Xmas, Xmas, January Holidays, 'Christmas rush', The Next Day, False Deadline ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 44
Provided by: gayle3
Category:

less

Transcript and Presenter's Notes

Title: Reconstructing the Future


1
Reconstructing the Future
  • Capacity Planning with data thats gone Troppo
  • Steve Jenkin - Info Tech.
  • Neil Gunther - Performance Dynamics

2
Overview
  • Background
  • Process acquire data, investigate, analyse/model
  • The detail
  • Sifting through the lumps
  • Data Analysis and Modeling
  • Summary

3
Aims
  • With relic data, wanted to write up analysis in
    performance modeling terms.
  • Believed techniques for success in project useful
    to designers and practitioners.

4
Background
  • ATO project - one of many hosted sites
  • 12-18 month project
  • Replacements I II followed
  • Complex Environment
  • System Diagram
  • Software Contractor
  • Our bad days
  • Runaway Failure late February
  • Anzac Day - security pinhole

5
System Diagram
Net
F
Web
F
DB
System Design required by DSD
6
System Diagram - II
Net
F
F
Bal
Admin
DB
Web
Web
Web
System as Run
7
System Diagram - III
Net
F
F
Bal
Admin
DB
Web
Web
Web
The rate limiting factor
8
Good Design Aspects
  • Secure Hosting Facility, Dual Firewalls.
  • No cookies, some javascript.
  • Through testing on many browsers and OS.
  • 128-bit SSL Unique ID Random Password
  • Dual paths for transferring registrations
  • E-mail - MD5 acknowledgements
  • Plan B - CDs of files
  • Load Balancer best operational decision.

9
Software Contractor
  • 0NF database design.
  • Dont worry about faults.
  • Message flood(s).
  • Everything in the Database strategy.
  • Monitoring, Capacity Planning, Traffic
    Forecasting, Stress Testing, Pre-Production

10
Zeroth Normal Form (0NF)
  • Single 3-column table.
  • Key ID tag 255 varchar data
  • 600,000 full registrations with 300-500 fields
  • Random check number, not Unique (sequential) ID,
    used in Key.
  • Performance, scalability, capacity impacts
  • Great for knocking up a quick PHP prototype.

11
The Detail or Getting the Numbers
  • 'struct acct' ain't 'struct acct
  • Solaris 40 bytes, Linux 64 bytes
  • Available process accounting tools didnt work.
  • missing bytes undocumented feature
  • Word alignment on 4 byte boundaries - padding
    required
  • SPARC is big-endian, Intel is little-endian.
  • perl can be better than C - pack/unpack
  • Raw Data
  • 4 machines, 165 days, 1 compressed file/day
  • 3Gb uncompressed binary data

12
struct acct - Solaris
struct acct char ac_flag / Accounting
flag / char ac_stat / Exit status /
char ac_pad2 / PADDING / uid_t
ac_uid / Accounting user ID / gid_t
ac_gid / Accounting group ID / dev_t
ac_tty / control tty / time_t ac_btime /
Beginning time / comp_t ac_utime /
accounting user time in clock ticks / comp_t
ac_stime / accounting system time in clock
ticks / comp_t ac_etime / accounting total
elapsed time in clock ticks / comp_t
ac_mem / memory usage in clicks (pages) /
comp_t ac_io / chars transferred by read/write
/ comp_t ac_rw / number of block
reads/writes / char ac_comm8 / command
name / Sizeof is 40 bytes, But
112(24)24(62)8 38 Huh? How long is a
clock tick? Whats a comp_t, uid_t, ?
13
struct acct - Perl
Format string for unpack my SPACCT_T 'C C n
N N N N n n n n n n A8' Pack/unpack C - 1
char n - 2 byte int (network order) N - 4 byte
int (network order) A8 - 8 byte ascii
string Compt_t, dev_t 16 bits, rest 32. comp_t
is a 16-bit floating point number with a 3-bit
(base 8) exponent and a
13-bit fraction.
14
Concluding struct
  • Converted Solaris binary acct records to
    tab-separated ascii. The Unix standard.
  • Ended up adding the unix time number in there,
    not just YYYY-MM-DD HHMMSS - better for sorting
    and joins.
  • Concatenated all those pesky daily files.
  • 42M records compressed into 121Mb ascii for DB
    server.
  • awk, grep, sort, join, etc to manipulate data.
  • Created summary tools. EG Total counts in an
    interval.

15
The Last Detail
  • Why only the DB Server?
  • Too hard lining up times on servers and
    identifying web traffic.
  • Why ORACLE connections?
  • Summary count of commands run by the DB server.
  • 19975897 oracle
  • 19965977 tnslsnr
  • 888528 beprms
  • 480040 bepsend
  • 453807 beprms-r
  • 447294 beptms
  • 54489 bepqueue
  • 14479 beptest.
  • 5984 imotof_a
  • 399 beprms_r
  • 382 bepimoto
  • 6 bepcd_ac
  • 5 bepmanag

16
Sifting through the lumps
  • With 20M records and 1M on your busiest day,
    simple tools arent going to cut it.
  • Selecting and aggregating data is critical.
  • Visualising the effects is really important.
  • But asking gnuplot (or anything) to plot 5
    minute samples for 165 days (50,000 points)
    isnt a good idea.

17
What to count?
  • The DB server command counts
  • 19975897 oracle
  • 19965977 tnslsnr
  • 888528 beprms
  • 480040 bepsend
  • bepsend should be a perfect measure - one email
    sent per completed registration.
  • oracle is the DB process. Looks like one per DB
    connection. Similar to the listener.
  • A good fallback?

18
bepsend. Why not - I
bepsend completed registrations - whole period
19
bepsend. Why not - II
bepsend completed registrations - 3 weeks of
spike
20
bepsend. Why not - III
  • E-mails not always sent.
  • There were floods after stoppages.
  • The rate was capped at 600 or 1000/hr.
  • Turned off bepsend for the Busiest day.
  • spent next 3 days clearing the backlog.
  • Close, but no cigar.

21
Why Oracle Connections?
Oracle Connections vs bepsend
22
Analysing the Data
  • Exponential Growth was initially used to model
    the site traffic.
  • But was that right?
  • Never went back and checked the model.
  • From inspection, the traffic (DB connections),
    has at least three distinct regions.
  • Is the obvious correct?
  • Is traffic growth exponential or something else?
  • Is there anything else useful in there?

23
Mathematica Marvels
  • Reconstructing the original exponential model
  • Doubling period 6 weeks
  • But significantly low around the spike
  • Enter the Power Law -
  • highly correlated behaviour as a cause
  • The system is primarily doing just one thing
    for a longer than average time Gun06

24
Power Law
Log-Log plot demonstrating Power Law behaviour
25
Exponential vs Power Law Fit
26
The Spike
  • Problems
  • The spike is too high for the simple model
  • Ummm, just how many registrations were there?
  • Major Insight - something else was going on.
  • Busy Tone impact wasnt modeled.
  • PDQ model of two linked queues possible
  • Busy Tone queue ( 5 seconds Q service time)
  • Registration queue (30 minutes Q service time)

27
Other time-series models
  • Simpoint
  • Irregular Sampling
  • Holt-Winters - for seasonal (repeating) data
  • From the paper
  • Neil Gunthers expert area
  • The other techniques are appropriate when many
    chunks are missing.
  • That wasnt the problem here. But we know what
    to do in those cases.

28
Getting Expert help
  • With my trusty Excel modest arsenal of
    techniques, no surprising results likely.
  • Having someone to talk the problems through was
    invaluable.
  • Especially when they have much better Maths, a
    bunch of powerful tools and lots of experience
    using them.

29
Runaway Failures
  • As web server response slows, users click again
    and again. CGIs have to run to completion, but
    more are started.
  • positive feedback vs negative/self-limiting.
  • System Demand increases precisely because of slow
    response, due to high system load.
  • Correlation of Load and Demand, as noted earlier,
    leads to super-exponential growth.

30
Busy Tone
  • A response to a system meltdown due to sudden
    increase in demand when the site was first
    advertised.
  • Load Average incr. from 1 to 150 in meltdown.
  • Response Time not measured. Off the chart.
  • BT implementation weakness - DB access.
  • Consumed large fraction of system on Busy Day.
  • System averaged 1250-1500 registrations/hr.

31
The Fudge Factor
  • On the busiest day, sending completed
    registrations halted to lighten load.
  • Email sent by DB server.
  • Because of previous mail floods, email rate
    limited - first to 600/hr, then 1000/hr.
  • From 3am Tuesday evening to 2am Sunday, 55,000
    completed registrations.
  • How much did Busy Tone increase workload?
  • A Fudge Factor to convert work-load
    unitsconnections to registrations (internal
    to external)

32
The Fudge Factor - II
Completed Registrations sent/hour over Busy 4 Days
33
The Fudge Factor - III
Adjusted Completed Registrations May-June
34
The Fudge Factor - IV
  • For whole period
  • 19,965,976 Connections, 480,076 registrations,
    and ratio of 41.5
  • For the busiest 4 days 3am Tue 30/5 - 2am Sun
    04/06
  • 2,517,219 Oracle Connections
  • 54,245 Registrations completed (12.5 total)
  • 46.4 Connections/Registration ratio
  • Rough effect of Busy Day
  • 55,000 Connections at average rate
    2,256,006an excess of 261,213 connections.
  • 10 for busy 4 days.
  • 20 if all on the day of the spike.
  • Averaged 12-13 retries for all users.

35
Traffic Characteristics - Weekly
  • Morning and Afternoon peaks
  • distinct lunch and tea times.
  • High evening peak.
  • But not on Friday nights.
  • People are surprisingly busy on weekends.
  • And many work late Sunday night.

36
Traffic Characteristics - I
37
Traffic Characteristics - Busy
  • System flat-out about 18 hours on Busy Day.
  • High traffic loads in normally off-peak times.
  • The Day After reasonably quiet.
  • Subsequent days very quiet.
  • How close to meltdown on the Busy Day because
    of the Busy Tone?
  • Perhaps 25, but its Sudden death
  • Busy Tone load (Retry RateQueue Len)
  • Need PDQ model to know.

38
Traffic Characteristics - II
Monday, Tuesday and Wednesday Load
39
Traffic Characteristics - IV
Load in equivalent hours _at_ max throughput
40
Our new insights
  • Metrics captured/displayed during the life of the
    system did not fully describe the load on the
    system.
  • registrations, response time and vmstats.
  • No DB connects.
  • Busy Tone created significant load.
  • The spike on Busy Day was super-exponential,
    not the original model.

41
Coulda, Woulda, Shoulda -Further work
  • Develop simple PDQ model to incorporate busy tone
    effect and show power law criticality.
  • Categorise connection data.
  • Investigate webserver data to reconstruct user
    response time, and correlate with load.
  • Show problems tuning Busy Tone parameters.
  • Reconstruct system activity, CPU and IO
    activity, by aggregating process accounting data.

42
Further work
  • General Recommendations
  • Design in instrumentation,Performance
    measurement, reporting and analysis, Capacity
    Planning and Load Projections
  • Expect and prepare to report on Lessons
    Learned.
  • E.g Canadian report - half or more of the
    traffic in last 6 weeks.
  • Design Operational Support Fault Procedures,
    Fault Prioritisation, escalation, a war-room,
    security failure or disaster response plan.
  • For e-mail transfer, use digests as well as
    backup CD's..
  • Describe the many different traffic periods
  • Pre-Xmas, Xmas, January Holidays, Christmas
    rush, The Next Day, False Deadline and well
    afterwards

43
Summary
  • We banged on some different system data and were
    able to verify some important system effects.
  • And learnt a few new things along the way.
  • About the system.
  • About Busy Tone and more runaway failures.
  • About handling and processing large datasets.
  • About performance analysis and modeling .
Write a Comment
User Comments (0)
About PowerShow.com