Unlocking the Mysteries Behind Update STatistics - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Unlocking the Mysteries Behind Update STatistics

Description:

Title: Unlocking the Mysteries Behind Update STatistics Author: John F. Miller III Last modified by: IBM_User Created Date: 3/8/1996 11:24:36 AM Document presentation ... – PowerPoint PPT presentation

Number of Views:186

Avg rating:3.0/5.0

Slides: 38

Provided by: John1980

Category:

more less

Transcript and Presenter's Notes

Title: Unlocking the Mysteries Behind Update STatistics

1
Unlocking the Mysteries Behind Update Statistics
John F. Miller III
STSM
Informix Chat with the Labs
2
The Dice Problem

Throw dice, how many will be 1?

3
Questions about the Dice

How many dice are you throwing?
How many sides does each dice have?
Are all the dice the same?

The better the information, the more accurate the
estimate.
4
What does Update Statistics do?

Collects information for the optimizer
Statistics LOW
Distributions MEDIUM
HIGH
Drop Distributions
Compile stored procedures

5
Statistics Collected

systables
systables
syscolumns
syscolumns
sysindexes
sysindexes

Number of Rows
Number of pages to store the data
Second largest value for a column
Second smallest value for a column
of unique values for the lead key
How highly clustered the values for the lead key

6
Update Statistics LowBasic Algorithm

Walk the leaf pages in each index
Submit btree cleaner requests when deleted items
are found causing re-balancing of indexes
Collects the following information
Number of unique items
Number of leave pages
How clustered the data is
Second highest and lowest value

7
How to Read Distributions
of rows represented in this bin
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
of unique values
Highest Value in this bin

To get the range of values look at the highest
value in the previous bin.

The value
of rows for this value
8
Example - Approximating a Value

There are 868317 rows containing a value between
-1 and 75
There are 70 unique values in this range
The optimizer will deduce 868317 / 70 12,404
records for each value between -1 and 75

--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
9
Example - Dealing with Data Skew
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)

Data skew
For the value 43 how many records will the
optimizer estimate will exist?
Answer 779848 values
Any value that exceeds 25 of the bin size will
be placed in an overflow bin

10
Basic Algorithm for Distributions

Develop scan plan based on available resources
Scan table
High All rows
Medium Sample of rows
Sort each column

Build distributions
Begin transaction
Delete old columns distributions
Insert new columns distributions
Commit transaction

11
Sample Size

HIGH
All rows in the table
Medium
Misconception about the number of rows sampled is
based on the number of rows in the table, this is
incorrect.
The number of samples depends on the Confidence
and Resolution.
If the sample size is greater than the number of
row in the table Medium turns into High mode

12
Update Statistics Medium Sample Size
13
How Much Information is Enough??
The better the information, the more accurate the
estimate.
14
Examining the Running QueryNo Statistics VS
Medium Statistics
No Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
20888 Estimated of Rows Returned 6760 1)
miller3.t1 SEQUENTIAL SCAN Filters
miller3.t1.c1 gt 20200
Medium Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
21 Estimated of Rows Returned 19 1)
miller3.t1 INDEX PATH (1) Index Keys c1
(Serial, fragments ALL) Lower Index
Filter t1.c1 gt 20250
Overall performance improved The estimates were
more accurate The query plan changed
15
Examining the Running QueryMedium Statistics VS
High Statistics
High Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
33 Estimated of Rows Returned 30 1)
miller3.t1 INDEX PATH (1) Index Keys c1
Lower Index Filter t1.c1 gt 20250
Medium Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
21 Estimated of Rows Returned 19 1)
miller3.t1 INDEX PATH (1) Index Keys c1
Lower Index Filter t1.c1 gt 20250
Overall performance did not change The estimates
were slightly more accurate The query plan did
not change
16
Version of Update Statistics Improvements

All version of 9.40 and 10.00
9.30.UC3
9.21 Not fixed
7.31.UD2

17
Improvements in Update Statistics

Update statistics can not allocated memory
between 4MB and 100MB of sort memory
The default has been raised from 4MB to 15MB
User can now configure the amount of memory
Use DBUPSPACE has been augmented to include
memory
Format of DBUPSPACE
max disk spacedefault memory
To increase the memory to 35 MB, set
DBUPSPACE035.
Allow update statistics to use light scans when
scanning a a table
Implemented light scans
Set oriented reads

18
Improvements in update statistics

Information about building data distributions is
not viewable by the DBA
Set explain will now print the scan path and
resource usage when building data distributions
Update statistics low on fragmented tables does
not run in parallel
With PDQ turned on each index fragment will be
scanned in parallel
PDQ at 1 means 10 of the index fragments scanned
in parallel, while PDQ at 10 means all the index
fragments will be scanned in parallel

19
Improvements in Update Statistics

Various errors (126, 312, 100,) when executing
update statistics
Errors when trying to insert the distributions
because set lock mode to wait was not handled
properly inside update statistics
Range scanning a fragmented index is slow Replace
the next loop merge with a binary search merge
when ordering items from index fragments
Most noticeable when the number of fragments in
an index is large

20
Update Statistics Medium Memory Requirements
21
Update Statistics High Memory Requirements

In memory sort
Approximate Memory number of rows sum(column
widths 2 sizeof(pointer) )

22
Memory Rules

Estimated Update Stats memory is below 100MB
Hard coded limit of 4MB
Attempts to minimize the scans by fitting as many
columns into 4MB
Estimated Update Stats memory is above 100MB
Memory is requested from MGM
Attempt to minimize the scans by fitting as many
columns in the MGM memory

23
Examples

Customer Table
Cust_id integer
Fname char(50)
Lname char(50)
Address1 char(200)
Address2 char(200)
State char(2)
zipcode integer
Number of Rows 500,000

24
ExamplesMemory for Incore Sort
25
ExamplesNumber of Table Scans
26
Confidence

A factor in the number of samples used by update
statistics medium

27
Resolution

Percentage of data that is represented in a
distribution bin
Example
100,000 rows in the table
Resolution of 2
Each bin will represent 2,000 rows

28
Example

Following Example
Table size 215,000 rows
Row size 445 bytes
Uniprocessor

29
Example of the current update statistics

Table jmiller.t9
Mode HIGH
Number of Bins 267 Bin size 1082
Sort data 101.4 MB
Sort memory granted 4.0 MB
Estimated number of table scans 10
PASS 1 c9
PASS 2 c5
PASS 3 c7
PASS 4 c6
..
PASS 10 c4

Completed pass 1 in 0 minutes 24 seconds
Completed pass 2 in 0 minutes 20 seconds
Completed pass 3 in 0 minutes 17 seconds
Completed pass 4 in 0 minutes 17 seconds
Completed pass 5 in 0 minutes 17 seconds
Completed pass 6 in 0 minutes 15 seconds
Completed pass 7 in 0 minutes 14 seconds
Completed pass 8 in 0 minutes 15 seconds
Completed pass 9 in 0 minutes 16 seconds
Completed pass 10 in 0 minutes 14 seconds

Total Time 146 seconds
30
The New Defaults
Table jmiller.t9 Mode
HIGH Number of Bins 267 Bin size 1082 Sort
data 101.4 MB Sort memory granted
15.0 MB Estimated number of table scans
7 PASS 1 c9,c8,c10,c5,c7 PASS 2 c6,c1 PASS 3
c3 PASS 4 c2 PASS 5 c4

Completed pass 1 in 0 minutes 34 seconds
Completed pass 2 in 0 minutes 19 seconds
Completed pass 3 in 0 minutes 16 seconds
Completed pass 4 in 0 minutes 14 seconds
Completed pass 5 in 0 minutes 15 seconds

Total Time 98 seconds
New Memory Default
31
Enabling PDQ with Update Statistics

Table jmiller.t9
Mode HIGH
Number of Bins 267 Bin size 1082
Sort data 101.4 MB
PDQ memory granted 106.5 MB
Estimated number of table scans 1
PASS 1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
Index scans disabled
Light scans enabled
Completed pass 1 in 0 minutes 29 seconds

PDQ Memory
Features Enabled
Total Time 29 seconds
32
Tuning with the New Statistics

Turn on PDQ when running update statistics, but
only for tables
Avoid PDQ when updating statistics for procedures
When running high or medium increase the memory
update statistics has to work with
Enable parallel sorting (i.e. PSORT_NPROCS)

33
Considerations

Change the RESOLUTION to 1.5
Increasing the number of bins for the
distributions
Increasing the sample size for update statistics
medium

34
Old Recommendations

Start one update statistics for each column of a
table

Fname
Lname
Address
Three sequential scans of the table
35
New Recommendations

Start one update statistics for ALL columns
giving it more resources (memory)
Requires only one scan of the table to produce
distributions on several columns.

Fname
Lname
Address
One scans of the table
36
Other Information

An Overview of the IBM Informix Dynamic Server
Optimizer
www.ibm.com/developerworks/db2/zones/informix/libr
ary/techarticle/0211desai/0211desai.html
Understanding and Tuning Update Statistics
www.ibm.com/developerworks/db2/zones/informix/libr
ary/techarticle/miller/0203miller.html
Predicate Inference in Informix Dynamic Server
www.ibm.com/developerworks/db2/zones/informix/libr
ary/techarticle/0206goswami/0206goswami.html
IBM Informix Performance Manual
IBM Informix SQL Reference Manual

37
Questions

Write a Comment

User Comments (0)