Title: Unlocking the Mysteries Behind Update STatistics
1Unlocking the Mysteries Behind Update Statistics
John F. Miller III
STSM
Informix Chat with the Labs
2The Dice Problem
- Throw dice, how many will be 1?
3Questions about the Dice
- How many dice are you throwing?
- How many sides does each dice have?
- Are all the dice the same?
The better the information, the more accurate the
estimate.
4What does Update Statistics do?
- Collects information for the optimizer
- Statistics LOW
- Distributions MEDIUM
HIGH - Drop Distributions
- Compile stored procedures
5Statistics Collected
- systables
- systables
- syscolumns
- syscolumns
- sysindexes
- sysindexes
- Number of Rows
- Number of pages to store the data
- Second largest value for a column
- Second smallest value for a column
- of unique values for the lead key
- How highly clustered the values for the lead key
6Update Statistics LowBasic Algorithm
- Walk the leaf pages in each index
- Submit btree cleaner requests when deleted items
are found causing re-balancing of indexes - Collects the following information
- Number of unique items
- Number of leave pages
- How clustered the data is
- Second highest and lowest value
7How to Read Distributions
of rows represented in this bin
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
of unique values
Highest Value in this bin
- To get the range of values look at the highest
value in the previous bin.
The value
of rows for this value
8Example - Approximating a Value
- There are 868317 rows containing a value between
-1 and 75 - There are 70 unique values in this range
- The optimizer will deduce 868317 / 70 12,404
records for each value between -1 and 75
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
9Example - Dealing with Data Skew
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
- Data skew
- For the value 43 how many records will the
optimizer estimate will exist? - Answer 779848 values
- Any value that exceeds 25 of the bin size will
be placed in an overflow bin
10Basic Algorithm for Distributions
- Develop scan plan based on available resources
- Scan table
- High All rows
- Medium Sample of rows
- Sort each column
- Build distributions
- Begin transaction
- Delete old columns distributions
- Insert new columns distributions
- Commit transaction
11Sample Size
- HIGH
- All rows in the table
- Medium
- Misconception about the number of rows sampled is
based on the number of rows in the table, this is
incorrect. - The number of samples depends on the Confidence
and Resolution. - If the sample size is greater than the number of
row in the table Medium turns into High mode
12Update Statistics Medium Sample Size
13How Much Information is Enough??
The better the information, the more accurate the
estimate.
14Examining the Running QueryNo Statistics VS
Medium Statistics
No Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
20888 Estimated of Rows Returned 6760 1)
miller3.t1 SEQUENTIAL SCAN Filters
miller3.t1.c1 gt 20200
Medium Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
21 Estimated of Rows Returned 19 1)
miller3.t1 INDEX PATH (1) Index Keys c1
(Serial, fragments ALL) Lower Index
Filter t1.c1 gt 20250
Overall performance improved The estimates were
more accurate The query plan changed
15Examining the Running QueryMedium Statistics VS
High Statistics
High Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
33 Estimated of Rows Returned 30 1)
miller3.t1 INDEX PATH (1) Index Keys c1
Lower Index Filter t1.c1 gt 20250
Medium Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
21 Estimated of Rows Returned 19 1)
miller3.t1 INDEX PATH (1) Index Keys c1
Lower Index Filter t1.c1 gt 20250
Overall performance did not change The estimates
were slightly more accurate The query plan did
not change
16Version of Update Statistics Improvements
- All version of 9.40 and 10.00
- 9.30.UC3
- 9.21 Not fixed
- 7.31.UD2
17Improvements in Update Statistics
- Update statistics can not allocated memory
between 4MB and 100MB of sort memory - The default has been raised from 4MB to 15MB
- User can now configure the amount of memory
- Use DBUPSPACE has been augmented to include
memory - Format of DBUPSPACE
- max disk spacedefault memory
- To increase the memory to 35 MB, set
DBUPSPACE035. - Allow update statistics to use light scans when
scanning a a table - Implemented light scans
- Set oriented reads
18Improvements in update statistics
- Information about building data distributions is
not viewable by the DBA - Set explain will now print the scan path and
resource usage when building data distributions - Update statistics low on fragmented tables does
not run in parallel - With PDQ turned on each index fragment will be
scanned in parallel - PDQ at 1 means 10 of the index fragments scanned
in parallel, while PDQ at 10 means all the index
fragments will be scanned in parallel
19Improvements in Update Statistics
- Various errors (126, 312, 100,) when executing
update statistics - Errors when trying to insert the distributions
because set lock mode to wait was not handled
properly inside update statistics - Range scanning a fragmented index is slow Replace
the next loop merge with a binary search merge
when ordering items from index fragments - Most noticeable when the number of fragments in
an index is large
20Update Statistics Medium Memory Requirements
21Update Statistics High Memory Requirements
- In memory sort
- Approximate Memory number of rows sum(column
widths 2 sizeof(pointer) )
22Memory Rules
- Estimated Update Stats memory is below 100MB
- Hard coded limit of 4MB
- Attempts to minimize the scans by fitting as many
columns into 4MB - Estimated Update Stats memory is above 100MB
- Memory is requested from MGM
- Attempt to minimize the scans by fitting as many
columns in the MGM memory
23Examples
- Customer Table
- Cust_id integer
- Fname char(50)
- Lname char(50)
- Address1 char(200)
- Address2 char(200)
- State char(2)
- zipcode integer
- Number of Rows 500,000
24ExamplesMemory for Incore Sort
25ExamplesNumber of Table Scans
26Confidence
- A factor in the number of samples used by update
statistics medium
27Resolution
- Percentage of data that is represented in a
distribution bin - Example
- 100,000 rows in the table
- Resolution of 2
- Each bin will represent 2,000 rows
28Example
- Following Example
- Table size 215,000 rows
- Row size 445 bytes
- Uniprocessor
29Example of the current update statistics
- Table jmiller.t9
- Mode HIGH
- Number of Bins 267 Bin size 1082
- Sort data 101.4 MB
- Sort memory granted 4.0 MB
- Estimated number of table scans 10
- PASS 1 c9
- PASS 2 c5
- PASS 3 c7
- PASS 4 c6
- ..
- PASS 10 c4
- Completed pass 1 in 0 minutes 24 seconds
- Completed pass 2 in 0 minutes 20 seconds
- Completed pass 3 in 0 minutes 17 seconds
- Completed pass 4 in 0 minutes 17 seconds
- Completed pass 5 in 0 minutes 17 seconds
- Completed pass 6 in 0 minutes 15 seconds
- Completed pass 7 in 0 minutes 14 seconds
- Completed pass 8 in 0 minutes 15 seconds
- Completed pass 9 in 0 minutes 16 seconds
- Completed pass 10 in 0 minutes 14 seconds
Total Time 146 seconds
30The New Defaults
Table jmiller.t9 Mode
HIGH Number of Bins 267 Bin size 1082 Sort
data 101.4 MB Sort memory granted
15.0 MB Estimated number of table scans
7 PASS 1 c9,c8,c10,c5,c7 PASS 2 c6,c1 PASS 3
c3 PASS 4 c2 PASS 5 c4
- Completed pass 1 in 0 minutes 34 seconds
- Completed pass 2 in 0 minutes 19 seconds
- Completed pass 3 in 0 minutes 16 seconds
- Completed pass 4 in 0 minutes 14 seconds
- Completed pass 5 in 0 minutes 15 seconds
Total Time 98 seconds
New Memory Default
31Enabling PDQ with Update Statistics
- Table jmiller.t9
- Mode HIGH
- Number of Bins 267 Bin size 1082
- Sort data 101.4 MB
- PDQ memory granted 106.5 MB
- Estimated number of table scans 1
- PASS 1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
- Index scans disabled
- Light scans enabled
- Completed pass 1 in 0 minutes 29 seconds
PDQ Memory
Features Enabled
Total Time 29 seconds
32Tuning with the New Statistics
- Turn on PDQ when running update statistics, but
only for tables - Avoid PDQ when updating statistics for procedures
- When running high or medium increase the memory
update statistics has to work with - Enable parallel sorting (i.e. PSORT_NPROCS)
33Considerations
- Change the RESOLUTION to 1.5
- Increasing the number of bins for the
distributions - Increasing the sample size for update statistics
medium
34Old Recommendations
- Start one update statistics for each column of a
table
Fname
Lname
Address
Three sequential scans of the table
35New Recommendations
- Start one update statistics for ALL columns
giving it more resources (memory) - Requires only one scan of the table to produce
distributions on several columns.
Fname
Lname
Address
One scans of the table
36Other Information
- An Overview of the IBM Informix Dynamic Server
Optimizer - www.ibm.com/developerworks/db2/zones/informix/libr
ary/techarticle/0211desai/0211desai.html - Understanding and Tuning Update Statistics
- www.ibm.com/developerworks/db2/zones/informix/libr
ary/techarticle/miller/0203miller.html - Predicate Inference in Informix Dynamic Server
- www.ibm.com/developerworks/db2/zones/informix/libr
ary/techarticle/0206goswami/0206goswami.html - IBM Informix Performance Manual
- IBM Informix SQL Reference Manual
37Questions