Unlocking the Mysteries Behind Update STatistics - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Unlocking the Mysteries Behind Update STatistics

Description:

Title: Unlocking the Mysteries Behind Update STatistics Author: John F. Miller III Last modified by: IBM_User Created Date: 3/8/1996 11:24:36 AM Document presentation ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 38
Provided by: John1980
Category:

less

Transcript and Presenter's Notes

Title: Unlocking the Mysteries Behind Update STatistics


1
Unlocking the Mysteries Behind Update Statistics
John F. Miller III
STSM
Informix Chat with the Labs
2
The Dice Problem
  • Throw dice, how many will be 1?

3
Questions about the Dice
  • How many dice are you throwing?
  • How many sides does each dice have?
  • Are all the dice the same?

The better the information, the more accurate the
estimate.
4
What does Update Statistics do?
  • Collects information for the optimizer
  • Statistics LOW
  • Distributions MEDIUM
    HIGH
  • Drop Distributions
  • Compile stored procedures

5
Statistics Collected
  • systables
  • systables
  • syscolumns
  • syscolumns
  • sysindexes
  • sysindexes
  • Number of Rows
  • Number of pages to store the data
  • Second largest value for a column
  • Second smallest value for a column
  • of unique values for the lead key
  • How highly clustered the values for the lead key

6
Update Statistics LowBasic Algorithm
  • Walk the leaf pages in each index
  • Submit btree cleaner requests when deleted items
    are found causing re-balancing of indexes
  • Collects the following information
  • Number of unique items
  • Number of leave pages
  • How clustered the data is
  • Second highest and lowest value

7
How to Read Distributions
of rows represented in this bin
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
of unique values
Highest Value in this bin
  • To get the range of values look at the highest
    value in the previous bin.

The value
of rows for this value
8
Example - Approximating a Value
  • There are 868317 rows containing a value between
    -1 and 75
  • There are 70 unique values in this range
  • The optimizer will deduce 868317 / 70 12,404
    records for each value between -1 and 75

--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
9
Example - Dealing with Data Skew
--- DISTRIBUTION --- (
-1 1 ( 868317, 70,
75) 2 ( 868317, 24,
100) 3 ( 868317, 12, 116) 4
( 868317, 30, 147) 5 (
868317, 39, 194) 6 ( 868317,
28, 222) --- OVERFLOW
--- 1 ( 779848, 43) 2
( 462364, 45)
  • Data skew
  • For the value 43 how many records will the
    optimizer estimate will exist?
  • Answer 779848 values
  • Any value that exceeds 25 of the bin size will
    be placed in an overflow bin

10
Basic Algorithm for Distributions
  • Develop scan plan based on available resources
  • Scan table
  • High All rows
  • Medium Sample of rows
  • Sort each column
  • Build distributions
  • Begin transaction
  • Delete old columns distributions
  • Insert new columns distributions
  • Commit transaction

11
Sample Size
  • HIGH
  • All rows in the table
  • Medium
  • Misconception about the number of rows sampled is
    based on the number of rows in the table, this is
    incorrect.
  • The number of samples depends on the Confidence
    and Resolution.
  • If the sample size is greater than the number of
    row in the table Medium turns into High mode

12
Update Statistics Medium Sample Size
13
How Much Information is Enough??
The better the information, the more accurate the
estimate.
14
Examining the Running QueryNo Statistics VS
Medium Statistics
No Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
20888 Estimated of Rows Returned 6760 1)
miller3.t1 SEQUENTIAL SCAN Filters
miller3.t1.c1 gt 20200
Medium Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
21 Estimated of Rows Returned 19 1)
miller3.t1 INDEX PATH (1) Index Keys c1
(Serial, fragments ALL) Lower Index
Filter t1.c1 gt 20250
Overall performance improved The estimates were
more accurate The query plan changed
15
Examining the Running QueryMedium Statistics VS
High Statistics
High Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
33 Estimated of Rows Returned 30 1)
miller3.t1 INDEX PATH (1) Index Keys c1
Lower Index Filter t1.c1 gt 20250
Medium Statistics QUERY ------ select
from t1 where c1 gt 20200 Estimated Cost
21 Estimated of Rows Returned 19 1)
miller3.t1 INDEX PATH (1) Index Keys c1
Lower Index Filter t1.c1 gt 20250
Overall performance did not change The estimates
were slightly more accurate The query plan did
not change
16
Version of Update Statistics Improvements
  • All version of 9.40 and 10.00
  • 9.30.UC3
  • 9.21 Not fixed
  • 7.31.UD2

17
Improvements in Update Statistics
  • Update statistics can not allocated memory
    between 4MB and 100MB of sort memory
  • The default has been raised from 4MB to 15MB
  • User can now configure the amount of memory
  • Use DBUPSPACE has been augmented to include
    memory
  • Format of DBUPSPACE
  • max disk spacedefault memory
  • To increase the memory to 35 MB, set
    DBUPSPACE035.
  • Allow update statistics to use light scans when
    scanning a a table
  • Implemented light scans
  • Set oriented reads

18
Improvements in update statistics
  • Information about building data distributions is
    not viewable by the DBA
  • Set explain will now print the scan path and
    resource usage when building data distributions
  • Update statistics low on fragmented tables does
    not run in parallel
  • With PDQ turned on each index fragment will be
    scanned in parallel
  • PDQ at 1 means 10 of the index fragments scanned
    in parallel, while PDQ at 10 means all the index
    fragments will be scanned in parallel

19
Improvements in Update Statistics
  • Various errors (126, 312, 100,) when executing
    update statistics
  • Errors when trying to insert the distributions
    because set lock mode to wait was not handled
    properly inside update statistics
  • Range scanning a fragmented index is slow Replace
    the next loop merge with a binary search merge
    when ordering items from index fragments
  • Most noticeable when the number of fragments in
    an index is large

20
Update Statistics Medium Memory Requirements
21
Update Statistics High Memory Requirements
  • In memory sort
  • Approximate Memory number of rows sum(column
    widths 2 sizeof(pointer) )

22
Memory Rules
  • Estimated Update Stats memory is below 100MB
  • Hard coded limit of 4MB
  • Attempts to minimize the scans by fitting as many
    columns into 4MB
  • Estimated Update Stats memory is above 100MB
  • Memory is requested from MGM
  • Attempt to minimize the scans by fitting as many
    columns in the MGM memory

23
Examples
  • Customer Table
  • Cust_id integer
  • Fname char(50)
  • Lname char(50)
  • Address1 char(200)
  • Address2 char(200)
  • State char(2)
  • zipcode integer
  • Number of Rows 500,000

24
ExamplesMemory for Incore Sort
25
ExamplesNumber of Table Scans
26
Confidence
  • A factor in the number of samples used by update
    statistics medium

27
Resolution
  • Percentage of data that is represented in a
    distribution bin
  • Example
  • 100,000 rows in the table
  • Resolution of 2
  • Each bin will represent 2,000 rows

28
Example
  • Following Example
  • Table size 215,000 rows
  • Row size 445 bytes
  • Uniprocessor

29
Example of the current update statistics
  • Table jmiller.t9
  • Mode HIGH
  • Number of Bins 267 Bin size 1082
  • Sort data 101.4 MB
  • Sort memory granted 4.0 MB
  • Estimated number of table scans 10
  • PASS 1 c9
  • PASS 2 c5
  • PASS 3 c7
  • PASS 4 c6
  • ..
  • PASS 10 c4
  • Completed pass 1 in 0 minutes 24 seconds
  • Completed pass 2 in 0 minutes 20 seconds
  • Completed pass 3 in 0 minutes 17 seconds
  • Completed pass 4 in 0 minutes 17 seconds
  • Completed pass 5 in 0 minutes 17 seconds
  • Completed pass 6 in 0 minutes 15 seconds
  • Completed pass 7 in 0 minutes 14 seconds
  • Completed pass 8 in 0 minutes 15 seconds
  • Completed pass 9 in 0 minutes 16 seconds
  • Completed pass 10 in 0 minutes 14 seconds

Total Time 146 seconds
30
The New Defaults
Table jmiller.t9 Mode
HIGH Number of Bins 267 Bin size 1082 Sort
data 101.4 MB Sort memory granted
15.0 MB Estimated number of table scans
7 PASS 1 c9,c8,c10,c5,c7 PASS 2 c6,c1 PASS 3
c3 PASS 4 c2 PASS 5 c4
  • Completed pass 1 in 0 minutes 34 seconds
  • Completed pass 2 in 0 minutes 19 seconds
  • Completed pass 3 in 0 minutes 16 seconds
  • Completed pass 4 in 0 minutes 14 seconds
  • Completed pass 5 in 0 minutes 15 seconds

Total Time 98 seconds
New Memory Default
31
Enabling PDQ with Update Statistics
  • Table jmiller.t9
  • Mode HIGH
  • Number of Bins 267 Bin size 1082
  • Sort data 101.4 MB
  • PDQ memory granted 106.5 MB
  • Estimated number of table scans 1
  • PASS 1 c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
  • Index scans disabled
  • Light scans enabled
  • Completed pass 1 in 0 minutes 29 seconds

PDQ Memory
Features Enabled
Total Time 29 seconds
32
Tuning with the New Statistics
  • Turn on PDQ when running update statistics, but
    only for tables
  • Avoid PDQ when updating statistics for procedures
  • When running high or medium increase the memory
    update statistics has to work with
  • Enable parallel sorting (i.e. PSORT_NPROCS)

33
Considerations
  • Change the RESOLUTION to 1.5
  • Increasing the number of bins for the
    distributions
  • Increasing the sample size for update statistics
    medium

34
Old Recommendations
  • Start one update statistics for each column of a
    table

Fname
Lname
Address
Three sequential scans of the table
35
New Recommendations
  • Start one update statistics for ALL columns
    giving it more resources (memory)
  • Requires only one scan of the table to produce
    distributions on several columns.

Fname
Lname
Address
One scans of the table
36
Other Information
  • An Overview of the IBM Informix Dynamic Server
    Optimizer
  • www.ibm.com/developerworks/db2/zones/informix/libr
    ary/techarticle/0211desai/0211desai.html
  • Understanding and Tuning Update Statistics
  • www.ibm.com/developerworks/db2/zones/informix/libr
    ary/techarticle/miller/0203miller.html
  • Predicate Inference in Informix Dynamic Server
  • www.ibm.com/developerworks/db2/zones/informix/libr
    ary/techarticle/0206goswami/0206goswami.html
  • IBM Informix Performance Manual
  • IBM Informix SQL Reference Manual

37
Questions
Write a Comment
User Comments (0)
About PowerShow.com