Java Freeman: Selectivity Issues

一、Different Data Types
We have been using the basic formula for the selectivity of a range-based predicate, and know that this varies slightly depending on whether you have zero, one, or two closed (meaning the equality in <=, >=) ends to your range. Using N as the number of closed ends, the version of the formula that uses user_tab_columns.num_distinct can be written as

   (required range) / (column high value - column low value) + N / num_distinct

For example, in a column with 1,000 different (integer) values ranging from 1 to 1000, and the pair of predicates colX > 10 and colX <= 20, you have one closed end (<=), and the formula would give you
   (20 - 10) / (1000 - 1) + 1/1000 = 10/999 + 1/1000

1.1 Date Values
Consider a date-only column holding values from 1 January 2000 to 31 December 2004; a predicate date_col between 30th Dec 2002 and 5th Jan 2003 would then give you a selectivity of
   (5th Jan 2003 - 30th Dec 2002 ) / ( 31st Dec 2004 - 1st Jan 2000 ) +
   2/(number of different dates) = 6 / 1826 + 2 / 1827

1.2 Character Values
In a test table, t1, with two columns, one of type varchar2(10) and one of type char(10), I have inserted the strings 'Aardvark', 'Apple', 'Blueberry', and 'Zymurgy' in both columns, and then generated statistics—including histograms.

create table t1 (
    v10    varchar2(10),
    c10    char(10)
)
;

insert into t1 values('Apple','Apple');
insert into t1 values('Blueberry','Blueberry');
insert into t1 values('Aardvark','Aardvark');
insert into t1 values('Zymurgy','Zymurgy');

commit;

begin
   dbms_stats.gather_table_stats(
        ownname           => user,
        tabname           => 't1',
        cascade           => true,
        estimate_percent   => null,
        method_opt       =>'for all columns size 10'
   );
end;
/

select
        column_name,
        endpoint_number,
        endpoint_value
from
        user_tab_histograms
where
        table_name = 'T1'
order by
        column_name,
        endpoint_Number
;

Col   End no                                        End Value
----- ------ ------------------------------------------------
C10        1 339,475,752,638,459,000,000,000,000,000,000,000     -- 'Aardvark '
           2 339,779,832,781,209,000,000,000,000,000,000,000     -- 'Apple     '
           3 344,891,393,972,447,000,000,000,000,000,000,000     -- 'Blueberry '
           4 469,769,561,047,943,000,000,000,000,000,000,000     -- 'Zymurgy   '
V10        1 339,475,752,638,459,000,000,000,000,000,000,000     -- 'Aardvark'
           2 339,779,832,781,057,000,000,000,000,000,000,000     -- 'Apple'
           3 344,891,393,972,447,000,000,000,000,000,000,000     -- 'Blueberry'
           4 469,769,561,047,943,000,000,000,000,000,000,000     -- 'Zymurgy'

Oracle appears to behave as follows:
• Extract a maximum of 32 bytes from the column; this representation of the column value is how the low_value, high_value and end_point_actual values are stored.
• Extract the first 15 bytes from the 32, padding with zeros at the right if necessary.
• Convert the 15 bytes hex number to decimal and round to 15 significant figures.

Let’s take a worked example—the string 'Aardvark'—to see if we can end up with the value stored in the histogram:
• 'Aardvark', when dumped in hex from the char(10) column, contains the following list of byte values (note the 20,20 space padding that is imposed by the char(10) definition):
'41,61,72,64,76,61,72,6B,20,20'.
• Since this is less than 15 bytes, we append a few zeros to produce the number
0x416172647661726B20200000000000. (If the column had been declared as char(40) the value would already have been padded with spaces (0x20) up to 40 characters, so we would stop at the fifteenth byte, and the value we would use would look like 0x416172647661726B20202020202020.)
• Converting this rather large hex number to decimal we get 339,475,752,638,459,043,065,991,628,037,554,176.
• And if we throw away everything after the first 15 digits then, as required, we get
339,475,752,638,459,000,000,000,000,000,000,000.

Looking at the rounding, you might notice that after about the first six or seven characters of a string, the rest of the letters don’t have any impact on the numeric representation used by the optimizer—which is why the numeric values for 'Apple' are the only ones that vary when comparing the char(10) with the varchar2(10) versions. We don’t really need to go into any more detail—but you could imagine that the optimizer might have trouble coping with data consisting (for example) of URLs, when lots of them start with http://.

The problem can be much worse, in fact, because of the increasing popularity of national language support. If you pick a multibyte character set for your database character set, then Oracle will be using the first 15 bytes of the string, not the first 15 characters. So the precision gets even worse.

1.3 Daft Data Types
One of the first examples in this chapter examined the predicate date_col between 30th Dec 2002 and 5th Jan 2003 on a table holding data from 1 January 2000 to 31 December 2004, and we saw how the optimizer could work out the selectivity of such a predicate. But let’s see how things can go wrong in popular implementations of storing dates. We create and populate a table with one row per date over a five-year date range

create table t1 (
    d1        date,
    n1        number(8),
    v1        varchar2(8)
)
;

insert into t1
select
    d1,
    to_number(to_char(d1,'yyyymmdd')),
    to_char(d1,'yyyymmdd')
from    (
    select
        to_date('31-Dec-1999') + rownum    d1
    from all_objects
    where
        rownum <= 1827
    )
;

begin
    dbms_stats.gather_table_stats(
        ownname            => user,
        tabname            => 't1',
        cascade            => true,
        estimate_percent    => null,
        method_opt        =>'for all columns size 1'
    );
end;
/

测试实例1
select    *
from    t1
where     d1 between to_date('30-Dec-2002','dd-mon-yyyy')
       and     to_date('05-Jan-2003','dd-mon-yyyy')
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation         | Name | Rows | Bytes | Cost |
----------------------------------------------------------
|   0 | SELECT STATEMENT |      |     8 |   184 |     4 |
|* 1 | TABLE ACCESS FULL| T1   |     8 |   184 |     4 |
----------------------------------------------------------
selectivity=6/1826 + 2/1827
cardinality=selectivity*row_nums=8(四舍五入）

测试实例2

select    *
from    t1
where     n1 between 20021230 and 20030105
;
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation         | Name | Rows | Bytes | Cost |
----------------------------------------------------------
|   0 | SELECT STATEMENT |      |   396 | 9108 |     4 |
|* 1 | TABLE ACCESS FULL| T1   |   396 | 9108 |     4 |
----------------------------------------------------------
Selectivity = (required range) / (high value - low value) + 2/num_distinct
                 = (20030105 - 20021230) / (20041231 - 20000101) + 2/1827
                 =         8875          /         41130         + 2/1827
                 = 0.215779 + 0.001095
                 = 0.216874           -- more than 20% of the data apparently!
Cardinality = Selectivity * row_nums = 396

测试实例3
select    *
from    t1
where     v1 between '20021230' and '20030105'
;
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation         | Name | Rows | Bytes | Cost |
----------------------------------------------------------
|   0 | SELECT STATEMENT |      |   396 | 9108 |     4 |
|* 1 | TABLE ACCESS FULL| T1   |   396 | 9108 |     4 |
----------------------------------------------------------
同测试实例2
Oracle seems to have treated the varchar2 column and its predicate values as if they were numeric. It is possible that a programmer in the CBO group has slipped in a “cunning guess” tweak to the optimizer in this very special case. The tweak disappears as soon as you format the date-like string to include nonnumeric characters (e.g., '2002-01-02').

二、Leading Zeros
create table t1
nologging
pctfree 0
as
with generator as (
    select    --+ materialize
        rownum     id
    from    all_objects
    where    rownum <= 2000
)
select
    /*+ ordered use_nl(v2) */
    trunc((rownum-1)/10000)        grp_id,
    lpad(rownum,18,'0')        id
--    'A' || lpad(rownum, 17, '0')    id
from
    generator    v1,
    generator    v2
where
    rownum <= 2000000
;

begin
    dbms_stats.gather_table_stats(
        ownname            => user,
        tabname            => 't1',
        cascade            => true,
        estimate_percent    => null,
        method_opt        =>'for all columns size 1'
    );
end;
/

select
    *
from t1
where
    id between '000000000000060000'
       and       '000000000000070000'
;

Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013

----------------------------------------------------------
| Id | Operation         | Name | Rows | Bytes | Cost |
----------------------------------------------------------
|   0 | SELECT STATEMENT |      | 10002 |   214K| 1059 |
|* 1 | TABLE ACCESS FULL| T1   | 10002 |   214K| 1059 |
----------------------------------------------------------
11g的Cardinality已经修改过来了，9g的时候Cardinality计算是不对的。

三、Deadly Defaults
What might the average database-independent developer choose as a good value to represent a null date? How about something far in the future, like 31 December 4000?
Doing a quick calculation for the range scan selectivity, you believe the selectivity is
( 31 Dec 2003 - 01 Jan 2003 ) / (31 Dec 2004 - 01 Jan 2000) + 2/1827 = 0.20044
But, given the extra, far-out, value, the optimizer thinks the selectivity is
( 31 Dec 2003 - 01 Jan 2003 ) / (31 Dec 4000 - 01 Jan 2000) + 2/1828 = 0.00159

四、Function Figures
function(colx) = 'SMITH'             Fixed 1% selectivity.
not function(colx) = 'SMITH'         Fixed 5% of selectivity.
function(colx) > 'SMITH' Fixed 5% of selectivity.
not function(colx) > 'SMITH' Fixed 5% of selectivity.
function(colx) >= 'SMITH' and function(colx) < 'SMITI'
                                     Derived 0.25% selectivity (5% * 5%).
function(colx) between 'SMITHA' and 'SMITHZ'
                                     Derived 0.25% selectivity (5% * 5%).
not function(colx) between 'SMITHA' and 'SMITHZ'
                                     Derived 9.75% selectivity (5%+5%–(5% *5%)).
function(colx) like 'SMITH%'         Fixed 5% of selectivity
not function(colx) like 'SMITH%'     Fixed 5% of selectivity.
function(colx) in ('SMITH','JONES') Derived 1.99% (1% + 1% – (1% * 1%))

五、Correlated Columns
create table t1
nologging
as
select
    trunc(dbms_random.value(0,25))    n1,
    rpad('x',40)            ind_pad,
    trunc(dbms_random.value(0,20))    n2,
    lpad(rownum,10,'0')        small_vc,
    rpad('x',200)            padding
from
    all_objects
where
    rownum <= 10000
;

update t1 set n2 = n1;

create index t1_i1 on t1(n1, ind_pad, n2)
nologging
pctfree 91
;

begin
    dbms_stats.gather_table_stats(
        ownname => user,
        tabname    => 'T1',
        cascade    => true,
        estimate_percent => null,
        method_opt => 'for all columns size 1'
    );
end;
/

select
    table_name,
    blocks,
    num_rows
from    user_tables
where    table_name = 'T1'
;
TABLE_NAME               BLOCKS   NUM_ROWS
-------------------- ---------- ----------
T1                          387      10000

select
    num_rows, distinct_keys,
    blevel, leaf_blocks, clustering_factor,
    avg_leaf_blocks_per_key, avg_data_blocks_per_key
from
    user_indexes
where    table_name = 'T1'
and    index_name = 'T1_I1'
;

NUM_ROWS DISTINCT_KEYS     BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR AVG_LEAF_BLOCKS_PER_KEY AVG_DATA_BLOCKS_PER_KEY
---------- ------------- ---------- ----------- ----------------- ----------------------- -----------------------
     10000            25          2        1107              6153                      44                     246

select
    column_name,
    num_nulls, num_distinct, density,
    low_value, high_value
from
    user_tab_columns
where    table_name = 'T1'
and    column_name in ('N1','N2','IND_PAD')
order by
    column_name
;
Col                   NUM_NULLS NUM_DISTINCT    DENSITY
-------------------- ---------- ------------ ---------- IND_PAD                       0            1         1
N1                            0           25        .04
N2                            0           25        .04

测试实例1
select
    /*+ index(t1) */
    small_vc
from
    t1
where
    n1    = 2
and    ind_pad    = rpad('x',40)
and    n2    = 2
;
Execution Plan
----------------------------------------------------------
Plan hash value: 1429545322

---------------------------------------------------------------------
| Id | Operation                   | Name | Rows | Bytes | Cost |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT            |       |   400 | 23200 |   294 |
|   1 | TABLE ACCESS BY INDEX ROWID| T1    |   400 | 23200 |   294 |
|* 2 |   INDEX RANGE SCAN          | T1_I1 |   400 |       |    47 |
---------------------------------------------------------------------
11g能够意思到n1和n2是有关联的，通过index的DISTINCT_KEYS来计算Cardinality
Cardinality = row_nums / DISTINCT_KEYS = 10000 / 25 = 400

Java Freeman

2011年7月30日星期六

Selectivity Issues

没有评论:

发表评论