Java Freeman: Selectivity

一、测试场景
There are 12 possible months in the year. -- known reference
Dates of birth are (probably) evenly scattered through the year. -- assumption
One-twelfth of the audience will be born in any one month. -- month’s selectivity
The request was for one specific month. -- predicate
The requested month does actually exist in the calendar. -- boundary check
There are 1,200 people in the audience. -- base cardinality
The answer is one twelfth of 1,200, which is 100. -- computed cardinality

1.1数据准备
execute dbms_random.seed(0);
create table audience as
select
    trunc(dbms_random.value(1,13))    month_no
from
    all_objects
where
    rownum <= 1200
;

统计信息收集
begin
    dbms_stats.gather_table_stats(
        user,
        'audience',
        cascade => true,
        estimate_percent => null,
        method_opt => 'for all columns size 1'
    );
end;
/

显示统计信息
select
    column_name,
    num_distinct,
    num_nulls,
    density,
    value_to_number(low_value)    low,
    value_to_number(high_value)    high
from
    user_tab_columns
where    table_name = 'AUDIENCE'
and    column_name = 'MONTH_NO'
;

COLUMN_NAME          NUM_DISTINCT NUM_NULLS    DENSITY        LOW       HIGH
-------------------- ------------ ---------- ---------- ---------- ----------
MONTH_NO                       12          0 .083333333          1         12

select
    column_name, endpoint_number, endpoint_value
from
    user_tab_histograms
where
    table_name = 'AUDIENCE'
order by
    column_name, endpoint_number
;

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- --------------- --------------
MONTH_NO                           0              1
MONTH_NO                           1             12

1.2执行查询
1.2.1单一查询
select count(*) from audience where month_no = 12;

We examine figures from the view user_tab_col_statistics (or user_tab_columns) and check view user_tab_histograms to find the following details:
• user_tab_col_statistics.num_distinct is equal to 12.
• user_tab_histograms shows just the low (1) and high (12) values, so assume the values are evenly spread.
• user_tab_col_statistics.density is equal to 1/12; one month gives one twelfth of the data.
• month_no is equal to 12,single column,equality,so user_tab_col_statistics.density is usable.
• 12 is between the low_value and high_value of user_tab_col_statistics.
• user_tab_col_statistics.num_nulls is equal to 0 (everyone was born some time—
the computer has to consider it, even though it was intuitively obvious to you).
• user_tables.num_rows is equal to 1,200.
• The answer is one-twelfth of 1,200, which is 100.

the query for December birthdays produces the following execution plan:
Execution Plan (9.2.0.6)
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=3)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'AUDIENCE' (Cost=2 Card=100 Bytes=300)

单一查询的Cardinality计算方式如下：

cardinality = num_rows * density = 1200 * 0 .083333333 = 100

或者：

cardinality = num_rows / num_distinct = 1200 / 12 = 100

1.2.2数据中包含NULL值的查询
用update audience set month_no = null where mod(id,10) = 0;对上面数据做如下改动
user_tab_col_statistics.num_nulls is equal to 120 -- new data item

统计信息如下
COLUMN_NAME          NUM_DISTINCT NUM_NULLS    DENSITY        LOW       HIGH
-------------------- ------------ ---------- ---------- ---------- ----------
ID                           1200          0 .000833333          1       1200
MONTH_NO                       12        120 .083333333          1         12

COLUMN_NAME          ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- --------------- --------------
ID                                 0              1
ID                                 1           1200
MONTH_NO                           0              1
MONTH_NO                           1             12
；：
因为NULL值的变更，The equivalent “argument” from the optimizer is ：

• Base selectivity = 1/12 (from density or from 1/num_distinct)
• num_nulls = 120
• num_rows = 1200
• Adjusted selectivity = Base selectivity * (num_rows - num_nulls) / num_rows
• Adjusted selectivity = (1/12) * ((1200 - 120)/1200) = 0.075
• Adjusted cardinality = Adjusted selectivity * num_rows
• Adjusted cardinality = 0.075 * 1200 = 90

the query for December birthdays execution plan变成如下所示:

Execution Plan (9.2.0.6)
----------------------------------------------------------
   0      SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=3)
   1    0   SORT (AGGREGATE)
   2    1     TABLE ACCESS (FULL) OF 'AUDIENCE' (Cost=2 Card=90 Bytes=270)

1.2.3用in查询

在没有null值的数据上进行如下查询

select count(*) from audience where month_no in (1,2);--Card=200
select count(*) from audience where month_no in (1,2,3);--Card=300
select count(*) from audience where month_no in (1,2,3,4);--Card=400
select count(*) from audience where month_no in (1,2,3,4,5);--Card=500
select count(*) from audience where month_no in (1,2,3,4,5,6);--Card=600
select count(*) from audience where month_no in (1,2,3,4,5,6,7);--Card=700
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8);--Card=800
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9);--Card=900
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10);--Card=1000
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11);--Card=1100
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12);--Card=1200
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13);--Card=1200

实际计算结果应该是1291（如何计算下面分析），但是因为这个数超过了num_rows = 1200，所以最后得出Card=1200
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13,14);--Card=1200

同上。

select count(*) from audience where month_no in (
    1, 2, 3, 4, 5, 6, 7, 8, 9,10,
    11,12,13,14,15,16,17,18,19,20,
    21,22,23,24,25,26,27,28,29,30
);--Card=1200

同上。

1.2.4Card的计算方法

The calculation that Oracle uses can be represented graphically, as shown in Figure 3-1 the further away you get from the known low/high range, the less likely you are to find data. Oracle uses a straight-line decay to predict the variation, decaying to zero when you exceed the range by the difference between low and high.
We have a range of 11—running from 1 to 12. The right-hand edge hits zero at month_no = 23 (high value + 11); the left-hand edge hits zero at month_no = -10 (low value – 11).

根据上面的图，1～12之间是Card=100的常数，-10～1之间是斜率为100/11的直线，12～23之间是斜率为-100/11的直线。Card值都是根据这个图和统计信息来计算的。

select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13);

的Card值计算方法如下：

100+100+100+100+100+100+100+100+100+100+100+100+（100-100/11)=1291(四舍五入）

1.2.5范围查询

1. month_no > 8 --Card=436

month_no > 8: This is an unbounded (no limit at one end), open (8 is excluded) range.
• Selectivity = (high_value – limit) / (high_value – low_value) = (12 – 8) / (12 – 1) = 4/11
• Cardinality = 1,200 * 4 / 11 = 436.363636 ... hence 437 or 436 depending on whether your version of Oracle rounds, or rounds up.

2. month_no >= 8 --Card=536

month_no >= 8: This is an unbounded, closed (includes 8) range, so adjust for closure. The adjustment is made by including the rows for the closing value—in other words, adding 1/num_distinct. (Again, 8i seems to use the density rather than 1/num_distinct, but this isn’t something you would notice unless you’ve been hacking the statistics or have a histogram in place.)
• Selectivity = (high_value – limit) / (high_value – low_value) + 1/num_distinct = 4/11 + 1/12
• Cardinality = 1,200 * (4/11 + 1/12) = 536.363636 ... 四舍五入

3. month_no < 8 --Card=764

4. month_no <= 8 --Card=864

5. month_no between 6 and 9 --Card=527

month_no between 6 and 9: Both are bounded (limited at both ends), closed ranges—the between clause of case 5 is just a convenient shorthand for the two separate predicates of case 6. This gives us two closing values, so two occurrences of the adjustment.
Selectivity = (9 – 6) / (12 – 1) + 1/12 + 1/12 ( >= , <= )

6. month_no >= 6 and month_no <= 9 --Card=527

7. month_no >= 6 and month_no < 9 --Card=427

8. month_no > 6 and month_no <= 9 --Card=427
9. month_no > 6 and month_no < 9 --Card=327
10. month_no > :b1 --Card=101

绑定变量在11g以前都是按照5%来计算的，但是11g好像做了特殊处理了，具体方法不知。

11. month_no >= :b1 --Card=101

12. month_no < :b1 --Card=101
13. month_no <= :b1 --Card=101
14. month_no between :b1 and :b2 --Card=101
15. month_no >= :b1 and month_no <= :b2 --Card=101
16。 month_no >= :b1 and month_no < :b2 --Card=101
17. month_no > :b1 and month_no < :b2 --Card=101
18. month_no > :b1 and month_no <= :b2 --Card=101
19. month_no > 12 --Card=100

如果按照No1的计算方法，Card=（12-12）/11*1200=0

根据实验，个人得出如下结论：

Oracle11g对边界值做了特殊处理，边界值的时候的计算公式如下

Card=1/distincts*num_rows=1/12*1200=100

20. month_no >= 12 --Card=100

边界值的时候>和>=没有区别

21. month_no >12 or month_no <12 --Card=1108

(1/12+(12-1)/12 - 1/12*(12-1)/12)*1200= 1108

22. month_no between 25 and 30 --Card=100

The optimizer detects that the query falls outside the known range, and seems to return a selectivity, hence cardinality that would be correct for column = constant.

23. month_no > 8 or month_no <= 8 --Card=1200

11g似乎对集合的处理更加巧妙了，11g以前都是按照上面No5的处理方式来处理的。

11g能够辨识出条件的组合是不是整个集合。

24. month_no > 8 or month_no <= 9 --Card=1200

1.2.6总结
Those of you who are familiar with probability theory will have recognized the three formulae as the standard formulae for calculating the combined probabilities of independent events:
• Probability(A AND B occur) = Probability(A occurs) * Probability(B occurs)
• Probability(A OR B occurs) = Probability(A occurs) + Probability(B occurs) – Probability(A AND B occur)
• Probability(NOT(A occurs)) = 1 – Probability(A occurs)

Java Freeman

2011年7月18日星期一

Selectivity

没有评论:

发表评论