There are 12 possible months in the year. -- known reference
Dates of birth are (probably) evenly scattered through the year. -- assumption
One-twelfth of the audience will be born in any one month. -- month’s selectivity
The request was for one specific month. -- predicate
The requested month does actually exist in the calendar. -- boundary check
There are 1,200 people in the audience. -- base cardinality
The answer is one twelfth of 1,200, which is 100. -- computed cardinality
1.1数据准备
execute dbms_random.seed(0);
create table audience as
select
trunc(dbms_random.value(1,13)) month_no
from
all_objects
where
rownum <= 1200
;
统计信息收集
begin
dbms_stats.gather_table_stats(
user,
'audience',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 1'
);
end;
/
显示统计信息
select
column_name,
num_distinct,
num_nulls,
density,
value_to_number(low_value) low,
value_to_number(high_value) high
from
user_tab_columns
where table_name = 'AUDIENCE'
and column_name = 'MONTH_NO'
;
COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY LOW HIGH
-------------------- ------------ ---------- ---------- ---------- ----------
MONTH_NO 12 0 .083333333 1 12
select
column_name, endpoint_number, endpoint_value
from
user_tab_histograms
where
table_name = 'AUDIENCE'
order by
column_name, endpoint_number
;
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- --------------- --------------
MONTH_NO 0 1
MONTH_NO 1 12
1.2执行查询
1.2.1单一查询
select count(*) from audience where month_no = 12;
We examine figures from the view user_tab_col_statistics (or user_tab_columns) and check view user_tab_histograms to find the following details:
• user_tab_col_statistics.num_distinct is equal to 12.
• user_tab_histograms shows just the low (1) and high (12) values, so assume the values are evenly spread.
• user_tab_col_statistics.density is equal to 1/12; one month gives one twelfth of the data.
• month_no is equal to 12,single column,equality,so user_tab_col_statistics.density is usable.
• 12 is between the low_value and high_value of user_tab_col_statistics.
• user_tab_col_statistics.num_nulls is equal to 0 (everyone was born some time—
the computer has to consider it, even though it was intuitively obvious to you).
• user_tables.num_rows is equal to 1,200.
• The answer is one-twelfth of 1,200, which is 100.
the query for December birthdays produces the following execution plan:
Execution Plan (9.2.0.6)
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=3)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'AUDIENCE' (Cost=2 Card=100 Bytes=300)
单一查询的Cardinality计算方式如下:
cardinality = num_rows * density = 1200 * 0 .083333333 = 100或者:
cardinality = num_rows / num_distinct = 1200 / 12 = 1001.2.2数据中包含NULL值的查询
用update audience set month_no = null where mod(id,10) = 0;对上面数据做如下改动
user_tab_col_statistics.num_nulls is equal to 120 -- new data item
统计信息如下
COLUMN_NAME NUM_DISTINCT NUM_NULLS DENSITY LOW HIGH
-------------------- ------------ ---------- ---------- ---------- ----------
ID 1200 0 .000833333 1 1200
MONTH_NO 12 120 .083333333 1 12
COLUMN_NAME ENDPOINT_NUMBER ENDPOINT_VALUE
-------------------- --------------- --------------
ID 0 1
ID 1 1200
MONTH_NO 0 1
MONTH_NO 1 12
;:
因为NULL值的变更,The equivalent “argument” from the optimizer is :
• Base selectivity = 1/12 (from density or from 1/num_distinct)
• num_nulls = 120
• num_rows = 1200
• Adjusted selectivity = Base selectivity * (num_rows - num_nulls) / num_rows
• Adjusted selectivity = (1/12) * ((1200 - 120)/1200) = 0.075
• Adjusted cardinality = Adjusted selectivity * num_rows
• Adjusted cardinality = 0.075 * 1200 = 90
the query for December birthdays execution plan变成如下所示:
Execution Plan (9.2.0.6)
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=3)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'AUDIENCE' (Cost=2 Card=90 Bytes=270)
----------------------------------------------------------
0 SELECT STATEMENT Optimizer=ALL_ROWS (Cost=2 Card=1 Bytes=3)
1 0 SORT (AGGREGATE)
2 1 TABLE ACCESS (FULL) OF 'AUDIENCE' (Cost=2 Card=90 Bytes=270)
1.2.3用in查询
在没有null值的数据上进行如下查询
select count(*) from audience where month_no in (1,2);--Card=200
select count(*) from audience where month_no in (1,2,3);--Card=300
select count(*) from audience where month_no in (1,2,3,4);--Card=400
select count(*) from audience where month_no in (1,2,3,4,5);--Card=500
select count(*) from audience where month_no in (1,2,3,4,5,6);--Card=600
select count(*) from audience where month_no in (1,2,3,4,5,6,7);--Card=700
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8);--Card=800
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9);--Card=900
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10);--Card=1000
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11);--Card=1100
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12);--Card=1200
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13);--Card=1200
select count(*) from audience where month_no in (1,2,3);--Card=300
select count(*) from audience where month_no in (1,2,3,4);--Card=400
select count(*) from audience where month_no in (1,2,3,4,5);--Card=500
select count(*) from audience where month_no in (1,2,3,4,5,6);--Card=600
select count(*) from audience where month_no in (1,2,3,4,5,6,7);--Card=700
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8);--Card=800
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9);--Card=900
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10);--Card=1000
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11);--Card=1100
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12);--Card=1200
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13);--Card=1200
实际计算结果应该是1291(如何计算下面分析),但是因为这个数超过了num_rows = 1200,所以最后得出Card=1200
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13,14);--Card=1200
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13,14);--Card=1200
同上。
select count(*) from audience where month_no in (
1, 2, 3, 4, 5, 6, 7, 8, 9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30
);--Card=1200
select count(*) from audience where month_no in (
1, 2, 3, 4, 5, 6, 7, 8, 9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30
);--Card=1200
同上。
1.2.4Card的计算方法
The calculation that Oracle uses can be represented graphically, as shown in Figure 3-1 the further away you get from the known low/high range, the less likely you are to find data. Oracle uses a straight-line decay to predict the variation, decaying to zero when you exceed the range by the difference between low and high.
We have a range of 11—running from 1 to 12. The right-hand edge hits zero at month_no = 23 (high value + 11); the left-hand edge hits zero at month_no = -10 (low value – 11).
We have a range of 11—running from 1 to 12. The right-hand edge hits zero at month_no = 23 (high value + 11); the left-hand edge hits zero at month_no = -10 (low value – 11).
根据上面的图,1~12之间是Card=100的常数,-10~1之间是斜率为100/11的直线,12~23之间是斜率为-100/11的直线。Card值都是根据这个图和统计信息来计算的。
select count(*) from audience where month_no in (1,2,3,4,5,6,7,8,9,10,11,12,13);
的Card值计算方法如下:
100+100+100+100+100+100+100+100+100+100+100+100+(100-100/11)=1291(四舍五入)1.2.5范围查询
1. month_no > 8 --Card=436
month_no > 8: This is an unbounded (no limit at one end), open (8 is excluded) range.
• Selectivity = (high_value – limit) / (high_value – low_value) = (12 – 8) / (12 – 1) = 4/11
• Cardinality = 1,200 * 4 / 11 = 436.363636 ... hence 437 or 436 depending on whether your version of Oracle rounds, or rounds up.
• Selectivity = (high_value – limit) / (high_value – low_value) = (12 – 8) / (12 – 1) = 4/11
• Cardinality = 1,200 * 4 / 11 = 436.363636 ... hence 437 or 436 depending on whether your version of Oracle rounds, or rounds up.
2. month_no >= 8 --Card=536
month_no >= 8: This is an unbounded, closed (includes 8) range, so adjust for closure. The adjustment is made by including the rows for the closing value—in other words, adding 1/num_distinct. (Again, 8i seems to use the density rather than 1/num_distinct, but this isn’t something you would notice unless you’ve been hacking the statistics or have a histogram in place.)
• Selectivity = (high_value – limit) / (high_value – low_value) + 1/num_distinct = 4/11 + 1/12
• Cardinality = 1,200 * (4/11 + 1/12) = 536.363636 ... 四舍五入
• Selectivity = (high_value – limit) / (high_value – low_value) + 1/num_distinct = 4/11 + 1/12
• Cardinality = 1,200 * (4/11 + 1/12) = 536.363636 ... 四舍五入
3. month_no < 8 --Card=764
4. month_no <= 8 --Card=864
5. month_no between 6 and 9 --Card=527
month_no between 6 and 9: Both are bounded (limited at both ends), closed ranges—the between clause of case 5 is just a convenient shorthand for the two separate predicates of case 6. This gives us two closing values, so two occurrences of the adjustment.
Selectivity = (9 – 6) / (12 – 1) + 1/12 + 1/12 ( >= , <= )
Selectivity = (9 – 6) / (12 – 1) + 1/12 + 1/12 ( >= , <= )
6. month_no >= 6 and month_no <= 9 --Card=527
7. month_no >= 6 and month_no < 9 --Card=427
8. month_no > 6 and month_no <= 9 --Card=427
9. month_no > 6 and month_no < 9 --Card=327
10. month_no > :b1 --Card=101
绑定变量在11g以前都是按照5%来计算的,但是11g好像做了特殊处理了,具体方法不知。
11. month_no >= :b1 --Card=101
12. month_no < :b1 --Card=101
13. month_no <= :b1 --Card=101
14. month_no between :b1 and :b2 --Card=101
15. month_no >= :b1 and month_no <= :b2 --Card=101
16。 month_no >= :b1 and month_no < :b2 --Card=101
17. month_no > :b1 and month_no < :b2 --Card=101
18. month_no > :b1 and month_no <= :b2 --Card=101
19. month_no > 12 --Card=100
13. month_no <= :b1 --Card=101
14. month_no between :b1 and :b2 --Card=101
15. month_no >= :b1 and month_no <= :b2 --Card=101
16。 month_no >= :b1 and month_no < :b2 --Card=101
17. month_no > :b1 and month_no < :b2 --Card=101
18. month_no > :b1 and month_no <= :b2 --Card=101
19. month_no > 12 --Card=100
如果按照No1的计算方法,Card=(12-12)/11*1200=0
根据实验,个人得出如下结论:
Oracle11g对边界值做了特殊处理,边界值的时候的计算公式如下
Card=1/distincts*num_rows=1/12*1200=100
20. month_no >= 12 --Card=100
边界值的时候>和>=没有区别
21. month_no >12 or month_no <12 --Card=1108
(1/12+(12-1)/12 - 1/12*(12-1)/12)*1200= 1108
22. month_no between 25 and 30 --Card=100
The optimizer detects that the query falls outside the known range, and seems to return a selectivity, hence cardinality that would be correct for column = constant.
23. month_no > 8 or month_no <= 8 --Card=1200
11g似乎对集合的处理更加巧妙了,11g以前都是按照上面No5的处理方式来处理的。
11g能够辨识出条件的组合是不是整个集合。
24. month_no > 8 or month_no <= 9 --Card=1200
1.2.6总结
Those of you who are familiar with probability theory will have recognized the three formulae as the standard formulae for calculating the combined probabilities of independent events:
• Probability(A AND B occur) = Probability(A occurs) * Probability(B occurs)
• Probability(A OR B occurs) = Probability(A occurs) + Probability(B occurs) – Probability(A AND B occur)
• Probability(NOT(A occurs)) = 1 – Probability(A occurs)
没有评论:
发表评论