Everyone knows that histograms are supposed to be used for data where a column has a few special values that appear far more frequently than the rest; and everyone knows that you should only build histograms on indexed columns. These beliefs are so misleading that I’m going to ignore them for the moment and start my explanation of histograms with an example that has no extreme values and no indexes.
1.1 测试数据
create table t1
as
with kilo_row as (
select /*+ materialize */
rownum
from all_objects
where rownum <= 1000
)
select
trunc(7000 * dbms_random.normal) normal
from
kilo_row k1,
kilo_row k2
where
rownum <= 1000000
;
select
tenth tenth,
min(normal) low_val,
max(normal) high_val,
max(normal) - min(normal) width,
round(100000 / (max(normal) - min(normal)),2) height
from (
select
normal,
ntile(10) over (order by normal) tenth
from t1
)
group by tenth
order by tenth
;
TENTH LOW_VAL HIGH_VAL WIDTH HEIGHT
---------- ---------- ---------- ---------- ----------
1 -32003 -8966 23037 4.34
2 -8966 -5883 3083 32.44
3 -5883 -3659 2224 44.96
4 -3659 -1761 1898 52.69
5 -1761 17 1778 56.24
6 17 1792 1775 56.34
7 1792 3678 1886 53.02
8 3678 5897 2219 45.07
9 5897 8974 3077 32.5
10 8974 34660 25686 3.89
begin
dbms_stats.gather_table_stats(
user,
't1',
cascade => true,
estimate_percent => null,
method_opt => 'for columns normal size 10'
);
end;
/
select
rownum tenth,
prev low_val,
curr high_val,
curr - prev width,
round(100000 / (curr - prev) , 2) height
from
(
select
endpoint_value curr,
lag(endpoint_value,1) over (
order by endpoint_number
) prev
from
user_tab_histograms
where
table_name = 'T1'
and column_name = 'NORMAL'
)
where
prev is not null
order by
curr
;
结果同上。
从上面的结果可以看出,T1表的数据不是平均分布的,各个区段的分布密度是不一样的,这种情况下就需要使用histograms来优化。如果不使用histograms,就会认为是平均分布的,那么oracle计算sql plan的cost时候,就会出现问题,导致性能问题。
二、Generic Histograms
You can actually use two different strategies: one for data sets with only a few (fewer than 255) distinct values, and one for data sets with lots of distinct values. Oracle calls the former a frequency histogram and the latter a height balanced histogram.
After generating a histogram for a column, you will usually find that the density value reported in user_tab_columns (et al.) is no longer equal to 1/num_distinct.
三、When Oracle Ignores Histograms
It is quite expensive to generate histograms, so you should think very carefully about doing it. Oracle doesn’t use them very much, and often won’t get any benefit from them. In fact, there are several cases where Oracle ignores histograms when you might think they would be used.
3.1 Histograms and Joins
Oracle only uses histograms fully when there are highly visible input values—in other words, in cases where you have predicates of the form column operator constant (although the constant might come from peeking at a bind variable). Think about this carefully, and you realize that Oracle may not be able to make much use of histogram information to help it optimize joins.
3.2 Histograms and Distributed Queries
Oracle keeps on getting better at handling distributed queries—but even in Oracle 10g, the optimizer doesn’t try to pull histograms from the remote site(s) to improve its execution plan,and it’s not just a question of ignoring histograms on join columns. Even when you have remote columns compared to literal constants, Oracle does not attempt to use the histogram. Again, it falls back to the num_distinct (or density, depending on version).
四、Frequency Histograms
The value 1 appears once, 2 appears twice, and so on up to the value 80 that appears 80 times for a total of 3,240 rows.
create table generator as
select
rownum id
from all_objects
where rownum <= 2000
;
create table t1 (
skew not null,
padding
)
as
select
/*+ ordered use_nl(v2) */
v1.id,
rpad('x',400)
from
generator v1,
generator v2
where
v1.id <= &m_demo_size
and v2.id <= &m_demo_size
and v2.id <= v1.id
order by
v2.id,v1.id
;
create index t1_i1 on t1(skew);
测试实例1
begin
dbms_stats.gather_table_stats(
user,
't1',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 1'
);
end;
/
select
num_distinct, density, num_Buckets
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'SKEW'
;
NUM_DISTINCT DENSITY NUM_BUCKETS
------------ ---------- -----------
80 .0125 1
select skew, padding
from t1
where skew = 5
;
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 41 | 16564 | 32 |
|* 1 | TABLE ACCESS FULL| T1 | 41 | 16564 | 32 |
----------------------------------------------------------
Cardinality = 0.0125 * 3240 = 41
测试实例2
begin
dbms_stats.gather_table_stats(
user,
't1',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 80'
);
end;
/
select
num_distinct, density, num_Buckets
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'SKEW'
;
NUM_DISTINCT DENSITY NUM_BUCKETS
------------ ---------- -----------
80 .000154321 80
density = 1 / (2 * num_rows)
select
endpoint_number, endpoint_value
from
user_tab_histograms
where
column_name = 'SKEW'
and table_name = 'T1'
order by
endpoint_number
;
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 1
3 2
6 3
10 4
...
3160 79
3240 80
select
endpoint_value row_value,
curr_num - nvl(prev_num, 0) row_count
from (
select
endpoint_value,
endpoint_number curr_num,
lag(endpoint_number,1) over (
order by endpoint_number
) prev_num
from
user_tab_histograms
where
column_name = 'SKEW'
and table_name = 'T1'
)
order by
endpoint_value
;
ROW_VALUE ROW_COUNT
---------- ----------
1 1
2 2
3 3
4 4
5 5
...
80 80
select
skew, padding
from
t1
where
skew = 5
;
Execution Plan
----------------------------------------------------------
Plan hash value: 1429545322
---------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
---------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 2020 | 6 |
| 1 | TABLE ACCESS BY INDEX ROWID| T1 | 5 | 2020 | 6 |
|* 2 | INDEX RANGE SCAN | T1_I1 | 5 | | 1 |
---------------------------------------------------------------------
根据上面histograms的统计信息,skew = 5的Cardinality = 5
select
skew, padding
from
t1
where
skew = 70
;
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 70 | 28280 | 32 |
|* 1 | TABLE ACCESS FULL| T1 | 70 | 28280 | 32 |
----------------------------------------------------------
根据上面histograms的统计信息,skew = 70的Cardinality = 70
Predicate Description CBO Human
skew = 40 Column = constant 40 40
skew = 40.5 Column = nonexistent, but in-range 1 0
skew between 21 and 24 Between range with mapped values 90 90
skew between 20.5 and 24.5 Between range with mapped values 90 90
skew between 1 and 2 Between range at extremes 3 3
skew between 79 and 80 Between range at extremes 159 159
skew > 4 and skew < 8 Greater than/less than range 18 18
skew = -10 Below high value 1 0
skew = 100 Above high value 1 0
skew between –5 and –3 Range below low value 1 0
skew between 92 and 94 Range above high value 1 0
skew between 79 and 82 Range crossing boundary 159 159
skew = :b1 Column = :bind 41 ???
skew between :b1 and :b2 Column between :bind1 and :bind2 8 ???
In the two examples with bind variables, we see that the CBO has used the standard
0.25% for the range, and when we check column = :bind, we find that 41 comes from
num_rows / num_distinct.
Finally, looking carefully at the density, we discover (and confirm with a few more test data sets) that density = 1 / (2 * num_rows)—which is probably why we end up seeing a cardinality of 1 whenever we go outside the low/high range, or ask for a value that is not in the histogram.
- Frequency histograms have no impact on expressions with bind variables (if they are not peeked). The selectivity on column = :bind is still 1 / num_distinct, the selectivity on ranges are still 5% and 0.25% for unbounded and bounded ranges.
五、“Height Balanced” Histograms
5.1 每个值建立一个histogram
begin
dbms_stats.gather_table_stats(
user,
't1',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 80'
);
end;
/
select
num_distinct, density, num_Buckets
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'SKEW'
;
NUM_DISTINCT DENSITY NUM_BUCKETS
------------ ---------- -----------
80 .000154321 80
select
endpoint_number, endpoint_value
from
user_tab_histograms
where
column_name = 'SKEW'
and table_name = 'T1'
order by
endpoint_number
;
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
1 1
3 2
6 3
10 4
...
3160 79
3240 80
select
count(*)
from t1
where skew = 77
;
Execution Plan
----------------------------------------------------------
Plan hash value: 2432955788
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 1 |
| 1 | SORT AGGREGATE | | 1 | 3 | |
|* 2 | INDEX RANGE SCAN| T1_I1 | 77 | 231 | 1 |
-----------------------------------------------------------
根据上面histograms的统计信息,skew = 77的Cardinality = 77
select
count(*)
from t1
where skew = 75
;
Execution Plan
----------------------------------------------------------
Plan hash value: 2432955788
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 1 |
| 1 | SORT AGGREGATE | | 1 | 3 | |
|* 2 | INDEX RANGE SCAN| T1_I1 | 75 | 225 | 1 |
-----------------------------------------------------------
根据上面histograms的统计信息,skew = 75的Cardinality = 75
5.2 建立比num_distinct值少的histogram
begin
dbms_stats.gather_table_stats(
user,
't1',
cascade => true,
estimate_percent => null,
method_opt => 'for all columns size 75'
);
end;
/
select
num_distinct, density, num_Buckets
from
user_tab_columns
where
table_name = 'T1'
and column_name = 'SKEW'
;
NUM_DISTINCT DENSITY NUM_BUCKETS
------------ ---------- -----------
80 .013973812 75
select
endpoint_number, endpoint_value
from
user_tab_histograms
where
column_name = 'SKEW'
and table_name = 'T1'
order by
endpoint_number
;
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
...
46 63
48 64
49 65
51 66
52 67
54 68
56 69
57 70
59 71
60 72
62 73
64 74
66 75
67 76
69 77
71 78
六、The Arithmetic
When the optimizer takes advantage of a height balanced histogram to work out cardinality, it adopts one of three major strategies.
- Strategy 1 for calculating selectivity/cardinality: For column = constant, the constant may be a popular value. A popular value is one that fills at least one bucket, and therefore results in a gap in user_tab_histograms view. If Oracle spots such a value, it does its arithmetic by buckets. For example, in the preceding, the value 77 appears at endpoint 69, and the previous value (76)
appears at endpoint 67: consequently Oracle considers 77 to be a popular value that spans 2 buckets out of 75. The selectivity of 77 is therefore 2/75, and the cardinality is 3,240 * 2/75 = 86.4.
select
count(*)
from t1
where skew = 77
;
Execution Plan
----------------------------------------------------------
Plan hash value: 2432955788
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 1 |
| 1 | SORT AGGREGATE | | 1 | 3 | |
|* 2 | INDEX RANGE SCAN| T1_I1 | 86 | 258 | 1 |
-----------------------------------------------------------
- Strategy 2 for calculating selectivity/cardinality: We have already noted that 70 is not a special value, so what happens when we execute a query to select all rows for that value? Oracle uses the density (not 1/num_distinct, note). We have the density recorded as 0.013973812, so we expect a cardinality of 3,240 * .013973812 = 45.27,
Execution Plan
----------------------------------------------------------
Plan hash value: 2432955788
-----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 3 | 1 |
| 1 | SORT AGGREGATE | | 1 | 3 | |
|* 2 | INDEX RANGE SCAN| T1_I1 | 29 | 87 | 1 |
-----------------------------------------------------------
实际的Cardinality是29,这个可能是11g的一个bug。有待12g中测试。
In purely descriptive terms, the density is as follows:
sum of the square of the frequency of the nonpopular values /
(number of nonnull rows * number of nonpopular nonnull rows)
- Strategy 3 for calculating selectivity/cardinality: After that brief digression, we can investi-gate the issue of range scans.
Strategy 3的测试实例如下。
测试实例1
The script generates 10,000 rows using a scaled normal distribution centered on the value 3,000 ranging from –5,000 to 11,000. Superimposed on this distribution, I have 500 rows each of the values 500, 1,000, 1,500, and so on up to 10,000—a list of 20 popular values.After analyzing this table with 250 buckets (80 rows per bucket,总共20000条记录), I get a histogram that shows the spikes from my 20 popular values. The number of distinct values is 5,626 and the column density is 0.000119361 (a convincing drop from 1 / 5,626 = 0.000177746).My first test case is a range scan across buckets with no popular values: n1 between 100 and 200. If we check the histogram, we find that the required values fall across three buckets with endpoints 17, 117, and 251.
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
8 -120
9 17
10 117
11 251
12 357
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 63 | 945 | 60 |
|* 1 | TABLE ACCESS FULL| T1 | 63 | 945 | 60 |
----------------------------------------------------------
Selectivity = (required range) / (high value - low value) + 2 * density =
(200–117)/(251-117) + (117-100)/(117-17) + 2 * 0.000177746 =
0.619403 + 0.17 + .000355486 =
0.789047508
Cardinality = selectivity * number of rows IN A BUCKET =
0.789047508 * 80 = 63.1238
测试实例2
The second test case is, of course, a range scan across buckets with a popular value: n1 between 400 and 600. (One of my spikes was at n1 = 500.) We need to check up on a little section from the histogram that will be relevant:
ENDPOINT_NUMBER ENDPOINT_VALUE
--------------- --------------
12 357
13 450
19 500 ***
20 520
21 598
22 670
Note that 500 is a (very) popular value—with six buckets allocated from the histogram. So we expect to account for at least six buckets plus the rest of the range. Checking carefully, we see that the range 400 to 600 will extend from buckets 12 to 22—broken down as follows:
• A selection from 357 to 450
• All buckets from 450 to 598
• A selection from 598 to 670
• A selection from 357 to 450
• All buckets from 450 to 598
• A selection from 598 to 670
So we have eight whole buckets (endpoint_numbers 13 to 21) plus
(450 - 400) / (450 - 357) + (600 - 598) / (670 - 598) + 2 * 0.000177746 =
50 / 93 + 2 / 72 + 0.000355486 =
0.537634 + 0.0277778 + 0.000355486 =
0.565768
Remembering to add the 8 for the complete buckets, we get a cardinality of
Cardinality = selectivity * number of rows IN A BUCKET = 8.565867 * 80 = 685.3
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 685 | 10275 | 60 |
|* 1 | TABLE ACCESS FULL| T1 | 685 | 10275 | 60 |
----------------------------------------------------------
七、Data Problems Revisited
7.1 Daft Datatypes
create table t1 (
d1 date,
n1 number(8),
v1 varchar2(8)
)
;
insert into t1
select
d1,
to_number(to_char(d1,'yyyymmdd')),
to_char(d1,'yyyymmdd')
from (
select
to_date('31-Dec-1999') + rownum d1
from all_objects
where
rownum <= 1827
)
;
As we saw in Selectivity issue, a query with a predicate of
where n1 between 20021230 and 20030105
produced a cardinality of 396 in Oracle。
下面是使用直方图的时候的测试。
begin
dbms_stats.gather_table_stats(
ownname => user,
tabname => 't1',
cascade => true,
estimate_percent => null,
method_opt =>'for all columns size 120'
);
end;
/
select
rownum bucket,
prev low_val,
curr high_val,
curr - prev width,
round( (1827/120) / (curr - prev) , 4) height
from
(
select
endpoint_value curr,
lag(endpoint_value,1) over (
order by endpoint_number
) prev
from
user_tab_histograms
where
table_name = 'T1'
and column_name = 'N1'
)
where
prev is not null
order by
curr
;
BUCKET LOW_VAL HIGH_VAL WIDTH HEIGHT
---------- ---------- ---------- ---------- ----------
68 20021028 20021112 84 .1813
69 20021112 20021127 15 1.015
70 20021127 20021212 85 .1791
71 20021212 20021227 15 1.015
72 20021227 20030111 8884 .0017
73 20030111 20030126 15 1.015
74 20030126 20030210 84 .1813
75 20030210 20030225 15 1.015
76 20030225 20030312 87 .175
77 20030312 20030327 15 1.015
78 20030327 20030411 84 .1813
79 20030411 20030426 15 1.015
80 20030426 20030511 85 .1791
81 20030511 20030526 15 1.015
82 20030526 20030610 84 .1813
83 20030610 20030625 15 1.015
84 20030625 20030710 85 .1791
select num_distinct, density, num_Buckets from user_tab_columns where table_name = 'T1' and column_name = 'N1' ;
NUM_DISTINCT DENSITY NUM_BUCKETS
------------ ---------- -----------
1827 .000547345 120
Execution Plan
----------------------------------------------------------
Plan hash value: 3617692013
----------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost |
----------------------------------------------------------
| 0 | SELECT STATEMENT | | 15 | 345 | 4 |
|* 1 | TABLE ACCESS FULL| T1 | 15 | 345 | 4 |
----------------------------------------------------------
selectivity =
('required range' / 'total range of bucket' ) + 2 * density =
(20030105 - 20021230) / (20030111 - 20021227) + 2 * 0.000547345 =
8875 / 8884 + 0.00109469 =
1.00008
('required range' / 'total range of bucket' ) + 2 * density =
(20030105 - 20021230) / (20030111 - 20021227) + 2 * 0.000547345 =
8875 / 8884 + 0.00109469 =
1.00008
But this is the selectivity that has to be applied to the number of rows in the buckets involved—not to the entire table—and there are 1,827 / 120 = 15 rows per bucket—round(15 * 1.00008) = 15, as required.
7.2 Dangerous Defaults
同上。都存在直方图Figure7-2问题。
没有评论:
发表评论