Java Freeman: The Clustering Factor

一、BaseLine Example
1.1测试数据准备
create table t1(
    date_ord    date        constraint t1_dto_nn    not null,
    seq_ord        number(6)    constraint t1_sqo_nn    not null,
    small_vc    varchar2(10)
)
pctfree 90
pctused 10
;

drop sequence t1_seq;
create sequence t1_seq;

create or replace procedure t1_load(i_tag varchar2) as

m_date    date;

begin
    dbms_output.put_line(
        dbms_lock.request(
            1,dbms_lock.s_mode, release_on_commit => true
        )
    );
    commit;
    for i in 0..25 loop
        m_date := trunc(sysdate) + i;

        for j in 1..200 loop
            insert into t1 values(
                m_date,
                t1_seq.nextval,
                i_tag || j
            );
            commit;
            dbms_lock.sleep(0.01);
        end loop;
    end loop;
end;
/

begin
    dbms_output.put_line(
        dbms_lock.request(
            1,dbms_lock.x_mode,
            release_on_commit=>true
        )
    );
end;
/

prompt
prompt    From five different sessions, execute a call to procedure
prompt    t1_load with a different tag for each session, e.g.
prompt        execute t1_LOAD('a')
prompt
prompt    These will all suspend, waiting to acquire a share lock that
prompt    this progam is holding in exclusive mode. When all five sessions
prompt    have called the procedure, press return in this screen to allow them
prompt    to continue

accept x

commit;

create index t1_i1 on t1(date_ord, seq_ord);

1.2统计信息收集
rem
rem    Report how many times five processes have collided on a block
rem    The first column is the number of different processes that
rem    have insertde into a block, the second is the number of blocks
rem    with that many hits.
rem
select    ct, count(*)
from
    (
    select block, count(*) ct
    from
        (
        select
            distinct dbms_rowid.rowid_block_number(rowid) block,
            substr(small_vc,1,1)
        from t1
        )
    group by block
    )
group by ct
;
        CT   COUNT(*)
---------- ----------
         5        738   有738个数据块是5个session一起插入数据的
         4          2   有2个数据块是4个session一起插入数据的
         3          1   有2个数据块是3个session一起插入数据的

begin
    dbms_stats.gather_table_stats(
        user,
        't1',
        cascade => true,
        estimate_percent => null,
        method_opt => 'for all columns size 1'
    );
end;
/

select
    blocks,
    num_rows
from
    user_tables
where
    table_name = 'T1';

    BLOCKS   NUM_ROWS
---------- ----------
       744      26000

前面的查询结果来看，真正用于数据存储的使用了738+2+1=741，剩下的3个数据块用于甚么去了呢？待查？

select
    index_name, blevel, leaf_blocks, clustering_factor
from
    user_indexes
where
    table_name = 'T1'
;

INDEX_NAME               BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ----------- -----------------
T1_I1                         1          86               840

1.3测试
实例1
select
    count(small_vc)
from
    t1
where
    date_ord = trunc(sysdate) + 7
;

Execution Plan
----------------------------------------------------------
Plan hash value: 269862921

----------------------------------------------------------------------
| Id | Operation                    | Name | Rows | Bytes | Cost |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    13 |    38 |
|   1 | SORT AGGREGATE              |       |     1 |    13 |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    | 1000 | 13000 |    38 |
|* 3 |    INDEX RANGE SCAN          | T1_I1 | 1000 |       |     5 |
----------------------------------------------------------------------
cost =
        blevel +
        ceil(effective index selectivity * leaf_blocks) +
        ceil(effective table selectivity * clustering_factor)
In this case, we are after 1 day out of 26—a selectivity of 3.846% or 0.03846—and the two selectivities are identical. Putting these figures into the formula:
cost =
        1 +
        ceil(0.03846 * 86) +
        ceil(0.03846 * 840)
        = 1 + 4 + 33 = 38

select
        /*+ full(t1) */
        rowid, date_ord, seq_ord, small_vc
from
        t1
where
        rownum <= 10
;

ROWID              DATE_ORD     SEQ_ORD SMALL_VC
------------------ --------- ---------- ----------
AAAMJHAAJAAAAAKAAA 26-JUL-11          1 A1
AAAMJHAAJAAAAAKAAB 26-JUL-11          2 B1
AAAMJHAAJAAAAAKAAC 26-JUL-11          3 C1
AAAMJHAAJAAAAAKAAD 26-JUL-11          4 A2
AAAMJHAAJAAAAAKAAE 26-JUL-11          5 D1
AAAMJHAAJAAAAAKAAF 26-JUL-11          6 E1
AAAMJHAAJAAAAAKAAG 26-JUL-11          7 B2
AAAMJHAAJAAAAAKAAH 26-JUL-11          8 D2
AAAMJHAAJAAAAAKAAI 26-JUL-11          9 B3
AAAMJHAAJAAAAAKAAJ 26-JUL-11         10 E2

Remember that the extended rowid is made up of the following:
• object_id First six letters (AAAMJH)
• Relative file_id Next three letters (AAJ)
• Block within file Next six letters (AAAAAK)
• Row within block Last three letters (AAA, AAB, AAC ...)
All these rows are in the same block (AAAAAK). In my test run, I populated the column small_vc with a tag that could be used to identify the process that inserted the row. All five of our processes were busy hitting the same table block at the same time. In a very busy system (in particular, one with a high degree of concurrency), we might have seen lots of buffer busy waits for blocks of class data block for whichever one block was the current focus of all the inserts.
How do we address this issue? Simple: we read the advice in the Oracle Performance Tuning Guide and Reference and (for older versions of Oracle particularly) realize that we should have created the table with multiple freelists. In this case, because we expect the typical degree of concurrency to be 5, we might go to exactly that limit, and create the table with the extra clause:
storage (freelists 5)

二、Reducing Table Contention (Multiple Freelists)
把1.1测试数据准备部分的表结构定义改成如下定义。
create table t1(
    date_ord    date        constraint t1_dto_nn    not null,
    seq_ord        number(6)    constraint t1_sqo_nn    not null,
    small_vc    varchar2(10)
)
pctfree 90
pctused 10
storage (freelists 5)
;

        CT   COUNT(*)
---------- ----------
         1        298
         3        445

    BLOCKS   NUM_ROWS
---------- ----------
       754      26000

INDEX_NAME               BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ----------- -----------------
T1_I1                         1          86             16330

Execution Plan
----------------------------------------------------------
Plan hash value: 3724264953

-----------------------------------------------------------
| Id | Operation          | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |    13 |   116 |
|   1 | SORT AGGREGATE    |      |     1 |    13 |       |
|* 2 |   TABLE ACCESS FULL| T1   | 1000 | 13000 |   116 |
-----------------------------------------------------------
全表查询cost=754/6.41=118
如果走index，按照上面的公式
cost =
        1 +
        ceil(0.03846 * 86) +
        ceil(0.03846 * 16330)
        = 1 + 4 + 629 = 634
所以Oracle选择使用全表扫描。

三、Reducing Leaf Block Contention (Reverse Key Indexes)
利用1.1测试数据准备部分创建数据，之后使用下面的index定义创建reverce index。
alter index t1_i1 rebuild reverse;

A reverse key index operates by reversing the byte order of each column of the index before inserting the resulting value into the index structure. The effect of this is to turn sequential values into index entries that are randomly scattered.

四、Column Order
This is just one important consideration when deciding the column order of an index. Another is the possibility for improving the compressibility of an index by putting the least selective (most repetitive) columns first. Another is the option for arranging the columns so that some very popular queries can perform an order by without doing a sort (an execution mechanism that typically appears as sort (order by) nosort in the execution plan).
Whatever your reason for deciding to change the order of columns in an index, remember that it might change the clustering_factor. The knock-on effect of this might be that the calculated cost of the index for a range scan becomes so high that Oracle ignores the index.

4.1测试数据准备
create table t1
pctfree 90
pctused 10
as
select
    trunc((rownum-1)/ 100)    clustered,
    mod(rownum - 1, 100)    scattered,
    lpad(rownum,10)        small_vc
from
    all_objects
where
    rownum <= 10000
;

create index t1_i1_good on t1(clustered, scattered);
create index t1_i2_bad on t1(scattered, clustered);

4.2统计信息收集
begin
    dbms_stats.gather_table_stats(
        user,
        't1',
        cascade => true,
        estimate_percent => null,
        method_opt => 'for all columns size 1'
    );
end;
/

select
    blocks,
    num_rows
from
    user_tables
where
    table_name = 'T1'
;
    BLOCKS   NUM_ROWS
---------- ----------
       292      10000

select
    index_name, blevel, leaf_blocks, clustering_factor
from
    user_indexes
where
    table_name = 'T1'
;
INDEX_NAME               BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ----------- -----------------
T1_I1_GOOD                    1          24               278
T1_I2_BAD                     1          24             10000

4.3测试
select
    /*+ index(t1 t1_i1_good) */
    count(small_vc)
from
    t1
where
    scattered = 50
and    clustered between 1 and 5
;

Execution Plan
----------------------------------------------------------
Plan hash value: 2833503776

---------------------------------------------------------------------------
| Id | Operation                    | Name       | Rows | Bytes | Cost |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |            |     1 |    17 |     4 |
|   1 | SORT AGGREGATE              |            |     1 |    17 |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1         |     6 |   102 |     4 |
|* 3 |    INDEX RANGE SCAN          | T1_I1_GOOD |     6 |       |     3 |
---------------------------------------------------------------------------

select
    /*+ index(t1 t1_i2_bad) */
    count(small_vc)
from
    t1
where
    scattered =50
and    clustered between 1 and 5
;
Execution Plan
----------------------------------------------------------
Plan hash value: 2223336776

--------------------------------------------------------------------------
| Id | Operation                    | Name      | Rows | Bytes | Cost |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |           |     1 |    17 |     9 |
|   1 | SORT AGGREGATE              |           |     1 |    17 |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1        |     6 |   102 |     9 |
|* 3 |    INDEX RANGE SCAN          | T1_I2_BAD |     6 |       |     2 |
--------------------------------------------------------------------------
Irrespective of which index we use in this example, we will visit exactly the same number of table blocks—but the order in which we visit them will be different, and this has been enough to make a big difference to the optimizer’s calculations.
For completeness, let’s just run our statistics through the formula.
Selectivity of scattered = 50: 1 / 100 = 0.01
Selectivity of clustered between 1 and 5: (5 – 1) / ( 99 – 0 ) + 2/100 = 0.060404
Combined selectivity: 0.01 * 0.060404 = 0.00060404
cost (t1_i1_good) =
     1 +
     ceil(0.060404 * 24) +     -- range on first column, invalidates second column
     ceil(0.00060404 * 278)    -- 2nd column can be used before visiting the table
                  = 1 + 2 + 1 = 4
cost (t1_i2_bad) =
     1 +
     ceil(0.00060404 * 24) +   -- can use both columns for start/stop keys
     ceil(0.00060404 * 10000) -- but the clustering_factor is overwhelming
                  = 1 + 1 + 7 = 9

五、Extra Columns
It’s not just a change in column order that could introduce a problem. It’s a fairly common (and often effective) practice to add a column or two to an existing index. By now I’m sure you won’t be surprised to discover that this, too, can make a dramatic difference to the clustering_factor, hence to the desirability of the index.Imagine a system that includes a table for tracking product movements. It has a fairly obvious index on the movement_date, but after some time, it might become apparent to the DBA that a number of commonly used queries would benefit from the addition of the product_id to this index (see script extra_col.sql in the online code suite).

5.1测试数据准备
create table t1
as
select
    sysdate + trunc((rownum-1) / 500)    movement_date,
    trunc(dbms_random.value(1,60.999))    product_id,
    trunc(dbms_random.value(1,10.000))    qty,
    lpad(rownum,10)                small_vc,
    rpad('x',100)                padding
from
    all_objects
where
    rownum <= 10000
;

5.2测试
5.2.1在movement_date上创建index进行测试
create index t1_i1 on t1(movement_date);
begin
    dbms_stats.gather_table_stats(
        user,
        't1',
        cascade => true,
        estimate_percent => null,
        method_opt => 'for all columns size 1'
    );
end;
/

select
    blocks,
    num_rows
from
    user_tables
where
    table_name = 'T1';

    BLOCKS   NUM_ROWS
---------- ----------
       194      10000

select
    index_name, blevel, leaf_blocks, clustering_factor
from
    user_indexes
where
    table_name = 'T1'
;

INDEX_NAME               BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ----------- -----------------
T1_I1                         1          27               184

测试
select
    sum(qty)
from
    t1
where
    movement_date = trunc(sysdate) + 7
and    product_id = 44
;
Execution Plan
----------------------------------------------------------
Plan hash value: 269862921

----------------------------------------------------------------------
| Id | Operation                    | Name | Rows | Bytes | Cost |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    14 |    12 |
|   1 | SORT AGGREGATE              |       |     1 |    14 |       |
|* 2 |   TABLE ACCESS BY INDEX ROWID| T1    |     8 |   112 |    12 |
|* 3 |    INDEX RANGE SCAN          | T1_I1 |   500 |       |     2 |
----------------------------------------------------------------------

select
    product_id, max(small_vc)
from
    t1
where
    movement_date = trunc(sysdate) + 7
group by
    product_id
;

Execution Plan
----------------------------------------------------------
Plan hash value: 2629508439

----------------------------------------------------------------------
| Id | Operation                    | Name | Rows | Bytes | Cost |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |    60 | 1320 |    40 |
|   1 | HASH GROUP BY               |       |    60 | 1320 |    40 |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |   500 | 11000 |    12 |
|* 3 |    INDEX RANGE SCAN          | T1_I1 |   500 |       |     2 |
----------------------------------------------------------------------
The first query is an example of the type of query that encouraged us to add the extra column to the index. The second query is an example of a query that will suffer as a conse-quence of the change. In both cases, Oracle will be visiting the same little clump of about ten blocks in the table—but the extra column changes the order in which the rows are visited (which is what the clustering_factor is about), so the cost changes, and in the second case the execution plan changes for the worse.

5.2.2在movement_date上创建index进行测试
drop index t1_i1;
create index t1_i1 on t1(movement_date, product_id);
begin
    dbms_stats.gather_table_stats(
        user,
        't1',
        cascade => true,
        estimate_percent => null,
        method_opt => 'for all columns size 1'
    );
end;
/

select
    index_name, blevel, leaf_blocks, clustering_factor
from
    user_indexes
where
    table_name = 'T1'
;
INDEX_NAME               BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- ---------- ----------- -----------------
T1_I1                         1          31              6645

select
    table_name,
    blocks,
    num_rows
from    user_tables
where    table_name = 'T1'
;

TABLE_NAME               BLOCKS   NUM_ROWS
-------------------- ---------- ----------
T1                          194      10000

select
    sum(qty)
from
    t1
where
    movement_date = trunc(sysdate) + 7
and    product_id = 44
;
Execution Plan
----------------------------------------------------------
Plan hash value: 269862921

----------------------------------------------------------------------
| Id | Operation                    | Name | Rows | Bytes | Cost |
----------------------------------------------------------------------
|   0 | SELECT STATEMENT             |       |     1 |    14 |     7 |
|   1 | SORT AGGREGATE              |       |     1 |    14 |       |
|   2 |   TABLE ACCESS BY INDEX ROWID| T1    |     8 |   112 |     7 |
|* 3 |    INDEX RANGE SCAN          | T1_I1 |     8 |       |     1 | ----------------------------------------------------------------------

select
    product_id, max(small_vc)
from
    t1
where
    movement_date = trunc(sysdate) + 7
group by
    product_id
;
Execution Plan
----------------------------------------------------------
Plan hash value: 136660032

-----------------------------------------------------------
| Id | Operation          | Name | Rows | Bytes | Cost |
-----------------------------------------------------------
|   0 | SELECT STATEMENT   |      |    60 | 1320 |    59 |
|   1 | HASH GROUP BY     |      |    60 | 1320 |    59 |
|* 2 |   TABLE ACCESS FULL| T1   |   500 | 11000 |    31 |
-----------------------------------------------------------

全表扫描cost = 194/6.41=31

index扫描cost = 1+0.05*31+0.05*6645=336

所以选择全表扫描。

Java Freeman

2011年7月27日星期三

The Clustering Factor

没有评论:

发表评论