Jonathan Lewis

Subscribe to Jonathan Lewis feed Jonathan Lewis
Just another Oracle weblog
Updated: 10 hours 15 min ago

Exchange Partition

Wed, 2023-02-01 09:04

A question appeared recently in a comment on a blog note I wrote about the new (in 12c) deferred global index maintenance that allowed the execution of a drop partition to return very quickly. The question was in two parts:

  • Why is an exchange partition so slow by comparison?
  • Is there any way to make it quick?

It occurred to me that it might be possible to do something for some cases by taking advantage of the partial indexing feature that also appeared in 12c. It wouldn’r eliminate the need for index maintenance, of course, and you clearly couldn’t avoid the insert maintenance for the incoming partition completely … but you might be able to ensure that it happens only at a time you find convenient.

There’s a little more detail in my reply to the comment, but this note is a place holder for the article I’ll write if I ever find time to test the idea.

Lob Space redux

Wed, 2023-02-01 07:56

At present this is just a place holder to remind me to finish commenting on (and correcting) a mistake I made when I wrote a note about the number of bytes of data you could get into an “enable storage in row” LOB before it had to be stored out of row.

Lost Or-Expand

Mon, 2023-01-30 07:39

I’ve commented previously on the “new” cost-based Or-Expansion introduced in 12c to replace the “legacy” Concatenation transformation, and I’ve been re-running some of my concatenation scripts to see whether the most recent versions of the optimizer will use Or-expansion unhinted in places where I’ve previously had to use hints to force concatenation to appear.

The latest test has produced a surprising result – I’ve got an example where 19c and 21c will use concatenation when hinted with use_concat(), but will object to or_expand() hint on the grounds that there’s “No valid predicate for OR expansion”

It’s worth knowing this could happen if you’re upgrading from 11g to 19c (as many people seem to be doing at present) as you may find that you have some statements that used to use concatenation unhinted, but now need to be hinted to do so as they can’t switch to or-expansion and won’t use concatenation unless hinted to do so.

tl;dr (the rest of the note is just a demonstration.) When you upgrade from 11g to 19c (or later) you may find that some queries perform badly because they stop using the legacy “concatenation” operator but can use the new “cost-based Or Expand” operator, and need to be hinted with a use_concat() hint.

Here’s a statement I can use to demonstrate the effect – I’ll post the code to create the tables at the end of the note:

select  /*+ gather_plan_statistics */
        n1, n2, small_vc
from
        t1
where
        (n1 = 1 and n2 = 10000)
or      (n1 = 10000 and n2 = 1)
;

I’ve rigged the data so that there are 9,999 distinct values of n1 each with one row, and 10,001 rows with the value 10,000; and I’ve done the same with n2 – 9,999 distinct values with one row each and 10,001 rows with the value 10,000.

I’ve gathered stats that include histograms on n1 and n2 (separately) and I’ve created indexes on n1 and n2 (separately). As a result the ideal path for this query is to use the index on n1 to find rows for the first of the two compound predicates and use the index on n2 to find rows for the second of the predicates, which should be possible if the optimizer first transforms the query using OR-expansion.

You’ll notice I’ve included the hint to capture rowsource execution statistics, so I’ll be executing this query with various hints and reporting the actual execution plans and workload. Using 19.11.0.0 and 21.3.0.0 with no special parameter settings the execution plan that appeared used B-tree/bitmap conversion:

| Id  | Operation                           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |    45 (100)|      2 |00:00:00.01 |      50 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |    45   (3)|      2 |00:00:00.01 |      50 |
|   2 |   BITMAP CONVERSION TO ROWIDS       |       |      1 |        |            |      2 |00:00:00.01 |      48 |
|   3 |    BITMAP OR                        |       |      1 |        |            |      1 |00:00:00.01 |      48 |
|   4 |     BITMAP AND                      |       |      1 |        |            |      1 |00:00:00.01 |      24 |
|   5 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|*  6 |       INDEX RANGE SCAN              | T1_N1 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
|   7 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |      22 |
|*  8 |       INDEX RANGE SCAN              | T1_N2 |      1 |        |    21   (0)|  10001 |00:00:00.01 |      22 |
|   9 |     BITMAP AND                      |       |      1 |        |            |      1 |00:00:00.01 |      24 |
|  10 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|* 11 |       INDEX RANGE SCAN              | T1_N2 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
|  12 |      BITMAP CONVERSION FROM ROWIDS  |       |      1 |        |            |      1 |00:00:00.01 |      22 |
|* 13 |       INDEX RANGE SCAN              | T1_N1 |      1 |        |    21   (0)|  10001 |00:00:00.01 |      22 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   6 - access("N1"=1)
   8 - access("N2"=10000)
  11 - access("N2"=1)
  13 - access("N1"=10000)

This is a fairly clever plan but not what I wanted to test so I set the hidden parameter ‘_b_tree_bitmap_plans’ to false for all subsequent tests. With this block in place the plan changed to a full tablescan:

-------------------------------------------------------------------------------------------------
| Id  | Operation         | Name | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |      |      1 |        |    99 (100)|      2 |00:00:00.01 |     349 |
|*  1 |  TABLE ACCESS FULL| T1   |      1 |      1 |    99   (2)|      2 |00:00:00.01 |     349 |
-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((("N1"=1 AND "N2"=10000) OR ("N1"=10000 AND "N2"=1)))


Definitely not what I wanted – so I added a hint telling the optimizer I wanted to see OR-expansion. The optimizer produced the same full tablescan! Since I had included the format option ‘hint_report’ in my call to dbms_xplan.display_cursor() I can show you the extra lines of output that explained why the optimizer “ignored” my hint:

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 1 (U - Unused (1))
---------------------------------------------------------------------------
   1 -  SEL$1
         U -  or_expand(@sel$1 (1) (2)) / No valid predicate for OR expansion

As you can see the hint was not “N – unresolved” or “E – Syntax error”. It was recognised, syntactically correct, notionally applicable but unused because the optmizer couldn’t see a way to use it (even though we can see an obvious way to use it).

Idle curiosity then prompted me to try the use_concat() hint, in the form: “use_concat(@sel$1 1)” – here’s the resulting execution plan:

---------------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |       |      1 |        |     4 (100)|      2 |00:00:00.01 |       7 |
|   1 |  CONCATENATION                       |       |      1 |        |            |      2 |00:00:00.01 |       7 |
|*  2 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       4 |
|*  3 |    INDEX RANGE SCAN                  | T1_N2 |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       3 |
|*  4 |   TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|*  5 |    INDEX RANGE SCAN                  | T1_N1 |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       2 |
---------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=10000)
   3 - access("N2"=1)
   4 - filter(("N2"=10000 AND (LNNVL("N2"=1) OR LNNVL("N1"=10000))))
   5 - access("N1"=1)

Exactly the plan I wanted to see from or_expand(), although the two subqueries are in the reverse order to the order I would expect from or_expand(). So the new cost-based or-expansion says there’s no valid predicate available for expansion, but the old, deprecated, heuristic, concatenation transformation manages to find a disjunct (OR) that can be expanded.

Of course the next thing to do is look at the predicted cost and actual work (mostly buffer gets) that Oracle reported for each plan:

  • bitmap conversion: (cost 45, buffers 50)
  • full tablescan: (cost 99, buffers 349)
  • concatenation: (cost 4, buffers 7)

The predicted costs are actually fairly consistent with buffer gets (which, if I flushed the cache, would also be mostly disk reads). I had been fairly impressed that the optimizer picked bitmap conversion, but it would have been so much better if the optimizer could see that this (slightly complex) set of predicates included an opportunity for or-expansion.

Footnote 1

This query shows an example of disjunctive normal form (DNF), i.e the where clause is a disjunct (OR) of conjuncts (ANDs). I understand that optimizers (in general) quite like this form, but there is another “nice” form which is CNF (conjunctive normal form) i.e. where the where clause is a conjuct (AND) of disjuncts (ORs). So, for entertainment, I rewrote the where clause in conjunctive normal form. You have to be a little careful when you play the “normal form” game, it’s quite easy to get it wrong, so here are the steps I took (using A, B, C, D instead of my 4 atomic predicates):

(A and B) or (C and D) ==
        (A or (C and D)) and (B or (C and D)) ==               -- distributing the (A and B)
        (A or C) and (A or D) and (B or C) and (B or D)        -- distributing the two occurrences of (C and D)

Here’s the restulting query and unhinted execution plan after substituting “n = 1” etc. back into the symbolic presentation (and it probably gives you some idea why I played safe by starting with A, B, C, D):

select  /*+ gather_plan_statistics */
        n1, n2, small_vc
from
        t1
where
        (n1 = 1 or n2 = 1) 
and     (n1 = 1 or n1 = 10000) 
and     (n2 = 10000 or n2 = 1)
and     (n2 = 10000 or n1 = 10000)
;

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                             | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                      |                 |      1 |        |     4 (100)|      2 |00:00:00.01 |       7 |
|   1 |  VIEW                                 | VW_ORE_BA8ECEFB |      1 |      2 |     4   (0)|      2 |00:00:00.01 |       7 |
|   2 |   UNION-ALL                           |                 |      1 |        |            |      2 |00:00:00.01 |       7 |
|*  3 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1              |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       4 |
|*  4 |     INDEX RANGE SCAN                  | T1_N1           |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       3 |
|*  5 |    TABLE ACCESS BY INDEX ROWID BATCHED| T1              |      1 |      1 |     2   (0)|      1 |00:00:00.01 |       3 |
|*  6 |     INDEX RANGE SCAN                  | T1_N2           |      1 |      1 |     1   (0)|      1 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("N2"=10000)
   4 - access("N1"=1)
   5 - filter(("N1"=10000 AND LNNVL("N1"=1)))
   6 - access("N2"=1)

It’s the OR-expansion I wanted to see.

If I can do an algorithmic rewrite that produces the desired plan the optimizer can be coded to do the rewrite – so I think you can expect to see this limitation removed at some future point. This plan, however, did still depend on my disabling B-tree/bitmap conversion; when I enabled B-tree/bimap conversion the optimizer used it to produce the following plan:

--------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |       |      1 |        |     2 (100)|      2 |00:00:00.01 |       6 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T1    |      1 |      1 |     2   (0)|      2 |00:00:00.01 |       6 |
|   2 |   BITMAP CONVERSION TO ROWIDS       |       |      1 |        |            |      2 |00:00:00.01 |       4 |
|   3 |    BITMAP OR                        |       |      1 |        |            |      1 |00:00:00.01 |       4 |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|*  5 |      INDEX RANGE SCAN               | T1_N1 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
|   6 |     BITMAP CONVERSION FROM ROWIDS   |       |      1 |        |            |      1 |00:00:00.01 |       2 |
|*  7 |      INDEX RANGE SCAN               | T1_N2 |      1 |        |     1   (0)|      1 |00:00:00.01 |       2 |
--------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   1 - filter((INTERNAL_FUNCTION("N1") AND INTERNAL_FUNCTION("N2") AND ("N2"=10000 OR "N1"=10000)))
   5 - access("N1"=1)
   7 - access("N2"=1)

The thing to note in this case, though, is that the B-tree/bitmap conversion is logically the correct thing to choose when you compare the estimated cost and actual workload:

  • or-expansion: (cost 4, buffers 7)
  • bitmap conversion: (cost 2, buffers 6)
Footnote 2

Mohamed Houri wrote an article on Or-expansion a year ago explaining the possible settings for the hidden parameter “_optimizer_cbqt_or_expansion”, which can off, on, linear, greedy or two_pass. I tried all the options to see if that would make any difference (apart from the obvious impact of “off”); but it didn’t.

Source code

If you want to do further experiments, here’s the script I used to generate the data:

rem
rem     Script:         concat_3b.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Sep 2008 / Jan 2003
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem             21.3.0.0
rem

create table t1
as
with generator as (
        select
                rownum  id
        from    dual
        connect by level <= 10000
)
select
        rownum                  n1,
        10000                   n2,
        lpad(rownum,10,'0')     small_vc,
        rpad('x',100)           padding
from
        generator       v1
;

insert /*+ append */ into t1
select
        n2, n1, small_vc, padding
from
        t1
;

commit;

create index t1_n1 on t1(n1);
create index t1_n2 on t1(n2);

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for columns size 100 n1, n2'
        );
end;
/


Case Study

Sat, 2023-01-28 06:46

This example appeared quite recently on one of the public Oracle Forums, and the person who posted it had already worked out (the important bit of) what was going on before anyone asked for more information or supplied any suggestions for action (although there was one response pointing to a MOS Note that might have had some relevance and did contain some interesting, though unrelated, background information).

The title of the question was: “KTSJ process running select dbms_rowid.rowid_type(rowid) causing huge load”, and the opening post contain a number of bits of information about what KTSJ was and how it relates to SMCO. The critical details, though, were the SQL, and where it was running.

The system was a 4 node RAC, running 19.12; the problem query was running multiple times between 4:00 am and 9:00 am, and the table it referenced was about 4 GB and “not fragmented”. The table is “basic compressed” and partitioned on a date column, there are no LOB columns, and there are about 200,000 updates on the table “each morning” (but no comment about whether this is before, during, or after the performance probelm). The query was simple:

select dbms_rowid.rowid_type(rowid) from <owner>.<table_name> where rownum=1;

In a follow-up post we got the following information about each run of the query:

Global Stats
============================================================================
| Elapsed |    Cpu  |      IO  |  Cluster | Fetch | Buffer |  Read |  Read |
| Time(s) | Time(s) | Waits(s) | Waits(s) | Calls |   Gets |  Reqs | Bytes |
============================================================================
|     123 |      16 |       94 |       13 |     1 |     1M | 81215 |  10GB |
============================================================================

If you’re familiar with the the SQL Monitor you’ll recognise this as the summary information from an SQL Monitor report – which means, of course, that the OP also has an execution plan for the query.

It’s easy of course to do the Bart Simpson “I knew that” claim after the OP has given the answer, but even with this tiny amount of information there are several clues to tell you what’s probably going on and why.

Quiz

I’m busy clearing an attic (loft) at the moment, but needed a break from the labour – but now it’s time to get back to work so I’ll pause here for anyone who wants to think about things you can infer from the information supplied so far. I’ll give you my thoughts some time later on this weekend.

Indexing foundations

Thu, 2023-01-26 12:08

Here are the introductory comments I made at a recent “Ask me Anything” session about indexing arranged by the All India Oracle User Group:

There are 4 fundamental thoughts that you need to bear in mind whenever you’re thinking about what indexes your application needs:

  1. The intent of indexing is to find the data you need with the minimum use of critical resources (which may be disk I/Os for relatively small systems, CPU for large systems).
  2. High precision is important – the ideal use of indexes is to avoid visiting table blocks that don’t hold useful data. With ideal indexing, query performance is dictated by the amount of data you want to manipulate not by the total amount of data in the database.
  3. Creating precise indexes for every query requirement leads to a lot of real-time maintenance when you modify data so you need to balance the resources needed for DML against the resources for queries.
  4. Oracle offers many ways to minimise the work you need to do to use and to maintain indexes. (So you need to think about other mechanisms every time you think about adding indexes)

If you can keep these points in mind then everything else you need to think about for your specific system follows as a logical consequence.

I did consider adding one more comment about the most significant difference between B-tree and Bitmap indexes which is that individual B-tree indexes tend to be very precise while precision is only achieved with Bitmap indexes by combining them; but that’s just adding a specific example to a list of general principles and could even have the effect of deflecting people from the thoughts of combining (necesary but annoying) low-precision B-tree indexes.

Quiz Night

Sun, 2023-01-15 12:25

Here’s a little extract from one of my “snap instance activity stats” packages picking out the figures where the upper case of the statistic name is like ‘%PARALLEL%’. I was very careful that nothing except one of my SQL*Plus sessions had done anything in the few seconds between the start and end snapshots so there’s no “(un)lucky timing” to explain the anomaly in these figures.

The quesion is: how can Oracle manage to claim hundreds of “checkpoint buffers written for parallel query” when there had been no parallel statements executing around the time the snapshots were taken?

Name                                                        Value
----                                                   ----------
DBWR parallel query checkpoint buffers written              1,430
queries parallelized                                            0
DML statements parallelized                                     0
DDL statements parallelized                                     0
DFO trees parallelized                                          0
Parallel operations not downgraded                              0
Parallel operations downgraded to serial                        0
Parallel operations downgraded 75 to 99 pct                     0
Parallel operations downgraded 50 to 75 pct                     0
Parallel operations downgraded 25 to 50 pct                     0
Parallel operations downgraded 1 to 25 pct                      0

Here’s a little background information if you don’t know why this is a puzzle.

When you start executiing a parallel tablescan (or index fast full scan) the first step that your session takes is to tell DBWR to write to disc any dirty blocks for the object you’re about to scan and wait for DBWR to confirm that the blocks have been written. The session does this because a parallel scan will bypass the buffer cache and read directly from disc to the PGA, so if there were any dirty blocks for the object in the buffer cache the query would not find them and get them to a correctly read-consistent state. The blocks written by DBWR in this case would (should) be reported under “DBWR parallel query checkpoint buffers written”.

Answer

To be published on 16th Jan 2023, some time after 1800 UTC (GMT)

Quiz Night

Wed, 2023-01-11 11:27

I may have given the answers to this little puzzle elsewhere on the blog already, but if so I can’t find where, and it’s worth a linttle note of its own.

I have a non-partitioned heap table that is a subset of all_objects, and I’m going to execute a series of calls to dbms_stats to see what object stats appear. When I first wrote the script Oracle didn’t gather stats automatically when you created a table, so I’ve have to add in a call to delete_table_stats immediately after creating the table. I’ve run a little script to display various object stats at each stage, that part of the process is left to any readers who want to check my answers.


rem     Script:         gather_col_stats.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2015
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem             11.2.0.4
rem

drop table t1;

@@setup

create table t1 
as
select  * 
from    all_objects
where   rownum <= 50000
;


execute dbms_stats.delete_table_stats(user,'t1');
start show_simple_stats t1

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for columns created size 1')
start show_simple_stats t1

delete from t1 where rownum <= 2000;

create index t1_i1 on t1(object_id);
start show_simple_stats t1

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all indexed columns size 1')
start show_simple_stats t1

After creating the table I call delete_table_stats() – does this definitely mean there are no stats on the table?

I’ve then gathered stats on just one column of the table – what has Oracle really done?

I’ve then deleted a couple of thousand rows, and created an index on the table. Since this is 11g (at least) what extra stats have come into existence?

I’ve then gathered stats for just the indexed columns – what has Oracle really done?



Answers (to be posted tomorrow)

Case Study

Fri, 2023-01-06 10:34

Here’s a query that appeared on the MOS “SQL Performance” forum (needs a support account) with the request: “Can someone please assist me to optimize this query?”

It looks like a nice simple query (though looks can be deceiving) so I thought I’d use it as another example on how to think about tuning SQL.

SELECT 
        MSI.SEGMENT1, OL.SHIP_FROM_ORG_ID, 
        OL.ORDERED_QUANTITY SOLD_QTY, 
        OL.UNIT_SELLING_PRICE SELLING_PRICE
FROM 
        OE_ORDER_HEADERS_ALL   OH, 
        OE_ORDER_LINES_ALL     OL, 
        MTL_SYSTEM_ITEMS_B     MSI
WHERE 
        OH.HEADER_ID         = OL.HEADER_ID
AND     OL.SHIP_FROM_ORG_ID  = MSI.ORGANIZATION_ID
AND     OL.INVENTORY_ITEM_ID = MSI.INVENTORY_ITEM_ID
AND     NVL(oh.source_document_type_id,0) <> 10
AND     SUBSTR(SEGMENT1,4,3) = 'FIF'
AND     UPPER(OL.FLOW_STATUS_CODE) NOT IN ('ENTERED','CANCELLED')

I really dislike (and actually have trouble) reading blocks of text in upper case, so before I do anything else here’s the same code converted to (mostly) lower case, with one minor adjustment:

select
        msi.segment1, 
        ol.ship_from_org_id, 
        ol.ordered_quantity     sold_qty,
        ol.unit_selling_price   selling_price
from 
        oe_order_headers_all   oh, 
        oe_order_lines_all     ol, 
        mtl_system_items_b     msi
where 
        oh.header_id         = ol.header_id
and     ol.ship_from_org_id  = msi.organization_id
and     ol.inventory_item_id = msi.inventory_item_id
and     nvl(oh.source_document_type_id,0) <> 10
and     substr(msi.segment1,4,3) = 'FIF'
and     upper(ol.flow_status_code) not in ('ENTERED','CANCELLED')

I’ve highlighted line 15 to pick out the minor adjustment: I’ve added a table alias to the expression substr(segment1,4,3). Every column reference should include its table alias. The query was pretty good in following the guideline, and it was fairly easy to work out the appropriate alias here because we can also see msi.segment1 and if there had been a “segment1” column in one of the other tables Oracle would have reported error “ORA-00918: column ambiguously defined“.

We’ve been told that the query is a “sales query” against Oracle EBS R12, and we’ve been given an execution plan – though the plan, unfortunmetly, is a image from one of the non-Oracle GUIs:

Conveniently the plan tells us that the three tables in the query really are tables and not views that hide complex subqueries; it also gives us some idea of the size of two of the tables (very big). What it doesn’t give us is any information about how and where the original predicates have been used.

Things we don’t know
  • What the query means (in business terms)
  • How long it takes the query to run
  • How long it probably should take the query to run
  • How fast the OP would like the query to run
  • How many rows from each table have to participate in deriving the result
  • How many rows and blocks there are in each table
  • How many distinct values there are for each of the columns in the where clause.
  • Whether there are any virtual columns (or extended stats) on the table
  • Whether there are any (numerically) skewed data distribution patterns
  • What physical patterns there might be in the rows identified by the predicates.
  • What the definitions of any available indexes are (including “function-based”)
  • Whether or not we’re allowed to change the code, or whether we have to “hint” it somehow.
  • Whether this is the ONLY set of literal values that would appear or whether it’s generated code that allows many variations to appear.
  • Whether this is a case where a production query using bind variables is being tested with one known set of values.
  • Which version of Oracle and optimizer_features_enable / hacked optimizer parameters
Basic threats

Every single predicate that compares a column with a literal hides the column inside a function call – which means the optimizer may be “losing” important statistical information

Two of the three literal-based predicates are “negative”, i.e. “not equal” and “not in”, which is another detail that can mess up the optimizer’s arithmetic (though the specific effects may vary with version and the column statistics).

Observations and Guesswork

The optimizer cardinality estimate for upper(ol.flow_status_code) not in (‘ENTERED’,’CANCELLED’) is 76,240 with a tablescan cost of 989,658. That looks like a very small fraction of a very large table. But it’s possible that this is the standard estimate for “unknown value not in (list of 2 items)” which would be 5% of 5% in 19c. Certainly 76,240 * 20 * 20 = 30M sounds like a suitable number of rows for a table with a tablscan cost close to a million. Possibly a better cardinality estimate would change the plan.

If we had extended stats, or a virtual column on upper(flow_status_code), and created a histogram because flow_status_code looks like the sort of column that would have a small number of distinct values with an enormous data skew, then the optimizer might pick a completely different path. If the estimate became very small it might be able to choose nested loops and an indexed access path all the way through the query; if the estimate became very large it might decide to use a different join order.

The optimizer cardinality estimate for nvl(oh.source_document_type_id,0) <> 10 is 6.65M with a tablescan cost of 95,823. The selectivity for this predicate is derived as 1 – selectivity(nvl(oh.source_document_type_id,0)= 10), which is derived as 1 – selectivity( oh.source_document_type_id = 10). At this point we could launch into all sorts of speculation about the column: the number of distinct values, the number of nulls, the existence (or not) of a histogram, a comparison between the blocks and cardinality of this “order headers” table and the estimates made above for the “order lines” table – but trying to cover all the options would be a long and tangled document, so I’ll just show you one example that might be a valid model this predicate:

  • t1 is a table with 100,000 rows
  • data_type_id is a numeric column with 80 distinct values
  • there are 1,000 rows in t1 where data_type_id is null
  • I’ve gathered stats just before executing a couple of queries

Here are the two queries, each followed by the critical line from its execution plan:

SQL> select count(*) from t1 where nvl(data_type_id,0) <> 50;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|*  2 |   TABLE ACCESS FULL| T1   | 98763 |   289K|   553   (2)| 00:00:01 |
---------------------------------------------------------------------------


SQL> select count(*) from t1 where data_type_id = 50;

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|*  2 |   TABLE ACCESS FULL| T1   |  1238 |  3714 |   551   (1)| 00:00:01 |
---------------------------------------------------------------------------

A quick arithmetic check: add the two row estimates: 98,763 + 1,238 = 99,001. Allowing for rounding errors in the arithmetic that’s the 99,000 rows where data_type_id is not null.

Unfortunately the result of the first query is 2,793 – the optimizer’s estimate is out by a factor of more than 35; and the result from the second query is 97,207 – the optimizer’s estimate is out (in the opposite direction) by a factor of more than 78.

The statistics are “up to date” – except 50 is an extremely common value for data_type_id and I haven’t created a histogram on the column.

So perhaps the original query is asking for “rare” order headers when there’s no histogram to give the optimizer any clue that 10 (the unwanted value) is an extremely common value for the document_type_id. Would a suitable histogram allow the optimizer to produce a much lower cardinality estimate and a significant change in the execution plan?

Finally we come to the predicate substr(msi.segment1,4,3) = ‘FIF’. I don’t know much about EBS but I can guess what the tables oe_order_headers_all and oe_order_lines_all represent; on the other hand I can’t think what is really behind the name mtl_system_items_b, especially when the column names that join it to the order lines table make it look as if it should be called something more like oe_products_all; and there’s no way to guess what the significance of a little chunk of a column called segment1 might be.

The only thing we can infer from the execution plan is that it must be a fairly huge table since the optimizer has chosen to probe it through a nested looped 76,240 times with a total incremental cost of 152,551 (= 1,244,546 – 1,091,995) rather than doing a tablescan and hash join.

The cardinality and cost estimates show a couple of interesting details. There’s a clear error in the cardinality estimate of the nested loop join since but the number of rows produced by the join is (very slightly) larger than the number of rows supplied to it by the hash join, even though it’s accessing the table by a unique scan of a unique index.

It’s worth commenting on the cost of the indexed access, which might seem low at only 2 when I’ve claimed it’s a “fairly huge” table which would presumably require traversing an index with a blevel of 2 (root, branch, leaf) before visiting the required table block – so a cost of 4 for the table visit would seem reasonable. In fact that is what it would be for a non-unique index (on this data) and a query for “id = constant}”; the optimizer has substracted one for the uniqueness, and one because it’s a nested loop join.

If we know that there are only a very few rows where substr(segment_1,4,3) = ‘FIF’, and if we think that this is a good starting table for the optimizer then we need to create a virtual column (or extended stats) and might need to support that with a histogram. Even then, of course, it might not be a good first table to have first in the join order.

Joins

So far we’ve only been considering ways to improve the optimizer’s cardinality estimates in the hope that better information would give it a better execution plan. We believe that all three tables are very large, and suspect that if the actual volume of relevant data is small we can get a better path that uses nested loops from beginning to end – but if we that path to be efficient we’re going to need suitable indexes, including a precision index into the first table in the join.

What we need to do now is consider the resources that might be needed to ensure we have the “perfect” indexes for the optimum, compared with the excess resources that would be used if we could find a “good enough” path.

Looking at the query, my feeling is that there are two possible paths that might work reasonably well (assuming the required volume of data is sufficiently small):

  • oe_order_headers_all -> or_order_lines_all -> mtl_system_items_b
  • oe_order_lines_all -> or_order_headers_all -> mtl_system_items_b

I suspect (in the absence of any supplied information) that a path starting with mtl_system_items_b will do too much random I/O into the order lines tables. (It’s a standard example of the problem displayed by pairings like: products -> order_lines and customers -> orders the order_lines for any given product are likely to be be scattered widely across the table, as are the orders for any particular customer.

Looking at the oe_order_lines_all table it strikes me that only a small number of rows will be newly entered or cancelled, and most of them will be in states like “completed”, “picked” invoiced”, etc. So that predicate is probably not one that will eliminate a lot of date, so I’m just going to take the example of getting into the oe_order_headers_all table. Assuming the predicate “nvl(oh.source_document_type_id,0) <> 10” does identify a “small enough” number of rows then we probably have the a good enough index (the foreign key index) into oe_order_lines_all, and we know we have a unique index from there into mtl_system_items_b.

So how do we access that small number of rows as efficiently as possible with a minimum of overhead. We’ve set up a histogram on source_document_type_id so that the optimizer gets a better idea of the number of rows – but that still leaves us with a tablescan of a huge table unless we create a tailored index. Here’s an example of what we can do – based on a table t2 that has 50,000 rows of which almost all have a document_type of ‘XX’, which we want to ignore, and a few rows where the document_type is null which have to be included in the ones we keep:

SQL> execute dbms_stats.gather_table_stats(user,'t2',method_opt =>'for all columns size 1 for columns size 254 document_type')

SQL> select count(*) from t2 where nvl(document_type,'YY') <> 'XX';

  COUNT(*)
----------
       300

Execution Plan
----------------------------------------------------------
Plan hash value: 3321871023

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     4 |   278   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     4 |            |          |
|*  2 |   TABLE ACCESS FULL| T2   |   300 |  1200 |   278   (2)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter(NVL("DOCUMENT_TYPE",'YY')<>'XX')

In keeping with the original supplied code I have a predicate which translates NULLs into a value that isn’t the ‘XX’ that we want to exclude. With the histogram in place the optimizer expresion has actually got exactly the right estimate.

So here’s a possible index definition that will allow us to create a very small index that identifies exactly those rows as efficiently as possible:

SQL> create index t2_i1 on t2 (case when nvl(document_type,'YY') <> 'XX' then 1 end);

SQL> execute dbms_stats.gather_table_stats(user,'t2',method_opt =>'for all hidden columns')

SQL> select count(*) from t2 where case when nvl(document_type,'YY') <> 'XX' then 1 end = 1;

  COUNT(*)
----------
       300

Execution Plan
----------------------------------------------------------
Plan hash value: 3503408237

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |     2 |     1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE   |       |     1 |     2 |            |          |
|*  2 |   INDEX RANGE SCAN| T2_I1 |   300 |   600 |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access(CASE  WHEN NVL("DOCUMENT_TYPE",'YY')<>'XX' THEN 1 END =1)

Two things to note; first, after I’ve created the “function-based” index I’ve gathered stats on “all hidden columns”. This is one way of gathering stats on the system-generated, invisible, virtual column that supports the index, and is a step that is often missed until the stats are gathered overnight by the automatic stats collection job at some random point in the future. Secondly I’ve had to change the query so that the predicate I use is an exact match for the index definition; this is why it’s often nicer to create a virtual column for the expression and index the virtual column – and to keep 3rd party or legacy code safe it’s often a good idea to declare such columns invisible.

Finally, just to show the efficiency of this strategy, here are a couple of the interesting stats about the index:

SQL> select table_name, index_name, num_rows from user_indexes where table_name = 'T2';

TABLE_NAME                INDEX_NAME             NUM_ROWS
------------------------- -------------------- ----------
T2                        T2_I1                       300

The table has 50,000 rows, but the index has only 300 entries as the expression is null for the other 49,700. As an important bonus it’s fairly safe to assume that there won’t be any code in the system that decided to use this strangely defined index when it should be using some other index.

Conclusion

We started with a query that was “too slow”. It contained some predicates that would hide any useful statistical information from the optimizer. In two of the three cases we could give the optimizer some useful statistics by creating virtual columns or extended stats on the expressions; the structure of the third expression was a special “nvl()” case which could simply have been missing a histogram on the underlying column.

Some of the information in the execution plan gave us clues about the scale of the tables – including the table which was accessed by a unique index – but on-site DBAs wouldn’t need to produced guesses about some of the numbers I came up with, they could simply query the data dictionary or, for some details, query the data directly/

Reviewing table and column names, and making some assumptions about some of the data distributions (again details that could be extracted from the actual data), I picked a path that would probably be suitable if the required volume of data was relatively small and demonstrated an example of how we could add in an efficient index that would make this query as efficient as needed without requiring much overhead in index maintenance and without introducing the risk of other queries changing execution plans to use this new index.

Row Migration

Mon, 2022-12-05 06:11

This is nothing more than a basic update of a note that I wrote 8 years ago. The update was triggered by a brief comment made by Martin Widlake at the recent UKOUG annual conference “Breakthrough 2022” in Birmingham. In his presentation on “wide tables”, he mentioned row migration and the possible effects of a row having to migrate many times as it grew and the possibility (of which he was not certain as he had only a vague memory of hearing the claim at some unspecified time in the past) that it might leave a “long chain of pointers” from the header to the final location of the migrated row.

It occurred to me that the vague memory might have been due to my blog note from 2014 explaining that this doesn’t happen. If a row migrates (i.e. the whole row gets moved to a new location leaving only a header behind pointing to the new location) then at a future point in time it might migrate to a 3rd (or 4th or more) location and update the header pointer, or it might actually migrate back to the original location if space has since become available.

The following script (originally created on 10gR2, but updated for 19.11 and modified for ease of retesting) builds a table, performans a series of updates on a row, and dumps the header block after each update.

rem
rem     Script:         row_migration.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Feb 2014 / Dec 2022
rem
rem     Last tested 
rem             10.2.0.5
rem             19.11.0.0
rem

create table t1 (
        id      number(6,0),
        v1      varchar2(1200)
)
pctfree 0
;


insert  into t1 
select  rownum - 1, rpad('x',100) i
from    all_objects i
where   rownum <= 75
;

commit;

prompt  =========================================
prompt  Get the relative file and block number of 
prompt  the block that hold 74 of the 75 rows, 
prompt  then dump the block to the trace file.
prompt  =========================================

column rel_file_no new_value m_file_no
column block_no    new_value m_block_no

select 
        dbms_rowid.rowid_relative_fno(rowid)    rel_file_no, 
        dbms_rowid.rowid_block_number(rowid)    block_no,
        count(*)                                rows_starting_in_block
from 
        t1
group by 
        dbms_rowid.rowid_relative_fno(rowid), 
        dbms_rowid.rowid_block_number(rowid) 
having
        count(*) = 74
;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D1_start';
alter system dump datafile &m_file_no block &m_block_no;

prompt  ======================================
prompt  Make the first row migrate and show it
prompt  ======================================

update t1 set v1 = rpad('x',400) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D2_migrate';
alter system dump datafile &m_file_no block &m_block_no;


prompt  ======================================
prompt  Fill the block the long row is now in,
promtp  the make it migrate again
prompt  ======================================

insert  into t1 
select  rownum + 75, rpad('x',100) 
from    all_objects 
where   rownum <= 75
;
commit;

update t1 set v1 = rpad('x',800) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D3_migrate_more';
alter system dump datafile &m_file_no block &m_block_no;

prompt  =======================================================
prompt  Fill the block the long row is in and shrink the row 
prompt  to see if it returns to its original block. (No).
prompt  =======================================================

insert  into t1 
select  rownum + 150, rpad('x',100) 
from    all_objects 
where   rownum <= 75
;
commit;

-- delete from t1 where id between 1 and 20;
-- commit;

update t1 set v1 = rpad('x',50) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D4_shrink_row';
alter system dump datafile &m_file_no block &m_block_no;


prompt  ==============================================
prompt  Make a lot of space in the original block then
prompt  GROW the row again to see if it migrates back.
prompt  ==============================================

delete from t1 where id between 1 and 20;
commit;

update t1 set v1 = rpad('x',1200) where id = 0;
commit;

alter system flush buffer_cache;
alter session set tracefile_identifier= 'D5_forcemigrate';
alter system dump datafile &m_file_no block &m_block_no;

When the script has run there will be 5 trace files, and single “grep” command to find the row entry in the dump for the first row of the block (row 0) will give you results like the following:

[oracle@linux19c trace]$ grep -A+3 "row 0" *19012*.trc

or19_ora_19012_D1_start.trc:tab 0, row 0, @0xab
or19_ora_19012_D1_start.trc-tl: 106 fb: --H-FL-- lb: 0x1  cc: 2
or19_ora_19012_D1_start.trc-col  0: [ 1]  80
or19_ora_19012_D1_start.trc-col  1: [100]
--
or19_ora_19012_D2_migrate.trc:tab 0, row 0, @0xab
or19_ora_19012_D2_migrate.trc-tl: 9 fb: --H----- lb: 0x2  cc: 0
or19_ora_19012_D2_migrate.trc-nrid:  0x090000ac.1
or19_ora_19012_D2_migrate.trc-tab 0, row 1, @0x115
--
or19_ora_19012_D3_migrate_more.trc:tab 0, row 0, @0xab
or19_ora_19012_D3_migrate_more.trc-tl: 9 fb: --H----- lb: 0x1  cc: 0
or19_ora_19012_D3_migrate_more.trc-nrid:  0x090000b0.7
or19_ora_19012_D3_migrate_more.trc-tab 0, row 1, @0x115
--
or19_ora_19012_D4_shrink_row.trc:tab 0, row 0, @0xab
or19_ora_19012_D4_shrink_row.trc-tl: 9 fb: --H----- lb: 0x2  cc: 0
or19_ora_19012_D4_shrink_row.trc-nrid:  0x090000b0.7
or19_ora_19012_D4_shrink_row.trc-tab 0, row 1, @0x115
--
or19_ora_19012_D5_forcemigrate.trc:tab 0, row 0, @0x4b9
or19_ora_19012_D5_forcemigrate.trc-tl: 1208 fb: --H-FL-- lb: 0x2  cc: 2
or19_ora_19012_D5_forcemigrate.trc-col  0: [ 1]  80
or19_ora_19012_D5_forcemigrate.trc-col  1: [1200]

  • The D1 trace shows you the row with a column count (cc) of 2, and the two column lengths.
  • The D2 trace shows you a column count of zero, and a nrid (next rowid) pointing to row 1 (2nd row) of block 0x090000ac.
  • The D3 trace shows you a column count of zero, and a nrid pointing to row 7 (eighth row) of block 0x090000b0, the row has moved to a new location and the header is pointing directly to the new location.
  • The D4 trace shows exactly the same output – after shrinking (even to a length that is less than it started at) the row has not moved back to the original location.
  • The D5 trace shows that the row has now moved back to its original location, even though it is now several hundred bytes longer than it used to be.

If you’re wondering why the row didn’t move back after shrinking at D4 (and even when I made a lot of space available in the original block the shrink didn’t cause a move), remember that Oracle tries to be “lazy” – the update can take place in situ, so Oracle doesn’t waste time and effort checking the original block.

Footnote

This note makes no claims about what might happen in a more complex case where a row is so long that it splits into multiple row pieces and the pieces end up scattering across multiple blocks. There are a couple of variations on that problem that might be worth investigating if you suspect that there is some behaviour of very wide tables or very long rows that is the source of a performance problem relating to excessive buffer gets or db file sequential reads.

Hakan Factor

Mon, 2022-11-28 09:14

There’s a question on the MOSC forum (needs an account) at present that started with the performance of the datapump API over a database link but moved on to the topic of how to handle a scenario that I’ve described in the past involving a table where rows are intially short and eventually become much longer and a requirement comes up to rebuild the table.

In this case the OP has to use datapump (selecting truncate as the “action on existence”) to copy the table data from one place to another rather then doing the more common ‘alter table move’ variant of rebuilding the table.

The underlying problem in this case is that:

  • the table has 84 columns made up of (pk_col1, pk_col2, flag, change_date) plus 20 groups of 4 “value” columns.
  • rows are inserted with just the four main columns and the first group of four values.
  • over time each subsequent group of 4 values in a row is updated in a separate statement

We haven’t been given numbers but a row probably ends up taking about 10 times the space it started with – and if that’s the case you would normally need to set the table’s pctfree to something like 90 to avoid getting a lot of migrated rows in the table. But that’s not the whole of the story.

Things to go wrong

If you don’t set pctfree to 90 you get lots of migrated rows. If you then do an export (expdp) in direct_path mode expdp will do a large number of single block reads following the migrated rows, and Oracle won’t cache the follow-on blocks, so you may re-read them several times in the course of reading one block in the direct path tablescan. (For cached reads the codepath for a tablescan will simply ignore the “head pointer” to a migrated row because it “knows” that it will find the whole row in some other block eventually.)

If you do set pctfree to 90 then when you rebuild the table (or recreate it with pctfree set to 90) than you end up with a much larger table with lots of blocks that are only 10% used because most of the rows are now big and aren’t going to grow any more.

Best strategy – the Hakan factor.

Work out how many rows in their final state will fit into a block and recreate the table telling Oracle that that’s the maximum number of rows it’s allowed to put in a block. (You could also set pctfree to zero at the same time to minimise the chance of Oracle inserting fewer rows than your target.)

The devil, of course, is in the detail. Part of the devilry comes from a bug that was fixed some time as far back as 10.2.0.1. Part comes from the fact that Oracle doesn’t give us a documented API to set the magic number – we have to find a way to teach Oracle about the number or hack the data dictionary. Part, inevitably, comes from the fact that when dealing with undocumented (or barely documented) mechanisms you ought to set up some test cases to check that the latest version of Oracle behaves the same way as your previous versions of Oracle when you’re playing dirty tricks.

Part 1 – Teaching Oracle.

You may know your data so well that you can immediately say how many “full-length” rows should should fit a block. If you can’t do this you could simply create a copy of the original table structure with a pctfree of zero then copy into it a few hundred rows from the original table using a predicate to limit the selected rows to ones that would not be updated any further. For example (using the table definition supplied by the OP) you might say:

create table test_tab_clone 
pctfree 0 
as 
select  * 
from    test_tab 
where   rownum = 0
/

insert into test_tab_clone 
select  * 
from    t1 
where   rownum <= 400 
and     fourthvalue19 is not null
/

commit
/

I’m assuming in this case column “fourthvalue19” will only be non-null only if the whole of the 19th set of values is populated and all the other sets of values are populated. From the OP’s perspective there may be a more sensible way of identifying fully populated rows. You do need to ensure that the table has at least one full block otherwise some odd things can happen when you try to set the Hakan factor.

Once you’ve got a small table of full size rows a simple analysis of rows per block is the next step:

select
        rows_starting_in_block,
        count(*)        blocks
from
        (
        select
                dbms_rowid.rowid_relative_fno(rowid),
                dbms_rowid.rowid_block_number(rowid),
                count(*)                                rows_starting_in_block
        from
                test_tab_clone
        group by
                dbms_rowid.rowid_relative_fno(rowid),
                dbms_rowid.rowid_block_number(rowid)
        )
group by
        rows_starting_in_block
order by
        rows_starting_in_block
/

ROWS_STARTING_IN_BLOCK     BLOCKS
---------------------- ----------
                     3          1
                    18         22
                    19          1
                       ----------
sum                            24

Looking at these results I can see that there’s a slight variation in the number of rows that could be crammed into a block – and one block which holds the last few rows of my insert statement which I can ignore. In a more realistic case you might need to tweak the selection predicate to make sure that you’ve picked only full-size rows; or you might simply need to decide that you’ve got a choice of two or three possible values for the Hakan factor and see what the results are from using them.

With the same figures above I’d be strongly inclined to set a Hakan factor of 18. That does mean I might be “wasting” roughly 1/19th of every block (for the relatively cases where a 19th row would have fitted) but it’s better than setting the Hakan factor to 19 and finding I get roughly 1 row in every 19 migrating for 22 blocks out of 23 where I should have restricted the number of rows per block to 18; the choice is not always that straightforward.

So here’s how we now “train” Oracle, then test that it learned the lesson:

truncate table test_tab_clone;
insert into test_tab_clone select * from test_tab where rownum <= 18;
alter table test_tab_clone minimize records_per_block;

truncate table test_tab_clone;
insert into test_tab_clone select * from all_objects where rownum <= 10000;

start rowid_count test_tab_clone

ROWS_STARTING_IN_BLOCK     BLOCKS
---------------------- ----------
                    10          1
                    18        555
                       ----------
sum                           556

In the first three statments I’ve emptied the table, inserted 18 rows (I ought to check they all went into the same block, really) and set the Hakan factor.

Once the Hakan factor is set I’ve emptied the table again then populated it with the “full” data set. In fact for demo purposes I’ve copied exactly 10,000 rows so that we can see that every block (except, we can safely assume, the last one written to) has acquired exactly 18 rows.

Part 2 – applying the strategy

It’s often easy to sketch out something that looks like as if it’s exactly what you need, but there are always peripheral considerations that might cause problems and an important part of examining a problem is to consider the overheads and penalties. How, for example, is our OP going to apply the method in production.

There are two problems

  • It’s a large table, and we’re cloning it because we can’t hack directly into the data dictionary to modify the table directly. What are the side effects?
  • We want the imported export to acquire the same Hakan factor. Do we have to take any special action?

The import is the simpler problem to consider since it’s not open-ended. As far as impdp is concerned we could import “data_only” or “include_metadata”, and the “table_exists_action” could be either replace or truncate, so there are only 4 combinations to investigate.

The bad news is that none of the options behaves nicely – impdp (tested on 19.11.0.0) seems to import the data then execute the “minimize records_per_block” command when really it should transfer the Hakan factor before importing the data. So it seems to be necessary to go through the same convoluted steps at least once to precreate a target table with the desired Hakan factor and thereafter use only the truncate option for the import if you want to make the target behave in every way like the source. (Even then you will need to watch out for extreme cases if the export holds fewer rows than the value you’ve set for the Hakan factor – with the special case that if the exported table is empty the attempt by the import to set the Hakan factor raises error “ORA-28603: statement not permitted on empty tables”.)

Let’s get back to the side effects of our cloning exercise on the source table. We’ve created a copy of the original data with a suitable Hakan factor so that blocks holding “completed” rows are full and 1blocks holding “in-gransit” rows have enough space to grow to their “completed” size and there are no migrated rows – and we don’t expect to see migrated rows in the future. But it’s not the right table, and to ensure we had a complete copy we would have stopped all processing of the source table.

Could we have avoided the stoppage? Maybe we could use the dbms_redefinition package – the OP is running Standard Edition so can’t do online redefinition any other way – and use the Hakan hack mechanism on the “interim” table immediately after creating it.

If we find that the online redefinition mechanism generates too much undo and redo we’ll have to use the blocking method – but then we have to do some table renaming and worry about PL/SQL packages becoming invalid, and foreign key constraints, synonyms, views etc. being associated with the wrong table.

So even though we can sketch out with an outline strategy there are still plenty of details to worry about around the edges. To a large degree this is because Oracle has not yet made the Hakan factor a “proper” property of a table that you can explicitly set in a “move” or “create table” operation . There is a function embedded in the executable (kkdxFixTableHAKAN) that looks as if it should set the Hakan factor, and there is presumably some piece of code that sets the Hakan factor when you exectute a call to “create table for exchange”, it would be nice if there was an API that was visible to DBAs.

Summary

If you have a table where rows grows significantly over their lifetime, you ought to ensure that you’ve set a suitable pctfree for the table. But if you anticipate copying, or moving the table at any time then there’s no way to pick a pctfree that is good for all stages of the data’s lifetime.

There is a feature that you can impose on the data to avoid the problems of extreme change in row-lengths and it’s fairly straightforward to impose on a single table but there is no API available to manipulate the feature directly and if you don’t anticipate the need during the initial design stage then applying the feature after the event can be an irritating and resource-intensive operation.

Footnote

For those not familiar with it, the Hakan Factor was introduced by Oracle to allow a little extra efficiency in the compression and use of bitmap indexes. If Oracle has information about the largest number of rows that can appear in any block in a table it can minimise the number of bits needed per block (space saving) and avoid having to expand and compare unnecessarily long sequences of zero bits when comparing entries across bitmap indexes. Given their intended use it should come as no surprise that you can’t call “minimize records_per_block” for a table that has an existing bitmap index.

Row_number() sorts

Mon, 2022-11-21 11:47

An email on the Oracle-L list server a few days ago described a performance problem that had appeared after an upgrade from 11.2.0.4 to 19c (19.15). A long running statement (insert as select, running parallel 16) that had run to completion in 11g using about 20GB of temporary space (with 50GM read and written) had failed after running for a couple of hours in 19c and consuming 2.5 TB of temporary space, even when the 11g execution plan was recreated through an SQL Profile.

When I took a look at the SQL Monitor report for 19c it turned out that a large fraction of the work done was in an operation called WINDOW CHILD PUSHED RANK which was there to deal with a predicate:

row_number() over(partition by t.ds_no, t.c_nbr order by c.cpcl_nbr desc) = 1

Checking the succesful 11g execution, this operation had taken an input rowsource of 7 billion rows and produced an output rowsource of 70 million rows.

Checking the SQL Monitor report for the failed executions in 19c the “pure” 19c plan had reported 7 billion input rows, 6GB memory and 1TB temp space at the same point, the plan with the 11g profile had reported 10 billion rows, but the operation had not yet reported any output rows despite reporting 9GB as the maximum memory allocation and 1TB as the maximum temp space usage. (Differences in row counts were probably due to the report being run for different dates.)

So, the question to the list server was: “is this a bug in 19c?”

Modelling

It’s a little unfortunate that I couldn’t model the problem in 19c at the time because my 19c VM kept crashing; but I built a very simple model to allow me to emulate the window sort and rank() predicate in an 11g instance, then re-played the model in an instance of 21c.

For the model data I took 50 copies of the first 50,000 rows from view all_objects to produce a table of 2,500,000 rows covering 35,700 blocks and 279 MB, (55,000 / 430 in 21c); then I ran the query below and reported its execution plan with a basic call to dbms_xplan.display_cursor():

select
        /*+ dynamic_sampling(0) */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

-------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |       |       |       | 29491 (100)|          |
|   1 |  SORT GROUP BY             |      |     8 |   184 |       | 29491   (9)| 00:02:28 |
|   2 |   VIEW                     |      |  2500K|    54M|       | 28532   (6)| 00:02:23 |
|*  3 |    VIEW                    |      |  2500K|   112M|       | 28532   (6)| 00:02:23 |
|*  4 |     WINDOW SORT PUSHED RANK|      |  2500K|    95M|   124M| 28532   (6)| 00:02:23 |
|   5 |      TABLE ACCESS FULL     | T1   |  2500K|    95M|       |  4821   (8)| 00:00:25 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY
              INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Oracle 21c produced the same execution plan – though the row estimate for the VIEW operations (numbers 2 and 3) was a more realistic 46,236 (num_distinct recorded for object_name) compared to the unchanged 2,500,000 from 11g. (Of course it should have been operation 4 that showed the first drop in cardinality.)

With my first build, the timings weren’t what I expected: under 21c the query completed in 3.3 seconds, under 11g it took 11.7 seconds. Most of the difference was due to a large (55MB) spill to temp space that appeared in 11g but not in 21c. This would have been because 11g wasn’t allowed a large enough PGA, so I set the workarea_size_policy to manual and the sort_area_size to 100M, which looks as if it should have been enough to cover the 11g requirement – it wasn’t and I had to grow the sort_area_size to 190 MB before the 11g operation completed in memory, allocating roughly 155MB. By comparison 21c reported an increase of only 19MB of PGA to run the query, claiming that it needed only 4.7MB to handle the critical operation.

For comparison purposes here are the two run-time execution plans, with rowsource execution stats (which messed the timing up a little) and the column projection information; 11g first:

-----------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
-----------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:03.96 |   35513 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:03.96 |   35513 |  3072 |  3072 | 2048  (0)|
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:04.07 |   35513 |       |       |          |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:03.93 |   35513 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|   1454K|00:00:08.82 |   35513 |   189M|  4615K|  168M (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:10.85 |   35513 |       |       |          |
-----------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1) "OWNER"[VARCHAR2,30], MAX("OBJECT_NAME")[30]
   2 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30]
   3 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "ORANK"[NUMBER,22]
   4 - (#keys=2) "OBJECT_NAME"[VARCHAR2,30], INTERNAL_FUNCTION("OBJECT_TYPE")[19], "OWNER"[VARCHAR2,30], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - "OWNER"[VARCHAR2,30], "OBJECT_NAME"[VARCHAR2,30], "OBJECT_TYPE"[VARCHAR2,19]

It’s an interesting oddity, and possibly a clue about the excess memory and temp space, that the A-Rows column for the Window Sort operation reports 1,454K rows output when it surely ought to be the final 45,982 at that point. It’s possible to imagine a couple of strategies that Oracle might be following to do the window sort that would reasult in the excess volume appearing, and I’ll leave it to the readers to use their imagination on that one.

And now 21c

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:02.98 |   54755 |  54750 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:02.98 |   54755 |  54750 |  5120 |  5120 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:02.97 |   54755 |  54750 |  5297K|   950K| 4708K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.28 |   54755 |  54750 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - filter("ORANK"=1)
   4 - filter(ROW_NUMBER() OVER ( PARTITION BY "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )<=1)

Column Projection Information (identified by operation id):
-----------------------------------------------------------

   1 - (#keys=1; rowset=256) "OWNER"[VARCHAR2,128], MAX("OBJECT_NAME")[128]
   2 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128]
   3 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "ORANK"[NUMBER,22]
   4 - (#keys=2; rowset=256) "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23], "OWNER"[VARCHAR2,128], ROW_NUMBER() OVER ( PARTITION BY
       "OBJECT_NAME" ORDER BY INTERNAL_FUNCTION("OBJECT_TYPE") DESC )[22]
   5 - (rowset=256) "OWNER"[VARCHAR2,128], "OBJECT_NAME"[VARCHAR2,128], "OBJECT_TYPE"[VARCHAR2,23]

In this case we see the A-rows from the Window Sort meeting our expectations – but that may be a beneficial side effect of the operation completing in memory.

Optimisation (?)

Given the dramatically different demands for memory for a query that ought to do the same thing in both versions it looks as if 21c may be doing something clever that 11g doesn’t do, or maybe doesn’t do very well, or maybe tries to do but has a bug that isn’t dramatic enough to be obvious unless you’re looking closely.

Here’s a script that I used to build the test data, with scope for a few variations in testing. You’ll notice that the “create table” includes an “order by” clause that is close to the sorting requirement of the over() clause that appears in the query. The results I’ve show so far were for data that didn’t have this clause in place.

rem
rem     Script:         analytic_sort_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested
rem             21.3.0.0
rem             11.2.0.4
rem

create table t1 nologging 
as
select 
        ao.*
from
        (select * from all_objects where rownum <= 50000) ao,
        (select rownum from dual connect by rownum <= 50)
order by
        object_name, object_type -- desc
/

--
--      Stats collection to get histograms
--

begin
        dbms_stats.gather_table_stats(
                ownname     => null,
                tabname     => 'T1',
                method_opt  => 'for all columns size 254'
        );
end;
/

--
-- reconnect here to maximise visibility of PGA allocation
--

connect xxxxxxxx/xxxxxxxx

set linesize 180
set trimspool on
set tab off

-- alter session set workarea_size_policy = manual;
-- alter session set sort_area_size = 199229440;

alter session set events '10046 trace name context forever, level 8';
-- alter session set statistics_level = all;
-- alter session set "_rowsource_execution_statistics"= true;

spool analytic_sort_2

select
        /*  monitoring */
        owner, max(object_name)
from    (
        select 
                /*+ no_merge */
                owner, object_name 
        from    (
                select 
                        owner, object_name,
                        row_number() over (partition by object_name order by object_type desc) orank 
                from 
                        t1
                )  where orank= 1
        )
group by 
        owner
order by
        owner
/

select * from table(dbms_xplan.display_cursor(format=>'cost bytes allstats last projection'));

alter session set events '10046 trace name context off';
alter session set "_rowsource_execution_statistics"= false;
alter session set statistics_level = typical;
alter session set workarea_size_policy = auto;

spool off

The results I’m going to comment on now are the ones I got after running the script as above, then reconnecting and flushing the shared pool before repeat the second half of the script (i.e. without recreating the table).

In 11g, going back to the automatic workarea sizing the session used 37MB of memory and then spilled (only) 3MB to temp. The run time was approximately 3 seconds – which is a good match for the “unsorted” 21c run time. As with the original tests, the value reported in A-rows is larger than we would expect (in this case suspiciously close to twice the correct values – but that’s more likely to be a coincidence than a clue). Interestingly, when I switched to the manual workarea_size_policy and set the sort_area_size to 190MB Oracle said “that’s the optimum memory” and used nearly all of it to complete in memory – for any value less than that (even down to 5MB) Oracle spilled just 3 MB to disk in a one-pass operation. So it looks as if Oracle “knows” it doesn’t need to sort the whole data set, but still uses as much memory as is available to do something before it starts to get clever.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  | Writes |  OMem |  1Mem | Used-Mem | Used-Tmp|
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 29491 (100)|      8 |00:00:01.76 |   35523 |   2145 |    331 |       |       |          |         |
|   1 |  SORT GROUP BY             |      |      1 |      8 |   184 |       | 29491   (9)|      8 |00:00:01.76 |   35523 |   2145 |    331 |  2048 |  2048 | 2048  (0)|         |
|   2 |   VIEW                     |      |      1 |   2500K|    54M|       | 28532   (6)|  28575 |00:00:02.00 |   35523 |   2145 |    331 |       |       |          |         |
|*  3 |    VIEW                    |      |      1 |   2500K|   112M|       | 28532   (6)|  28575 |00:00:01.83 |   35523 |   2145 |    331 |       |       |          |         |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|    95M|   124M| 28532   (6)|  57171 |00:00:02.10 |   35523 |   2145 |    331 |  2979K|   768K|   37M (1)|    3072 |
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|    95M|       |  4821   (8)|   2500K|00:00:11.84 |   35513 |   1814 |      0 |       |       |          |         |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In 21c there’s essentially no difference between the sorted and unsorted tests, which suggests that with my data the session had started finding been able to apply its optimisation strategy at the earliest possible moment rather than waiting until it had no alternative but to spill to disc.

--------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name | Starts | E-Rows |E-Bytes|E-Temp | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
--------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |      1 |        |       |       | 48864 (100)|     12 |00:00:00.98 |   54753 |  54748 |       |       |          |
|   1 |  SORT GROUP BY             |      |      1 |     12 |   852 |       | 48864   (1)|     12 |00:00:00.98 |   54753 |  54748 |  4096 |  4096 | 4096  (0)|
|   2 |   VIEW                     |      |      1 |  46236 |  3205K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  3 |    VIEW                    |      |      1 |  46236 |  6547K|       | 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |       |       |          |
|*  4 |     WINDOW SORT PUSHED RANK|      |      1 |   2500K|   131M|   162M| 48859   (1)|  45982 |00:00:00.97 |   54753 |  54748 |  5155K|   940K| 4582K (0)|
|   5 |      TABLE ACCESS FULL     | T1   |      1 |   2500K|   131M|       | 15028   (1)|   2500K|00:00:00.42 |   54753 |  54748 |       |       |          |
--------------------------------------------------------------------------------------------------------------------------------------------------------------

Assumption

Given the way that 11g reports a very small spill to disc, which stays fairly constant in size no matter how large or small the available PGA allocation is, when the input data is sorted to help the over() clause, and given how large the spill to disc can become when the data is not sorted, I feel that Oracle has an optimisation that discards input rows early in the analytic window sort. But we also have some evidence of a flaw in the code in versions prior to 21c that means Oracle fails to re-use memory that becomes available from rows that have been discarded.

Strategy

I’ve said in the past that if you’re using analytic functions you ought to minimise the size of the data you’re processing before you apply the analytic part. Another step that can help is to make sure you’ve got the data into a (fairly well) sorted order before you reach the analytic part.

In the case of versions of Oracle prior to 21c, it also seems to make sense (if you can arrange it) to minimise the reduce the amount of memory the session is allowed to use for a sort operation, as this will reduce the CPU used by the session and avoid grabbing excess redundant memory that could be used more effectively by other sessions.

Addendum

Just before publishing I found a way of keeping my 19.11.0.0 instance alive long enough to run the tests, then also ran them on an instance of 12.2.0.1. Both versions showed the same pattern of doing a large allocation of memory and large spill to disc when the data was not sorted, and a large allocation of memory but a small spill to disc when the data was sorted.

As a little sanity check I also exported the 19c data and imported it to 21c in case it was a simple variation in the data that allwoed made 21c to operate more efficiently than19c. The change in data made no difference to the way in which 21c handled it, in both cases it called for a small allocation of memory with no spill to disc.

Lost Space

Wed, 2022-11-16 06:56

I’ve just discovered that the space management bitmaps for the tablespace I normally use in my 21c tests are broken. In a tablespace that’s supposed to be completely empty a query of dba_free_space shows 4 gaps totalling several thousand blocks:

SQL> select * from dba_free_space where tablespace_name = 'TEST_8K_ASSM';

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
TEST_8K_ASSM                           13        128     327680         40           13
TEST_8K_ASSM                           13        216    1376256        168           13
TEST_8K_ASSM                           13       1792     720896         88           13
TEST_8K_ASSM                           13       1896     196608         24           13
TEST_8K_ASSM                           13       9600  969932800     118400           13

Of course I ran my script to drop all segments and purge the recyclebin when I first saw this, but that didn’t help, and a query against dba_segments showed no segments, and a query against seg$ showed nothing in the file. So somehow the bits are bust.

Fortunately there’s a dbms_space_admin package with a procedure tablespace_verify() that I’ve wanted to test for some time – the documentation is a little sparse about how it works. So here’s a cut-and-paste of my first (and second) call to the procedure, executing from the SYS schema and passing in the tablespace name:

SQL> execute dbms_space_admin.tablespace_verify ('TEST_8K_ASSM')
BEGIN dbms_space_admin.tablespace_verify ('TEST_8K_ASSM'); END;

*
ERROR at line 1:
ORA-20000: BitMap entry partially used with no Extent Map entry
TSN 6: Range RelFno 13: ExtNo: 32702 BeginBlock: 0 EndBlock: 4194303
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 83
ORA-06512: at line 1

SQL> execute dbms_space_admin.tablespace_verify ('TEST_8K_ASSM')
BEGIN dbms_space_admin.tablespace_verify ('TEST_8K_ASSM'); END;

*
ERROR at line 1:
ORA-20000: BitMap entry partially used with no Extent Map entry
TSN 6: Range RelFno 13: ExtNo: 32766 BeginBlock: 0 EndBlock: 4194303
ORA-06512: at "SYS.DBMS_SPACE_ADMIN", line 83
ORA-06512: at line 1


The output isn’t promising – but we can, at least, see that it’s the right RelFno, and the Extno: seems to have moved on by 64 (which is a nice number in an abstract, computational way), but what might the Extno: be? And I know that I’ve only got 128,000 blocks in the file and it’s not set to auto-extend so that EndBlock: value is a little worrying.

Just to add a little more confusion – the next few calls reported the ExtNo: as 0, then stuck at 32,766. So it probably wasn’t walking the files bitmap blocks as I first guessed.

What to do next? In my case I could throw the tablespace away – there was nothing in it, and even if there were I could have recreated it very easily – so I was happy to try the next dbms_space_admin feature: tablespace_fix_bitmaps(). Here’s the declaration:

  procedure tablespace_fix_bitmaps(
        tablespace_name         in    varchar2 ,
        dbarange_relative_file  in    positive ,
        dbarange_begin_block    in    positive ,
        dbarange_end_block      in    positive ,
        fix_option              in    positive
  );
  --
  --  Marks the appropriate dba range (extent) as free/used in bitmap
  --  Input arguments:
  --   tablespace_name         - name of tablespace
  --   dbarange_relative_file  - relative fileno of dba range (extent)
  --   dbarange_begin_block    - block number of beginning of extent
  --   dbarange_end_block      - block number (inclusive) of end of extent
  --   fix_option              - TABLESPACE_EXTENT_MAKE_FREE or
  --                             TABLESPACE_EXTENT_MAKE_USED

Again the documentation is a little sparse, so I’m just going to cross my fingers and hope for the best – proceeding a little cautiously. Looking at the report of free space I can infer from the first two lines that the bits for blocks 168 (128 + 40) to 215 (216 – 1) are marked as used. So I’ll try to pass that information into the procedure call:

set serveroutput on
set linesize 132
set trimspool on
set tab off


begin
        dbms_space_admin.tablespace_fix_bitmaps(
                tablespace_name         => 'TEST_8K_ASSM',
                dbarange_relative_file  => 13,
                dbarange_begin_block    => 168,
                dbarange_end_block      => 215,
                fix_option              => dbms_space_admin.TABLESPACE_EXTENT_MAKE_FREE
        );
end;
/

PL/SQL procedure successfully completed.

SQL> select * from dba_free_space where tablespace_name = 'TEST_8K_ASSM';

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
TEST_8K_ASSM                           13        128    2097152        256           13
TEST_8K_ASSM                           13       1792     720896         88           13
TEST_8K_ASSM                           13       1896     196608         24           13
TEST_8K_ASSM                           13       9600  969932800     118400           13

Comparing the new results from dba_free_space we can see that we’ve eliminated the “used” chunk that was between the first two free chunks and now have a single free chunk stretching from block 128 to block 383. So now we rinse and repeat – and we could use dba_free_space to help by generating a list of begin and end blocks – we might even consider writing a query to drive a cursor loop (being very careful to allow for multi-file tablespaces, which I haven’t done):

select
        relative_fno, block_id, block_id + blocks begin_block,
        lead(block_id) over (order by relative_fno, block_id) - 1  end_block
from
        dba_free_space
where
        tablespace_name = 'TEST_8K_ASSM'
order by
        relative_fno, block_id
/

 RELATIVE_NO   BLOCK_ID BEGIN_BLOCK  END_BLOCK
------------ ---------- ----------- ----------
          13        128         384       1791
          13       1792        1880       1895
          13       1896        1920       9599
          13       9600      128000

After three more calls to tablespace_fix_bitmaps() this is the result I got from my query against dba_free_space – followed by a call to tablespace_verify():

TABLESPACE_NAME                   FILE_ID   BLOCK_ID      BYTES     BLOCKS RELATIVE_FNO
------------------------------ ---------- ---------- ---------- ---------- ------------
TEST_8K_ASSM                           13        128 1047527424     127872           13

SQL>  execute dbms_space_admin.tablespace_verify ('TEST_8K_ASSM')

PL/SQL procedure successfully completed.

Summary

After finding a tablespace that should have shown nothing but free space along its whole length (and checking the recyclebin, and the underlying seg$ table) I called dbms_space_admin.tablespace_verify() to see what it thought was going on and it reported an inconsistency between the tablespace (file) bitmap and segment bitmaps (in this case because there were no segment bitmaps when the file bitmap said there ought to be).

Starting from a query against dba_free_space I worked out the ranges of blocks that were marked in the file bitmap as used when they shouldn’t have been, and called dbms_space_admin.tablespace_fix_bitmaps() for each range.

After fixing all the bad ranges I called tablespace_verify() again to see if it had any more complaints,, and got an empty report.

Footnotes

The documentation is not user-friendly, and it would be nice to have some comments in the manaul (or dbmsspc.sql script) describing possible outputs. On the other hand I managed to avoid reading the documentation carefully enough anyway, because it wasn’t until I started searching MOS for better documentation that I realised I should have used the ASSM version of verify

execute dbms_space_admin.assm_tablespace_verify ('TEST_8K_ASSM', dbms_space_admin.ts_verify_bitmaps)

This procedure might have reported sensible information for the Extno, BeginBlock and EndBlock. But it was too late to find out – I’ll just wait for the next corruption to happen.

There is one circumstance where you might see multiple chunks in dba_free_space when there are no segments allocated, but with no gaps between chunks – if Oracle has to “grow” the bitmap for a file then the separate chunks of the bitmap report their freespace separately.

Another possibility for multiple free space chunks when there are no (ordinary) segments is if you’ve moved the tablespace bitmap or converted a dictionary managed tablespace to a locally managed tablespace – again a rare occurrence – in which case the tablespace bitmap will be in a “nearly-hidden” segment.

opt_estimate 4a

Tue, 2022-11-15 05:21

I wrote a batch of notes about the opt_estimate() hint a couple of years ago, including one where I explained the option for using the hint to specify the number of rows in a query block. I’ve just come across a particular special case for that strategy that others might find a use for. It’s something to do whant using the “select from dual … connect by” trick for multiplying rows.

Here’s a little data to model the idea – I’ve used the all_tables view to generate some “well-known” data since I want to add a tiny bit of complexity to the query while still leaving it easy to understand the index. The results from this demonstration come from Oracle 21.3.0.0, and I’ve included the hint /*+ no_adaptive_plan */ to stop Oracle from getting too clever during optimisation.

rem
rem     Script:         opt_estimate_dual.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Nov 2022
rem
rem     Last tested 
rem             21.3.0.0
rem

create table tables_table as select * from all_objects where object_type = 'TABLE';
create table objects_table as select * from all_objects;

alter table objects_table add constraint ot_pk primary key(object_id);

begin
        dbms_stats.gather_table_stats(
                ownname    => user,
                tabname    => 'tables_table',
                method_opt => 'for columns size 60 owner'
        );
end;
/

set serveroutput off

with driver as (
        select  /*+ materialize */
                tt.owner, tt.object_id, v1.rn
        from    tables_table tt,
                (
                select
                        /*+  opt_estimate(query_block scale_rows=10) */
                        rownum rn
                from    dual
                connect by
                        level <= 10
                ) v1
        where
                tt.owner = 'OUTLN'
)
select  /*+ no_adaptive_plan */
        dr.rn, dr.owner, dr.object_id,
        ot.object_id, ot.owner, ot.object_type, ot.object_name
from
        driver dr,
        objects_table   ot
where
        ot.object_id = dr.object_id
/

select * from table(dbms_xplan.display_cursor(format => 'hint_report'));


In my system tables_table holds 727 rows and objects_table holds 58383 rows. Three rows in tables_table correspond to tables owned by user ‘OUTLN’ which means I expect the driver CTE (common table expression / “with” subquery) to generate 30 rows and, given the join on unique id, the query to return 30 rows.

I’ve used the /*+ materialize */ hint to force Oracle to create an in-memory temporary table for the driver CTE, the /*+ no_adaptive_plan */ hint to stop Oracle from getting too clever during optimisation, and the critical /*+ opt_estimate() */ hint to help the optimizer understand the effect of my “connect by” on dual. Here’s the execution plan I get if I omit that last hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |    14 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6632_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |     3 |    78 |     9   (0)| 00:00:01 |
|   4 |     VIEW                                 |                            |     1 |    13 |     2   (0)| 00:00:01 |
|   5 |      COUNT                               |                            |       |       |            |          |
|   6 |       CONNECT BY WITHOUT FILTERING       |                            |       |       |            |          |
|   7 |        FAST DUAL                         |                            |     1 |       |     2   (0)| 00:00:01 |
|   8 |     BUFFER SORT                          |                            |     3 |    39 |     9   (0)| 00:00:01 |
|*  9 |      TABLE ACCESS FULL                   | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|  10 |   NESTED LOOPS                           |                            |     3 |   453 |     5   (0)| 00:00:01 |
|  11 |    NESTED LOOPS                          |                            |     3 |   453 |     5   (0)| 00:00:01 |
|  12 |     VIEW                                 |                            |     3 |   276 |     2   (0)| 00:00:01 |
|  13 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6632_31D19D4 |     3 |    78 |     2   (0)| 00:00:01 |
|* 14 |     INDEX UNIQUE SCAN                    | OT_PK                      |     1 |       |     0   (0)|          |
|  15 |    TABLE ACCESS BY INDEX ROWID           | OBJECTS_TABLE              |     1 |    59 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   9 - filter("TT"."OWNER"='OUTLN')
  14 - access("OT"."OBJECT_ID"="DR"."OBJECT_ID")

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  no_adaptive_plan

   2 -  SEL$1
           -  materialize


I’ve highlighted operations 4 and 8 in the plan: operation 4 is the view of dual that has generated 10 rows – unfortunately the optimizer has only considered the stats of the dual table, and hasn’t factored in the effects of the “connect by with rownum”. Operation 8 shows us that the optimizer has (correctly, thanks to the histogram I requested) estimated 3 rows for the tablescan of tables_table. The result of these two estimates is that operation 3 reports an estimate of 3 ( = 3 * 1 ) rows to be used in probing objects_table.

This is the plan after enabling the /*+ opt_estimate() */ hint:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |    45 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D6633_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |    30 |   780 |    13   (0)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL                    | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|   5 |     BUFFER SORT                          |                            |    10 |   130 |     6   (0)| 00:00:01 |
|   6 |      VIEW                                |                            |    10 |   130 |     2   (0)| 00:00:01 |
|   7 |       COUNT                              |                            |       |       |            |          |
|   8 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
|   9 |         FAST DUAL                        |                            |     1 |       |     2   (0)| 00:00:01 |
|  10 |   NESTED LOOPS                           |                            |    30 |  4530 |    32   (0)| 00:00:01 |
|  11 |    NESTED LOOPS                          |                            |    30 |  4530 |    32   (0)| 00:00:01 |
|  12 |     VIEW                                 |                            |    30 |  2760 |     2   (0)| 00:00:01 |
|  13 |      TABLE ACCESS FULL                   | SYS_TEMP_0FD9D6633_31D19D4 |    30 |   780 |     2   (0)| 00:00:01 |
|* 14 |     INDEX UNIQUE SCAN                    | OT_PK                      |     1 |       |     0   (0)|          |
|  15 |    TABLE ACCESS BY INDEX ROWID           | OBJECTS_TABLE              |     1 |    59 |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("TT"."OWNER"='OUTLN')
  14 - access("OT"."OBJECT_ID"="DR"."OBJECT_ID")

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 2
---------------------------------------------------------------------------
   0 -  STATEMENT
           -  no_adaptive_plan

   2 -  SEL$1
           -  materialize


There are three things that stand out in this report.

  • I’ve highlighted operations 4 and 6: operation 4 is the tablescan of tables_table that correctly estimates 3 rows; operation 6 is the view operation that now correctly estimates 10 rows.
  • With the correct estimate for the view the estimate for the join to objects_table is now correct and the join order for the merge join cartesian at operation 3 has been reversed.
  • The Hint Report tells us that the opt_estimate() hint is not (always) an optimizer hint! This is a real pain because when the opt_estimate() hints you’ve tried to use don’t appear to work it’s not easy to work out what you’ve done wrong.

To make a point, I can take the demo a little further by changing the /*+ opt_estimate() */ hint to scale_rows=120. Here’s the body of the resulting plan:

-----------------------------------------------------------------------------------------------------------------------
| Id  | Operation                                | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |                            |       |       |   369 (100)|          |
|   1 |  TEMP TABLE TRANSFORMATION               |                            |       |       |            |          |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D663A_31D19D4 |       |       |            |          |
|   3 |    MERGE JOIN CARTESIAN                  |                            |   360 |  9360 |    13   (0)| 00:00:01 |
|   4 |     TABLE ACCESS FULL                    | TABLES_TABLE               |     3 |    39 |     7   (0)| 00:00:01 |
|   5 |     BUFFER SORT                          |                            |   120 |  1560 |     6   (0)| 00:00:01 |
|   6 |      VIEW                                |                            |   120 |  1560 |     2   (0)| 00:00:01 |
|   7 |       COUNT                              |                            |       |       |            |          |
|   8 |        CONNECT BY WITHOUT FILTERING      |                            |       |       |            |          |
|   9 |         FAST DUAL                        |                            |     1 |       |     2   (0)| 00:00:01 |
|  10 |   HASH JOIN                              |                            |   360 | 54360 |   356   (1)| 00:00:01 |
|  11 |    VIEW                                  |                            |   360 | 33120 |     2   (0)| 00:00:01 |
|  12 |     TABLE ACCESS FULL                    | SYS_TEMP_0FD9D663A_31D19D4 |   360 |  9360 |     2   (0)| 00:00:01 |
|  13 |    TABLE ACCESS FULL                     | OBJECTS_TABLE              | 58383 |  3363K|   354   (1)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

The earlier plans used a nested loop join into objects_table. In this plan we can see at operation 10 that the optimizer has selected a hash join because the larger row estimate for the CTE has increased the cost of the query beyond the inflection point between nested loop and hash joins.

Summary

If you need to use the “connect by” in an inline view then you may find that the optimizer gets a very bad estimate of the number of rows the view definition will generate and that an /*+ opt_estimate() */ hint in the view using the “scale_rows=nnn” option will produce better estimates of cardinality, hence a better plan.

Footnote

In this particular case where I’ve used the dual table by itself in an inline view I could have used the rows=NNN” option to get the same effect.

In any case I could have added a /*+ qb_name() */ hint to the inline view, and includes a qualifying “@qb” in the /*+ opt_estimate() */ hint.

Using hints is hard, especially when they’re not documented. There is a lot more to learn about this hint; for example, telling the optimizer about the size of a rowsource doesn’t help if it’s going to use its estimate of distinct values in the next steps of the plan – a correction you’ve managed to introduce at one point may disappear in the very next optimizer calculation.

This catalogue lists more articles on the opt_estimate() hint and its relatives.

PL/SQL Labels

Mon, 2022-10-24 03:06

A tweet from Franck Pachot [July 2021] about fully qualified names in Postgres prompted me to highlight a note I wrote a few years ago about using the label mechanism in Oracle’s PL/SQL to avoid collisions between variable names and table names. This led to a brief twitter exchange about labels and goto, an associated feature that I had completely forgotten about until I read a very sketchy comment in the scripts I’d used to demonstrate the use of labels to fully qualify variable names.

The comment – which I hadn’t included in the published note – was as follows:

rem     Using labels as targets
rem             goto label_name
rem                     unconditional jump
rem             exit [label_name] when {condition}
rem                     exit to line after labelled loop
rem             continue [label_name] when {condition}
rem                     start next iteration of labelled loop

Having just rediscovered the original blog note and associated script I found that I’d written up a demo of using the three code control mechanisms at the time of the discussion but not got around to publishing it, so here it is now:

rem
rem     Script:         plsql_block_names_2.sql
rem     Author:         Jonathan Lewis
rem     Dated:          Aug 2021
rem     Purpose:        
rem
rem     Last tested 
rem             19.11.0.0
rem

create or replace procedure demo_labels(p_control number)
as
begin
        if demo_labels.p_control = 0 then
                goto early_exit;
        end if;

        <<outer_loop>>
        for i in 1..10 loop
                dbms_output.put_line('Starting Inner');
                <<inner_loop>>
                for i in 1..4 loop
                        exit inner_loop when outer_loop.i > 3;
                        dbms_output.put_line(inner_loop.i);
                        continue inner_loop when demo_labels.p_control = 1;
                        dbms_output.put_line('Control != 1');
                        continue outer_loop when demo_labels.p_control = 2;
                end loop inner_loop;
                dbms_output.put_line('Ended inner');
        end loop outer_loop;

        <<normal_exit>>
        dbms_output.put_line('===========');
        dbms_output.put_line('Normal Exit');
        dbms_output.put_line('===========');
        goto terminate;

        <<early_exit>>
        dbms_output.put_line('==========');
        dbms_output.put_line('Early Exit');
        dbms_output.put_line('==========');

        <<Terminate>>
        null;

end;
/

I’ve created a procedure called demo_labels, and you can see at line 14 an example of qualifying a variable name with the name of the procedure. In fact I’ve used this example to show that you can (and should) qualify the names of the formal parameters to the procedure.

Inside this procedure I’ve created labels for the two loops, <<outer_loop>> and <<inner_loop>> and, again, you can see cases (lines 23 and 24) where I’ve used the loop names to qualify the names of variables declared for the loop. You’ll notice that I (deliberately) used the same index variable name for both loops – this type of thing is usually an error waiting to happen but, by qualifying the variables at every use, I’ve pre-empted the possible “capture” problem of one use of the variable name hiding another use of the same name.

I’ve also shown the use of an exit inner_loop, and a continue with both inner_loop and outer_loop; and I’ve also used the loop names to identify clearly which loop is ending on an end loop.

Finally I’ve created three further labels as potential targets for transferring execution, of which I’ve only used early_exit and terminate.

Here’s a little script I’ve then run to show the effects – you might like to work out what’s going to happen before scrolling down to the comments and output:

set feedback off
set serveroutput on

prompt  ==================== 0 ==================== 
execute demo_labels(0)

prompt  ==================== 1 ==================== 
execute demo_labels(1)

prompt  ==================== 2 ==================== 
execute demo_labels(2)

With zero as the input the procedure immediately jumps to the early_exit label, runs through the next 4 commands and returns.

==================== 0 ====================
==========
Early Exit
==========

With 1 as the input we start the outer loop, then start the inner loop and print the inner loop counter for the first time, but because we meet the requirements of the continue inner_loop at line 25 we drop to the end of the inner loop and go round again (you may prefer to think of this as going back to the top of the loop, I just happen to find it more natural to think of ending the cycle and starting a new one) for a total of 4 times, then print “Ended Inner” and go round the outer loop a second and third time doing exactly the same thing.

When we go round the outer loop for the 4th and subsequent cycles we will immediately exit inner_loop (line 23), which means we print “Ended Inner” and go round the outer again. Finally we’ll complete 10 cycles of the outer loop, work through the normal_exit and goto terminate.

==================== 1 ====================
Starting Inner
1
2
3
4
Ended inner
Starting Inner
1
2
3
4
Ended inner
Starting Inner
1
2
3
4
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
===========
Normal Exit
===========

With 2 as the input we start the outer loop, then start the inner loop and print “Starting Inner”, then print “1”, but at line 25 we don’t jump to the end of the inner loop, we fall through to the line 26, print “Control != 1”, then continue to the end of the outer loop, and cycle round again. Repeating the three lines of output 3 times, then (as with input 1) line 23 makes us jump to the end of the inner loop for the next 7 cycles of the outer loop before we pass through the normal exit.

==================== 2 ====================
Starting Inner
1
Control != 1
Starting Inner
1
Control != 1
Starting Inner
1
Control != 1
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
Starting Inner
Ended inner
===========
Normal Exit
===========

Playing around with other input values is left as an exercise.

v$session_longops

Thu, 2022-10-13 04:30

There’s a question on the Oracle developer forum at the moment asking how a tablescan could be reported as taking 94,000 seconds so far when a count(*) shows that it holds only a couple of hundred thousand rows (and it’s not storing megabytes of LOB per row if that’s your first guess).

I can think of a few reasons so I don’t know if I’ve supplied the correct explanation for the OP’s example, but it prompted me to point out that Oracle can provide several different perspectives on performance that can seem to be counter-intuitive or contradictory if you don’t realise what they’re trying to tell you.

The data supplied by the OP was initially not very readable so I’ll just point out that v$session_longops can report the sql_id, plan_hash_value and line_id for each operation, so when it reports a message like:

Table Scan: XXXXXXXXX: 158204 out of 204924 Blocks done

you can pull the query and execution plan from memory very easily.

This is the execution plan – with some cosmetic cleaning – supplied by the OP, with the comment that the tablescan reported by v$session_longops was operation 22 (highlighted):

SORT AGGREGATE
  VIEW 
    UNION-ALL 
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN RIGHT OUTER
              INLIST ITERATOR 
                TABLE ACCESS BY INDEX ROWID BATCHED VAC_000087FD
                  INDEX RANGE SCAN I_220101_VAC_000087FD_1
              TABLE ACCESS FULL VLD_00008866
          TABLE ACCESS FULL CCH_00008868
      HASH JOIN ANTI SNA
        FILTER 
          NESTED LOOPS OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN RIGHT ANTI SNA
                    INDEX FULL SCAN SYS_C00364276
                    TABLE ACCESS FULL VAC_00006D1B
                TABLE ACCESS FULL VLD_00006D22
            VIEW PUSHED PREDICATE 
              FILTER 
                MERGE JOIN ANTI NA
                  SORT JOIN
                    TABLE ACCESS BY INDEX ROWID BATCHED CCH_00007BB0
                      INDEX FULL SCAN SYS_C00353046
                  SORT UNIQUE
                    TABLE ACCESS FULL LNK_00008614
        INDEX FAST FULL SCAN SYS_C00364277
      HASH JOIN ANTI SNA
        FILTER 
          NESTED LOOPS OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN ANTI SNA
                    TABLE ACCESS FULL VAC_00007C1D
                    INDEX FULL SCAN SYS_C00365638
                TABLE ACCESS FULL VLD_00007C24
            VIEW PUSHED PREDICATE 
              FILTER 
                MERGE JOIN ANTI NA
                  SORT JOIN
                    TABLE ACCESS BY INDEX ROWID BATCHED CCH_00007C26
                      INDEX FULL SCAN SYS_C00353233
                  SORT UNIQUE
                    TABLE ACCESS FULL LNK_00008787
        INDEX FAST FULL SCAN SYS_C00365640
      HASH JOIN ANTI SNA
        FILTER 
          HASH JOIN OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN ANTI SNA
                    INLIST ITERATOR 
                      TABLE ACCESS BY INDEX ROWID BATCHED VAC_00006CC2
                        INDEX RANGE SCAN I_200101_VAC_00006CC2_1
                    INDEX FULL SCAN SYS_C00364113
                TABLE ACCESS FULL VLD_00006CC9
            VIEW 
              HASH JOIN RIGHT ANTI NA
                TABLE ACCESS FULL LNK_000084B2
                TABLE ACCESS FULL CCH_00007BAB
        INDEX FAST FULL SCAN SYS_C00364114
      HASH JOIN ANTI SNA
        FILTER 
          HASH JOIN OUTER
            FILTER 
              HASH JOIN RIGHT OUTER
                VIEW 
                  HASH JOIN RIGHT ANTI SNA
                    INDEX FULL SCAN SYS_C00364266
                    TABLE ACCESS FULL VAC_00006CE2
                TABLE ACCESS FULL VLD_00006CE9
            VIEW 
              HASH JOIN RIGHT ANTI NA
                TABLE ACCESS FULL LNK_000085F0
                TABLE ACCESS FULL CCH_00007BAC
        INDEX FAST FULL SCAN SYS_C00364267
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN RIGHT OUTER
              TABLE ACCESS FULL VAC_00008613
              TABLE ACCESS FULL VLD_00008612
          TABLE ACCESS FULL CCH_00008617
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN OUTER
              TABLE ACCESS FULL VLD_00008785
              TABLE ACCESS FULL VAC_00008786
          TABLE ACCESS FULL CCH_0000878A
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN RIGHT OUTER
              TABLE ACCESS FULL VAC_000084B1
              TABLE ACCESS FULL VLD_000084B0
          TABLE ACCESS FULL CCH_000084B5
      FILTER 
        HASH JOIN OUTER
          FILTER 
            HASH JOIN OUTER
              TABLE ACCESS FULL VLD_000085EE
              TABLE ACCESS FULL VAC_000085EF
          TABLE ACCESS FULL CCH_000085F3

The quick answer to the OP’s question is that operation 22 is the second child to a hash join (operation 17) that passes its rowsource through a FILTER operation (16) to become the first child of a nested loop (operation 15).

This means operation 22 is passing its rows up the tree one row at a time (it’s the probe table, not the build table) and the time it takes to process each row is dictated by the second child of the nested loop join. In other words – it might take a tiny amount of work to do the tablescan, but the elapsed time for the tablescan to complete is dictated by the time it takes to call the view (operation 23) for every single row that survives the journey up to operation 15.

Demo

To demonstrate the principle that the “working time” for an operation and the elapsed time to completion can be dramatically different I’ll set up a two-table join and show that a “small tablescan” can (apparently) take a long time and get into v$session_longops because of “the other” table. As a quick and dirty trick I’ll create a function that calls dbms_session.sleep() – the function that should be used to replace calls to dbms_lock.sleep()– to sleep for 1/100 second.

create table t1
as
with generator as (
        select 
                rownum id
        from dual 
        connect by 
                level <= 1e4    -- > comment to avoid WordPress format issue
)
select
        rownum                          id,
        rownum                          n1,
        lpad(rownum,10,'0')             v1,
        lpad('x',100,'x')               padding
from
        generator       v1,
        generator       v2
where
        rownum <= 1e6   -- > comment to avoid WordPress format issue
;

create table t2 as select * from t1;
alter table t2 add constraint t2_pk primary key(id);


create or replace function waste_time(i_in number) return number
as
begin
        dbms_session.sleep(0.01);
        return i_in;
end;
/

With the data and function in place I’ll code (and hint) a nested loop join that starts with a full tablescan of t1 and probes t2 by primary key 1,000 times.

set timing on

select
        /*+ leading(t1 t2) full(t1) use_nl_with_index(t2) */
        sum(t1.id)
from
        t1, t2
where
        mod(t1.id,1000) = 0
and     t2.id  = t1.id
and     t2.n1 != 0
/

select
        /*+ leading(t1 t2) full(t1) use_nl_with_index(t2) */
        sum(t1.id)
from
        t1, t2
where
        mod(t1.id,1000) = 0
and     t2.id  = t1.id
and     waste_time(t2.n1) != 0
/

set timing off

Of course, thanks to the call to waste_time() passing in t2.n1 I expect the second version of the query to take at least 10 seconds longer than the first (given 1,000 waits of 0.01 seconds spent in the call).

SUM(T1.ID)
----------
 500500000

1 row selected.

Elapsed: 00:00:00.26

SUM(T1.ID)
----------
 500500000

1 row selected.

Elapsed: 00:00:14.39

So the question is – what does v$session_longops say about any “long operations” for my session? Query and result:

select 
        sql_id, 
        sql_plan_line_id,
        to_char(vsl.start_time,'dd hh24:mi:ss') start_time, 
        to_char(vsl.last_update_time,'dd hh24:mi:ss') last_time, 
        vsl.elapsed_seconds,
        vsl.message 
from 
        V$session_Longops vsl
where 
        vsl.sid = (select ms.sid from v$mystat ms where rownum = 1)
/

SQL_ID        SQL_PLAN_LINE_ID START_TIME                LAST_TIME   ELAPSED_SECONDS
------------- ---------------- ------------------------- ----------- ---------------
MESSAGE
------------------------------------------------------------------------------------------------------------------------------------
cqv88nkkvrwpv                4 13 09:56:21               13 09:56:35              14
Table Scan:  TEST_USER.T1: 18020 out of 18020 Blocks done

So that looks like 14 seconds to do a tablescan of just 18,020 blocks. The number is very similar to the elapsed time reported for the second of my two queries – but just to make sure let’s use the reported SQL ID to pull the query and plan from memory and check operation 4 for a tablescan of t1.

select * from table(dbms_xplan.display_cursor('cqv88nkkvrwpv', format=>'hint_report'));


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------
SQL_ID  cqv88nkkvrwpv, child number 0
-------------------------------------
select  /*+ leading(t1 t2) full(t1) use_nl_with_index(t2) */
sum(t1.id) from  t1, t2 where  mod(t1.id,1000) = 0 and t2.id  = t1.id
and waste_time(t2.n1) != 0

Plan hash value: 1846150233

---------------------------------------------------------------------------------------
| Id  | Operation                     | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |       |       |       | 24956 (100)|          |
|   1 |  SORT AGGREGATE               |       |     1 |    15 |            |          |
|   2 |   NESTED LOOPS                |       | 10000 |   146K| 24956   (1)| 00:00:01 |
|   3 |    NESTED LOOPS               |       | 10000 |   146K| 24956   (1)| 00:00:01 |
|*  4 |     TABLE ACCESS FULL         | T1    | 10000 | 50000 |  4936   (2)| 00:00:01 |
|*  5 |     INDEX UNIQUE SCAN         | T2_PK |     1 |       |     1   (0)| 00:00:01 |
|*  6 |    TABLE ACCESS BY INDEX ROWID| T2    |     1 |    10 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - filter(MOD("T1"."ID",1000)=0)
   5 - access("T2"."ID"="T1"."ID")
   6 - filter("WASTE_TIME"("T2"."N1")<>0)

Hint Report (identified by operation id / Query Block Name / Object Alias):
Total hints for statement: 3
---------------------------------------------------------------------------

   1 -  SEL$1
           -  leading(t1 t2)

   4 -  SEL$1 / T1@SEL$1
           -  full(t1)

   5 -  SEL$1 / T2@SEL$1
           -  use_nl_with_index(t2)

Summary

When you see an entry in v$session_longops it is an indicator to an operation that took a “long” time to complete; but “completion” of the operation and “work done” by the operation are not the same thing. The operation may be the victim of a problem, not the cause. If the problem query is still in memory then v$session_long_ops gives you enough information to find the query (and check you’re looking at the right plan) so that you have a better chance of identifying the real offender.

Tracing Tip #JoelKallmanDay

Tue, 2022-10-11 04:27

I don’t really do tips because I often see simple tip that have a specific purpose being mis-used abused; but it’s a special day in the Oracle community, so here’s a quick tip in tribute to a great sharer.

I’ve been doing some work with datapump recently, and needed to get a better picture of how various processes were hanging together. But when you call expdp or impdp you don’t have direct control over the all the processes that might start running – and that covers a DMxx (datapump master) process, multiple DWxx (datapump worker) processes and Pnnn (parallel execution) processes.

As a quick, dirty, and brutal starting point (on a dedicated test system) I just enabled tracing for all the processes I wanted to follow, using a variant of the new(er) syntax, specifying process naming patterns:

Here’s what I did for 11g:

alter system set events 'sql_trace {process:pname = dw | dm | p00} level=8';

This enabled tracing whenever a process with a name (v$process.pname) starting with ‘DM’,’DW’, or ‘P00’ started running. (I restricted myself to parallel processes p000 – p009 because I had set a very small value for parallel_max_servers.)

I was a little surprised to discover that this didn’t work when I tried to use it in a 19c PDB:

SQL> alter system set events 'sql_trace {process:pname = dw | dm | p0} level=8';
alter system set events 'sql_trace {process:pname = dw | dm | p0} level=8'
*
ERROR at line 1:
ORA-49100: Failed to process event statement [sql_trace {process:pname = dw | dm | p0} level=8]
ORA-49601: syntax error: found "|": expecting one of: ":" etc..

So I read the detail in the error message and changed the statement (after two wrong guesses) to:

alter system set events 'sql_trace {process:pname = dw | process:pname=dm | process:pname=p0} level=8';

This worked.

Case Study

Thu, 2022-09-29 12:27

A recent question on the Oracle Developer Community forum asked for help with a statement that was taking a long time to run. The thread included the results from a trace file that had been passed through tkprof so we have the query and the actual execution plan with some rowsource execution stats.

Here’s the query – extracted from the tkprof output:

SELECT DISTINCT
       pll.po_line_id,
       ploc.line_location_id,
       (SELECT ptl.line_type
          FROM apps.po_line_types_tl ptl
         WHERE ptl.line_type_id = pll.line_type_id AND ptl.LANGUAGE = 'US')
           "Line_Type",
       ploc.quantity_accepted,
       NULL
           release_approved_date,
       NULL
           release_date,
       NULL
           release_hold_flag,
       NULL
           release_type,
       DECODE (ploc.po_release_id, NULL, NULL, ploc.quantity)
           released_quantity,
       (SELECT items.preprocessing_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "PreProcessing_LT",
       (SELECT items.full_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "Processing_LT",
       (SELECT items.postprocessing_lead_time
          FROM apps.mtl_system_items_b items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id = ploc.SHIP_TO_ORGANIZATION_ID)
           "PostProcessing_LT",
       ploc.firm_status_lookup_code,
       NVL (
           (SELECT pla.promised_date
              FROM apps.po_line_locations_archive_all pla
             WHERE     pla.po_header_id = pha.po_header_id
                   AND pla.po_line_id = pll.po_line_id
                   AND pla.line_location_id = ploc.line_location_id
                   AND pla.revision_num =
                       (SELECT MIN (revision_num)
                          FROM apps.po_line_locations_archive_all plla2
                         WHERE     plla2.promised_date IS NOT NULL
                               AND plla2.line_location_id =
                                   ploc.line_location_id)),
           ploc.promised_date)
           "Original_Promise_Date",
       (SELECT items.long_description
          FROM apps.mtl_system_items_tl items
         WHERE     items.inventory_item_id = pll.item_id
               AND items.organization_id IN
                       (SELECT fin.inventory_organization_id
                          FROM apps.financials_system_params_all fin
                         WHERE fin.org_id = pha.org_id)
               AND items.LANGUAGE = 'US')
           "Item_Long_Description",
       NVL (ploc.approved_flag, 'N')
           approved_code,
       pvs.country
           "Supplier_Site_Country",
       pll.note_to_vendor,
         NVL (ploc.quantity, 0)
       - NVL (ploc.quantity_cancelled, 0)
       - NVL (ploc.quantity_received, 0) * ploc.price_override
           "Shipment_Amount",
       ploc.attribute4
           "PO_Ship_Date",
       (SELECT meaning
          FROM apps.fnd_lookup_values
         WHERE     lookup_type = 'SHIP_METHOD'
               AND lookup_code = ploc.attribute9
               AND language = 'US')
           "Ship_Method",
       (SELECT prla.note_to_receiver
          FROM apps.po_req_distributions_all  prda
               INNER JOIN apps.po_requisition_lines_all prla
                   ON prda.requisition_line_id = prla.requisition_line_id
         WHERE prda.distribution_id = pdi.req_distribution_id)
           "Note_To_Receiver",
       DECODE (pha.USER_HOLD_FLAG, 'Y', 'Y', pll.USER_HOLD_FLAG)
           "Hold_Flag",
       (SELECT ABC_CLASS_NAME
          FROM APPS.MTL_ABC_ASSIGNMENT_GROUPS  ASG
               INNER JOIN APPS.MTL_ABC_ASSIGNMENTS ASSI
                   ON ASG.ASSIGNMENT_GROUP_ID = ASSI.ASSIGNMENT_GROUP_ID
               INNER JOIN APPS.MTL_ABC_CLASSES classes
                   ON ASSI.ABC_CLASS_ID = classes.ABC_CLASS_ID
         WHERE     ASG.organization_id = ploc.SHIP_TO_ORGANIZATION_ID
               AND ASG.ASSIGNMENT_GROUP_NAME = 'MIN ABC Assignment'
               AND ASSI.inventory_item_id = pll.item_id)
           ABCClass,
       (SELECT CONCATENATED_SEGMENTS AS charge_accountsfrom
          FROM apps.gl_code_combinations_kfv gcc
         WHERE gcc.code_combination_id = pdi.code_combination_id)
           AS charge_accounts
  FROM apps.po_headers_all         pha,
       apps.po_lines_all           pll,
       apps.po_line_locations_all  ploc,
       apps.po_distributions_all   pdi,
       apps.per_all_people_f       papf,
       apps.AP_SUPPLIERS           pv,
       apps.AP_SUPPLIER_SITES_ALL  pvs,
       apps.AP_SUPPLIER_CONTACTS   pvc,
       apps.ap_terms               apt,
       apps.po_lookup_codes        plc1,
       apps.po_lookup_codes        plc2,
       apps.hr_locations           hlv_line_ship_to,
       apps.hr_locations           hlv_ship_to,
       apps.hr_locations           hlv_bill_to,
       apps.hr_organization_units  hou,
       apps.hr_locations_no_join   loc,
       apps.hr_locations_all_tl    hrl1,
       apps.hr_locations_all_tl    hrl2
 WHERE     1 = 1
       AND pll.po_header_id(+) = pha.po_header_id
       AND ploc.po_line_id(+) = pll.po_line_id
       AND pdi.line_location_id(+) = ploc.line_location_id
       AND ploc.shipment_type IN ('STANDARD', 'PLANNED')
       AND papf.person_id(+) = pha.agent_id
       AND TRUNC (SYSDATE) BETWEEN papf.effective_start_date
                               AND papf.effective_end_date
       AND papf.employee_number IS NOT NULL
       AND pv.vendor_id(+) = pha.vendor_id
       AND pvs.vendor_site_id(+) = pha.vendor_site_id
       AND pvc.vendor_contact_id(+) = pha.vendor_contact_id
       AND apt.term_id(+) = pha.terms_id
       AND plc1.lookup_code(+) = pha.fob_lookup_code
       AND plc1.lookup_type(+) = 'FOB'
       AND plc2.lookup_code(+) = pha.freight_terms_lookup_code
       AND plc2.lookup_type(+) = 'FREIGHT TERMS'
       AND hlv_line_ship_to.location_id(+) = ploc.ship_to_location_id
       AND hlv_ship_to.location_id(+) = pha.ship_to_location_id
       AND hlv_bill_to.location_id(+) = pha.bill_to_location_id
       AND hou.organization_id = pha.org_id
       AND hou.location_id = loc.location_id(+)
       AND hrl1.location_id(+) = pha.ship_to_location_id
       AND hrl1.LANGUAGE(+) = 'US'
       AND hrl2.location_id(+) = pha.bill_to_location_id
       AND hrl2.LANGUAGE(+) = 'US'
       AND hou.organization_id IN (2763)
       AND NVL (pha.closed_code, 'OPEN') IN ('OPEN', 'CLOSED')
       AND NVL (pll.closed_code, 'OPEN') IN ('OPEN', 'CLOSED')
       AND NVL (ploc.cancel_flag, 'N') = 'N'
       AND pha.authorization_status IN
               ('APPROVED', 'REQUIRES REAPPROVAL', 'IN PROCESS')

As you can see there are 10 inline scalar subqueries (highlighted) in the query with a select distinct to finish off the processing of an 18 table join. That’s a lot of scalar subqueries so it’s worth asking whether the code should be rewritten to use joins (though in newer vesions of Oracle some of the subqueries might be transformed to outer joins anyway). We also know that a distinct may be a hint that there’s a possible logic error that has been “fixed” by eliminating duplicates.

Ignoring those points, let’s consider the execution plan from the tkprof output which (with a tiny bit of extra formatting) is as follows:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.46       1.75          0          3          0           0
Execute      1      0.00       0.00          0          0          0           0
Fetch    50346    279.02    1059.39     179103   30146895          0      755164
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50348    279.49    1061.14     179103   30146898          0      755164

Misses in library cache during parse: 1
Optimizer mode: ALL_ROWS
Parsing user id: 678  
Number of plan statistics captured: 1

Rows (1st) Rows (avg) Rows (max)  Row Source Operation
---------- ---------- ----------  ---------------------------------------------------
         9          9          9  TABLE ACCESS BY INDEX ROWID PO_LINE_TYPES_TL (cr=20 pr=0 pw=0 time=680 us cost=2 size=32 card=1)
         9          9          9   INDEX UNIQUE SCAN PO_LINE_TYPES_TL_U1 (cr=11 pr=0 pw=0 time=323 us cost=1 size=0 card=1)(object id 63682480)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267756 pr=28 pw=0 time=22598079 us cost=4 size=13 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720936 pr=0 pw=0 time=4644552 us cost=3 size=0 card=1)(object id 42812859)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267747 pr=0 pw=0 time=2442479 us cost=4 size=13 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720936 pr=0 pw=0 time=1238342 us cost=3 size=0 card=1)(object id 42812859)

    576365     576365     576365  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_B (cr=2267743 pr=0 pw=0 time=2029190 us cost=4 size=14 card=1)
    576365     576365     576365   INDEX UNIQUE SCAN MTL_SYSTEM_ITEMS_B_U1 (cr=1720932 pr=0 pw=0 time=967729 us cost=3 size=0 card=1)(object id 42812859)

    672743     672743     672743  TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=5507736 pr=163043 pw=0 time=535914552 us cost=3 size=27 card=1)
    672743     672743     672743   INDEX UNIQUE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=4560824 pr=163043 pw=0 time=533161038 us cost=2 size=0 card=1)(object id 42811947)
    755121     755121     755121    SORT AGGREGATE (cr=3540960 pr=163043 pw=0 time=530079821 us)
   1040963    1040963    1040963     TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=3540960 pr=163043 pw=0 time=534243973 us cost=5 size=15 card=1)
   1776649    1776649    1776649      INDEX RANGE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=1123074 pr=6392 pw=0 time=37128373 us cost=3 size=0 card=2)(object id 42811947)

    587486     587486     587486  TABLE ACCESS BY INDEX ROWID MTL_SYSTEM_ITEMS_TL (cr=3436629 pr=3564 pw=0 time=64125044 us cost=5 size=34 card=1)
    587486     587486     587486   INDEX RANGE SCAN MTL_SYSTEM_ITEMS_TL_U1 (cr=2852930 pr=869 pw=0 time=45628505 us cost=4 size=0 card=1)(object id 136492495)
         1          1          1    TABLE ACCESS BY INDEX ROWID FINANCIALS_SYSTEM_PARAMS_ALL (cr=645351 pr=0 pw=0 time=5743158 us cost=2 size=10 card=1)
    322268     322268     322268     INDEX SKIP SCAN FINANCIALS_SYSTEM_PARAMS_U1 (cr=323083 pr=0 pw=0 time=5104895 us cost=1 size=0 card=1)(object id 42770563)

        10         10         10  TABLE ACCESS BY INDEX ROWID FND_LOOKUP_VALUES (cr=51 pr=1 pw=0 time=3620 us cost=5 size=60 card=1)
        20         20         20   INDEX RANGE SCAN FND_LOOKUP_VALUES_X99 (cr=31 pr=1 pw=0 time=2133 us cost=4 size=0 card=1)(object id 42759866)

    634276     634276     634276  NESTED LOOPS  (cr=3540930 pr=5535 pw=0 time=181518759 us cost=5 size=28 card=1)
    634276     634276     634276   TABLE ACCESS BY INDEX ROWID PO_REQ_DISTRIBUTIONS_ALL (cr=1631471 pr=5253 pw=0 time=65405333 us cost=3 size=12 card=1)
    634276     634276     634276    INDEX UNIQUE SCAN PO_REQ_DISTRIBUTIONS_U1 (cr=994522 pr=5252 pw=0 time=31023194 us cost=2 size=0 card=1)(object id 42788583)
    634276     634276     634276   TABLE ACCESS BY INDEX ROWID PO_REQUISITION_LINES_ALL (cr=1909459 pr=282 pw=0 time=115275921 us cost=2 size=16 card=1)
    634276     634276     634276    INDEX UNIQUE SCAN PO_REQUISITION_LINES_U1 (cr=944449 pr=268 pw=0 time=12285440 us cost=1 size=0 card=1)(object id 42789681)

    511989     511989     511989  NESTED LOOPS  (cr=3533763 pr=6 pw=0 time=8999321 us cost=5 size=55 card=1)
    511989     511989     511989   NESTED LOOPS  (cr=2850293 pr=6 pw=0 time=7086027 us cost=4 size=45 card=1)
    576055     576055     576055    TABLE ACCESS BY INDEX ROWID MTL_ABC_ASSIGNMENT_GROUPS (cr=612378 pr=0 pw=0 time=2002832 us cost=2 size=29 card=1)
    576055     576055     576055     INDEX UNIQUE SCAN MTL_ABC_ASSIGNMENT_GROUPS_U2 (cr=36323 pr=0 pw=0 time=951307 us cost=1 size=0 card=1)(object id 42783622)
    511989     511989     511989    TABLE ACCESS BY INDEX ROWID MTL_ABC_ASSIGNMENTS (cr=2237915 pr=6 pw=0 time=4672006 us cost=3 size=16 card=1)
    511989     511989     511989     INDEX UNIQUE SCAN MTL_ABC_ASSIGNMENTS_U1 (cr=1551490 pr=4 pw=0 time=2533524 us cost=2 size=0 card=1)(object id 42757737)
    511989     511989     511989   TABLE ACCESS BY INDEX ROWID MTL_ABC_CLASSES (cr=683470 pr=0 pw=0 time=1488045 us cost=1 size=10 card=1)
    511989     511989     511989    INDEX UNIQUE SCAN MTL_ABC_CLASSES_U1 (cr=171481 pr=0 pw=0 time=693745 us cost=0 size=0 card=1)(object id 42789694)

     13320      13320      13320  TABLE ACCESS BY INDEX ROWID GL_CODE_COMBINATIONS (cr=34801 pr=0 pw=0 time=802675 us cost=3 size=49 card=1)
     13320      13320      13320   INDEX UNIQUE SCAN GL_CODE_COMBINATIONS_U1 (cr=21481 pr=0 pw=0 time=397344 us cost=2 size=0 card=1)(object id 42775044)


    755164     755164     755164  HASH UNIQUE (cr=30147018 pr=179103 pw=0 time=1058922684 us cost=749257 size=197349453 card=482517)
    768890     768890     768890   HASH JOIN  (cr=7289842 pr=6926 pw=0 time=244582512 us cost=696202 size=197349453 card=482517)
    140451     140451     140451    TABLE ACCESS FULL PER_ALL_PEOPLE_F (cr=38207 pr=0 pw=0 time=313692 us cost=18484 size=13278261 card=428331)
    768890     768890     768890    NESTED LOOPS OUTER (cr=7251635 pr=6926 pw=0 time=242897348 us cost=672652 size=30016980 card=79410)
    755121     755121     755121     NESTED LOOPS OUTER (cr=5538283 pr=6031 pw=0 time=154841427 us cost=443987 size=28382903 card=78623)
    755121     755121     755121      NESTED LOOPS OUTER (cr=5508916 pr=6031 pw=0 time=153523676 us cost=443982 size=18184959 card=51809)
    755121     755121     755121       NESTED LOOPS OUTER (cr=5386279 pr=6031 pw=0 time=151985656 us cost=443978 size=11642422 card=34142)
    755121     755121     755121        NESTED LOOPS  (cr=5090949 pr=6031 pw=0 time=139220421 us cost=375644 size=11574138 card=34142)
    792959     792959     792959         NESTED LOOPS  (cr=1747964 pr=134 pw=0 time=64597738 us cost=109035 size=19934760 card=73560)
    254919     254919     254919          HASH JOIN OUTER (cr=315780 pr=6 pw=0 time=14811187 us cost=29121 size=5413350 card=22650)
    254919     254919     254919           NESTED LOOPS OUTER (cr=286919 pr=0 pw=0 time=12395919 us cost=13792 size=5209500 card=22650)
    254919     254919     254919            HASH JOIN RIGHT OUTER (cr=107134 pr=0 pw=0 time=12153146 us cost=13790 size=3868572 card=17426)
      3834       3834       3834             VIEW  HR_LOCATIONS (cr=3913 pr=0 pw=0 time=15826 us cost=125 size=360 card=60)
      3834       3834       3834              NESTED LOOPS  (cr=3913 pr=0 pw=0 time=15055 us cost=125 size=1080 card=60)
      3834       3834       3834               TABLE ACCESS FULL HR_LOCATIONS_ALL (cr=262 pr=0 pw=0 time=11211 us cost=125 size=304 card=38)
      3834       3834       3834               INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=3651 pr=0 pw=0 time=6183 us cost=0 size=20 card=2)(object id 42783719)
    254919     254919     254919             HASH JOIN RIGHT OUTER (cr=103221 pr=0 pw=0 time=11917174 us cost=13666 size=3764016 card=17426)
      3834       3834       3834              VIEW  HR_LOCATIONS (cr=3898 pr=0 pw=0 time=14651 us cost=125 size=360 card=60)
      3834       3834       3834               NESTED LOOPS  (cr=3898 pr=0 pw=0 time=14267 us cost=125 size=1080 card=60)
      3834       3834       3834                TABLE ACCESS FULL HR_LOCATIONS_ALL (cr=247 pr=0 pw=0 time=9532 us cost=125 size=304 card=38)
      3834       3834       3834                INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=3651 pr=0 pw=0 time=9539 us cost=0 size=20 card=2)(object id 42783719)
    254919     254919     254919              HASH JOIN RIGHT OUTER (cr=99323 pr=0 pw=0 time=11817243 us cost=13541 size=3659460 card=17426)
        45         45         45               INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=21 pr=0 pw=0 time=614 us cost=4 size=49 card=1)(object id 63685210)
    254919     254919     254919               HASH JOIN RIGHT OUTER (cr=99302 pr=0 pw=0 time=11729251 us cost=13537 size=2805586 card=17426)
        59         59         59                INDEX RANGE SCAN FND_LOOKUP_VALUES_U1 (cr=20 pr=0 pw=0 time=445 us cost=4 size=49 card=1)(object id 63685210)
    254919     254919     254919                NESTED LOOPS  (cr=99282 pr=0 pw=0 time=11653162 us cost=13533 size=1951712 card=17426)
         1          1          1                 NESTED LOOPS OUTER (cr=116 pr=0 pw=0 time=113273 us cost=3 size=40 card=1)
         1          1          1                  NESTED LOOPS  (cr=113 pr=0 pw=0 time=113227 us cost=2 size=32 card=1)
         1          1          1                   INDEX UNIQUE SCAN HR_ALL_ORGANIZATION_UNTS_TL_PK (cr=110 pr=0 pw=0 time=113164 us cost=1 size=17 card=1)(object id 63680720)
         1          1          1                   TABLE ACCESS BY INDEX ROWID HR_ALL_ORGANIZATION_UNITS (cr=3 pr=0 pw=0 time=59 us cost=1 size=15 card=1)
         1          1          1                    INDEX UNIQUE SCAN HR_ORGANIZATION_UNITS_PK (cr=2 pr=0 pw=0 time=7 us cost=0 size=0 card=1)(object id 42789144)
         1          1          1                  TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=3 pr=0 pw=0 time=42 us cost=1 size=8 card=1)
         1          1          1                   INDEX UNIQUE SCAN HR_LOCATIONS_PK (cr=2 pr=0 pw=0 time=7 us cost=0 size=0 card=1)(object id 42797079)
    254919     254919     254919                 TABLE ACCESS BY INDEX ROWID PO_HEADERS_ALL (cr=99166 pr=0 pw=0 time=11505632 us cost=13530 size=1254672 card=17426)
    255397     255397     255397                  INDEX SKIP SCAN PO_HEADERS_ALL_X3 (cr=1753 pr=0 pw=0 time=725236 us cost=352 size=0 card=37674)(object id 42773719)
    254883     254883     254883            INDEX UNIQUE SCAN AP_TERMS_TL_U1 (cr=179785 pr=0 pw=0 time=183291 us cost=0 size=8 card=1)(object id 42798416)
    482528     482528     482528           TABLE ACCESS FULL AP_SUPPLIER_SITES_ALL (cr=28861 pr=6 pw=0 time=227983 us cost=13727 size=4323123 card=480347)
    792959     792959     792959          TABLE ACCESS BY INDEX ROWID PO_LINES_ALL (cr=1432184 pr=128 pw=0 time=53002963 us cost=5 size=96 card=3)
    793375     793375     793375           INDEX RANGE SCAN PO_LINES_U2 (cr=504726 pr=20 pw=0 time=17603112 us cost=2 size=0 card=5)(object id 42755253)
    755121     755121     755121         TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ALL (cr=3342985 pr=5897 pw=0 time=71357938 us cost=4 size=68 card=1)
   1138558    1138558    1138558          INDEX RANGE SCAN PO_LINE_LOCATIONS_N15 (cr=1707311 pr=5830 pw=0 time=37903421 us cost=3 size=0 card=2)(object id 63697005)
    723002     723002     723002        VIEW PUSHED PREDICATE  HR_LOCATIONS (cr=295330 pr=0 pw=0 time=11391536 us cost=2 size=2 card=1)
    723002     723002     723002         NESTED LOOPS  (cr=295330 pr=0 pw=0 time=11004720 us cost=2 size=18 card=1)
    723002     723002     723002          INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=146911 pr=0 pw=0 time=1391389 us cost=1 size=10 card=1)(object id 42783719)
    723002     723002     723002          TABLE ACCESS BY INDEX ROWID HR_LOCATIONS_ALL (cr=148419 pr=0 pw=0 time=9233363 us cost=1 size=8 card=1)
    723002     723002     723002           INDEX UNIQUE SCAN HR_LOCATIONS_PK (cr=117800 pr=0 pw=0 time=836734 us cost=0 size=0 card=1)(object id 42797079)
    755119     755119     755119       INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=122637 pr=0 pw=0 time=829404 us cost=0 size=20 card=2)(object id 42783719)
    755121     755121     755121      INDEX UNIQUE SCAN HR_LOCATIONS_ALL_TL_PK (cr=29367 pr=0 pw=0 time=716408 us cost=0 size=20 card=2)(object id 42783719)
    768883     768883     768883     TABLE ACCESS BY INDEX ROWID PO_DISTRIBUTIONS_ALL (cr=1713352 pr=895 pw=0 time=75314769 us cost=3 size=17 card=1)
    768883     768883     768883      INDEX RANGE SCAN PO_DISTRIBUTIONS_N1 (cr=1096671 pr=874 pw=0 time=24392643 us cost=2 size=0 card=1)(object id 42782429)

The plan is a bit long, but you may recall that a query with scalar subqueries in the select list reports the plans for each of the separate scalar subqueries before reporting the main query block – and I’ve inserted blank lines in the output above to improve the visibility of the individual blocks / scalar subqueries.

An odd little detail of this tkprof output was that there was no report of the wait information recorded against the query, though the following information appeared as the summary for the trace file, giving us a very good idea of the wait events for the individual query:

OVERALL TOTALS FOR ALL NON-RECURSIVE STATEMENTS

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        6      0.85       2.14          0          6          0           0
Execute      6      0.00       0.00          0          7        104          85
Fetch    50358    279.03    1059.39     179103   30146895          0      755329
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    50370    279.88    1061.54     179103   30146908        104      755414

Misses in library cache during parse: 3

Elapsed times include waiting on following events:
  Event waited on                             Times   Max. Wait  Total Waited
  ----------------------------------------   Waited  ----------  ------------
  SQL*Net message to client                   50363        0.00          0.00
  SQL*Net message from client                 50362      157.17        227.70
  row cache lock                                141        0.03          0.67
  library cache lock                             77        0.01          0.21
  library cache pin                              75        0.01          0.27
  Disk file operations I/O                      791        0.00          0.01
  gc current block 3-way                     835881        0.15        305.35
  gc current block 2-way                     471360        0.24        144.04
  KJC: Wait for msg sends to complete            40        0.00          0.00
  gc cr multi block request                       8        0.00          0.00
  gc current block congested                  10014        0.03          4.23
  gc cr block 3-way                           20215        0.06          4.69
  gc current grant busy                          20        0.00          0.00
  gc cr grant 2-way                          165010        0.07         25.13
  db file sequential read                    179103        0.05        196.31
  gc cr grant congested                         729        0.19          0.36
  gc current block busy                       71431        0.05        118.15
  gc cr block 2-way                            1800        0.01          0.31
  latch free                                      3        0.00          0.00
  gc cr block congested                         197        0.01          0.06
  latch: cache buffers chains                    45        0.00          0.00
  latch: gc element                              15        0.00          0.00
  gc cr block busy                               15        0.02          0.07
  latch: object queue header operation            1        0.00          0.00
  KSV master wait                                 2        0.00          0.00
  ASM file metadata operation                     1        0.00          0.00
  SQL*Net more data to client                     1        0.00          0.00
  gc current grant 2-way                          6        0.00          0.00

An important initial observation is that the query returned 750,000 rows in 50,000 fetches (all figures rounded for convenience) and that’s consistent with the SQL*Plus default arraysize of 15. So there might be a little time saved by setting the arraysize to a larger value (but only a few 10s of seconds – based on the 227 seconds total minus the 157 second maximum wait for the “SQL*Net message from client” figures and there may be some benefit of increasing the SQL*net SDU_SIZE at the same time). Critically, though, we should ask “why do you want a query to return 750,000 rows?”, and “how fast do you think is ‘reasonable’?” You’ll also note from the “gc” waits that the system is based on RAC with at least 3 nodes – and RAC is always a suspect when you see unexpected time spent in a query.

Where in the driving query block does most of the time go between the last hash join (line 62) and the hash unique (line 61) – it’s in the query block whose plan starts at line 28 where we see 163,000 physical blocks read (pr=) and 535 seconds (time= microseconds) of which 6,400 blocks come from the index range scan operation at line 32 but most comes from line 31 fetching 1 million rows (by index rowid) from table po_lines_locations_archive_all.

    672743     672743     672743  TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=5507736 pr=163043 pw=0 time=535914552 us cost=3 size=27 card=1)
    672743     672743     672743   INDEX UNIQUE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=4560824 pr=163043 pw=0 time=533161038 us cost=2 size=0 card=1)(object id 42811947)
    755121     755121     755121    SORT AGGREGATE (cr=3540960 pr=163043 pw=0 time=530079821 us)
   1040963    1040963    1040963     TABLE ACCESS BY INDEX ROWID PO_LINE_LOCATIONS_ARCHIVE_ALL (cr=3540960 pr=163043 pw=0 time=534243973 us cost=5 size=15 card=1)
   1776649    1776649    1776649      INDEX RANGE SCAN PO_LINE_LOCATIONS_ARCHIVE_U1 (cr=1123074 pr=6392 pw=0 time=37128373 us cost=3 size=0 card=2)(object id 42811947)

This part of the workload comes from 672,743 executions of the subquery starting at line 36 of the original query text:

           (SELECT pla.promised_date
              FROM apps.po_line_locations_archive_all pla
             WHERE     pla.po_header_id = pha.po_header_id
                   AND pla.po_line_id = pll.po_line_id
                   AND pla.line_location_id = ploc.line_location_id
                   AND pla.revision_num =
                       (SELECT MIN (revision_num)
                          FROM apps.po_line_locations_archive_all plla2
                         WHERE     plla2.promised_date IS NOT NULL
                               AND plla2.line_location_id =
                                   ploc.line_location_id))

If we want to improve the performance of this query with a minimum of re-engineering, recoding and risk then a good point to start would be to examine this query block in isolation and see if there is a simple, low-cost way of improving its efficiency. (Note: this may not be a route to optimising the whole query “properly”, but it may give a quick win that is “good enough”.)

We could go a little further down this route of optimising the scalar subqueries by looking at the time spent in each of them in turn. Taking out the top line of each of the separate sections of the plan and extracting just the pr, pw and time values (which I’ll scale back from microseconds to seconds) we get the following

pr=      0      pw=0    time=   0
pr=     28      pw=0    time=  23
pr=      0      pw=0    time=   2
pr=      0      pw=0    time=   2
pr= 163043      pw=0    time= 536
pr=   3564      pw=0    time=  64
pr=      1      pw=0    time=   0
pr=   5535      pw=0    time= 182
pr=      6      pw=0    time=   9
pr=      0      pw=0    time=   1

The 8th scalar subquery (line 42 in the plan, line 75 in the query) gives us an opportunity to reduce the run time by 182 seconds, so might be worth a little investment in programmer time.

The 6th subquery (line 34 in the plan, line 49 in the query) adds only 64 seconds to the run time, so we might be less inclined to do anything about it.

You might note that the 2nd, 3rd and 4th subqueries are against the same table with the same predicate to get three different columns – this group is the “obvious” choice for recoding as a single join rather than three separate subqueries, but if you look at the total times of the three subqueries the “extra” two executions add only two seconds each to the total time – so although the this scalar subquery coding pattern is undesirable, it’s not necessarily going to be worth expending the effort to rewrite it in this case.

If you’re wondering, by the way, why different subqueries are reporting different numbers of rows returned (and each one should return at most one row on each execution), there are two reasons for any subquery to be reporting fewer than the 768,890 rows reported by the basic driving hash join:

  • first – an execution may simply return no rows,
  • secondly – there may be some benefits from scalar subquery caching.

One of the nice details about newer versions of Oracle is that the “starts” statistic is also reported in the trace/tkprof output so you would be able to see how much your query had benefited from scalar subquery caching.

If we add together the time reported by each of the scalar subquery sections of the plan the total time reported is approximately 819 seconds. Cross-checking with the difference in the times reported for operations 61 and 62 (hash unique of hash join) we see: 1,059 seconds – 245 seconds = 814 seconds. This is a good match (allowing for the accumulation of a large number of small errors) for the 819 seconds reported in the subqueries – so the hash unique isn’t a significant part of the query even though it has virtually no effect on the volume of data. You’ll note that it didn’t spill to disc (pw = 0) but completed in memory.

Summary

I’ve written a quick note on this query because the coding style was undesirable and the execution plan quite lengthy. I’ve reviewed how the style of the SQL is echoed in the shape of the plan. I’ve then pursued the idea of optimising the code piece-wise to see if there were any opportunities for improving the performance “enough” without going through the effort of a complete redesign of the query.

Given the information in the Rowsource Execution from the trkprof output it proved easy to identify where the largest amounts of times appeared that might be reduced by very localised optimsation.

In passing I pointed out the option for reducing the time spent on network traffic by increasing the array fetch size, and increasing the SDU_SIZE for the SQL*Net messages to client.

Dumping redo

Mon, 2022-09-12 04:05

In the past I’ve sometimes had to dump the contents of the redo log to a trace file when I needed to find out what work Oracle was doing behing the scenes. To minimise the volume dumped by the “alter system dump logfile” command and make it easier to find the bit I wanted to see I used to “switch logfile” just before (and sometimes just after) the statement I was investigating.

With the advent of pluggable databases the “switch logfile” command now raises Oracle error: “ORA-65040: operation not allowed from within a pluggable database”, so I had to change the strategy. This is just a brief note (echoing a footnote to an older note) of the approach I now use:

column current_scn new_value start_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

-- do something interesting here

column current_scn new_value end_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter session set tracefile_identifier='sometextyoulike';

alter system dump redo scn min &start_scn scn max &end_scn ;
alter session set tracefile_identifier='';

The list of options for the dump has been extended since I published the note on dumping the log file, and now (19.11.0.0) allows the following options (using c notation for the type of the variables you supply to each parameter):

 rdba min  %d rdba max  %d tablespace_no  %d
 dba min  %u  %u dba max  %u  %u
 securefile_dba  %u  %u
 length  %d
 time min  %d
 time max  %d
 layer  %d
 opcode  %d
 scn min  %llu
 scn max  %llu
 xid  %d  %d  %d
 objno  %u
 con_id  %d
 skip corruption


If you try to restrict the dump on objno (object id) or xid (transaction id) then the trace file will skip any redo records generated by private threads / in-memory undo and report the text: “Skipping IMU Redo Record: cannot be filtered by XID/OBJNO”

The tablespace_no option can only be used when both rdba min and rdba max (rolback data block address range) have been specified.

The con_id option may only be legal when used to specify a PDB from the CDB

Remember – when you dump redo you get just the redo for your session; there is some scope for being selective, but the starting point would be all the redo for the PDB you’re working from.

Parallel Default

Fri, 2022-09-09 04:25

“Why did my query go parallel?”

It’s a question that crops up from time to time, usually followed by a list of reasons why it shouldn’t have gone parallel – no hints in the query, table is not declared parallel, parallel_degree_policy is set to manual etc.

When the question appeared recently on the Oracle developer forum it turned out that the table in question was declared as “parallel (degree default)”, which prompted the OP to ask the question: “is parallel = default not equivalent to parallel = 1”.

The answer to the question is that the two options are not equivalent – but that’s not the point of this note. Here’s a little script to test the claim:

drop table t1 purge;

create table t1 pctfree 90 as select * from all_objects where rownum <= 50000;

select degree, instances from user_tables where table_name = 'T1';

explain plan for select sum(object_id) from t1;
select * from table(dbms_xplan.display);

alter table t1 parallel (degree default);
select degree, instances from user_tables where table_name = 'T1';

explain plan for select sum(object_id) from t1;
select * from table(dbms_xplan.display);


I’ve created a table in the simplest possible way, but picked a fixed number of rows (to help reproducibility) and – because parallel is usually about “big” objects – I’ve left a lot of empty space (90%) in each block.

Then I’ve checked the execution plan for a very simple query that can only do a full tablescan, with the two declarations of parallelism set.

Here are the outputs of the 4 queries I’ve run:

DEGREE                                   INSTANCES
---------------------------------------- ----------------------------------------
         1                                        1

1 row selected.


PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3724264953

---------------------------------------------------------------------------
| Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |     5 |  1275   (2)| 00:00:01 |
|   1 |  SORT AGGREGATE    |      |     1 |     5 |            |          |
|   2 |   TABLE ACCESS FULL| T1   | 50000 |   244K|  1275   (2)| 00:00:01 |
---------------------------------------------------------------------------

9 rows selected.


DEGREE                                   INSTANCES
---------------------------------------- ----------------------------------------
   DEFAULT                                        1

1 row selected.



PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------
Plan hash value: 3110199320

----------------------------------------------------------------------------------------------------------------
| Id  | Operation              | Name     | Rows  | Bytes | Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT       |          |     1 |     5 |   350   (0)| 00:00:01 |        |      |            |
|   1 |  SORT AGGREGATE        |          |     1 |     5 |            |          |        |      |            |
|   2 |   PX COORDINATOR       |          |       |       |            |          |        |      |            |
|   3 |    PX SEND QC (RANDOM) | :TQ10000 |     1 |     5 |            |          |  Q1,00 | P->S | QC (RAND)  |
|   4 |     SORT AGGREGATE     |          |     1 |     5 |            |          |  Q1,00 | PCWP |            |
|   5 |      PX BLOCK ITERATOR |          | 50000 |   244K|   350   (0)| 00:00:01 |  Q1,00 | PCWC |            |
|   6 |       TABLE ACCESS FULL| T1       | 50000 |   244K|   350   (0)| 00:00:01 |  Q1,00 | PCWP |            |
----------------------------------------------------------------------------------------------------------------

Note
-----
   - automatic DOP: Computed Degree of Parallelism is 4 because of degree limit

17 rows selected.


Clearly “parallel default” does not have the same effect as “parallel 1”. Any time you’ve got a query unexpectedly running parallel it’s possible that some table (or index on the table) has been created with a parallel degree of default. (More commonly, someone may have rebuilt an index “parallel N” to get the job done more quickly then forgotten to alter the index back to parallel 1 – or noparallel – afterwards.)

The point of this note, though, is that there are some questions you should not ask until you’ve spent a few minutes thinking about how you might create a model that gives you the answer. There are several reasons for this

  • The more you do it, the better and faster you get at modelling and understanding – and sometimes you really need to model a complex problem because you’re not allowed to show anything that looks like production in public.
  • If the simple model seems to disagree with the behaviour you see in production it may give you some clues about where to look in the production system for the source of the difference.
  • If the answer isn’t what you thought it would be you can change the question you put publicly to: “I thought Oracle would do X but it did Y; here’s how I tested, is there a flaw in the test?”

It took about 5 minutes for me to run up this demo – that might seem a bit quick but I’ve had a lot of practice (and it took a lot longer to write the note) – and it was, in this case, a waste of my time because I knew the answer; but I often run up little models before responding to questions on the forums or listservers because while I often think I know what the answer “ought” to be I do like to check before I say something that might be incorrect.

Shrinking indexes

Fri, 2022-09-02 13:21

If you want to do something about “wasted” space in an index what are the differences that you need to consider between the following three options (for the purposes of the article I’m ignoring “rebuild” and “rebuild online”):

alter index xxx coalesce;

alter index xxx shrink space compact;

alter index xxx shrink space;

Looking at the notes in a script I wrote a “few” years ago it seems that I haven’t looked at a comparison between the coalesce option and the shrink space options since 10.2.0.3 and I suspect things may have changed since then, so I’ve discarded the results that I had recorded (in 2012) and started again with 19.11.0.0

Background

I’ve been looking at the “deferred global index maintenance” in the last couple of weeks which is why I was toying with the idea of writing something about the shrinking indexes and how this differs from coalescing them when an Oracle Forum question (needs MOS account) produced the (slightly surprising) suggestion to use coalesce – so I decided it was time to (re-)test, write and publish.

RTFM

From the 19c SQL reference manual under “alter index”, or following the links from there to the “shrink clause”, or the database administration reference

  • Specify COALESCE to instruct Oracle Database to merge the contents of index blocks where possible to free blocks for reuse.
  • Use this [shrink] clause to compact the index segments. Specifying ALTERINDEXSHRINKSPACECOMPACT is equivalent to specifying ALTERINDEXCOALESCE.
    • If you specify COMPACT, then Oracle Database only defragments the segment space … The database does not readjust the high water mark and does not release the space immediately.
  • Can’t shrink space for bitmap join index or function-based index.
  • Segment shrink is an online, in-place operation. DML operations and queries can be issued during the data movement phase of segment shrink. Concurrent DML operations are blocked for a short time at the end of the shrink operation, when the space is deallocated.
  • Shrink operations can be performed only on segments in locally managed tablespaces with automatic segment space management (ASSM).
  • As with other DDL operations, segment shrink causes subsequent SQL statements to be reparsed because of invalidation of cursors unless you specify the COMPACT clause.

As with many little features of Oracle it’s quite hard to pick up a complete and cohesive statement of what something does and what impact it might have. Some of the bullet points above are generic about shrinking segments, and may not be totally accurate for shrinking only an index – will it invalidate cursors, or does that happen only when you shrink a table.

If you do read through the links you also notice that I’ve omitted several points from the generic shrink details that are not relevant for indexes (for example the requirement to enable row movement), and have only mentioned the restrictions which are explicitly referenced in the “shrink clause” for indexes.

What do we need to know?

Some of the fairly typical bits of information we might need to know about a “house-keeping” task like coalesce/shrink are:

  • How much work does if do, and of what type?
  • What exactly is the benefit we might get for the work done
  • What side-effects do we have to consider (locking, cursor invalidation etc.)
  • What side effects might show up if the process fails in mid-stream.

In the case of coalesce/shrink for indexes, a few specific questions would be:

  • Is “shrink space compact” really equivalent to “coalesce”
  • Are the operations “online” or only “nearly online”.
  • If shrink/coalesce is moving index entries around and moving index blocks around what happens if a session wants to insert an index entry into a leaf block that’s currently being “transferred” into another leaf block.
  • If it’s a big index that needs several minutes (or more) to shrink/coalesce, could ongoing transactions cause index leaf block splits that produce unexpected effects when Oracle tried to drop the highwater mark.
  • How big an index, and how long would the test have to take, and what degree of concurrency, and how (un)lucky would you have to be to hit a moment when something “strange” happened.

Finally – what tools would be helpful. Initially we might just look at:

  • session stats – to see what work we do
  • the dbms_space package – to check segment usage before and after
  • the treedump event – to get a detailed picture of the index

Based on what we see we might feel the need to dig a little deeper with:

  • v$enqueue_stats
  • v$rollstat (rollback (undo) segment usage)
  • SQL tracing with wait states
  • Enqueue (lock) tracing
  • redo dumps
The basic model

Here’s a little script to create a model that we can use for testing. Because of the stated requirement of the shrink space command I’ll just point out that the default tablespace should be using automatic segment space management (ASSM):

rem
rem     Script:         shrink_coalesce.sql
rem     Author:         Jonathan Lewis
rem     Dated:          May 2012
rem
rem     Last tested:
rem             19.11.0.0
rem 

execute dbms_random.seed(0)

create table t1 (
        v1      varchar2(7)
);

create index t1_i1 on t1(v1);

begin
        for i in 1..1e6 loop
                insert into t1(v1) values(
                        to_char(1e6 + trunc(dbms_random.value(0,100000)))
                );
        end loop;
end;
/

commit;

column ind_id new_value m_ind_id

select  object_id ind_id
from    user_objects
where   object_name = 'T1_I1'
;

alter session set tracefile_identifier = 'creation';
alter session set events 'immediate trace name treedump level &m_ind_id';
alter system flush buffer_cache;

pause Check the tree dump and pick a leaf block to dump

-- alter system dump datafile &&datafile block &&block_id;
alter system dump datafile 36 block 5429;


prompt  ========================
prompt  Deleting 4 rows out of 5
prompt  ========================

delete  from t1 
where   mod(v1,5) != 0
;

commit;

alter session set tracefile_identifier = 'deletion';
alter session set events 'immediate trace name treedump level &m_ind_id';
alter system flush buffer_cache;

-- pause Check the tree dump and pick a leaf block to dump
-- alter system dump datafile &&datafile block &&block_id;
alter system dump datafile 36 block 5429;

begin
        dbms_stats.gather_table_stats(
                ownname          => user,
                tabname          =>'T1',
                method_opt       => 'for all columns size 1',
                cascade          => true
        );
end;
/

select
        rows_per_block,
        count(*)        block_count
from    (
        select
                /*+
                        dynamic_sampling(0)
                        index_ffs(t1,t1_i1)
                        noparallel_index(t,t1_i1)
                */
                sys_op_lbid( &m_ind_id ,'L',t1.rowid)   as block_id,
                count(*)                                as rows_per_block
        from
                t1
        group by
                sys_op_lbid( &m_ind_id ,'L',t1.rowid)
        )
group by
        rows_per_block
order by
        rows_per_block
;

@@dbms_space_use_assm_embedded test_user t1_i1 index

Unusually (for me) I’ve created the data by inserting rows one at a time after creating the index. This is to avoid starting from a “perfect” index i.e. one where the physical ordering of the leaf blocks is closely correlated with the logical ordering with leaf blocks that are very well packed. With a single session inserting rows there will be a visible pattern to the choice that Oracle makes for “the next leaf block” when it needs a leaf block split, but with the random insertions there won’t be a pattern in “which block just split” so when you walk the index in order the steps from one leaf block to the next will jump fairly randomly around the segment.

The table starts at 1,000,000 rows, but ends up with about 200,000 and an index where roughly 75% of the rows in each leaf block have been deleted. So that we know what the tests start from I’ve done a treedump of the index before and after the delete (and included a pause in the script to allow you to find a dump to block from the treedump if you want to) with the following results:

Before:
----- begin tree dump
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:222.222 avs:3778)
      leaf: 0x900154e 151000398 (0: row:218.218 avs:3854)
      leaf: 0x9000abd 150997693 (1: row:219.219 avs:3835)
      leaf: 0x900153e 151000382 (2: row:209.209 avs:4025)
      leaf: 0x900058d 150996365 (3: row:230.230 avs:3626)
      leaf: 0x90013a8 150999976 (4: row:229.229 avs:3645)
      leaf: 0x9000ae1 150997729 (5: row:411.411 avs:187)
      leaf: 0x900031c 150995740 (6: row:227.227 avs:3683)
      leaf: 0x90014d3 151000275 (7: row:229.229 avs:3645)
      leaf: 0x9000aec 150997740 (8: row:226.226 avs:3702)
      leaf: 0x90014f3 151000307 (9: row:226.226 avs:3702)
      leaf: 0x9000593 150996371 (10: row:219.219 avs:3835)
      leaf: 0x9001559 151000409 (11: row:223.223 avs:3759)
      leaf: 0x9000a9d 150997661 (12: row:210.210 avs:4006)
      leaf: 0x900152e 151000366 (13: row:215.215 avs:3911)
      leaf: 0x900018a 150995338 (14: row:258.258 avs:3094)
...


After:
----- begin tree dump
branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:222.47 avs:3778)
      leaf: 0x900154e 151000398 (0: row:218.52 avs:3854)
      leaf: 0x9000abd 150997693 (1: row:219.44 avs:3835)
      leaf: 0x900153e 151000382 (2: row:209.43 avs:4025)
      leaf: 0x900058d 150996365 (3: row:230.44 avs:3626)
      leaf: 0x90013a8 150999976 (4: row:229.45 avs:3645)
      leaf: 0x9000ae1 150997729 (5: row:411.88 avs:187)
      leaf: 0x900031c 150995740 (6: row:227.50 avs:3683)
      leaf: 0x90014d3 151000275 (7: row:229.42 avs:3645)
      leaf: 0x9000aec 150997740 (8: row:226.46 avs:3702)
      leaf: 0x90014f3 151000307 (9: row:226.57 avs:3702)
      leaf: 0x9000593 150996371 (10: row:219.46 avs:3835)
      leaf: 0x9001559 151000409 (11: row:223.54 avs:3759)
      leaf: 0x9000a9d 150997661 (12: row:210.33 avs:4006)
      leaf: 0x900152e 151000366 (13: row:215.30 avs:3911)
      leaf: 0x900018a 150995338 (14: row:258.52 avs:3094)
...
      leaf: 0x900077f 150996863 (398: row:356.64 avs:1232)
      leaf: 0x9000d67 150998375 (399: row:327.62 avs:1783)
   branch: 0x9000e45 150998597 (0: nrow: 378, level: 1)
      leaf: 0x900047a 150996090 (-1: row:342.86 avs:1498)
      leaf: 0x9000d46 150998342 (0: row:357.60 avs:1213)
...
...
      leaf: 0x9000607 150996487 (492: row:369.80 avs:985)
      leaf: 0x9000c60 150998112 (493: row:395.70 avs:491)
   branch: 0x9000c68 150998120 (6: nrow: 503, level: 1)
      leaf: 0x90001b2 150995378 (-1: row:235.60 avs:3531)
      leaf: 0x9001323 150999843 (0: row:230.54 avs:3626)

The “before” section is just the first few lines of 3,538 and shows us that we have a root block with 8 branch blocks (numbered from -1 to +6), and the first branch block holds 401 leaf blocks(numbered from -1 to 399), and the first leaf block starts with 222 index entries (in its row directory) of which, we learn form the “after” section, 47 (i.e. roughly 20%) are still “in use” after the delete. The “after” section adds in a few extra lines from the treedump, around branch block 0 and branch block 6.

In passing, if I execute a new transaction that inserts a new index entry into the first leaf block Oracle will tidy its directory and the start of the tree dump will now look like the following:

 branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 401, level: 1)
      leaf: 0x900016d 150995309 (-1: row:48.48 avs:7084)
      leaf: 0x900154e 151000398 (0: row:218.52 avs:3854)

After the initial big insert many of the leaf blocks hold around 220 rows, but we can see one leaf block holding 411 rows in the initial 16 shown. Allowing for the 9 non-leaf blocks we can calculate that we should see an average of approximately 1,000,000/3,529 = 283 rows per leaf block; the variation is the result of leaf block “50/50” splits. When a leaf block is full the next attempted insert causes Oracle to attach a new leaf block to the structure and share the existing entries fairly evenly between the two blocks (although there is one special case, the so-called “90/10” split that can happen at the highest value leaf block). The shares are not exactly equal because Oracle has to insert a new branch pointer at the same time and may be able to reduce the size of this branch pointer by moving the split point some way from the “fair share” 50/50 point.

Of course, there’s also some variation in the content of the leaf blocks because they tend to start refilling shortly after they’ve split, so it can be quite instructive (when your system reaches “steady state” to produce a “histogram” of leaf contents – which is what the last SQL statement in my setup script is about, with the following results:

ROWS_PER_BLOCK BLOCK_COUNT
-------------- -----------
            24           1
            26           1
            27           1
            28           5
            29           5
            30           7
            31          11
            32          11
            33          26
            34          23
            35          28
            36          28
            37          49
            38          47
            39          43
            40          49
            41          62
            42          73
            43          81
            44          92
            45          98
            46          91
            47         117
            48         104
            49         124
            50         124
            51         117
            52         114
            53         106
            54         123
            55         109
            56         104
            57          96
            58          84
            59          70
            60          95
            61          57
            62          73
            63          77
            64          74
            65          66
            66          56
            67          52
            68          54
            69          59
            70          44
            71          56
            72          49
            73          47
            74          51
            75          27
            76          34
            77          29
            78          27
            79          25
            80          27
            81          28
            82          26
            83          16
            84          23
            85          16
            86          18
            87          11
            88          19
            89          16
            90          10
            91          11
            92           5
            93           5
            94           2
            95           3
            96           4
            97           3
            99           3
           100           4
           103           1
           107           1
           119           1

78 rows selected.

The result (because it’s randomly arriving values) is fairly close to the bell curve of the Normal distribution centred at around 50 rows. There’s a fairly long tail up to 119 rows, but that’s likely to be because the index state hadn’t quite reached steady state before the big delete.

Having dumped a leaf block I know that a completely packed leaf block could hold 420 rows, at pctfree 10 that would mean 378 rows, and at 70% utilisation (which is what I expect with random arrival) an average of 294 rows generating an index of 3,400 leaf blocks rather than the 3,529 I got. (Again, I think the divergence from expectation are probably related to needing more time to get to steady state.)

The final call in the script is to a stripped down version of some code I published a few years back; the relevance of the numbers when applied to indexes is desribed in this blog note and shows the following:

Unformatted                   :           62 /          507,904
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :           45 /          368,640
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :            0 /                0
Full                          :        3,545 /       29,040,640

PL/SQL procedure successfully completed.

Segment Total blocks:        3,712
Object Unused blocks:            0

PL/SQL procedure successfully completed.

Freespace 2 is the label given to the set of blocks that are available for use (empty) whether or not they are in the index structure. Of course, given that the pattern of work so far it’s fairly safe to assume that in this case they are “formatted but not yet attached to the index structure”.

A quick arithmetic check highlights an apparent discrepancy: 62 + 45 + 3,545 = 3,652, which is 60 blocks short of the number in the segment; but that’s okay because I have 29 uniform extents of 1MB in the segment, which means 2 space management level 1 bitmap blocks per extent plus one level 2 bitmap block, plus the segment header block – for a total of 60 space management blocks.

The thing I’m not keen on is that the space management blocks are reporting 3,545 Full blocks, when the treedump showed 3,538 blocks – where did the extra 7 come from. But I’m not going to worry about that for this blog note.

Tests and results

The following block of code shows the full set of logging and tracing that I did – though not necessarily all in a single run – for each of the three options. The code in this case is wrapped around a call to coalesce:

alter session set tracefile_identifier = 'coalesce';
alter session set events 'immediate trace name treedump level &m_ind_id';

execute snap_enqueues.start_snap
execute snap_rollstats.start_snap
execute snap_my_stats.start_snap
execute snap_redo.start_snap

alter session set events 'trace[ksq] disk=medium';

column current_scn new_value start_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter index t1_i1 coalesce;

column current_scn new_value end_scn
select to_char(current_scn,'9999999999999999') current_scn from v$database;

alter session set events 'trace[ksq] off';

execute snap_redo.end_snap
execute snap_my_stats.end_snap
execute snap_rollstats.end_snap
execute snap_enqueues.end_snap

alter session set events 'immediate trace name treedump level &m_ind_id';

alter session set tracefile_identifier='coalesce_redo';
alter system dump redo scn min &start_scn scn max &end_scn ;
alter session set tracefile_identifier='';

@@dbms_space_use_assm_embedded test_user t1_i1 index
@@index_histogram_embedded t1 t1_i1 &m_ind_id

Working from the top down:

  • Set an identifier to include in the trace file name.
  • take a starting treedump (which will go to that trace file)
  • take starting snapshots of
    • system level enqueue stats
    • system leve rollback stats
    • my session activity stats
    • a subset of session stats relating to redo
  • enable tracing of Enqueues (locks)
  • capture the current SCN in a define variable
  • coalesce the index
  • capture the final SCN in a define variable
  • report the change in the 4 sets of stats listed above
  • save the ending treedump to the trace file
  • set a new identifier for the tracefile name
  • dump all the redo generated while the coalesce was going on
  • Call a script to report the space usage for the index segment
  • Call a script to report the histogram of leaf block usage again

The starting treedump will match the “post-delete” treedump above, of course, but it’s just a convenience for each test to have its before and after treedumps in the same trace file, and the redo dump (which includes redo from every active session) is so large – about 275MB – that it’s a good idea to keep it separate from the treedumps and enqueue trace.

The histogram script is just a wrapper for the two sys_op_lbid() queries shown earlier on. The space usage script is one we’ve already met.

A test run takes only a couple of minutes – and most of the time is spent inserting 1M rows into an indexed table one at a time. (The time it took to analyze the logs, traces and dumps is much longer, and the time to summarize and write up the results is longer still!)

Here, then, are the most interesting details from the three tests. Some of the comments I make are not immediately “proved” by the results I’m showing, but the volume of data required to supply corroborative would become excessive and very boring.

Coalesce

The first “big picture” item to look at after the coalesce is the space usage:

Unformatted                   :           62 /          507,904
Freespace 1 (  0 -  25% free) :            0 /                0
Freespace 2 ( 25 -  50% free) :        3,037 /       24,879,104
Freespace 3 ( 50 -  75% free) :            0 /                0
Freespace 4 ( 75 - 100% free) :            0 /                0
Full                          :          553 /        4,530,176

PL/SQL procedure successfully completed.

Segment Total blocks:        3,712
Object Unused blocks:            0

The index segment is 3,712 blocks, of which 553 are “Full”, and 3,037 are in the “Freespace 2” state which, for indexes, means they are empty and available for reuse. The coalesce hasn’t released space back to the tablespace but we can’t tell from these figures whether the 553 blocks full blocks are packed into the “bottom end” of the segment or scattered across the entire length of the segment. Or, to view it another way, the figues don’t tell us whether Oracle has been shuffling rows without completely re-arranging the block linkages or whether it’s also been moving rows so that it can reconnect leaf blocks in a way that leaves all the empty blocks above a notional highwater mark.

We can dig a little deeper by by looking at the treedump:

branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 64, level: 1)
      leaf: 0x900016d 150995309 (-1: row:377.377 avs:833)
      leaf: 0x90014d3 151000275 (0: row:377.377 avs:833)
      leaf: 0x900118c 150999436 (1: row:377.377 avs:833)
      leaf: 0x9000370 150995824 (2: row:377.377 avs:833)
...
      leaf: 0x9000d2f 150998319 (61: row:377.377 avs:833)
      leaf: 0x9000d67 150998375 (62: row:114.114 avs:5830)
   branch: 0x9000e45 150998597 (0: nrow: 59, level: 1)
      leaf: 0x900047a 150996090 (-1: row:377.377 avs:833)
      leaf: 0x9000725 150996773 (0: row:377.377 avs:833)

...
...
      leaf: 0x9000a05 150997509 (67: row:377.377 avs:833)
      leaf: 0x900030d 150995725 (68: row:376.376 avs:852)
   branch: 0x9000c68 150998120 (6: nrow: 76, level: 1)
      leaf: 0x90001b2 150995378 (-1: row:60.60 avs:6856)
      leaf: 0x9001323 150999843 (0: row:377.377 avs:833)

The root block is still reporting the same number of level 1 branch blocks, but the branch blocks report far fewer leaf blocks each. Most of the leaf blocks report 377 index entries, but the first and last leaf blocks of each branch tend to show less than that.

I pointed out earlier on that with pctfree 10 we’d get 378 rows per leaf block if we recreated the index, but it looks like there’s a little overhead I didn’t allow for and we’ve actually got 377 from the coalesce. You’ll notice that a coalesce will actually reduce the number of index entries in a leaf block if it exceeds the limit set by pctfree (remember how the original treedump extracts showed one leaf block with 411 entries).

Coalesce does not act “across” branch blocks, which is why (a) the number of branch blocks is unchanged, and (b) why the number of rows in the last leaf block of a branch block may have fewer rows than the typical leaf blocks – coalesce will not move rows from the first leaf block of the next branch.

I’ve included a few lines from around the branches numbered 0 and 6 in this extract. If you compare them with the treedump taken just after the delete you’ll see that the coalesce has copied rows back from the second (0th) leaf of branch 0 into the first (-1th) leaf , but not from the second (0th) leaf into the first (-1th) leaf of branch 6. I don’t know why this is but perhaps it’s something to do with the relative number of rows in the first and second (-1th and 0th) leaf blocks – the same behaviour showed up at the start of branch 3 where the two leaf blocks had 58 and 63 rows respectively.

Getting back to the question of whether the “Freespace 2” blocks reported by the space usage procedure are still in the structure or whether they have been unlinked – the number of leaf blocks reported per branch block is fairly convincing – the empty leaf blocks have been detached from the structure and are simply flagged as free blocks in the space management level 1 bitmap. We could do a quick check of all the branch blocks (grep ” branch” from the trace file):

branch: 0x9000104 150995204 (0: nrow: 8, level: 2)
   branch: 0x9000438 150996024 (-1: nrow: 64, level: 1)
   branch: 0x9000e45 150998597 (0: nrow: 59, level: 1)
   branch: 0x90007d1 150996945 (1: nrow: 61, level: 1)
   branch: 0x9000e8a 150998666 (2: nrow: 66, level: 1)
   branch: 0x900043c 150996028 (3: nrow: 70, level: 1)
   branch: 0x9000e18 150998552 (4: nrow: 70, level: 1)
   branch: 0x900073d 150996797 (5: nrow: 70, level: 1)
   branch: 0x9000c68 150998120 (6: nrow: 76, level: 1)

Add up the nrow for the level 1 branches and you get 536; add 9 for the branch blocks themselves and you get 545 – and the space usage report says 553 (an unexplained error of 8 which I’ll get round to worrying about one day). I wonder if there’s any significance in how close it is to the error of 7 that we had before the coalesce.

We can learn more from the tree dump by walking the leaf blocks in order and checking their block addresses.

  • The first leaf block of the first level 1 branch block is 0x900016d before and after the coalesce.
  • The second leaf block of the this branch block is 0x90014d3 after the coalesce, but that was the address of leaf block number 7 before the coalesce.
  • The third leaf block is 0x900118c after the coalesce but was leaf block number 15 before the coalesce.

The coalesce has been walking the index in order, copying rows back to earlier leaf blocks and unlinking the block it’s working if it becomes empty. The ultimate effect of this is that the final set of index leaf blocks isn’t compacted into the smallest contiguous space possible, it’s scattered just as widely and randomly across the whole segment as it was before the coalesce.

We could go one step further to demonstrate this. Extract all the lines for leaf blocks from the treedump and sort them into order. Since I’m using 1MB exents I’d expect to see (nearly) 128 consecutive block addresses in order before a possible jump to a block in the next extent. Here are first few addresses when I do the experiment:

      leaf: 0x9000105 150995205 (59: row:377.377 avs:833)
      leaf: 0x9000108 150995208 (3: row:377.377 avs:833)
      leaf: 0x900010a 150995210 (52: row:377.377 avs:833)
      leaf: 0x9000115 150995221 (12: row:377.377 avs:833)
      leaf: 0x9000117 150995223 (63: row:377.377 avs:833)
      leaf: 0x900011e 150995230 (3: row:377.377 avs:833)
      leaf: 0x900011f 150995231 (53: row:377.377 avs:833)
      leaf: 0x900012b 150995243 (34: row:377.377 avs:833)
      leaf: 0x9000137 150995255 (34: row:377.377 avs:833)
      leaf: 0x900013a 150995258 (63: row:377.377 avs:833)
      leaf: 0x900013d 150995261 (43: row:377.377 avs:833)

You don’t have to be skilled at reading hex numbers to see all the gaps between the used block addresses.

Coalesce – Transactions

We now know where the index has got to, so the next question is how did it get there. The snapshot showing the change in rollback statistics (v$rollstat) is revealing.

USN   Ex Size K  HWM K  Opt K      Writes     Gets  Waits Shr Grow Shr K  Act K
----  -- ------  -----  -----      ------     ----  ----- --- ---- ----- ------
   0   0      0      0      0           0        1      0   0    0     0      0
   1   5   5120      0      0     5101714     1199      0   0    5     0 -28446
   2   6   6144      0      0     5278032     1245      0   0    6     0    275
   3   7   7168      0      0     5834744     1365      0   0    7     0  -1492
   4   1   8192      0      0     5944580     1378      0   0    1     0 -17281
   5   6   6144      0      0     5126248     1203      0   0    6     0    303
   6   4   4096      0      0     5076808     1189      0   0    4     0    -72
   7   6   6144      0      0     5244984     1239      0   0    6     0    127
   8   7   7168      0      0     5818394     1363      0   0    7     0    263
   9   7   7168      0      0     6017230     1401      0   0    7     0    213
  10   1   8192   8192      0     5060154     1178      0   0    1     0 -54488

My session was the only one active on the system, and it’s only a small play system so the only undo segments it has are the basic 10 that appear when you create the database (plus the one in the rollback segment in the system tablespace).

The critical numbers are the writes (bytes) and gets (blocks), which tell us that our single operation has behaved as a number of individual transactions which have been starting in different undo segments. Given the fairly even spread of bytes written it’s a good bet that we’re seeing a fairly large number of fairly small transactions.

We can following this up by looking at the snapshot of enqueues (locks):

Type    Requests       Waits     Success      Failed    Wait m/s Reason
----    --------       -----     -------      ------    -------- ------
CF             2           0           2           0           0 contention
XR             1           0           1           0           0 database force logging
TM             1           0           1           0           0 contention
TX         3,051           0       3,051           0           0 contention
HW            50           0          50           0           0 contention
TT            50           0          50           0           0 contention
CU            24           0          24           0           0 contention
OD             1           0           1           0           0 Serializing DDLs
JG           126           0         126           0           0 queue lock
JG            12           0          12           0           0 q mem clnup lck
JG           126           0         126           0           0 contention

The enqueue we’re interested in is the TX enqueue – and we saw more than 3,000 of them in the interval. (That’s interestingly close to the number of blocks in the index or, to be even fussier, the number of leaf blocks that have been emptied – but that might be a coincidence.)

You’ll notice, though, that there’s only 1 TM (table) lock request. Whatever else we’re doing we’re not locking and unlocking the table on every single transaction – so we need to find out what that lock is and whether it might be a threat to our application (a TM lock in mode 4, 5, or 6, held for the duration would be a disaster). And that’s why I enabled the ksq trace – here’s the extract from the trace file showing the acquisition of the TM lock.

2022-09-02 11:36:57.645*:ksq.c@9175:ksqgtlctx(): *** TM-000230A3-00000000-0039DED3-00000000 mode=2 flags=0x401 why=173 timeout=0 ***
2022-09-02 11:36:57.645*:ksq.c@9183:ksqgtlctx(): xcb=0x9bbeec68, ktcdix=2147483647 topxcb=0x9bbeec68 ktcipt(topxcb)=0x0
2022-09-02 11:36:57.645*:ksq.c@9203:ksqgtlctx(): ksqgtlctx: Initializing lock structure
2022-09-02 11:36:57.645*:ksq.c@9324:ksqgtlctx(): DID DUMP START
2022-09-02 11:36:57.645*:ksq.c@9328:ksqgtlctx():        ksqlkdid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9333:ksqgtlctx():        tktcmydid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9337:ksqgtlctx():        tksusesdi: 0000-0000-00000000
2022-09-02 11:36:57.645*:ksq.c@9341:ksqgtlctx():        tksusetxn: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9343:ksqgtlctx(): DID DUMP END
2022-09-02 11:36:57.645*:ksq.c@9517:ksqgtlctx(): ksqgtlctx: did not find link
2022-09-02 11:36:57.645*:ksq.c@9687:ksqgtlctx(): ksqgtlctx: updated ksqlrar1, ksqlrar:0x9e7f7cb8, ksqlral:(nil)
2022-09-02 11:36:57.645*:ksq.c@9841:ksqgtlctx(): ksqgtlctx: updated ksqlral, ksqlral:0x9bac7bc0, res:0x9e7f7cb8
2022-09-02 11:36:57.645*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:2 req:0
2022-09-02 11:36:57.645*:ksq.c@9960:ksqgtlctx(): SUCCESS

I’ve highlighted the line where the TM lock appears, reporting an “id1” of 000230A3, which is the object_id of the table t1. Take note of the other highlighted line which gives the address of the resource element used (res: 0x9e7f7cb8) because we can use this to find where the lock is released:

2022-09-02 11:36:58.735*:ksq.c@10367:ksqrcli_int(): ksqrcli_int: updated ksqlral, ksqlral:0x9bac7bc0, res:0x9e7f7cb8
2022-09-02 11:36:58.735*:ksq.c@10501:ksqrcli_int(): returns 0

This appears in the last few lines of the ksq trace, after the appearance of several thousand (brief) TX locks that have been acquired and released. So there is a low-impact table lock held for the duration of the coalesce that is not going to stop other sessions from updating the table (and its indexes).

There was one other lock released after the TM lock:

2022-09-02 11:36:58.769*:ksq.c@10367:ksqrcli_int(): ksqrcli_int: updated ksqlral, ksqlral:0x9e6b2370, res:0x9e7f0788
2022-09-02 11:36:58.769*:ksq.c@10501:ksqrcli_int(): returns 0

Working backwards using the resource address we find that this was an OD lock, taken immediately after the TM lock:

2022-09-02 11:36:57.645*:ksq.c@9175:ksqgtlctx(): *** OD-000230A4-00000000-0039DED3-00000000 mode=4 flags=0x10001 why=277 timeout=0 ***
2022-09-02 11:36:57.645*:ksq.c@9183:ksqgtlctx(): xcb=0x9bbeec68, ktcdix=2147483647 topxcb=0x9bbeec68 ktcipt(topxcb)=0x0
2022-09-02 11:36:57.645*:ksq.c@9203:ksqgtlctx(): ksqgtlctx: Initializing lock structure
2022-09-02 11:36:57.645*:ksq.c@9324:ksqgtlctx(): DID DUMP START
2022-09-02 11:36:57.645*:ksq.c@9328:ksqgtlctx():        ksqlkdid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9333:ksqgtlctx():        tktcmydid: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9337:ksqgtlctx():        tksusesdi: 0000-0000-00000000
2022-09-02 11:36:57.645*:ksq.c@9341:ksqgtlctx():        tksusetxn: 0001-0029-0000013C
2022-09-02 11:36:57.645*:ksq.c@9343:ksqgtlctx(): DID DUMP END
2022-09-02 11:36:57.645*:ksq.c@9517:ksqgtlctx(): ksqgtlctx: did not find link
2022-09-02 11:36:57.645*:ksq.c@9687:ksqgtlctx(): ksqgtlctx: updated ksqlrar1, ksqlrar:0x9e7f0788, ksqlral:(nil)
2022-09-02 11:36:57.645*:ksq.c@9841:ksqgtlctx(): ksqgtlctx: updated ksqlral, ksqlral:0x9e6b2370, res:0x9e7f0788
2022-09-02 11:36:57.645*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:4 req:0
2022-09-02 11:36:57.645*:ksq.c@9960:ksqgtlctx(): SUCCESS

Checking v$lock_type we see that the OD lock is the “Online DDLs” lock, with the description “Lock to prevent concurrent online DDLs” and its first parameter is the object_id of the object that is the targete of the DDL. The value in the trace file (000230A4) identifies the index that we are coalescing; at mode 4 the lock mode is fairly aggressive, but I’m surprised that it isn’t 6 – if we were to interpret the value the way we would for TM locks it would suggest that two sessions could coalesce the index at the same time!

Apart from 50 pairs of TT/HW locks (tablespace DDL / Segment Highwater mark) due to undo segments growing and shrinking, the rest of the ksq trace was taken up by 3,051 TX locks, typically reporting their acquisition and release on adjacent lines of the trace, e.g.:

2022-09-02 11:36:57.650*:ksq.c@9100:ksqgtlctx(): ksqtgtlctx: PDB mode
2022-09-02 11:36:57.650*:ksq.c@9175:ksqgtlctx(): *** TX-0004001C-00002F83-0039DED3-00000000 mode=6 flags=0x401 why=176 timeout=0 ***
2022-09-02 11:36:57.650*:ksq.c@9183:ksqgtlctx(): xcb=0x9bd37110, ktcdix=2147483647 topxcb=0x9bbeec68 ktcipt(topxcb)=0x0
2022-09-02 11:36:57.650*:ksq.c@9203:ksqgtlctx(): ksqgtlctx: Initializing lock structure
2022-09-02 11:36:57.650*:ksq.c@9324:ksqgtlctx(): DID DUMP START
2022-09-02 11:36:57.650*:ksq.c@9328:ksqgtlctx():        ksqlkdid: 0001-0029-0000013C
2022-09-02 11:36:57.650*:ksq.c@9333:ksqgtlctx():        tktcmydid: 0001-0029-0000013C
2022-09-02 11:36:57.650*:ksq.c@9337:ksqgtlctx():        tksusesdi: 0000-0000-00000000
2022-09-02 11:36:57.650*:ksq.c@9341:ksqgtlctx():        tksusetxn: 0001-0029-0000013C
2022-09-02 11:36:57.650*:ksq.c@9343:ksqgtlctx(): DID DUMP END
2022-09-02 11:36:57.650*:ksq.c@9517:ksqgtlctx(): ksqgtlctx: did not find link
2022-09-02 11:36:57.650*:ksq.c@9687:ksqgtlctx(): ksqgtlctx: updated ksqlrar1, ksqlrar:0x9e80b098, ksqlral:(nil)
2022-09-02 11:36:57.650*:ksq.c@9841:ksqgtlctx(): ksqgtlctx: updated ksqlral, ksqlral:0x9bd37148, res:0x9e80b098
2022-09-02 11:36:57.650*:ksq.c@9851:ksqgtlctx(): ksqgtlctx: updated lock mode, mode:6 req:0
2022-09-02 11:36:57.650*:ksq.c@9960:ksqgtlctx(): SUCCESS
2022-09-02 11:36:57.650*:ksq.c@10367:ksqrcli_int(): ksqrcli_int: updated ksqlral, ksqlral:0x9bd37148, res:0x9e80b098
2022-09-02 11:36:57.650*:ksq.c@10501:ksqrcli_int(): returns 0
Coalesce – Workload

We’ve examined the end result of a coalesce, and seen something of the mechanism that Oracle adopts to get to that result, but what does it cost (in terms of work done)? In many cases it’s sufficient to limit the analysis to:

  • how much I/O
  • how much CPU
  • how much undo and redo generated

The obvious I/O comes from the requirement to walk the index in leaf block order, and the dbwr will eventyally have to write back every block (including the empty ones). But that I/O, and the inevitable CPU usage is not particularly interesting, what’s more interesting (and more of a threat) is the impact of the undo and redo. This is where the snapsthos of session stats and redo stats give us the information we need to know, and all I’m going to look at are the redo-related stats for the test:

Name                                                                     Value
----                                                                     -----
messages sent                                                               92
calls to kcmgcs                                                             81
calls to kcmgas                                                          6,118
calls to get snapshot scn: kcmgss                                        3,075
redo entries                                                            34,810
redo size                                                           76,475,936
redo buffer allocation retries                                              39
redo subscn max counts                                                   1,049
redo synch time                                                              3
redo synch time (usec)                                                  33,207
redo synch time overhead (usec)                                            159
redo synch time overhead count (  2ms)                                       1
redo synch writes                                                            1
redo write info find                                                         1
undo change vector size                                             55,062,672
rollback changes - undo records applied                                    287

Bearing in mind that this index started at roughly 3,600 blocks / 28MB and coalesced to roughly 560 blocks / 4.5MB I want to draw your attention to just three of the figures (highlighted): the number and total size of redo records generated, and the volume of undo generated. 23,810 redo records, 75MB or redo, of which 55MB was due to undo.

It’s nice to see that the undo figure is consistent with the sum of the writes we saw in the snapshot of v$rollstat. But the numbers warn us that there’s a lot of work going into a coalesce – and it could have a big impact on other users. My session is generating a lot of undo, and it’s cycling through every undo segment as it does so – that means other sessions that need to create read-consistent images of recently changed data blocks that are completely unrelated to my index may have to work backwards through a large number of undo blocks trying to find upper bound SCNs (check for statistics like: ‘transaction tables consistent read%’)

You’ll notice that I’ve also reported “rollback changes – undo records applied”; these are appearing because of a “DML restarts” issue that makes a statement rollback and try again the first time it grows an undo segment. Luckily all my transactions are very small so each individual transaction won’t suffer much if it has to restart, but if you have a long running DML statement and I keep filling and extending undo segments (possibly shrinking other undo segments to do so) that’s going to increase your chances of a finding your undo segment is full and doing a huge rollback and restart of the statement. Be very careful about timing your coalesce commands.

Since I’ve dumped all the redo for the duration of the test run, I’ll finish by showing a little analysis of the results. The trace file for this 28MB index was over 250MB, so it’s not something you’d generate on a production size coalesce.

All I’m going to do is use grep to pull out the redo OP codes of every change vector in the file and show you a couple of extracts. First a commonly occurring pattern:

CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f685d SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:1 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:2 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:3 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x0900018a OBJ:143528 SCN:0x00000000024f67c1 SEQ:1 OP:10.11 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:2 CLS:1 AFN:36 DBA:0x09000438 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:32 AFN:17 DBA:0x0440ec96 OBJ:4294967295 SCN:0x00000000024f686e SEQ:4 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x09000438 OBJ:143528 
SCN:0x00000000024f686e SEQ:1 OP:10.39 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:1 CLS:32 AFN:17 DBA:0x0440ec97 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.8 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f686e SEQ:2 OP:5.2 ENC:0 RBL:0 FLG:0x0000
CHANGE #2 CON_ID:3 TYP:1 CLS:32 AFN:17 DBA:0x0440ec98 OBJ:4294967295 SCN:0x00000000024f686e SEQ:1 OP:5.1 ENC:0 RBL:0 FLG:0x0000
CHANGE #3 CON_ID:3 TYP:0 CLS:1 AFN:36 DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.34 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:8 AFN:36 DBA:0x09001500 OBJ:143528 SCN:0x00000000024f685a SEQ:1 OP:13.22 ENC:0 RBL:0 FLG:0x0000

CHANGE #1 CON_ID:3 TYP:0 CLS:31 AFN:17 DBA:0x04402750 OBJ:4294967295 SCN:0x00000000024f686e SEQ:3 OP:5.4 ENC:0 RBL:0 FLG:0x0000

The last line is Op Code 5.4, a commit (or rollback), and I picked a batch of rows between one commit and the next, so this entire set of 20 change vectors is a single transaction taking place in the coalesce. I’ve placed gaps before every “Change #1” to show the boundaries between redo records. So you can see that a typical small transaction is 11 redo records – that’s another sanity check – we saw roughly 3,000 TX enqueues, and 34,800 redo entries: 11 * 3,000 = 33,000, which is a good enough match.

Op Code 5.2 is “get next undo block”, Op Code 5.1 is “create undo record”, so I’m going to simplify the list by removing those codes; and removing some of the irrelevant material from the start and end of each line the example reduces to:

DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 
DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 
DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 
DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.6 
DBA:0x0900018a OBJ:143528 SCN:0x00000000024f67c1 SEQ:1 OP:10.11 
DBA:0x09000438 OBJ:143528 SCN:0x00000000024f686d SEQ:1 OP:4.1 
DBA:0x09000438 OBJ:143528 SCN:0x00000000024f686e SEQ:1 OP:10.39
DBA:0x090014d3 OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.8 
DBA:0x0900152e OBJ:143528 SCN:0x00000000024f686e SEQ:2 OP:10.34 
DBA:0x09001500 OBJ:143528 SCN:0x00000000024f685a SEQ:1 OP:13.22 

Translating the OP Codes (and adding in a little information I have about which blocks the block addresses (DBA) correspond to, this is what the transcation does

  • block cleanout of leaf block 0x090014d3 (4.1)
  • block cleanout of leaf block 0x0900152e (4.1)
  • lock leaf block 0x0900152e (10.6)
  • lock leaf block (0x090014d3 (10.6)
  • change the “pointer to previous” of leaf block 0x0900018a (10.11)
  • block cleanout of branch block 0x09000438 (4.1)
  • update branch block 0x09000438, delete one leaf entry (10.39)
  • create new version of leaf block 0x090014d3 (10.8)
  • create empty version of leaf block 0x0900152e (10.34)
  • update space management level 1 bitmap block (13.22)

So where does the huge amount of redo appear. If we looked at the 11 Redo Record headers for the extract we could use the LEN information to point us to the cirtical bits:

REDO RECORD - Thread:1 RBA: 0x000391.000174dc.01c0 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174dd.0028 LEN: 0x0060 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174dd.0088 LEN: 0x0164 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174de.0010 LEN: 0x00e4 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174de.00f4 LEN: 0x00ec VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174de.01e0 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174df.0048 LEN: 0x0104 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174df.014c LEN: 0x3a88 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.000174fd.01b4 LEN: 0x20bc VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.0001750e.0180 LEN: 0x0064 VLD: 0x01 CON_UID: 3792595
REDO RECORD - Thread:1 RBA: 0x000391.0001750e.01e4 LEN: 0x0058 VLD: 0x01 CON_UID: 3792595

I’ve highlighted the two big ones – records 8 and 9, which are the ones holding the 10.8 (create new leaf) and 10.34 (make block empty). Why are they so big at 14,984 bytes and 8,380 bytes respectively?

Record 8 includes a change vector (5.1) for the undo of the replaced block which is a block image at 8,032 bytes, and a change vector for the new version of the block in a format similar to an array insert which happened to have 344 rows at this point for a size of roughly 6,500 bytes.

Record 9 includes a change vector (5.1) for the undo of the wiped block, again a block image of 8,032 bytes. But the 10.34 (make empty) is only a few tens of bytes.

My test shows a particularly nasty effect of coalesce and its “pairwise” clean-up. Checking the “post-delete” tree dump I can see I’ve emptied leaf block 0x0900152e by copying 30 rows back into leaf block 0x090014d3, and I can see that this is the fifth block that I’ve emptied into 0x090014d3, and I can see that I’ll be doing one more pass to get that block full; and each time I do this I dump two block images, and an “array-update” redo change vector that gets bigger and bigger until it’s nearly the full 8KB. The operation generates a lot of undo and a lot of redo.

Coalesce – concurrency

As a quick test of what happens when other work is going on on the table I ran a little script to insert an extra 100 rows (without committing) into the table just after the big delete, but just before the coalesce, generating random values from the same range as the original values.

The coalesce didn’t seem to take any extra time, and I didn’t see any enqueue waits of buffer busy waits (though a different test of 3,000 rapid single row inserts with commits while the coalesce was running manage to get one buffer busy wait on a branch block).

The final result, though was not very good. With 100 uncommitted inserts getting in the way the index report 687 “full” blocks rather than the 553 that we saw originally. That’s an increase of more than one block per row inserted.

Basically it looks as if Oracle hits block with an uncommitted change and says – I can’t copy those rows backwards so I’ll have to leave the current block wherever I’ve got to, skip the modified block and restart the coalesce in the next block along; so every block with an uncommitted change could result in two extra blocks ultimately not being packed as well as they could be.

shrink space compact

Coming soon

shrink space

Coming soon

Summary (so far)

alter index xxx coalesce is an online operation which reads through the index in order, copying index entries backwards to fill leaf blocks (according to the pctfree setting). This is not a row-by-row process, the session constructs full leaf blocks in private memory and “initialises” them into the database, re-initialising any blocks that have been made empty at the same time. The leaf block that was the last one to contribute index entries to the full block will be used as the target for the next fill – unless the transfer process left it completely empty in which case Oracle starts with the next leaf block in the index.

When the process empties a leaf block, Oracle unlinks it from the index structure. This means deleting an entry from the level 1 branch block and modifying the two leaf blocks either side of the empty block so that the “next pointer” of the previous block points to the next leaf block, and the “previous pointer” of the next block points to the previous block. The bitmap entry for the the empty block can then be set to report it as “Freespace 2” – ready for re-use anywhere else in the index.

The copy-back process doesn’t cross the boundaries of level 1 branch blocks, so the last leaf block for a branch block may not reach the limit of rows it’s allowed to hold. Also the first leaf block of a branch block is never substituted, and it may (for reasons I don’t know) end up holding less than the expected maximum row count.

It is possible to make a reasonably estimate of the undo and redo generated by a call to coalesce an index.

The coalesce operates as a large number of small transactions (working through the index on pairs of adjacent blocks) and will cycle through all the undo segments in your undo tablespace.

If a coalesce reaches a block that holds an active transaction it will skip the block and move on to the next available leaf block, so a little light activity on the index could significantly reduce the effectiveness of the coalesce and different indexes on the same table could be affected very differently because of the pattern that exist in the data and its indexes.

Pages