Quantcast
Viewing all articles
Browse latest Browse all 34

Options for Tracing Oracle dbms_stats

A Baffling Case of ORA-00942

Recently we had a good reason to find out what was going on behind the scenes with dbms_stats.The automatic stats collection job in a client’s database was frequently failing due to ORA-00942 table not found errors.

There were a number of red herrings involved in finding a solution to this issue.

Red Herrings

One of the red herrings involved the number of tables that were being created and dropped on a daily basis.  At the time this was occurring, up to 200,000 tables may have been created and dropped daily.

Running a few tests though caused us to discard this theory.

Here is what happens if you delete a table while stats are being gathered on the table:

  • T1: Session 1: create table xx as select * from sometable with 7m rows
  • T2: Session 2: exec dbms_stats.gather_table_stats(‘JKSTILL’,’XX’,estimate_percent => 100)
  • T3: Session 1: drop table xx;
  • T4: Session 2: after 20 seconds or so:  ORA-08103: object no longer exists

A similar error occurs if the table is dropped while one of the indexes is being analyzed:  ORA-01418: specified index does not exist

Enable Tracing

The next obvious step seemed to be setting an event in the instance that would cause a trace file to be generated when an ORA-00942 was encountered.  Following is the command used:

alter system set events '942 trace name errorstack level 3’;

We set the parameter, and that evening the ORA-00942 did appear in the alert log. We checked the generated trace file and found nothing of value.  The only information in the trace file was the Oracle banner and the session information. A search was made of all trace files, and nothing was found.  As this was a three node RAC system, all nodes were searched.

This was a little perplexing, as testing this event on another 11.2 RAC system provided a trace file.  There were a couple of differences: the production RAC was Solaris, while the testing was performed on a Linux system.  What may be more likely, however, is that jobs run via the autotask process may have been ignoring the events directive – more on that later.

(At this point I need to mention that we may have neglected to run the ‘alter system set events’ on all instances.  However, that may have been a case of serendipity as you shall see.)

DBMS_STATS Tracing

After some diligent searching, I found that dbms_stats has its own built in tracing routines.

Tracing is enabled by calling dbms_stats.set_global_prefs(‘trace’,<trace flags>)

Following are the possible values for the trace flags:

1 = use dbms_output.put_line instead of writing into trace file

2 = enable dbms_stat trace only at session level

4 = trace table stats

8 = trace index stats

16 = trace column stats

32 = trace auto stats – logs to sys.stats_target$_log

64 = trace scaling

128 = dump backtrace on error

256 = dubious stats detection

512 = auto stats job

1024 = parallel execution tracing

2048 = print query before execution

4096 = partition prune tracing

8192 = trace stat differences

16384 = trace extended column stats gathering

32768 = trace approximate NDV (number distinct values) gathering

Use a combination of flags to combine different tracing levels.

In our case, we wanted to trace the auto stats job, and dump a trace on error, so the following options were used:

exec dbms_stats.set_global_prefs('trace',to_char(512+128))

The next day, however, the story was a familiar one: ORA-00942 appeared in the alert log as the reason the stats job failed, but no useful information appeared in the trace files.

Even More Tracing

Time to get more aggressive. I would replace the default stats gathering job with a wrapper that would enable both the dbms_stats specific tracing, and 10046 tracing, all in the same session that was running the stats job.

The following steps were taken to finally get a useful trace of the stats job, and find out just what statement was causing the ORA-00942 error.

First, a wrapper was created:

create table last_sql_tab (
   sql_date timestamp with local time zone not null,
   sql_error number not null,
   sql_id varchar2(13) not null,
   sql_fulltext clob not null
)
tablespace sysaux
/

-- db_stats_wrapper.sql
-- use a wrapper for dbms_stats.gather_database_stats_job_proc
-- so that tracing can be set if needed
create or replace procedure sys.db_stats_wrapper
as
   error_code number;
   v_sql_fulltext clob;
   v_curr_sqlid varchar2(13);
   v_prev_sqlid varchar2(13);
begin
   --    2 = enable dbms_stat trace only at session level
   --  128 = dump backtrace on error
   --  512 = auto stats job
   -- 2048 = print query before execution

   dbms_application_info.set_module(
      module_name => 'sys.db_stats_wrapper',
      action_name => 'auto optimizer stats collection'
   );

   dbms_stats.set_param('trace',to_char(2 + 2128 + 512 + 2048));
   execute immediate q'[alter session set events '10046 trace name context forever, level 4']';
   dbms_stats.gather_database_stats_job_proc;

exception
when others then
   error_code := sqlcode;
   select sql_id, prev_sql_id into v_curr_sqlid, v_prev_sqlid
   from v$session
   where sid = sys_context('userenv','sid');

   -- hopefully the sql_id and prev_sql_id are still available in v$session
   -- and v$sqlstats
   insert into last_sql_tab(sql_date, sql_error, sql_id, sql_fulltext)
   select systimestamp , error_code, sql_id, sql_fulltext
   from v$sqlstats
   where sql_id in (v_curr_sqlid,v_prev_sqlid);

   commit;

   -- reset to the default gather stats program once we have a failure with trace file
   DBMS_SCHEDULER.SET_ATTRIBUTE (
      name           => 'GATHER_STATS_PROG',
      attribute      => 'program_action',
      value          => 'dbms_stats.gather_database_stats_job_proc'
   );

end;
/

The next step was to set the new wrapper as the procedure for the auto stats job to run:

BEGIN
   DBMS_SCHEDULER.SET_ATTRIBUTE (
      name           => 'GATHER_STATS_PROG',
      attribute      => 'program_action',
      value          => 'sys.db_stats_wrapper'
   );
END;
/

The next day, success!  While the stats job failed as expected with ORA-00942, this time we had useful trace files.

While a lot of interesting information could be found in the trace files created by the dbms_stats tracing mechanism, ultimately the issue was found in the 10046 output, near the end of the trace file:

 ERROR #18446740225354926104:err=942 tim=5273897072222

Looking for the first occurrence of the cursor address, the failing statement was located.  I won’t reproduce all of it here as it is quite large (16k).  There is no doubt that the query is used to gather stats:

PARSING IN CURSOR #18446740225354926104 len=16707 dep=2 uid=0 oct=189 lid=0 tim=5270675200777 hv=3012852732 ad='1991c6b148' sqlid='63n9pwutt8yzw'
MERGE /*+ dynamic_sampling(ST 4) dynamic_sampling_est_cdn(ST) */ INTO STATS_TARGET$ ST USING (SELECT STALENESS, OSIZE, OBJ#,…

So the culprit has been found, now what?  There was nothing obviously amiss with the SQL statement, so a new SR was opened with Oracle Support.  After a couple days of discussions with Support, we were no closer to a solution. It seems this was something they had not seen until now.

The Value of Networking

While discussing this topic with one of my fellow DBAs another topic came up. During the discussion, my co-worker mentioned that he had experienced issues caused by one of Oracle’s hidden parameters being set to a non-default value.

The parameter in question:  _optimizer_push_pred_cost_based

Excusing myself, I immediately logged on to the database and checked the value for that parameter.

1  select
2     a.KSPPINM NAME
3     , a.KSPPDESC DESCRIPTION
4     , b.KSPFTCTXVL VALUE
5     , decode(b.KSPFTCTXDF,'TRUE','Y','N') ISDEFAULT
6  from X$KSPPI a, x$ksppcv2 b
7  where b.kspftctxpn = a.indx+1
8  and  a.KSPPINM like '_optimizer_push_pred_cost_based'
9* order by name
14:43:31 oravm01.jks.com - sys@oravm1 SQL> /

DEF
PARAMETER NAME                      DESCRIPTION                      VALUE    VAL?
----------------------------------- -------------------------------- -------- ----
_optimizer_push_pred_cost_based     use cost-based query transformat FALSE    N
ion for push pred optimization

The parameter was set to a non-default value. Might this be the cause of the problem?

The Culprit is Found

A search of Oracle Support revealed exactly what I was looking for:

Bug 14341002 : WRONG RESULT ON “UNION-ALL PARTITION” “_OPTIMIZER_PUSH_PRED_COST_BASED=FALSE”

This note contains test cases to show what can happen when _optimizer_push_pred_cost_based = FALSE and there are UNION ALL set operators present in the SQL.

Going back to the trace file, I checked to see if SQL in question contained any UNION ALL operators. The answer is yes, it did, as 11 of them were found in the SQL.  It looked like the solution had been found.

The Final Test

Once again a wrapper was employed to test the effects of changes on the auto stats job.

This time the purpose was not to trace, but to reset the value of _optimizer_push_pred_cost_based back to the default value of TRUE for the session that would gather database statistics.

-- db_stats_wrapper_2.sql
-- use a wrapper for dbms_stats.gather_database_stats_job_proc
-- setting parameters that will allow stats to (usually) complete without ORA-00942
create or replace procedure sys.db_stats_wrapper_2
as
   error_code number;
   v_sql_fulltext clob;
   v_curr_sqlid varchar2(13);
   v_prev_sqlid varchar2(13);
   n_table_not_found number := -942;
   table_not_found exception;
   pragma exception_init(table_not_found,-942);
begin
   dbms_application_info.set_module(
      module_name => 'sys.db_stats_wrapper_2',
      action_name => 'auto optimizer stats collection'
);
   -- default value for _optimizer_push_pred_cost_based is true - database is set to false
   -- setting to false is known to cause incorrect results with UNION ALL in 11.2.0.3
   -- the SQL in dbms_stats that gets ORA-00942 has several UNION ALL
   execute immediate q'[alter session set   "_optimizer_push_pred_cost_based" = true]';
   dbms_stats.gather_database_stats_job_proc;

exception
when others then

   error_code := sqlcode;
   select sql_id, prev_sql_id into v_curr_sqlid, v_prev_sqlid
   from v$session
   where sid = sys_context('userenv','sid');

   -- hopefully the sql_id and prev_sql_id are still available in v$session
   -- and v$sqlstats

   insert into last_sql_tab(sql_date, sql_error, sql_id, sql_fulltext)
   select systimestamp , error_code, sql_id, sql_fulltext
   from v$sqlstats
   where sql_id in (v_curr_sqlid,v_prev_sqlid);

   commit;

   -- make another attempt to gather stats
   if error_code = n_table_not_found then
      dbms_stats.gather_database_stats_job_proc;
   end if;

end;
/

-- set the new procedure as the stats gather job
BEGIN
   DBMS_SCHEDULER.SET_ATTRIBUTE (
      name           => 'GATHER_STATS_PROG',
      attribute      => 'program_action',
      value          => 'sys.db_stats_wrapper_2'
   );
END;
/

The next day was a happy one as the statistics ran without error and continued to do so each day.

Soon it was determined that the _optimizer_push_pred_cost_based parameter could be permanently set back to its default value of FALSE.

So ends the story. But, what about that DMBS_STATS tracing?

DBMS_STATS Tracing – Take 2

Ultimately this tracing did not help us in troubleshooting this issue.

Nonetheless this is quite an interesting undocumented feature of DBMS_STATS.

I have not yet spent too much time working with it, but a cursory examination of  trace file recently created to look at how Oracle goes about calculating the number of distinct values (NDV) for columns shows that it is quite verbose.

Following is a partial example

DBMS_STATS: Start gather table stats — tabname: MYTABLE

DBMS_STATS: Started table MYUSER.MYTABLE.SYS_P100050 at 16-OCT-13 06.36.51.805782000 PM -07:00. Execution phase: 1 stattype: DATA

DBMS_STATS: Specified granularity = ALL, Fixed granularity = ALL

DBMS_STATS: parameters ** pfix:  granularity: Z gIdxGranularity:  executionPhase: 1 pname: SYS_P100050 stime: 10-16-2013 18:36:51 method_opt: FOR ALL COLUMNS SIZE AUTO

DBMS_STATS: Start construct analyze using SQL .. Execution Phase: 1 granularity: ALL global_requested:  pfix:

DBMS_STATS: gather stats on subpartition SYS_SUBP202384:stats are stale

DBMS_STATS: Start gather_stats.. pfix:  ownname: MYUSERtabname: MYTABLE pname: SYS_P202408 spname: SYS_SUBP202384 execution phase: 1

DBMS_STATS: Using approximate NDV pct=0

DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME

DBMS_STATS:   Y    Y    Y    Y    Y    Y         Y    Y                   CLIENT_ID

DBMS_STATS:        Y    Y    Y    Y    Y         Y                        GROUP_ID

DBMS_STATS:        Y    Y    Y    Y    Y    Y    Y                        CONTACT_ID

DBMS_STATS:   Y    Y         Y    Y    Y    Y    Y                        MYDATE

DBMS_STATS: Approximate NDV Options

DBMS_STATS: NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL,NDV,NIL,NIL

DBMS_STATS: Building Histogram for CLIENT_ID

DBMS_STATS:  bktnum=-1, nnv=13643, snnv=13643, sndv=1, est_ndv=1, mnb=254

DBMS_STATS:  Trying frequency histogram

DBMS_STATS: Building Histogram for GROUP_ID

DBMS_STATS:  bktnum=-1, nnv=13643, snnv=13643, sndv=1, est_ndv=1, mnb=254

DBMS_STATS:  Trying frequency histogram

DBMS_STATS: Building Histogram for CONTACT_ID

DBMS_STATS:  bktnum=-1, nnv=13643, snnv=13643, sndv=13643, est_ndv=13643, mnb=254

DBMS_STATS:  Trying height balanced histogram

DBMS_STATS: Building Histogram for SEGMENT_DATE_JOINED

DBMS_STATS:  bktnum=-1, nnv=13643, snnv=13643, sndv=10749, est_ndv=10749, mnb=254

DBMS_STATS:  Trying height balanced histogram

DBMS_STATS: ====================================================================================================

DBMS_STATS: Statistics from clist:

DBMS_STATS: ====================================================================================================

DBMS_STATS: Number of rows in the table = 13643

DBMS_STATS: COL_NAME                       ACL       NV        SNNVDV    NDV       SNDV      DENS      CCNT

DBMS_STATS: —————————————————————————————————-

DBMS_STATS: CLIENT_ID                      4         0         13643     1         1         .00003664 1

DBMS_STATS:          min: Typ=2 Len=3: c2,18,63

DBMS_STATS:          max: Typ=2 Len=3: c2,18,63

DBMS_STATS:          Histograms:

DBMS_STATS:          ——————————————————————————————-

DBMS_STATS:          BVAL EAVAL                            EDVAL

DBMS_STATS:          ——————————————————————————————-

DBMS_STATS:          #                                     Typ=2 Len=3: c2,18,63

Anyone investigating how DBMS_STATS works under the covers would probably find this quite useful.

Here are a couple of other bits that will be of use should you wish to experiment with tracing DBMS_STATS

This bit of SQL will create the table needed should you choose to use trace level 32.

create table sys.stats_target$_log
as
select t.*, rpad('X',30,'X') session_id, 1 state
from sys.stats_target$ t
where 1=0;

The following SQL will provide some history of the options used. The SPARE1 column indicates if the option has ever been enabled.

set linesize 200
COLUMN   sval1 ON FORMAT   999999
COLUMN   sval2 ON FORMAT   a36
COLUMN   spare1 ON FORMAT   999999999999999999 head 'TRACE|ENABLED?'
COLUMN   spare2 ON FORMAT   a6
COLUMN   spare3 ON FORMAT   a6
COLUMN   spare4 ON FORMAT   a40
COLUMN   spare5 ON FORMAT   a6
COLUMN   spare6 ON FORMAT   a6

select sname, sval1, sval2
   ,spare1
   --, spare2
   --, spare3
   , spare4
   --, spare5
   --, spare6
from SYS.OPTSTAT_HIST_CONTROL$
order by sname
/

Summary

There are a few lessons to be learned here:

  • Setting an instance-wide parameter to deal with a problem can cause other issues that may be quite difficult to troubleshoot.
  • Never underestimate the value of networking with your peers.
  • Don’t think that by opening an SR you no longer have to work the issue – keep digging. It could be that no one else has seen this issue.
  • Never give up. If one method of troubleshooting does not work, look for others. Not only will this help solve the problem, but you may just learn something new in the process.

System Info

  • Solaris 11
  • Oracle 11.2.03 in a three node RAC configuration

This is a busy database that can be quite adversely affected when the database statistics are wrong.


Viewing all articles
Browse latest Browse all 34

Trending Articles