Quantcast
Viewing all articles
Browse latest Browse all 34

What’s Eating My Database? Where to Focus Performance Efforts Using Oracle AWR

No doubt you have at times been on the DBA side of this conversation, or one much like it.

Help Desk: We have reports that the database is slow

DBA: Really? OK, tell me when it was slow and I will investigate.

Help Desk: Well, we don’t really know just when that was. Several people have called and reported that at times it has been nearly unresponsive. The problem is not consistent and they don’t know what time that was.

DBA: Umm… OK, I will see what I can do.

Or perhaps you take care of databases for third parties, and you receive similar reports, again with no exact times, or any indication of just how ‘slow’ the database was.

At times like this you just need to find the periods of highest activity and start looking there.

But, how to determine the periods of highest activity?

Traditional methods would include looking at CPU, load and memory statistics from sar or other sources.

Perhaps a better method however is just to let the Oracle database tell you where to concentrate your efforts.

The data is all available in AWR, all that is necessary is to go find it.

The gold standard for determining database activity is the Average Active Session (AAS), a metric that appeared when Oracle introduced ASH and AWR as performance monitoring tools.

AAS measures just how active the database is , and is defined as “number of sessions that are either working or waiting” at a point in time.

Armed with that information you can now dig in to AWR and find the most active times from an AAS perspective

Initial efforts to find the top 10 most active periods in a database looked like this, where the following SQL fragment is used rank AWR snapshots by the maximum AAS value, and return the 10 most active snapshots.

When doing performance analysis via AWR it is a good idea to preserve the areas of interest via AWR baselines. Doing so allows querying AWR snapshots that might otherwise be deleted due to AWR retention times.

--aas-1.sql
with aas as (
   select snap_id
      , value
      , dbid
      , instance_number
      , rownum my_rownum
   from (
      select distinct snap_id, dbid, instance_number, max(value) over (partition by snap_id order by snap_id) value
      from dba_hist_sysmetric_history
      where metric_name = 'Average Active Sessions'
      order by value desc
   )
),
top10aas as (
   select
      sn.begin_interval_time begin_time
      , aas.instance_number
      , aas.snap_id
      , aas.value
      , aas.dbid
   from aas
   join dba_hist_snapshot sn on sn.snap_id = aas.snap_id
      and sn.instance_number = aas.instance_number
   where my_rownum <= 10
   order by value desc
)
select * from top10aas

That seemed to work well until some apparently spurious values showed up. A server having 32 cores was seen to have an AAS of 3000+ during a period of time in one AWR snapshot. As there was no other evidence (CPU, Memory, IO) to support this, it seemed it was likely in error.

So rather than rely on AAS, a different metric was chosen.

The number of sessions per snapshot that are on CPU, or waiting on some non-idle event are now being counted as db_time. The ratio of db_time:elapsed_time is now being considered to determined the most active periods.

--aas-2.sql
with aas as (
   select snap_id, value, dbid, instance_number, rownum my_rownum
   from (
      select snap_id
         , round(db_time / elapsed_time,1)  value
         , dbid
         , instance_number
      from (
         select distinct h.snap_id
            , h.instance_number
            , h.dbid
            , count(*) over (partition by h.snap_id, h.instance_number) * 10 db_time
            , (extract( day from (s.end_interval_time - s.begin_interval_time) )*24*60*60)+
               (extract( hour from (s.end_interval_time - s.begin_interval_time) )*60*60)+
               (extract( minute from (s.end_interval_time - s.begin_interval_time) )*60)+
               (extract( second from (s.end_interval_time - s.begin_interval_time)))
            elapsed_time
         from dba_hist_active_sess_history h
         join dba_hist_snapshot s on s.snap_id = h.snap_id
            and s.instance_number = h.instance_number
         where (
            wait_class is null  -- on CPU
            or wait_class != 'Idle' -- ignore idle events
         )
      )
      order by 2 desc
   )
),
top10aas as (
   select
      sn.begin_interval_time begin_time
      , aas.instance_number
      , aas.snap_id
      , aas.value
      , aas.dbid
   from aas
   join dba_hist_snapshot sn on sn.snap_id = aas.snap_id
      and sn.instance_number = aas.instance_number
   where my_rownum <= 10
   order by value desc
)
select * from top10aas

In hindsight it may be a good idea to rename sub-factored queries from top10aas and similar to topactivity, or something like that.

On the other hand, if no renaming takes place, the methods can be switched back and forth without affecting any dependent scripts.

Here are examples from an active test database. Some of the same periods appear in each report, albeit in different order.

The first example is using the query on dba_hist_sysmetric_history:

SQL> @aas-1
BEGIN_TIME                    INSTANCE_NUMBER     SNAP_ID VALUE      DBID
------------------------------ --------------- ---------- ---------- ----------
15-JUL-16 09.00.43.624 AM                    1      18087 294.076413 4107786006
21-JUL-16 11.00.59.752 AM                    1      18233 126.210386 4107786006
09-AUG-16 05.00.40.556 AM                    1      18683 77.0418398 4107786006
06-JUL-16 05.00.33.517 AM                    1      17867 75.36815   4107786006
24-JUL-16 09.00.35.396 AM                    1      18303 75.131809  4107786006
06-AUG-16 04.00.42.262 AM                    1      18610 56.5065604 4107786006
17-AUG-16 05.00.20.844 AM                    1      18875 54.1554518 4107786006
26-AUG-16 04.00.23.145 AM                    1      19090 53.2953573 4107786006
09-JUL-16 05.00.37.146 AM                    1      17939 51.2031095 4107786006
15-JUL-16 04.00.28.880 AM                    1      18082 50.1482053 4107786006

10 rows selected.

Next are the results of the query that is considering db_time:

SQL> @aas-2
BEGIN_TIME                       INSTANCE_NUMBER          SNAP_ID      VALUE           DBID
------------------------------ --------------- ---------- ---------- ----------
12-AUG-16 01.00.52.943 AM                     1            18751        27.7 4107786006
28-JUL-16 01.00.28.734 AM                     1            18391        26.4 4107786006
05-AUG-16 01.00.43.076 AM                     1            18583        25.4 4107786006
19-AUG-16 01.00.51.324 AM                     1            18919        23.9 4107786006
09-AUG-16 01.00.25.700 AM                     1            18679          23 4107786006
15-JUL-16 01.00.15.981 AM                     1            18079        22.5 4107786006
08-JUL-16 01.00.52.888 AM                     1            17911        21.4 4107786006
26-AUG-16 01.00.09.871 AM                     1            19087        21.4 4107786006
29-JUL-16 01.00.31.166 AM                     1            18415        21.2 4107786006
24-JUL-16 10.00.44.598 AM                     1            18304        20.2 4107786006

10 rows selected.

You probably noticed there is not a 1:1 match of periods chosen between these two methods. Due to the difference in these metrics, there will never be a 1:1 match between the two methods, however both methods are valid and will find periods of high activity.

Now that you can identify the top 10 activity periods, what is the next step? Perhaps you would like to run AWR/ADDM reports for these times, or drill down with some custom queries. However you proceed, it may be a good idea to create an AWR Baseline for each of the snapshots.

Don’t worry about cleaning up those baselines later, as you will see they can be set to automatically expire at a later date.

The following SQL fragment is based on aas-2.sql as seen previously. This SQL fragment will be called by create-baselines.sql. Putting the top10 logic in the SQL fragment allows modifying how the top10 list is generated without making any changes to create-baselines.sql.

with aas as (
   select begin_snap_id, end_snap_id, value, dbid, instance_number, rownum my_rownum
   from (
      select begin_snap_id
         , lead(begin_snap_id,1) over (partition by dbid, instance_number order by begin_snap_id) end_snap_id
         , round(db_time / elapsed_time,1)  value
         , dbid
         , instance_number
      from (
         select distinct h.snap_id begin_snap_id
            , h.instance_number
            , h.dbid
            , count(*) over (partition by h.snap_id, h.instance_number) * 10 db_time
            , (extract( day from (s.end_interval_time - s.begin_interval_time) )*24*60*60)+
               (extract( hour from (s.end_interval_time - s.begin_interval_time) )*60*60)+
               (extract( minute from (s.end_interval_time - s.begin_interval_time) )*60)+
               (extract( second from (s.end_interval_time - s.begin_interval_time)))
            elapsed_time
         from dba_hist_active_sess_history h
         join dba_hist_snapshot s on s.snap_id = h.snap_id
            and s.instance_number = h.instance_number
            and  (
               h.wait_class is null  -- on CPU
               or h.wait_class != 'Idle' -- wait events - ignore idle events
            )
            -- these predicates are useful if you have some idea of the date range
            --
            -- most recent 5 days
            -- and s.end_interval_time > systimestamp - 5
            --
            -- or maybe a range
            -- and s.end_interval_time between timestamp '2016-08-01 00:00:01' and timestamp '2016-08-02 23:59:59'
      )
      order by value desc
   )
),
top10 as (
   select
      sn.begin_interval_time begin_time
      , aas.instance_number
      , aas.begin_snap_id
      , aas.end_snap_id
      , aas.value
      , aas.dbid
   from aas
   join dba_hist_snapshot sn on sn.snap_id = aas.begin_snap_id
      and sn.dbid = aas.dbid
      and sn.instance_number = aas.instance_number
   where my_rownum <= 10
   order by value desc
)

Now for the script to create the baselines. First we will see if there are already any baselines:

SQL> @show-awr-baselines

select
   baseline_name
   , baseline_type
   , start_snap_id
   , to_char(start_snap_time,'yyyy-mm-dd hh24:mi:ss') start_snap_time
   , to_char(creation_time,'yyyy-mm-dd hh24:mi:ss') creation_time
   , to_char(creation_time+expiration,'yyyy-mm-dd hh24:mi:ss') expiration
from dba_hist_baseline
order by creation_time;
                                                                       START
BASELINE_NAME                                      BASELINE TYPE     SNAP ID START_SNAP_TIME      CREATION_TIME       EXPIRATION
-------------------------------------------------- --------------- --------- -------------------- ------------------- --------------------
SYSTEM_MOVING_WINDOW                               MOVING_WINDOW       32588 2016-08-22 13:00:15  2011-12-19 16:45:11

1 row selected.

No baselines currently. Now run create-baselines.sql. First the code, and then the results

set serveroutput on size unlimited

var n_expire_days number

-- change to something sensible for real use
exec :n_expire_days := 1

declare

   v_baseline_pfx varchar2(30) := 'dw'; -- tag as Data Warehouse
   v_baseline_name varchar2(128);

   i_expire_days integer := :n_expire_days;

   e_baseline_exists exception;
   pragma exception_init(e_baseline_exists, -13528);

   procedure p ( p_in varchar2)
   is
   begin
      dbms_output.put(p_in);
   end;

   procedure pl ( p_in varchar2)
   is
   begin
      p(p_in);
      dbms_output.put_line(null);
   end;

begin

dbms_output.put_line(lpad('=',30,'='));

for aasrec in (
   @@top10
   select  begin_time, instance_number, begin_snap_id, end_snap_id, value, dbid
   from top10
)
loop
   pl('   begin_time: ' || aasrec.begin_time);
   pl('begin snap_id: ' || aasrec.begin_snap_id);
   pl('  end snap_id: ' || aasrec.end_snap_id);
   pl(' Metric Value: ' || aasrec.value);


   -- create the baselines
   -- catch errors if already exists

   begin
      v_baseline_name := v_baseline_pfx || '_'
         || to_char(aasrec.begin_snap_id) || '_'
         || to_char(aasrec.begin_time,'yyyymmdd-hh24:mi:ss');

      pl('Baseline Name: ' || v_baseline_name);
      --/*
      dbms_workload_repository.create_baseline(
         start_snap_id => aasrec.begin_snap_id,
         end_snap_id => aasrec.end_snap_id,
         baseline_name => v_baseline_name,
         dbid => aasrec.dbid,
         expiration => i_expire_days
      );
      --*/

   exception
   when e_baseline_exists then
      pl('!!Baseline ' || v_baseline_name || ' already exists');
   when others then
      raise;
   end;

   pl(lpad('=',30,'='));

end loop;

end;
/

And now the output of the PL/SQL procedure:

Elapsed: 00:00:00.01
==============================
begin_time: 27-JUL-16 08.00.56.132 PM
begin snap_id: 31972
end snap_id: 31973
Metric Value: .5
Baseline Name: dw_31972_20160727-20:00:56
==============================
begin_time: 23-JUL-16 08.00.10.334 AM
begin snap_id: 31888
end snap_id: 31889
Metric Value: .4
Baseline Name: dw_31888_20160723-08:00:10
==============================
begin_time: 15-AUG-16 08.00.18.055 PM
begin snap_id: 32428
end snap_id: 32429
Metric Value: .4
Baseline Name: dw_32428_20160815-20:00:18
==============================
begin_time: 16-AUG-16 02.00.33.705 PM
begin snap_id: 32446
end snap_id: 32447
Metric Value: .3
Baseline Name: dw_32446_20160816-14:00:33
==============================
begin_time: 18-AUG-16 08.00.17.095 PM
begin snap_id: 32500
end snap_id: 32501
Metric Value: .3
Baseline Name: dw_32500_20160818-20:00:17
==============================
begin_time: 19-AUG-16 12.00.13.883 AM
begin snap_id: 32504
end snap_id: 32505
Metric Value: .3
Baseline Name: dw_32504_20160819-00:00:13
==============================
begin_time: 09-AUG-16 08.00.15.699 PM
begin snap_id: 32284
end snap_id: 32285
Metric Value: .3
Baseline Name: dw_32284_20160809-20:00:15
==============================
begin_time: 21-AUG-16 04.00.30.241 AM
begin snap_id: 32556
end snap_id: 32557
Metric Value: .3
Baseline Name: dw_32556_20160821-04:00:30
==============================
begin_time: 24-JUL-16 01.00.11.481 PM
begin snap_id: 31917
end snap_id: 31918
Metric Value: .3
Baseline Name: dw_31917_20160724-13:00:11
==============================
begin_time: 25-JUL-16 06.00.12.365 PM
begin snap_id: 31946
end snap_id: 31947
Metric Value: .3
Baseline Name: dw_31946_20160725-18:00:12
==============================

PL/SQL procedure successfully completed.

Another run of show-awr-baselines.sql to confirm:

SQL> @show-awr-baselines
                                                START
BASELINE_NAME               BASELINE TYPE     SNAP ID START_SNAP_TIME      CREATION_TIME       EXPIRATION
--------------------------- --------------- --------- -------------------- ------------------- --------------------
SYSTEM_MOVING_WINDOW        MOVING_WINDOW       32588 2016-08-22 13:00:15  2011-12-19 16:45:11
dw_31888_20160723-08:00:10  STATIC              31888 2016-07-23 09:00:24  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_31946_20160725-18:00:12  STATIC              31946 2016-07-25 19:00:26  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_32446_20160816-14:00:33  STATIC              32446 2016-08-16 15:00:47  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_32504_20160819-00:00:13  STATIC              32504 2016-08-19 01:00:28  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_31917_20160724-13:00:11  STATIC              31917 2016-07-24 14:00:26  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_31972_20160727-20:00:56  STATIC              31972 2016-07-27 21:00:10  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_32428_20160815-20:00:18  STATIC              32428 2016-08-15 21:00:32  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_32500_20160818-20:00:17  STATIC              32500 2016-08-18 21:00:31  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_32284_20160809-20:00:15  STATIC              32284 2016-08-09 21:00:29  2016-08-30 12:45:03 2016-08-31 12:45:03
dw_32556_20160821-04:00:30  STATIC              32556 2016-08-21 05:00:44  2016-08-30 12:45:03 2016-08-31 12:45:03

11 rows selected.

Please note the default retention time in the script is 1 day; you most likely will change that to something more reasonable, such as 30 days.

The AWR snapshots preserved by the new baselines will not be aged out as per standard AWR retention policies, allowing ample time to run AWR/ADDM reports on periods of interest, or query AWR directly for these periods so you can drill down on the root cause of performance issues and find the bottlenecks.

Code from this article may also be found at https://github.com/jkstill/oracle-baselines


Viewing all articles
Browse latest Browse all 34

Trending Articles