Quantcast
Channel: Jared Still – Official Pythian Blog
Viewing all articles
Browse latest Browse all 34

Alter Session Kill on Steroids

$
0
0

Perhaps you have encountered something like this: A session that is consuming too many resources needs to be killed. You locate the session and use ALTER SYSTEM KILL SESSION ‘SID,SERIAL#’ to kill the session. As you continue to monitor the database you find that the status of the session in v$session is ‘KILLED’, but the session does not go away. You also notice that the SERIAL# is continually changing.

Now you find there is no OS process associated with the session, but the session continues as PMON is unable to finish cleanup for the session. Usually when this happens, the session will be holding a lock. When that happens, the only method to release the lock is to bounce the database. There are some bugs that may be responsible for this problem, such as this one described by Oracle Support:

Pmon Spins While Cleaning Dead Process (Doc ID 1130713.1)

This particular bug affects Oracle 10.2.0.1 – 11.1.0.7. I have personally seen this same behavior happen on many versions of the database from 7.0 on. To avoid these hanging sessions many DBA’s have adopted the habit of first killing the OS process with an OS utility, and if the session is still visible in v$session, issue the ALTER SYSTEM KILL command.

The OS command used on linux/unix is usually ‘kill -9’. On windows it is OraKill. This method usually avoids the problems encountered when killing a session that is holding a lock and processing DML.

I don’t know just what circumstances trigger this behavior, as I have never been able to reproduce it at will. When it does happen though, it is more than annoying as the only way to clear locks held by the recalcitrant session is to bounce the database.

Quite some time ago (at least as far back as Oracle 8i) Oracle introduced the new IMMEDIATE keyword to use with ALTER SYSTEM KILL SESSION. Using this keyword removes the need to use an OS command to kill a session – Oracle will do it for you! To test this I am using Oracle 10.2.0.4 on Oracle Linux 5.5. I have previously run these same tests in 11.2.0.3 with the same results. Had I access to an 8i or 9i database I would have run the tests there. To start with let’s see what happens when a session is killed without the immediate keyword.

Login to the session to be killed:

$ sqlplus scott/tiger@10gr2

Login as SYSDBA from another terminal and check for scott’s session:

SQL> l
  1  select
  2     s.username,
  3     s.sid,
  4     s.serial#,
  5     p.spid spid
  6  from v$session s, v$process p
  7  where s.username = 'SCOTT'
  8*    and p.addr = s.paddr
SQL> /

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 133         35 22870

1 row selected.

All that has happened at this point is that Oracle has made an internal call that has disconnected Scott’s session. (tracing that operation is a different topic.) The process on the server has not been terminated. This can be seen by the following experiment:

Logon again as Scott.

In a SYSDBA session check for Scott’s:

 SQL> @scott

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 146         81 23678

Now check for the shadow process associated with scott’s session on the server:


[root@ora10gR2 tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Kill the session and check the status:

SQL> alter system kill session '146,81';

SQL> l
  1  select
  2     s.username,
  3     s.sid,
  4     s.serial#,
  5     p.spid spid
  6  from v$session s, v$process p
  7  where s.username = 'SCOTT'
  8*    and p.addr = s.paddr
SQL>/

no rows selected

Check again on the server for the process:

[root@ora10gR2 tmp]# ps -fp 23678
UID PID PPID C STIME TTY TIME CMD
oracle 23678 1 0 16:56 ? 00:00:00 oraclejs01 (LOCAL=NO)

Interesting, isn’t it? We know the process is still alive on the server, but the session information is no longer associated with the process. This happens because Oracle has disconnected the session, which allows the process to continue until the sqlplus session is terminated. The session information is still available in v$session, but is no longer associated with a server process:

select
  2     s.username,
  3     s.status,
  4     s.sid,
  5     s.serial#
  6  from v$session s
  7* where s.username = 'SCOTT'
SQL>/

USERNAME                       STATUS          SID    SERIAL#
------------------------------ -------- ---------- ----------
SCOTT                          KILLED          146         81

1 row selected.

 1* select pid,spid from v$process where pid = 146
SQL>/

no rows selected

When exiting the Scott session, I can see that the session was killed:

SQL> exit
ERROR:
ORA-00028: your session has been killed

Let’s perform the experiment again, but this time use the IMMEDIATE keyword.

Logon as scott:

> sqlplus scott/tiger@10gr2

SQL*Plus: Release 11.2.0.3.0 Production on Tue Aug 5 17:18:53 2014

Logon as SYSDBA and check for the scott session;

SQL> @scott

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SCOTT                                 146         83 23939

1 row selected.

Before killing scott’s session:

  • get my OS PID
  • enable 10046 trace

The OS PID will be used for strace on the SYSDBA session shadow process on the server.
The 10046 trace is so we can see what is happening in the strace output.

SQL> l
1 select
2 s.username,
3 s.sid,
4 s.serial#,
5 p.spid spid
6 from v$session s, v$process p
7 where s.username is not null
8 and p.addr = s.paddr
9 and userenv('SESSIONID') = s.audsid
10* order by username, sid
SQL>/

USERNAME                              SID    SERIAL# SPID
------------------------------ ---------- ---------- ------------
SYS                                   145         65 23947

1 row selected.

SQL> alter session set events '10046 trace name context forever, level 12';

Session altered.

Now ssh to the db server , check for Scott session shadow process and start strace:

[root@ora10gR2 tmp]# strace -o 23947.strace -p 23947
^Z
[1]+ Stopped strace -o 23947.strace -p 23947
[root@ora10gR2 tmp]# bg
[1]+ strace -o 23947.strace -p 23947 &

[root@ora10gR2 tmp]# ps -p 23939
PID TTY TIME CMD
23939 ? 00:00:00 oracle

Now kill Scott’s session and exit the SYSDBA session:

SQL> alter system kill session '146,83' immediate;

System altered.

The strace command will now have exited on the server.

First check again for Scott’s session:

[root@ora10gR2 tmp]# ps -p 23939
PID TTY TIME CMD
[root@ora10gR2 tmp]#

So the Scott shadow process has terminated.

As the 10046 trace was enabled, the output to the oracle trace file will appear in the strace file, which allows searching for ‘alter system kill’ in the strace file.

From the strace file:

write(5, "alter system kill session '146,8"..., 44) = 44

Now searching for the PID of scott’s session 23939:

read(10, "23939 (oracle) S 1 23939 23939 0"..., 999) = 228
close(10) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) S 1 23939 23939 0"..., 999) = 228
close(10) = 0
kill(23939, SIGKILL) = 0
kill(23939, SIGCONT) = 0
open("/proc/23939/stat", O_RDONLY) = 10
read(10, "23939 (oracle) Z 1 23939 23939 0"..., 999) = 178
close(10) = 0

From the previous text I can see that Oracle opened the status file for PID 23939.
Why it did so twice I am not sure.

What happens after that is the interesting part.

kill(23939, SIGKILL) = 0

That line means that the SIGKILL signal was successfully sent to Scott’s shadow process.

What does that mean? Run kill -l to get a list of signals:

kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL       5) SIGTRAP
 6) SIGABRT      7) SIGBUS       8) SIGFPE       9) SIGKILL     10) SIGUSR1
11) SIGSEGV     12) SIGUSR2     13) SIGPIPE     14) SIGALRM     15) SIGTERM
16) SIGSTKFLT   17) SIGCHLD     18) SIGCONT     19) SIGSTOP     20) SIGTSTP
21) SIGTTIN     22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO       30) SIGPWR
31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1  36) SIGRTMIN+2  37) SIGRTMIN+3
38) SIGRTMIN+4  39) SIGRTMIN+5  40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8
43) SIGRTMIN+9  44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13 52) SIGRTMAX-12
53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9  56) SIGRTMAX-8  57) SIGRTMAX-7
58) SIGRTMAX-6  59) SIGRTMAX-5  60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2
63) SIGRTMAX-1  64) SIGRTMAX

Notice that signal 9 (kill -9) is SIGKILL.

So when killing a session with ALTER SYSTEM KILL SESSION ‘PID,SERIAL#’ IMMEDIATE Oracle is actually doing the kill -9 for you, and has been for many years now.

Though not shown here, this same test was run when the session was killed without using the IMMEDIATE keyword, and there were no attempts to kill the session. This was inferred as well by the fact the the process was still running on the server up until the time the Scott sqlplus session was exited.


Viewing all articles
Browse latest Browse all 34

Trending Articles