proc filesystem for quick’n’dirty troubleshooting _ tanel poder's performance & troubleshooting...

Upload: ijazkhan

Post on 04-Nov-2015

10 views

Category:

Documents


0 download

DESCRIPTION

Proc Filesystem for Quick’n’Dirty Troubleshooting _ Tanel Poder's Performance & Troubleshooting Blog

TRANSCRIPT

  • Peeking into Linux kernel-land using /proc

    filesystem for quickndirty troubleshooting

    This blog entry is about modern Linuxes. In other words RHEL6 equivalents with 2.6.3x kernels and

    not the ancient RHEL5 with 2.6.18 kernel (wtf?!), which is the most common in enterprises

    unfortunately. And no, Im not going to use kernel debuggers or SystemTap scripts here, just plain old

    cat /proc/PID/xyz commands against some useful /proc filesystem entries.

    Heres one systematic troubleshooting example I reproduced in my laptop. A DBA was wondering why

    their find command had been running much slower, without returning any results for a while.

    Knowing the environment, we had a hunch, but I got asked about what would be the systematic

    approach for troubleshooting this already ongoing problem right now.

    Luckily the system was running OEL6, so had a pretty new kernel. Actually the 2.6.39 UEK2.

    So, lets do some troubleshooting. First lets see whether that find process is still alive:

    [root@oel6 ~]# ps -ef | grep find

    root 27288 27245 4 11:57 pts/0 00:00:01 find . -type f

    root 27334 27315 0 11:57 pts/1 00:00:00 grep find

    Yep its there PID 27288 (Ill use that pid throughout the troubleshooting example).

    Lets start from the basics and take a quick look whats the bottleneck for this process if its not

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    1 of 22 5/25/2015 12:34 PM

  • blocked by anything (for example reading everything it needs from cache) it should be 100% on CPU.

    If its bottlenecked by some IO or contention issues, the CPU usage should be lower or completely

    0%.

    [root@oel6 ~]# top -cbp 27288

    top - 11:58:15 up 7 days, 3:38, 2 users, load average: 1.21, 0.65, 0.47

    Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie

    Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

    Mem: 2026460k total, 1935780k used, 90680k free, 64416k buffers

    Swap: 4128764k total, 251004k used, 3877760k free, 662280k cached

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

    27288 root 20 0 109m 1160 844 D 0.0 0.1 0:01.11 find . -type f

    Top tells me this process is either 0% on CPU or very close to zero percent (so it gets rounded to 0% in

    the output). Theres an important difference though, as a process that is completely stuck, not having a

    chance of getting onto CPU at all vs. a process which is getting out of its wait state every now and then

    (for example some polling operation times out every now and then and thee process chooses to go back

    to sleep). So, top on Linux is not a good enough tool to show that difference for sure but at least we

    know that this process is not burning serious amounts of CPU time.

    Lets use something else then. Normally when a process seems to be stuck like that (0% CPU usually

    means that the process is stuck in some blocking system call which causes the kernel to put the

    process to sleep) I run strace on that process to trace in which system call the process is currently

    stuck. Also if the process is actually not completely stuck, but returns from a system call and wakes up

    briefly every now and then, it would show up in strace (as the blocking system call would complete and

    be entered again a little later):

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    2 of 22 5/25/2015 12:34 PM

  • [root@oel6 ~]# strace -cp 27288

    Process 27288 attached - interrupt to quit

    ^C

    ^Z

    [1]+ Stopped strace -cp 27288

    [root@oel6 ~]# kill -9 %%

    [1]+ Stopped strace -cp 27288

    [root@oel6 ~]#

    [1]+ Killed strace -cp 27288

    Oops, the strace command itself got hung too! It didnt print any output for a long time and didnt

    respond to CTRL+C, so I had to put it into background with CTRL+Z and kill it from there. So much

    for easy diagnosis.

    Lets try pstack then (on Linux, pstack is just a shell script wrapper around the GDB debugger). While

    pstack does not see into kernel-land, it will still give us a clue about which system call was requested

    (as usually theres a corresponding libc library call in the top of the displayed userland stack):

    [root@oel6 ~]# pstack 27288

    ^C

    ^Z

    [1]+ Stopped pstack 27288

    [root@oel6 ~]# kill %%

    [1]+ Stopped pstack 27288

    [root@oel6 ~]#

    [1]+ Terminated pstack 27288

    Pstack also got stuck without returning anything!

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    3 of 22 5/25/2015 12:34 PM

  • So we still dont know whether our process is 100% (hopelessly) stuck or just 99.99% stuck (the

    process keeps going back to sleep) and where its stuck.

    Ok, where else to look? Theres one more commonly available thing to check the process status and

    WCHAN fields, which can be reported via good old ps (perhaps I should have ran this command

    earlier to make sure this process isnt a zombie by now):

    [root@oel6 ~]# ps -flp 27288

    F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

    0 D root 27288 27245 0 80 0 - 28070 rpc_wa 11:57 pts/0 00:00:01 find . -type f

    You should run ps multiple times in a row to make sure that the process is still in the same state (you

    dont want to get misled by a single sample taken at an unlucky time), but Im displaying it here only

    once for brevity.

    The process is in state D (uninterruptible sleep) which is usually disk IO related (as the ps man pages

    also say). And the WCHAN (the function which caused this sleep/wait in this process) field is a bit

    truncated. I could use a ps option (read the man) to print it out a bit wider, but as this info comes from

    the proc filesystem anyway, lets query it directly from the source (again, it would be a good idea to

    sample this multiple times as we are not sure yet whether our process is completely stuck or just

    mostly sleeping):

    [root@oel6 ~]# cat /proc/27288/wchan

    rpc_wait_bit_killable

    Hmm this process is waiting for some RPC call. RPC usually means that the process is talking to

    other processes (either in the local server or even remote servers). But we still dont know why.

    ?

    Before we go on to the meat of this article, lets figure out whether this process is completely stuck or

    not. The /proc/PID/status can tell us that on modern kernels. I have highlighted the interesting bits

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    4 of 22 5/25/2015 12:34 PM

  • for our purpose:

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    5 of 22 5/25/2015 12:34 PM

  • [root@oel6 ~]# cat /proc/27288/status

    Name: find

    State: D (disk sleep)

    Tgid: 27288

    Pid: 27288

    PPid: 27245

    TracerPid: 0

    Uid: 0 0 0 0

    Gid: 0 0 0 0

    FDSize: 256

    Groups: 0 1 2 3 4 6 10

    VmPeak: 112628 kB

    VmSize: 112280 kB

    VmLck: 0 kB

    VmHWM: 1508 kB

    VmRSS: 1160 kB

    VmData: 260 kB

    VmStk: 136 kB

    VmExe: 224 kB

    VmLib: 2468 kB

    VmPTE: 88 kB

    VmSwap: 0 kB

    Threads: 1

    SigQ: 4/15831

    SigPnd: 0000000000040000

    ShdPnd: 0000000000000000

    SigBlk: 0000000000000000

    SigIgn: 0000000000000000

    SigCgt: 0000000180000000

    CapInh: 0000000000000000

    CapPrm: ffffffffffffffff

    CapEff: ffffffffffffffff

    CapBnd: ffffffffffffffff

    Cpus_allowed: ffffffff,ffffffff

    Cpus_allowed_list: 0-63

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    6 of 22 5/25/2015 12:34 PM

  • Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,

    Mems_allowed_list: 0

    voluntary_ctxt_switches: 9950

    nonvoluntary_ctxt_switches: 17104

    The process is in the state D Disk Sleep (uninterruptible sleep). And look into the number of

    voluntary_ctxt_switches and nonvoluntary_ctxt_switches this tells you how many times the process

    was taken off CPU (or put back). Then run the same command again a few seconds later and see if

    those numbers increase. In my case these numbers did not increase, thus I could conclude that the

    process was completely stuck (well, at least during the few seconds between these commands). So I

    can be more confident now that this process is completely stuck (and its not just flying under the

    radar by burning 0.04% of CPU all time).

    By the way, there are two more locations where to get the context switch counts (and the 2nd one

    works on ancient kernels as well):

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    7 of 22 5/25/2015 12:34 PM

  • [root@oel6 ~]# cat /proc/27288/sched

    find (27288, #threads: 1)

    ---------------------------------------------------------

    se.exec_start : 617547410.689282

    se.vruntime : 2471987.542895

    se.sum_exec_runtime : 1119.480311

    se.statistics.wait_start : 0.000000

    se.statistics.sleep_start : 0.000000

    se.statistics.block_start : 617547410.689282

    se.statistics.sleep_max : 0.089192

    se.statistics.block_max : 60082.951331

    se.statistics.exec_max : 1.110465

    se.statistics.slice_max : 0.334211

    se.statistics.wait_max : 0.812834

    se.statistics.wait_sum : 724.745506

    se.statistics.wait_count : 27211

    se.statistics.iowait_sum : 0.000000

    se.statistics.iowait_count : 0

    se.nr_migrations : 312

    se.statistics.nr_migrations_cold : 0

    se.statistics.nr_failed_migrations_affine: 0

    se.statistics.nr_failed_migrations_running: 96

    se.statistics.nr_failed_migrations_hot: 1794

    se.statistics.nr_forced_migrations : 150

    se.statistics.nr_wakeups : 18507

    se.statistics.nr_wakeups_sync : 1

    se.statistics.nr_wakeups_migrate : 155

    se.statistics.nr_wakeups_local : 18504

    se.statistics.nr_wakeups_remote : 3

    se.statistics.nr_wakeups_affine : 155

    se.statistics.nr_wakeups_affine_attempts: 158

    se.statistics.nr_wakeups_passive : 0

    se.statistics.nr_wakeups_idle : 0

    avg_atom : 0.041379

    avg_per_cpu : 3.588077

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    8 of 22 5/25/2015 12:34 PM

  • nr_switches : 27054

    nr_voluntary_switches : 9950

    nr_involuntary_switches : 17104

    se.load.weight : 1024

    policy : 0

    prio : 120

    clock-delta : 72

    Here you need to look into the nr_switches number (which equals nr_voluntary_switches +

    nr_involuntary_switches ).

    The total nr_switches is 27054 in the above output and this is also what the 3rd field in /proc/PID

    /schedstat shows:

    [root@oel6 ~]# cat /proc/27288/schedstat

    1119480311 724745506 27054

    And it doesnt increase

    - /

    So, it looks like our process is pretty stuck :) Strace and pstack were useless. They use ptrace() system

    call for attaching to the process and peeking into its memory, but as the process was hopelessly stuck,

    very likely in some system call, so I guess the ptrace() call got hung itself. (BTW, I once tried stracing

    the strace process itself when attaching to a target process and it crashed the target. You have been

    warned :)

    So, how to see in which system call are we stuck without strace or pstack? Luckily I was running on a

    modern kernel say hi to /proc/PID/syscall!

    [root@oel6 ~]# cat /proc/27288/syscall

    262 0xffffffffffffff9c 0x20cf6c8 0x7fff97c52710 0x100 0x100 0x676e776f645f616d 0x7fff97c52658 0x390e2da8ea

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    9 of 22 5/25/2015 12:34 PM

  • Ok WTF am I supposed to do with that?

    Well, these numbers usually stand for something. If its 0xAVeryBigNumber then its usually a

    memory address (and pmap-like tools could be used where they point), but if its a small number then

    perhaps its an index into some array like open file descriptor array (which you can read from

    /proc/PID/fd ) or in this case, as we are dealing with system calls its the system call number where

    your process is currently in! So, this process is stuck in system call #262!

    Note that the system call numbers may differ between OS type, OS version and platform, so you should

    read it from the proper .h file on your flavor of OS. Usually searching for syscall* from /usr/include

    is a good starting point. On my version and platform (64bit) of Linux the system calls are defined here

    in /usr/include/asm/unistd_64.h :

    [root@oel6 ~]# grep 262 /usr/include/asm/unistd_64.h

    #define __NR_newfstatat 262

    Here you go! The syscall 262 is something called newfstatat . Just run man to find out what it is. Heres

    a hint regarding system call names if man page doesnt find a call, then try it without possible

    suffixes and prefixes (for example man pread instead of man pread64) in this case, run man

    without the new prefix man fstatat . Or just google.

    Anyway, this system call new-fstat-at allows you to read file properties pretty much like the usual

    stat system call. And we are stuck in this file metadata reading operation. So weve gotten one step

    further. But we still dont know why are we stuck there?

    Well, Say Hello To My Little Friend /proc/PID/stack, which allows you to read kernel stack

    backtraces of a process by just cating a proc file!!!

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    10 of 22 5/25/2015 12:34 PM

  • [root@oel6 ~]# cat /proc/27288/stack

    [] rpc_wait_bit_killable+0x24/0x40 [sunrpc]

    [] __rpc_execute+0xf5/0x1d0 [sunrpc]

    [] rpc_execute+0x43/0x50 [sunrpc]

    [] rpc_run_task+0x75/0x90 [sunrpc]

    [] rpc_call_sync+0x42/0x70 [sunrpc]

    [] nfs3_rpc_wrapper.clone.0+0x35/0x80 [nfs]

    [] nfs3_proc_getattr+0x47/0x90 [nfs]

    [] __nfs_revalidate_inode+0xcc/0x1f0 [nfs]

    [] nfs_revalidate_inode+0x36/0x60 [nfs]

    [] nfs_getattr+0x5f/0x110 [nfs]

    [] vfs_getattr+0x4e/0x80

    [] vfs_fstatat+0x70/0x90

    [] sys_newfstatat+0x24/0x50

    [] system_call_fastpath+0x16/0x1b

    [] 0xffffffffffffffff

    The top function is where in kernel code we are stuck right now which is exactly what the WCHAN

    output showed us (note that actually there are some more functions on top of it, such as the kernel

    schedule() function which puts processes to sleep or on CPU as needed, but these are not shown here

    probably because theyre a result of this wait condition itself, not the cause).

    Thanks to the full kernel-land stack of this task we can read it from bottom up to understand how did

    we end up calling this rpc_wait_bit_killable function in the top, which ended up calling the scheduler

    and put us into sleep mode.

    The system_call_fastpath in the bottom is a generic kernel system call handler function which has

    called the kernel code for that newfstatat syscall we issued ( sys_newfstatat ). And then, continuing to

    go upwards through the child functions we end up seeing a bunch of NFS functions in the stack! This

    is a 100% undeniable proof that we are somewhere under NFS codepath. Im not saying in NFS

    codepath as when you continue reading upwards past the NFS functions, you see that the topmost

    NFS function has in turn called some RPC function (rpc_call_sync) for talking to some other process

    in this case probably the [kworker/N:N] , [nfsiod] , [lockd] or [rpciod] kernel IO threads. And we are

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    11 of 22 5/25/2015 12:34 PM

  • never getting a reply from those threads for some reason (the usual suspect would be dropped network

    connection, packets or just network connectivity issues).

    To see whether any of the helper threads are stuck on network-related code, you could sample their

    kernel stacks too, although the kworkers do much more than just NFS RPC communication. During a

    separate experiment (just copying large files over NFS), I have caught one of the kworkers waiting in

    networking code:

    [root@oel6 proc]# for i in `pgrep worker` ; do ps -fp $i ; cat /proc/$i/stack ; done

    UID PID PPID C STIME TTY TIME CMD

    root 53 2 0 Feb14 ? 00:04:34 [kworker/1:1]

    [] __cond_resched+0x2a/0x40

    [] lock_sock_nested+0x35/0x70

    [] tcp_sendmsg+0x29/0xbe0

    [] inet_sendmsg+0x48/0xb0

    [] sock_sendmsg+0xef/0x120

    [] kernel_sendmsg+0x41/0x60

    [] xs_send_kvec+0x8e/0xa0 [sunrpc]

    [] xs_sendpages+0x173/0x220 [sunrpc]

    [] xs_tcp_send_request+0x5d/0x160 [sunrpc]

    [] xprt_transmit+0x83/0x2e0 [sunrpc]

    [] call_transmit+0xa8/0x130 [sunrpc]

    [] __rpc_execute+0x66/0x1d0 [sunrpc]

    [] rpc_async_schedule+0x15/0x20 [sunrpc]

    [] process_one_work+0x13e/0x460

    [] worker_thread+0x17c/0x3b0

    [] kthread+0x96/0xa0

    [] kernel_thread_helper+0x4/0x10

    It should be possible to enable kernel tracing for knowing which exact kernel thread communicates

    with other kernel threads, but I wont go in there in this post as its supposed to be a practical (and

    simple :) troubleshooting exercise!

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    12 of 22 5/25/2015 12:34 PM

  • Anyway, thanks to very easy kernel stack sampling in modern Linux kernels (I dont know exact

    version when it was introduced) we managed to systematically figure out where the find command was

    stuck in NFS code in the Linux kernel. And when you have NFS-related hangs, the most usual

    suspects are network issues. If you are wondering how did I reproduce this problem, I mounted a NFS

    volume from a VM, started the find command and then just suspended the VM. This produces similar

    symptoms as a network (configuration, firewall) issue where a connection is silently dropped without

    notifying the TCP endpoints or packets just dont go through for some reason.

    As the top function listed in the stack was one of the killable safe-to-kill functions

    ( rpc_wait_bit_killable ) then we can kill it with kill -9 :

    [root@oel6 ~]# ps -fp 27288

    UID PID PPID C STIME TTY TIME CMD

    root 27288 27245 0 11:57 pts/0 00:00:01 find . -type f

    [root@oel6 ~]# kill -9 27288

    [root@oel6 ~]# ls -l /proc/27288/stack

    ls: cannot access /proc/27288/stack: No such file or directory

    [root@oel6 ~]# ps -fp 27288

    UID PID PPID C STIME TTY TIME CMD

    [root@oel6 ~]#

    The process is gone.

    -

    Note that as the /proc/PID/stack looks like just a plain text proc file, you can do poor-mans stack

    profiling also on the kernel threads! This is an example of sampling the current system call and the

    kernel stack (if in system call) and aggregating it into a semi-hierarchical profile in a poor-mans way:

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    13 of 22 5/25/2015 12:34 PM

  • [root@oel6 ~]# export LC_ALL=C ; for i in {1..100} ; do cat /proc/29797/syscall | awk '{ print $1 }' ; cat

    69 running

    1 ffffff81534c83

    2 ffffff81534820

    6 247

    25 180

    100 0xffffffffffffffff

    1 thread_group_cputime

    27 sysenter_dispatch

    3 ia32_sysret

    1 task_sched_runtime

    27 sys32_pread

    1 compat_sys_io_submit

    2 compat_sys_io_getevents

    27 sys_pread64

    2 sys_io_getevents

    1 do_io_submit

    27 vfs_read

    2 read_events

    1 io_submit_one

    27 do_sync_read

    1 aio_run_iocb

    27 generic_file_aio_read

    1 aio_rw_vect_retry

    27 generic_file_read_iter

    1 generic_file_aio_read

    27 mapping_direct_IO

    1 generic_file_read_iter

    27 blkdev_direct_IO

    27 __blockdev_direct_IO

    27 do_blockdev_direct_IO

    27 dio_post_submission

    27 dio_await_completion

    6 blk_flush_plug_list

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    14 of 22 5/25/2015 12:34 PM

  • This gives you a very rough sample of where in kernel this process spent its time if at all. The system

    call numbers are separated, in bold, above running means that this process was in userland (not in

    a system call) during the sample. So, 69% of the time during the sampling, this process was running

    userland code. 25% of time in syscall #180 (nfsservctl on my system) and 6% in syscall #247 (waitid).

    There are two more functions seen in this output but for some reason they havent been properly

    translated to a function name. Well, this address must stand for something, so lets try our luck

    manually:

    [root@oel6 ~]# cat /proc/kallsyms | grep -i ffffff81534c83

    ffffffff81534c83 t ia32_sysret

    So it looks that these samples are about a 32-bit architecture-compatible system call return function

    but as this function isnt a system call itself (just an internal helper function), perhaps thats why the

    /proc/stack routine didnt translate it. Perhaps these addresses show up because theres no read

    consistency on the /proc views, when owning threads modify these memory structures and entries,

    the reading threads may sometimes read data in-flux.

    Let check the other address too:

    [root@oel6 ~]# cat /proc/kallsyms | grep -i ffffff81534820

    [root@oel6 ~]#

    Nothing? Well, the troubleshooting doesnt have to stop yet lets see if theres anything interesting

    around that address. I just removed a couple of trailing characters from the address:

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    15 of 22 5/25/2015 12:34 PM

  • [root@oel6 ~]# cat /proc/kallsyms | grep -i ffffff815348

    ffffffff8153480d t sysenter_do_call

    ffffffff81534819 t sysenter_dispatch

    ffffffff81534847 t sysexit_from_sys_call

    ffffffff8153487a t sysenter_auditsys

    ffffffff815348b9 t sysexit_audit

    Looks like the sysenter_dispatch function starts just 1 byte before the original address displayed by

    /proc/PID/stack. So we had possibly already executed the one byte (possibly a NOP operation left in

    there for dynamic tracing probe traps). But nevertheless, looks like these stack samples were in the

    sysenter_dispatch function, which isnt a system call itself, but a system call helper function.

    Note that there are different types of stack samplers the Linux Perf, Oprofile tools and Solaris

    DTraces profile provider do sample the instruction pointer (EIP on 32bit intel or RIP on x64) and

    stack pointer (ESP on 32bit and RSP on 64bit) registers of the currently running thread on CPU and

    walk the stack pointers backwards from there. Therefore, these tools only show threads which happen

    to be on CPU when sampling!!! This is of course perfect, when troubleshooting high CPU utilization

    issues, but not useful at all when troubleshooting hung processes or too long process sleeps/waits.

    Tools like pstack on Linux,Solaris,HP-UX, procstack (on AIX), ORADEBUG SHORT_STACK and just

    reading the /proc/PID/stack pseudofile provide a good addition (not replacement) to the CPU

    profiling tools as they access the probed process memory regardless of their CPU scheduling state

    and read the stack from there. If the process is sleeping, not on CPU, the starting point of the stack can

    be read from the stored context which is saved to kernel memory by OS scheduler during the context

    switch.

    Of course, the CPU event profiling tools can often do much more than just a pstack, OProfile, Perf and

    even DTraces CPC provider (in Solaris11) can set up and sample CPU internal performance counters

    to estimate things like the amount of CPU cycles stalled waiting for main memory, amount of L1/L2

    cache misses, etc. Better read what Kevin Closson has to say on these topics: (Perf, Oprofile)

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    16 of 22 5/25/2015 12:34 PM

  • Enjoy :-)

    What the heck are the /dev/shm/JOXSHM_EXT_x files on Linux?

    SGA bigger than than the amount of HugePages configured (Linux 11.2.0.3)

    Oracle Exadata Troubleshooting

    Oracle Performance & Troubleshooting Online Seminars in 2013

    Session Snapper v4 The Worlds Most Advanced Oracle Troubleshooting Script!

    :

    120 31 7More

    Stoyan says:

    FEBRUARY 21, 2013 AT 1:48 PM

    Just awesome thank you

    REPLY

    Panagiotis Papadomitsos says:

    MARCH 7, 2013 AT 4:50 AM

    This must be one of the best hardcore troubleshooting articles Ive ever read. Thanks!

    REPLY

    Renato Araujo says:

    DECEMBER 18, 2013 AT 7:10 AM

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    17 of 22 5/25/2015 12:34 PM

  • +1

    Many Thanks

    REPLY

    Mani says:

    MARCH 7, 2013 AT 1:53 PM

    Thank you Tanel for Sharing, it was awesome ride on Linux kernel-land!

    REPLY

    Jared says:

    APRIL 2, 2013 AT 8:09 PM

    Thanks Tanel, this article has been on my todo list for some time, and finally read it today.

    Excellent stuff, I need to save it somewhere I can easily find it

    My favorite type of work is troubleshooting down to the level where there is no further

    visibility,

    learning new OS bits and tools, and writing tools as necessary as I go.

    Either find the issue, or be able to point to the vendor and prove it is their problem. :)

    That threshold of visibility is at a somewhat lower level with *nix than with the Oracle

    database.

    Although, the OS tools in Linuxland make it possible to delve deeper into Oracle issues than

    would be possible otherwise.

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    18 of 22 5/25/2015 12:34 PM

  • REPLY

    LAKS says:

    APRIL 9, 2013 AT 6:19 AM

    Late to post a query here but still curious to know what is the environment details where this

    hang was seen. I mean what was the virtualization product used(KVM or xen) to create the

    linux guest VM(32 or 64bit). if possible pls reply.thx in advance

    REPLY

    Tanel Poder says:

    APRIL 9, 2013 AT 8:56 AM

    I was using VMWare Fusion 5 on my Mac server.

    REPLY

    Tanel Poder says:

    APRIL 9, 2013 AT 8:57 AM

    But you can use any VM for this VirtualBox is easy to use and

    free

    REPLY

    Jeyanthan says:

    APRIL 13, 2013 AT 12:10 PM

    That was a heck of a troubleshooting ! Thanks for your detailed analysis walkthrough, Tanel.

    If I were you, Id have done kill -9 27288 right away ;)

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    19 of 22 5/25/2015 12:34 PM

  • REPLY

    Ravikumar says:

    DECEMBER 30, 2013 AT 3:31 AM

    Is it possible to get no of kernel memory pages.

    REPLY

    Tanel Poder says:

    DECEMBER 30, 2013 AT 1:03 PM

    There are different types (and reasons) why kernel memory pages get allocated.

    On linux just cat /proc/meminfo (some homework is needed for understanding

    which types are for kernel purposes).

    REPLY

    Aleksey says:

    JANUARY 17, 2014 AT 7:30 AM

    You ref to /home/oracle/os_explain in your poor-man script. What is the content of it?

    Also,

    ffffffff81534819 t sysenter_dispatch

    doesnt starts just 1 byte before ffffff81534820, since there are HEX numbers here. (so, it is 7

    bytes actually).

    REPLY

    Tanel Poder says:

    JANUARY 17, 2014 AT 11:00 AM

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    20 of 22 5/25/2015 12:34 PM

  • check this: http://blog.tanelpoder.com/files/scripts/tools/unix/os_explain

    Ha, thanks for catching my hex to dec conversion error :-)

    REPLY

    Adi To says:

    JULY 6, 2014 AT 6:27 PM

    Dear Tanel,

    I am late to the party, but chanced upon your blog post: very much appreciate you sharing

    your tricks. Very useful for me, delving into Linux from Unix-land. Regards

    REPLY

    Frank Scholten says:

    JULY 25, 2014 AT 10:45 AM

    Hi Tanel,

    Great article!

    How come you could kill the process at the end of your article? I thought it was in an disk

    sleep state?

    Cheers,

    Frank

    REPLY

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    21 of 22 5/25/2015 12:34 PM

  • Tanel Poder says:

    JULY 31, 2014 AT 8:55 AM

    Hi Frank, the D means uninterruptible sleep (not only Disk sleep), the

    WCHAN column shows in which kernel function its stuck and in this case its

    rpc_wait, which has something to do with interprocess communication and the

    later diagnosis showed it was the NFS codepath. So it wasnt stuck in the disk

    block IO layer somewhere, it was waiting for a reply from another process via

    RPC.

    REPLY

    Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

    22 of 22 5/25/2015 12:34 PM