proc filesystem for quick’n’dirty troubleshooting _ tanel poder's performance & troubleshooting...

Peeking into Linux kernel-land using /proc

filesystem for quickndirty troubleshooting

This blog entry is about modern Linuxes. In other words RHEL6 equivalents with 2.6.3x kernels and

not the ancient RHEL5 with 2.6.18 kernel (wtf?!), which is the most common in enterprises

unfortunately. And no, Im not going to use kernel debuggers or SystemTap scripts here, just plain old

cat /proc/PID/xyz commands against some useful /proc filesystem entries.

Heres one systematic troubleshooting example I reproduced in my laptop. A DBA was wondering why

their find command had been running much slower, without returning any results for a while.

Knowing the environment, we had a hunch, but I got asked about what would be the systematic

approach for troubleshooting this already ongoing problem right now.

Luckily the system was running OEL6, so had a pretty new kernel. Actually the 2.6.39 UEK2.

So, lets do some troubleshooting. First lets see whether that find process is still alive:

[root@oel6 ~]# ps -ef | grep find

root 27288 27245 4 11:57 pts/0 00:00:01 find . -type f

root 27334 27315 0 11:57 pts/1 00:00:00 grep find

Yep its there PID 27288 (Ill use that pid throughout the troubleshooting example).

Lets start from the basics and take a quick look whats the bottleneck for this process if its not

Peeking into Linux kernel-land using /proc filesystem for quickndirty troubleshooting | Tanel Poder'... http://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quic...

1 of 22 5/25/2015 12:34 PM

blocked by anything (for example reading everything it needs from cache) it should be 100% on CPU.

If its bottlenecked by some IO or contention issues, the CPU usage should be lower or completely

0%.

[root@oel6 ~]# top -cbp 27288

top - 11:58:15 up 7 days, 3:38, 2 users, load average: 1.21, 0.65, 0.47

Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 2026460k total, 1935780k used, 90680k free, 64416k buffers

Swap: 4128764k total, 251004k used, 3877760k free, 662280k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

27288 root 20 0 109m 1160 844 D 0.0 0.1 0:01.11 find . -type f

Top tells me this process is either 0% on CPU or very close to zero percent (so it gets rounded to 0% in

the output). Theres an important difference though, as a process that is completely stuck, not having a

chance of getting onto CPU at all vs. a process which is getting out of its wait state every now and then

(for example some polling operation times out every now and then and thee process chooses to go back

to sleep). So, top on Linux is not a good enough tool to show that difference for sure but at least we

know that this process is not burning serious amounts of CPU time.

Lets use something else then. Normally when a process seems to be stuck like that (0% CPU usually

means that the process is stuck in some blocking system call which causes the kernel to put the

process to sleep) I run strace on that process to trace in which system call the process is currently

stuck. Also if the process is actually not completely stuck, but returns from a system call and wakes up

briefly every now and then, it would show up in strace (as the blocking system call would complete and

be entered again a little later):


2 of 22 5/25/2015 12:34 PM

[root@oel6 ~]# strace -cp 27288

Process 27288 attached - interrupt to quit

^C

^Z

[1]+ Stopped strace -cp 27288

[root@oel6 ~]# kill -9 %%

[1]+ Stopped strace -cp 27288

[root@oel6 ~]#

[1]+ Killed strace -cp 27288

Oops, the strace command itself got hung too! It didnt print any output for a long time and didnt

respond to CTRL+C, so I had to put it into background with CTRL+Z and kill it from there. So much

for easy diagnosis.

Lets try pstack then (on Linux, pstack is just a shell script wrapper around the GDB debugger). While

pstack does not see into kernel-land, it will still give us a clue about which system call was requested

(as usually theres a corresponding libc library call in the top of the displayed userland stack):

[root@oel6 ~]# pstack 27288

^C

^Z

[1]+ Stopped pstack 27288

[root@oel6 ~]# kill %%

[1]+ Stopped pstack 27288

[root@oel6 ~]#

[1]+ Terminated pstack 27288

Pstack also got stuck without returning anything!


3 of 22 5/25/2015 12:34 PM

So we still dont know whether our process is 100% (hopelessly) stuck or just 99.99% stuck (the

process keeps going back to sleep) and where its stuck.

Ok, where else to look? Theres one more commonly available thing to check the process status and

WCHAN fields, which can be reported via good old ps (perhaps I should have ran this command

earlier to make sure this process isnt a zombie by now):

[root@oel6 ~]# ps -flp 27288

F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD

0 D root 27288 27245 0 80 0 - 28070 rpc_wa 11:57 pts/0 00:00:01 find . -type f

You should run ps multiple times in a row to make sure that the process is still in the same state (you

dont want to get misled by a single sample taken at an unlucky time), but Im displaying it here only

once for brevity.

The process is in state D (uninterruptible sleep) which is usually disk IO related (as the ps man pages

also say). And the WCHAN (the function which caused this sleep/wait in this process) field is a bit

truncated. I could use a ps option (read the man) to print it out a bit wider, but as this info comes from

the proc filesystem anyway, lets query it directly from the source (again, it would be a good idea to

sample this multiple times as we are not sure yet whether our process is completely stuck or just

mostly sleeping):

[root@oel6 ~]# cat /proc/27288/wchan

rpc_wait_bit_killable

Hmm this process is waiting for some RPC call. RPC usually means that the process is talking to

other processes (either in the local server or even remote servers). But we still dont know why.

?

Before we go on to the meat of this article, lets figure out whether this process is completely stuck or

not. The /proc/PID/status can tell us that on modern kernels. I have highlighted the interesting bits


4 of 22 5/25/2015 12:34 PM

for our purpose:


5 of 22 5/25/2015 12:34 PM

[root@oel6 ~]# cat /proc/27288/status

Name: find

State: D (disk sleep)

Tgid: 27288

Pid: 27288

PPid: 27245

TracerPid: 0

Uid: 0 0 0 0

Gid: 0 0 0 0

FDSize: 256

Groups: 0 1 2 3 4 6 10

VmPeak: 112628 kB

VmSize: 112280 kB

VmLck: 0 kB

VmHWM: 1508 kB

VmRSS: 1160 kB

VmData: 260 kB

VmStk: 136 kB

VmExe: 224 kB

VmLib: 2468 kB

VmPTE: 88 kB

VmSwap: 0 kB

Threads: 1

SigQ: 4/15831

SigPnd: 0000000000040000

ShdPnd: 0000000000000000

SigBlk: 0000000000000000

SigIgn: 0000000000000000

SigCgt: 0000000180000000

CapInh: 0000000000000000

CapPrm: ffffffffffffffff

CapEff: ffffffffffffffff

CapBnd: ffffffffffffffff

Cpus_allowed: ffffffff,ffffffff

Cpus_allowed_list: 0-63


6 of 22 5/25/2015 12:34 PM

Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,

Mems_allowed_list: 0

voluntary_ctxt_switches: 9950

nonvoluntary_ctxt_switches: 17104

The process is in the state D Disk Sleep (uninterruptible sleep). And look into the number of

voluntary_ctxt_switches and nonvoluntary_ctxt_switches this tells you how many times the process

was taken off CPU (or put back). Then run the same command again a few seconds later and see if

those numbers increase. In my case these numbers did not increase, thus I could conclude that the

process was completely stuck (well, at least during the few seconds between these commands). So I

can be more confident now that this process is completely stuck (and its not just flying under the

radar by burning 0.04% of CPU all time).

By the way, there are two more locations where to get the context switch counts (and the 2nd one

works on ancient kernels as well):


7 of 22 5/25/2015 12:34 PM

[root@oel6 ~]# cat /proc/27288/sched

find (27288, #threads: 1)

---------------------------------------------------------

se.exec_start : 617547410.689282

se.vruntime : 2471987.542895

se.sum_exec_runtime : 1119.480311

se.statistics.wait_start : 0.000000

se.statistics.sleep_start : 0.000000

se.statistics.block_start : 617547410.689282

se.statistics.sleep_max : 0.089192

se.statistics.block_max : 60082.951331

se.statistics.exec_max : 1.110465

se.statistics.slice_max : 0.334211

se.statistics.wait_max : 0.812834

se.statistics.wait_sum : 724.745506

se.statistics.wait_count : 27211

se.statistics.iowait_sum : 0.000000

se.statistics.iowait_count : 0

se.nr_migrations : 312

se.statistics.nr_migrations_cold : 0

se.statistics.nr_failed_migrations_affine: 0

se.statistics.nr_failed_migrations_running: 96

se.statistics.nr_failed_migrations_hot: 1794

se.statistics.nr_forced_migrations : 150

se.statistics.nr_wakeups : 18507

se.statistics.nr_wakeups_sync : 1

se.statistics.nr_wakeups_migrate : 155

se.statistics.nr_wakeups_local : 18504

se.statistics.nr_wakeups_remote : 3

se.statistics.nr_wakeups_affine : 155

se.statistics.nr_wakeups_affine_attempts: 158

se.statistics.nr_wakeups_passive : 0

se.statistics.nr_wakeups_idle : 0

avg_atom : 0.041379

avg_per_cpu : 3.588077


8 of 22 5/25/2015 12:34 PM

nr_switches : 27054

nr_voluntary_switches : 9950

nr_involuntary_switches : 17104

se.load.weight : 1024

policy : 0

prio : 120

clock-delta : 72

Here you need to look into the nr_switches number (which equals nr_voluntary_switches +

nr_involuntary_switches ).

The total nr_switches is 27054 in the above output and this is also what the 3rd field in /proc/PID

/schedstat shows:

[root@oel6 ~]# cat /proc/27288/schedstat

1119480311 724745506 27054

And it doesnt increase

- /

So, it looks like our process is pretty stuck :) Strace and pstack were useless. They use ptrace() system

call for attaching to the process and peeking into its memory, but as the process was hopelessly stuck,

very likely in some system call, so I guess the ptrace() call got hung itself. (BTW, I once tried stracing

the strace process itself when attaching to a target process and it crashed the target. You have been

warned :)

So, how to see in which system call are we stuck without strace or pstack? Luckily I was running on a

modern kernel say hi to /proc/PID/syscall!

[root@oel6 ~]# cat /proc/27288/syscall

262 0xffffffffffffff9c 0x20cf6c8 0x7fff97c52710 0x100 0x100 0x676e776f645f616d 0x7fff97c52658 0x390e2da8ea


9 of 22 5/25/2015 12:34 PM

Ok WTF am I supposed to do with that?

Well, these numbers usually stand for something. If its 0xAVeryBigNumber then its usually a

memory address (and pmap-like tools could be used where they point), but if its a small number then

perhaps its an index into some array like open file descriptor array (which you can read from

/proc/PID/fd ) or in this case, as we are dealing with system calls its the system call number where

your process is currently in! So, this process is stuck in system call #262!

Note that the system call numbers may differ between OS type, OS version and platform, so you should

read it from the proper .h file on your flavor of OS. Usually searching for syscall* from /usr/include

is a good starting point. On my version and platform (64bit) of Linux the system calls are defined here

in /usr/include/asm/unistd_64.h :

[root@oel6 ~]# grep 262 /usr/include/asm/unistd_64.h

#define __NR_newfstatat 262

Here you go! The syscall 262 is something called newfstatat . Just run man to find out what it is. Heres

a hint regarding system call names if man page doesnt find a call, then try it without possible

suffixes and prefixes (for example man pread instead of man pread64) in this case, run man

without the new prefix man fstatat . Or just google.

Anyway, this system call new-fstat-at allows you to read file properties pretty much like the usual

stat system call. And we are stuck in this file metadata reading operation. So weve gotten one step

further. But we still dont know why are we stuck there?

Well, Say Hello To My Little Friend /proc/PID/stack, which allows you to read kernel stack

backtraces of a process by just cating a proc file!!!


10 of 22 5/25/2015 12:34 PM

[root@oel6 ~]# cat /proc/27288/stack

[] rpc_wait_bit_killable+0x24/0x40 [sunrpc]

[] __rpc_execute+0xf5/0x1d0 [sunrpc]

[] rpc_execute+0x43/0x50 [sunrpc]

[] rpc_run_task+0x75/0x90 [sunrpc]

[] rpc_call_sync+0x42/0x70 [sunrpc]

[] nfs3_rpc_wrapper.clone.0+0x35/0x80 [nfs]

[] nfs3_proc_getattr+0x47/0x90 [nfs]

[] __nfs_revalidate_inode+0xcc/0x1f0 [nfs]

[] nfs_revalidate_inode+0x36/0x60 [nfs]

[] nfs_getattr+0x5f/0x110 [nfs]

[] vfs_getattr+0x4e/0x80

[] vfs_fstatat+0x70/0x90

[] sys_newfstatat+0x24/0x50

[] system_call_fastpath+0x16/0x1b

[] 0xffffffffffffffff

The top function is where in kernel code we are stuck right now which is exactly what the WCHAN

output showed us (note that actually there are some more functions on top of it, such as the kernel

schedule() function which puts processes to sleep or on CPU as needed, but these are not shown here

probably because theyre a result of this wait condition itself, not the cause).

Thanks to the full kernel-land stack of this task we can read it from bottom up to understand how did

we end up calling this rpc_wait_bit_killable function in the top, which ended up calling the scheduler

and put us into sleep mode.

The system_call_fastpath in the bottom is a generic kernel system call handler function which has

called the kernel code for that newfstatat syscall we issued ( sys_newfstatat ). And then, continuing to

go upwards through the child functions we end up seeing a bunch of NFS functions in the stack! This

is a 100% undeniable proof that we are somewhere under NFS codepath. Im not saying in NFS

codepath as when you continue reading upwards past the NFS functions, you see that the topmost

NFS function has in turn called some RPC function (rpc_call_sync) for talking to some other process

in this case probably the [kworker/N:N] , [nfsiod] , [lockd] or [rpciod] kernel IO threads. And we are


11 of 22 5/25/2015 12:34 PM

never getting a reply from those threads for some reason (the usual suspect would be dropped network

connection, packets or just network connectivity issues).

To see whether any of the helper threads are stuck on network-related code, you could sample their

kernel stacks too, although the kworkers do much more than just NFS RPC communication. During a

separate experiment (just copying large files over NFS), I have caught one of the kworkers waiting in

networking code:

[root@oel6 proc]# for i in `pgrep worker` ; do ps -fp $i ; cat /proc/$i/stack ; done

UID PID PPID C STIME TTY TIME CMD

root 53 2 0 Feb14 ? 00:04:34 [kworker/1:1]

[] __cond_resched+0x2a/0x40

[] lock_sock_nested+0x35/0x70

[] tcp_sendmsg+0x29/0xbe0

[] inet_sendmsg+0x48/0xb0

[] sock_sendmsg+0xef/0x120

[] kernel_sendmsg+0x41/0x60

[] xs_send_kvec+0x8e/0xa0 [sunrpc]

[] xs_sendpages+0x173/0x220 [sunrpc]

[] xs_tcp_send_request+0x5d/0x160 [sunrpc]

[] xprt_transmit+0x83/0x2e0 [sunrpc]

[] call_transmit+0xa8/0x130 [sunrpc]

[] __rpc_execute+0x66/0x1d0 [sunrpc]

[] rpc_async_schedule+0x15/0x20 [sunrpc]

[] process_one_work+0x13e/0x460

[] worker_thread+0x17c/0x3b0

[] kthread+0x96/0xa0

[] kernel_thread_helper+0x4/0x10

It should be possible to enable kernel tracing for knowing which exact kernel thread communicates

with other kernel threads, but I wont go in there in this post as its supposed to be a practical (and

simple :) troubleshooting exercise!


12 of 22 5/25/2015 12:34 PM

Anyway, thanks to very easy kernel stack sampling in modern Linux kernels (I dont know exact

version when it was introduced) we managed to systematically figure out where the find command was

stuck in NFS code in the Linux kernel. And when you have NFS-related hangs, the most usual

suspects are network issues. If you are wondering how did I reproduce this problem, I mounted a NFS

volume from a VM, started the find command and then just suspended the VM. This produces similar

symptoms as a network (configuration, firewall) issue where a connection is silently dropped without

notifying the TCP endpoints or packets just dont go through for some reason.

As the top function listed in the stack was one of the killable safe-to-kill functions

( rpc_wait_bit_killable ) then we can kill it with kill -9 :

[root@oel6 ~]# ps -fp 27288


root 27288 27245 0 11:57 pts/0 00:00:01 find . -type f

[root@oel6 ~]# kill -9 27288

[root@oel6 ~]# ls -l /proc/27288/stack

ls: cannot access /proc/27288/stack: No such file or directory

[root@oel6 ~]# ps -fp 27288


[root@oel6 ~]#

The process is gone.

-

Note that as the /proc/PID/stack looks like just a plain text proc file, you can do poor-mans stack

profiling also on the kernel threads! This is an example of sampling the current system call and the

kernel stack (if in system call) and aggregating it into a semi-hierarchical profile in a poor-mans way:


13 of 22 5/25/2015 12:34 PM

[root@oel6 ~]# export LC_ALL=C ; for i in {1..100} ; do cat /proc/29797/syscall | awk '{ print $1 }' ; cat

69 running

1 ffffff81534c83

2 ffffff81534820

6 247

25 180

100 0xffffffffffffffff

1 thread_group_cputime

27 sysenter_dispatch

3 ia32_sysret

1 task_sched_runtime

27 sys32_pread

1 compat_sys_io_submit

2 compat_sys_io_getevents

27 sys_pread64

2 sys_io_getevents

1 do_io_submit

27 vfs_read

2 read_events

1 io_submit_one

27 do_sync_read

1 aio_run_iocb

27 generic_file_aio_read

1 aio_rw_vect_retry

27 generic_file_read_iter

1 generic_file_aio_read

27 mapping_direct_IO

1 generic_file_read_iter

27 blkdev_direct_IO

27 __blockdev_direct_IO

27 do_blockdev_direct_IO

27 dio_post_submission

27 dio_await_completion

6 blk_flush_plug_list


14 of 22 5/25/2015 12:34 PM

This gives you a very rough sample of where in kernel this process spent its time if at all. The system

call numbers are separated, in bold, above running means that this process was in userland (not in

a system call) during the sample. So, 69% of the time during the sampling, this process was running

userland code. 25% of time in syscall #180 (nfsservctl on my system) and 6% in syscall #247 (waitid).

There are two more functions seen in this output but for some reason they havent been properly

translated to a function name. Well, this address must stand for something, so lets try our luck

manually:

[root@oel6 ~]# cat /proc/kallsyms | grep -i ffffff81534c83

ffffffff81534c83 t ia32_sysret

So it looks that these samples are about a 32-bit architecture-compatible system call return function

but as this function isnt a system call itself (just an internal helper function), perhaps thats why the

/proc/stack routine didnt translate it. Perhaps these addresses show up because theres no read

consistency on the /proc views, when owning threads modify these memory structures and entries,

the reading threads may sometimes read data in-flux.

Let check the other address too:

[root@oel6 ~]# cat /proc/kallsyms | grep -i ffffff81534820

[root@oel6 ~]#

Nothing? Well, the troubleshooting doesnt have to stop yet lets see if theres anything interesting

around that address. I just removed a couple of trailing characters from the address:


15 of 22 5/25/2015 12:34 PM

[root@oel6 ~]# cat /proc/kallsyms | grep -i ffffff815348

ffffffff8153480d t sysenter_do_call

ffffffff81534819 t sysenter_dispatch

ffffffff81534847 t sysexit_from_sys_call

ffffffff8153487a t sysenter_auditsys

ffffffff815348b9 t sysexit_audit

Looks like the sysenter_dispatch function starts just 1 byte before the original address displayed by

/proc/PID/stack. So we had possibly already executed the one byte (possibly a NOP operation left in

there for dynamic tracing probe traps). But nevertheless, looks like these stack samples were in the

sysenter_dispatch function, which isnt a system call itself, but a system call helper function.

Note that there are different types of stack samplers the Linux Perf, Oprofile tools and Solaris

DTraces profile provider do sample the instruction pointer (EIP on 32bit intel or RIP on x64) and

stack pointer (ESP on 32bit and RSP on 64bit) registers of the currently running thread on CPU and

walk the stack pointers backwards from there. Therefore, these tools only show threads which happen

to be on CPU when sampling!!! This is of course perfect, when troubleshooting high CPU utilization

issues, but not useful at all when troubleshooting hung processes or too long process sleeps/waits.

Tools like pstack on Linux,Solaris,HP-UX, procstack (on AIX), ORADEBUG SHORT_STACK and just

reading the /proc/PID/stack pseudofile provide a good addition (not replacement) to the CPU

profiling tools as they access the probed process memory regardless of their CPU scheduling state

and read the stack from there. If the process is sleeping, not on CPU, the starting point of the stack can

be read from the stored context which is saved to kernel memory by OS scheduler during the context

switch.

Of course, the CPU event profiling tools can often do much more than just a pstack, OProfile, Perf and

even DTraces CPC provider (in Solaris11) can set up and sample CPU internal performance counters

to estimate things like the amount of CPU cycles stalled waiting for main memory, amount of L1/L2

cache misses, etc. Better read what Kevin Closson has to say on these topics: (Perf, Oprofile)


16 of 22 5/25/2015 12:34 PM

Enjoy :-)

What the heck are the /dev/shm/JOXSHM_EXT_x files on Linux?

SGA bigger than than the amount of HugePages configured (Linux 11.2.0.3)

Oracle Exadata Troubleshooting

Oracle Performance & Troubleshooting Online Seminars in 2013

Session Snapper v4 The Worlds Most Advanced Oracle Troubleshooting Script!

:

120 31 7More

Stoyan says:

FEBRUARY 21, 2013 AT 1:48 PM

Just awesome thank you

REPLY

Panagiotis Papadomitsos says:

MARCH 7, 2013 AT 4:50 AM

This must be one of the best hardcore troubleshooting articles Ive ever read. Thanks!

REPLY

Renato Araujo says:

DECEMBER 18, 2013 AT 7:10 AM


17 of 22 5/25/2015 12:34 PM

+1

Many Thanks

REPLY

Mani says:

MARCH 7, 2013 AT 1:53 PM

Thank you Tanel for Sharing, it was awesome ride on Linux kernel-land!

REPLY

Jared says:

APRIL 2, 2013 AT 8:09 PM

Thanks Tanel, this article has been on my todo list for some time, and finally read it today.

Excellent stuff, I need to save it somewhere I can easily find it

My favorite type of work is troubleshooting down to the level where there is no further

visibility,

learning new OS bits and tools, and writing tools as necessary as I go.

Either find the issue, or be able to point to the vendor and prove it is their problem. :)

That threshold of visibility is at a somewhat lower level with *nix than with the Oracle

database.

Although, the OS tools in Linuxland make it possible to delve deeper into Oracle issues than

would be possible otherwise.


18 of 22 5/25/2015 12:34 PM

REPLY

LAKS says:

APRIL 9, 2013 AT 6:19 AM

Late to post a query here but still curious to know what is the environment details where this

hang was seen. I mean what was the virtualization product used(KVM or xen) to create the

linux guest VM(32 or 64bit). if possible pls reply.thx in advance

REPLY

Tanel Poder says:

APRIL 9, 2013 AT 8:56 AM

I was using VMWare Fusion 5 on my Mac server.

REPLY

Tanel Poder says:

APRIL 9, 2013 AT 8:57 AM

But you can use any VM for this VirtualBox is easy to use and

free

REPLY

Jeyanthan says:

APRIL 13, 2013 AT 12:10 PM

That was a heck of a troubleshooting ! Thanks for your detailed analysis walkthrough, Tanel.

If I were you, Id have done kill -9 27288 right away ;)


19 of 22 5/25/2015 12:34 PM

REPLY

Ravikumar says:

DECEMBER 30, 2013 AT 3:31 AM

Is it possible to get no of kernel memory pages.

REPLY

Tanel Poder says:

DECEMBER 30, 2013 AT 1:03 PM

There are different types (and reasons) why kernel memory pages get allocated.

On linux just cat /proc/meminfo (some homework is needed for understanding

which types are for kernel purposes).

REPLY

Aleksey says:

JANUARY 17, 2014 AT 7:30 AM

You ref to /home/oracle/os_explain in your poor-man script. What is the content of it?

Also,

ffffffff81534819 t sysenter_dispatch

doesnt starts just 1 byte before ffffff81534820, since there are HEX numbers here. (so, it is 7

bytes actually).

REPLY

Tanel Poder says:

JANUARY 17, 2014 AT 11:00 AM


20 of 22 5/25/2015 12:34 PM

check this: http://blog.tanelpoder.com/files/scripts/tools/unix/os_explain

Ha, thanks for catching my hex to dec conversion error :-)

REPLY

Adi To says:

JULY 6, 2014 AT 6:27 PM

Dear Tanel,

I am late to the party, but chanced upon your blog post: very much appreciate you sharing

your tricks. Very useful for me, delving into Linux from Unix-land. Regards

REPLY

Frank Scholten says:

JULY 25, 2014 AT 10:45 AM

Hi Tanel,

Great article!

How come you could kill the process at the end of your article? I thought it was in an disk

sleep state?

Cheers,

Frank

REPLY


21 of 22 5/25/2015 12:34 PM

Tanel Poder says:

JULY 31, 2014 AT 8:55 AM

Hi Frank, the D means uninterruptible sleep (not only Disk sleep), the

WCHAN column shows in which kernel function its stuck and in this case its

rpc_wait, which has something to do with interprocess communication and the

later diagnosis showed it was the NFS codepath. So it wasnt stuck in the disk

block IO layer somewhere, it was waiting for a reply from another process via

RPC.

REPLY


22 of 22 5/25/2015 12:34 PM

proc filesystem for quick’n’dirty troubleshooting _ tanel poder's performance & troubleshooting...

Documents