Andrew Mobbs (mobbsy) wrote,
Andrew Mobbs

Argh. To record this one for posterity.

Take one NFS mount on a RHEL 3 box.

It used to Just Work.

One day, "ls -l" consistently hangs, as does "mv", and "cp". Many other things work, including "ls" and "echo hello > foo", and "lsof".

It turns out that the call that is hanging is "getxattr".

Another RHEL 3 box, installed from the same image doesn't have this problem. "ls -l" on the same mount doesn't bother calling getxattr.

The processes on the problem machine can be recovered with the following procedure:
kill -9 <PID>
(process is still alive, still hung on disk wait)
umount -f <MOUNTPOINT>
(errors claiming fs is busy, but hung processes die)
umount -f <MOUNTPOINT>
(yes, again, but no errors this time)
(Ta-da - filesystem reappears, hung processes are dead. However, all commands that call getxattr still exhibit the same problem.)

Attempting a forced unmount without the kill doesn't do anything useful.

Rinse - repeat - get same result time and again - fiddle - write one-line test program for bug report - everything mysteriously starts working again. Even the getxattr test program just returns EOPNOTSUPP rather than hanging.


[Oh, and for u.c.o.l readers, no this is a different NFS problem to the one I was talking about there]

