Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
nfsv4 has been flakey this spring... [SOLVED]
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Networking & Security
View previous topic :: View next topic  
Author Message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Sat May 11, 2013 12:57 am    Post subject: nfsv4 has been flakey this spring... [SOLVED] Reply with quote

I've been serving /home over nfsv4 for several years now, and the only problem has been the firefox/thunderbird sqlite databases, which from a little searching appears to be a known problem. I've now got those directories symlinked to local disk, and the 30-second hangs disappeared.

This spring things changed. I've been getting "stale nfs file handle" messages occasionally, refusing to allow firefox and/or thunderbird to start. In addition they occasionally can't start from the xfce buttons, and it's time to have my wife (poor WAF here) start typing into an xterm. Sometimes even that quits working. Once it's bad that particular mechanism stays bad. Sometimes a new xterm will work, sometimes logging out and back in will work, sometimes it just takes a reboot. It seems almost like a bit-rot type of thing, except if I had bad memory I would expect it to manifest itself in crashes too, and it hasn't. I'm also running SMART, and it's not complaining about bad sectors. Nothing is appearing in my logfiles, either.

Then today I was doing weekly service, and I remembered the idmapd discussions I'd participated on earlier. To be honest, I still haven't done anything about this yet. Tonight I saw this topic again, noticed my kernel config and USE flags, and figured I'd post here.

So in general:
Code:
# CONFIG_NFS_USE_LEGACY_DNS is not set
and
[ebuild   R    ] net-fs/nfs-utils-1.2.6  USE="caps nfsv4 tcpd -ipv6 -kerberos -nfsdcld -nfsidmap -nfsv41 (-selinux)" 0 kB

So this looks as if I've "automagically" dropped some sort of legacy support, by not tweaking to include it.
It also looks as if I'm not including the new nfsidmap support, by leaving that USE flag unset. But I'm not sure if that's what's going on with the kernel config option I've left un-checked. The name looks as if it has nothing to do with nfsidmap, but the description is fuzzier.
In addition I'm not sure what this "nfsdcld" (client tracking daemon) is all about.

Do I want to activate the nfsidmap USE flags? (I see that it's defaulted to on for nfs-utils-1.2.7)
Do I have to "simultaneously" do this on both server and clients?
What about nfsdcld - is that a server feature, and what does it do for me? I've done a little searching, and it appears to keep getting scaled back, but it's still present in nfs-utils-1.2.7.

EDIT - Adding that in https://forums.gentoo.org/viewtopic-t-896828-highlight-idmap.html there is discussion about modifying a file "/etc/request-key.conf", but I don't even have that file. /EDIT

By the way, I run "harmonized uids" on my network, so theoretically I don't need idmapping at all, though I know that if idmapd isn't running, nothing gets properly mapped even if it all does match.

Thanks for any hints.
_________________
.sigs waste space and bandwidth


Last edited by depontius on Mon May 27, 2013 10:00 pm; edited 1 time in total
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Sat May 11, 2013 1:23 am    Post subject: Reply with quote

I had exactly the same problem. It took me weeks to find a solution.

In my case, the solution was to edit /etc/exports on the NFS server and to give each export its own unique fsid.

/etc/exports now looks like this:

Code:
/export               -rw,fsid=0,no_subtree_check,sync            192.168.1.3 192.168.1.4 192.168.1.5
/export/home          -rw,fsid=1,nohide,no_subtree_check,async    192.168.1.3 192.168.1.4 192.168.1.5
/export/dir           -rw,fsid=2,nohide,no_subtree_check,sync     192.168.1.3 192.168.1.4 192.168.1.5
/export/dir/subdir    -rw,fsid=3,nohide,no_subtree_check,sync     192.168.1.3 192.168.1.4 192.168.1.5
...


After I restarted the NFS server and all NFS clients, the problems were gone. I haven't had a single 'stale NFS handle' or any hangs or problems since then.

Good luck

Michael
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Sat May 11, 2013 1:37 am    Post subject: Reply with quote

I currently have:
Code:
# /etc/exports: NFS file systems being exported.  See exports(5).
/exports       192.168.???.0/255.255.255.0(fsid=0,rw,sync,no_subtree_check,secure,anonuid=1000,anongid=100)
/exports/home  192.168.???.0/255.255.255.0(nohide,rw,sync,no_subtree_check,secure,anonuid=1000,anongid=100)

So I'll try adding a unique fsid to the second line, as you suggest. Revisiting "man 5 exports" it looks as if I should also remove the "nohide" option.

Unfortunately I suspect this means shutting down nfs on all of the clients, then restarting the server, then restarting the clients?

Have you fiddled with cachefilesd at all? I was running it for quite a while, then quit this spring, when things started getting bad, thinking it might be part of the problem. Obviously that wasn't it, but I haven't wanted to re-complicate things while it all isn't running right. I've also found that cachefilesd/fscache causes a kernel panic on 3.8, but haven't tested it on 3.9 yet. Maybe if I get past this other problem.

Thanks for the suggestions, something to try tomorrow morning.
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
_______0
Guru
Guru


Joined: 15 Oct 2012
Posts: 521

PostPosted: Sat May 11, 2013 10:36 am    Post subject: Reply with quote

w00t!! 2013 and nfs still suffers from "stale nfs file handle"?? That does it, nfs devs should be barred from celebrating 2014 new years without a solution to that error.

Recently I tried nfs with gentoo and since it was quite some time last used it I forgot many configs settings that I assumed where absolutely needed, the matching host thingy. To my surprise NFS4 worked out of the box!! Perhaps you're over-configuring.

For me the way it worked was to disable ALL nfs3 options in the kernel.
Back to top
View user's profile Send private message
mike155
Advocate
Advocate


Joined: 17 Sep 2010
Posts: 4438
Location: Frankfurt, Germany

PostPosted: Sat May 11, 2013 12:07 pm    Post subject: Reply with quote

Well, part of the problem is the sentence below in the man page of 'exports':
Code:
As not all filesystems are stored on devices, and not all filesystems have UUIDs, it is *sometimes* necessary to explicitly tell NFS how to identify a filesystem.  This is done with the fsid= option.

Unfortunately, they don't explain 'sometimes'. This sentence should read:
Code:
If you want to avoid problems with 'stale NFS handles', always explicitly specify the fsid= option. Your system may work without specifying fsid, but this is not guaranteed.

NFS seems to work for many people without specifying the fsid=option. In my case it doesn't work - and I experienced the problems described in the first posting. Why do I need to explicitly specify the fsid= option? I guess the reason is that I export encrypted filesystems (LUKS / dm-crypt). This seems to qualify as 'sometimes'...
Back to top
View user's profile Send private message
depontius
Advocate
Advocate


Joined: 05 May 2004
Posts: 3509

PostPosted: Sat May 11, 2013 1:04 pm    Post subject: Reply with quote

bug_report wrote:

NFS seems to work for many people without specifying the fsid=option. In my case it doesn't work - and I experienced the problems described in the first posting. Why do I need to explicitly specify the fsid= option? I guess the reason is that I export encrypted filesystems (LUKS / dm-crypt). This seems to qualify as 'sometimes'...

In my case it worked for many years - until this spring, and I wasn't able to figure out exactly which change broke it. I'd thought it was moving my server to a 3.8 kernel, and moving it back to the last pre-3.8 stable seemed to fix it, but only for several days.

It takes a planetary alignment to do this kind of client/server fix, and that alignment hasn't happened yet. Hopefully later today...

EDIT...

The planets aligned an hour or two ago, and I made the recommended changes. The jury is out for a few days, since that's how long it took for problems to show up after my last "fix". (The last "fix" was regressing to 3.7.5-r1 on the server, along with a clean client/server restart.) I'll wait to put up [SOLVED] until next week.

Thanks for the help.

In the meantime, does anyone have any advice on the "nfsidmap" and/or "nfsdcld" USE flags?

EDIT...

No problems since implementing the fix, but we were away some of this time. I think we've finally got enough run-time on to call this one [SOLVED]. When I first set up nfsv4, the documentation seemed to indicate that "fsid=0" was necessary for the first entry, for the export root. I don't remember it indicating it was necessary for any of the actual data exports, and I did run that way for quite a few years without problems.

Still wondering about "nfsidmap" and/or "nfsdcld" USE flags.
_________________
.sigs waste space and bandwidth
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Networking & Security All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum