NIS and groups

Message

big_gie · Post by **big_gie** » Tue Dec 07, 2010 10:43 pm

Hi,

A machine of ours is set up as a NIS server to propagate passwords and such to different other machines. These can then be used as a single cluster.

I'm having trouble submitting jobs to an installed torque to this cluster. The jobs sits in the queue indefinitely. Looking at the torque's logs, the compute nodes of the cluster seems to have trouble establishing a connection. I can ssh just fine between all the different machines though.

Here is a snippet of the log when I submit a job to run on two nodes. On one node of the cluster, I have:

11/30/2010 20:31:31;0008; pbs_mom;Job;90858.mycluster;no group entry for group me, user=me, errno=0 (Success)
11/30/2010 20:31:31;0008; pbs_mom;Job;90858.mycluster;ERROR: received request 'ABORT_JOB' from 10.0.0.105:1023 for job '90858.mycluster' (job does not exist locally)
[repeated many times, until I cancel the job]

while another one I get:

11/30/2010 20:10:09;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters
11/30/2010 20:10:09;0008; pbs_mom;Job;90857.unicron.cl.uottawa.ca;Job Modified at request of PBS_Server@mycluster
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
11/30/2010 20:10:09;0008; pbs_mom;Job;90857.mycluster;checking job post-processing routine
11/30/2010 20:10:09;0080; pbs_mom;Job;90857.mycluster;obit sent to server
11/30/2010 20:10:10;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution (15023) in 90857.mycluster, job_start_error from node 10.0.0.104:15003 in job_start_error
11/30/2010 20:10:10;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution (15023) in 90857.mycluster, abort attempted 16 times in job_start_error. ignoring abort request from node 10.0.0.104:15003
[repeated many times, until I cancel the job]

It seems it might have something to do with ids... I thus checked them on the headnode and the compute nodes:

me@headnode $ id me
uid=1001(me) gid=1009(me) groups=1009(me)

me@node104 $ id me
uid=1001(me) gid=1009 groups=1009

It looks like the nodes don't know about the groups' name? When I type "ls -l" on the nodes, the files/folders group in my home directory is "1009" while on the headnode it's my username.

I initially though the problem was with torque, but could it be with NIS? I don't know anything about NIS, is there a way I can test it?

Thanks a lot for any insights, suggestions or help!

tony-curtis · Post by **tony-curtis** » Wed Dec 08, 2010 9:59 pm

what's the group setting in /etc/nsswitch.conf?

big_gie · Post by **big_gie** » Wed Dec 08, 2010 10:02 pm

Here's the content of the /etc/nsswitch.conf file (on the headnode):

#passwd: compat
#shadow: compat
#group: compat
passwd: files nis
shadow: files nis
group: files nis

# passwd: db files nis
# shadow: db files nis
# group: db files nis

hosts: files dns
networks: files dns

services: db files
protocols: db files
rpc: db files
ethers: db files
netmasks: files
netgroup: files
bootparams: files

automount: files
aliases: files

tony-curtis · Post by **tony-curtis** » Wed Dec 08, 2010 10:04 pm

what's in nsswitch.conf on the compute nodes?

big_gie · Post by **big_gie** » Wed Dec 08, 2010 10:06 pm

I just checked and it is exactly the same file on the headnode and compute nodes...

tony-curtis · Post by **tony-curtis** » Wed Dec 08, 2010 10:10 pm

can you "ypcat passwd" (and group) on both the head and compute nodes?

Check also that "getent passwd" (and group) delivers the concatenation of /etc/passwd(group) and the NIS map.

big_gie · Post by **big_gie** » Wed Dec 08, 2010 10:15 pm

"ypcat passwd"'s output is identical on headnode and compute nodes. Here is an example:

me:PASSWORDHASH1009:My name:/home/me:/bin/bash

(there's a dozen users though)

"ypcat group" does not return anything on either headnode or compute nodes. Is this normal?

tony-curtis · Post by **tony-curtis** » Wed Dec 08, 2010 10:18 pm

> normal?

depends on your setup. From what you've said, I'm guessing that the YP group map has been set up but is empty, and that the group(s) you're expecting to see are only in /etc/group on the head (so the "files" repository on the head picks up the groups for you, but neither "files" nor "nis" on the nodes will see anything). You need to set up the group map to get the local groups into YP, and then the nodes should see the groups properly.

big_gie · Post by **big_gie** » Wed Dec 08, 2010 10:47 pm

I think you are right: the yp group was set but empty as "ypcat group" returns without error but is empy.

I tried something to see if I could fix my original problem. I copied the headnode's /etc/group file to 2 compute nodes and then tried to submit a torque job. It seems the job ran on two nodes without failing!! I'll do more test to really verify this though, but it's encouraging and validate my initial guess of a problem with nis...

Now to fix it permanently... As you said, I will need to "set up the group map into YP" so the nodes will see the different groups too. How can I achieve this?

Thanx a lot

tony-curtis · Post by **tony-curtis** » Wed Dec 08, 2010 10:51 pm

The build for the YP maps is in /var/yp on the YP/NIS server (presumably also the head node? or use "ypwhich" to find the server).

A "make" in there will incorporate local changes into YP/NIS. passwd and group should be handled by default.

big_gie · Post by **big_gie** » Wed Dec 08, 2010 11:17 pm

Ok thanx.
I've followed the (archived) gentoo wiki for NIS[1] and read "Verifying the NIS/NYS Installation"[2] and "Creating and Updating NIS maps"[3]. The makefile already contained:

[...]
all: passwd group hosts rpc services netid protocols netgrp mail \
shadow # publickey # networks ethers bootparams printcap \
# amd.home auto.master auto.home auto.local passwd.adjunct \
# timezone locale netmasks
[...]

Running make (as root) in /var/yp gives:

sudo make
gmake[1]: Entering directory `/var/yp/[nisdomainname]'
Updating netid.byname...
Updating shadow.byname... Ignored -> merged with passwd
gmake[1]: Leaving directory `/var/yp/[nisdomainname]'

I then restarted ypbind on head node and compute node. But unfortunately, the same behavior is observed: "ypcat group" reports nothing and:

me@computenode $ ypmatch me group
Can't match key me in map group.byname. Reason: No such key in map

Am I missing something?

[1] http://www.gentoo-wiki.info/HOWTO_Setup_NIS
[2] http://www.tldp.org/HOWTO/NIS-HOWTO/verification.html
[3] http://www.tldp.org/HOWTO/NIS-HOWTO/maps.html

tony-curtis · Post by **tony-curtis** » Thu Dec 09, 2010 1:15 am

is MERGE_GROUP=true in /var/yp/Makefile ?

also make sure MINGID is incorporating the groups you want visible to YP.

one thing to try is to force a group update: touch /etc/group and make in /var/yp

big_gie · Post by **big_gie** » Thu Dec 09, 2010 7:39 pm

MERGE_GROUP was set to true in the makefile. MINGID is set to 500, while our ids are 1000 and up.
I touched the /etc/group file, run make:

$ sudo make
gmake[1]: Entering directory `/var/yp/[nisdomainname]'
Updating group.byname...
yphelper: This program is for internal use from some
ypserv scripts and should never be called
from a terminal
Updating group.bygid...
yphelper: This program is for internal use from some
ypserv scripts and should never be called
from a terminal
Updating netid.byname...
Updating shadow.byname... Ignored -> merged with passwd
gmake[1]: Leaving directory `/var/yp/[nisdomainname]'

Restarted ypbind and ypserve on the server, and ypbind on the compute node.

I still see "me 1009" instead of "me me" in "ls -l"'s output on the compute nodes. ypcat group is still empty...

tony-curtis · Post by **tony-curtis** » Thu Dec 09, 2010 8:35 pm

Baffling.

Could there perhaps be a minor formatting error in /etc/group that is being ignored locally, but the YP make process is choking on?
How did you create/edit the local groups? groupadd/vigr and friends, or ... ?

big_gie · Post by **big_gie** » Tue Mar 22, 2011 5:46 pm

I had to stop working on this for some time, but then I just checked again since I really need a queuing system (fighting for compute nodes is painful...)

The problem seemed to be the absence of gid correspondence between numbers and names on compute nodes. I tried installing sys-auth/munge-0.5.9 and one of test revealed this:

headnode$ munge -n | ssh computenode unmunge
STATUS: Success (0)
ENCODE_HOST: headnode.cl.... (10.0.0.1)
ENCODE_TIME: 2011-03-22 13:37:28 (1300815448)
DECODE_TIME: 2011-03-22 13:37:30 (1300815450)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: me (1001)
GID: ??? (1009)
LENGTH: 0

Note the "???".

I then grepped "group.bygid" in /var/yp/Makefile and found this line: "group.bygid: $(GROUP) $(GSHADOW) $(YPDIR)/Makefile". Then, grepping for "GSHADOW" revealed that the "GSHADOW=" line was commented! I uncommented it, ran make and now the gid seems to propagate correctly:

headnode$ munge -n | ssh computenode unmunge
STATUS: Success (0)
ENCODE_HOST: headnode.cl.... (10.0.0.1)
ENCODE_TIME: 2011-03-22 13:37:28 (1300815448)
DECODE_TIME: 2011-03-22 13:37:30 (1300815450)
TTL: 300
CIPHER: aes128 (4)
MAC: sha1 (3)
ZIP: none (0)
UID: me (1001)
GID: me (1009)
LENGTH: 0

Now I don't know why the line was commented. Was it me or the vendor, can't tell. A backup I created last december has the line commented. I don't have a backup before that (I did not back'd-up /var...) so I can't verify.

Is the GSHADOW=... line normally commented? Could it be an old default value?

Thanks for all your help.