A machine of ours is set up as a NIS server to propagate passwords and such to different other machines. These can then be used as a single cluster.
I'm having trouble submitting jobs to an installed torque to this cluster. The jobs sits in the queue indefinitely. Looking at the torque's logs, the compute nodes of the cluster seems to have trouble establishing a connection. I can ssh just fine between all the different machines though.
Here is a snippet of the log when I submit a job to run on two nodes. On one node of the cluster, I have:
while another one I get:11/30/2010 20:31:31;0008; pbs_mom;Job;90858.mycluster;no group entry for group me, user=me, errno=0 (Success)
11/30/2010 20:31:31;0008; pbs_mom;Job;90858.mycluster;ERROR: received request 'ABORT_JOB' from 10.0.0.105:1023 for job '90858.mycluster' (job does not exist locally)
[repeated many times, until I cancel the job]
It seems it might have something to do with ids... I thus checked them on the headnode and the compute nodes:11/30/2010 20:10:09;0008; pbs_mom;Req;send_sisters;sending ABORT to sisters
11/30/2010 20:10:09;0008; pbs_mom;Job;90857.unicron.cl.uottawa.ca;Job Modified at request of PBS_Server@mycluster
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;cannot locate job that triggered req
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;top of preobit_reply
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
11/30/2010 20:10:09;0080; pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
11/30/2010 20:10:09;0008; pbs_mom;Job;90857.mycluster;checking job post-processing routine
11/30/2010 20:10:09;0080; pbs_mom;Job;90857.mycluster;obit sent to server
11/30/2010 20:10:10;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution (15023) in 90857.mycluster, job_start_error from node 10.0.0.104:15003 in job_start_error
11/30/2010 20:10:10;0001; pbs_mom;Svr;pbs_mom;Bad UID for job execution (15023) in 90857.mycluster, abort attempted 16 times in job_start_error. ignoring abort request from node 10.0.0.104:15003
[repeated many times, until I cancel the job]
me@headnode $ id me
uid=1001(me) gid=1009(me) groups=1009(me)
It looks like the nodes don't know about the groups' name? When I type "ls -l" on the nodes, the files/folders group in my home directory is "1009" while on the headnode it's my username.me@node104 $ id me
uid=1001(me) gid=1009 groups=1009
I initially though the problem was with torque, but could it be with NIS? I don't know anything about NIS, is there a way I can test it?
Thanks a lot for any insights, suggestions or help!

