[Prev] Thread [Next]  |  [Prev] Date [Next]

Re: [Moabusers] job disappeared from qstat and showq, still running on node Pim Schravendijk Mon Mar 21 20:00:31 2011

I had set Nodeavailability combined:procs
A bit worrying that this still allows load to go above the number of cores!

I'm not sure if this is an issue of torque or moab. Clearly, torque knows
which jobs are running but either doesn't tell this to moab, or moab doesn't
read it from torque. The fact that after restarting moab, moab still doesn't
sync with torque makes me think it's a moab issue ?!? Torque is 2.5.2 as I
wrote earlier.

I was planning to do the check via a crontab, but that healthcheck seems to
have fancy integration in moab so I'll try that.

Am 21.03.2011 23:58 schrieb <[EMAIL PROTECTED]>:

 Hi Pim,

At least that can be done with NODEAVAILABILITYPOLICY – but I see Chris got
in first!

but I’d prefer DEDICATED and to fix the problems of orphaned jobs.  Perhaps
the node health check could be used to detect spurious load and avoid
allocating extra jobs to badly behaving nodes – or even to identify and kill
orphan processes.


others have mentioned epilogue in this thread and there are sporadic
discussions on torqueusers about killing off processes that should not be
running – which is relatively easy if your nodes are not shared but more
difficult in the general case – and clearly you have to be very careful not
to kill the wrong processes!

- Gareth


*From:* Pim Schravendijk [mailto:[EMAIL PROTECTED]
*Sent:* Tuesday, 22 March 2011 9:45 AM
*To:* Williams, Gareth (CSIRO IM&T, Docklands)
*Subject:* Re: RE: [Moabusers] job disappeared from qstat and showq, still
running on node

Hi Gareth,

Thank you for your thoughts on the matter!

I checked top before and after purging ...
moabusers mailing list