[Prev] Thread [Next]  |  [Prev] Date [Next]

[torqueusers] problem with jobs sharing cores Fotis Georgatos Sat Feb 11 12:00:43 2012

Hi Mike,

I had to debug a problem during last week which appears somewhat related;
in short, the mpi stack (openmpi) was intervening in cpu affinity.

I was able to solve it in my case with the following line:
"mpiexec --report-bindings --cpus-per-rank 4 -np ..."
In your case I recommend a check on the equivalent FAQ of your mpi stack like:

 From time to time you would like to check that your scheduler is actually
placing jobs on nodes as you imagine it would; this tool would help in this:
(tarball works fine in userspace, rpm & repo are available for sysadmins).


On 10/02/2012 01:20, [EMAIL PROTECTED] wrote:
> From:[EMAIL PROTECTED]  [mailto:[EMAIL PROTECTED] On Behalf Of Zulauf, Michael
> Sent: Thursday, February 09, 2012 12:30 PM
> Subject: [torqueusers] problem with jobs sharing cores
> Hi all. . .
> I apologize if this message appears more than once - there was an issue with 
> my email address and list registration (which I hope is now fixed), and so 
> I'm having to resend this. . .
> Anyway, where I work, we've had a problem for a while that we haven't been 
> able to resolve.  I'm not certain of the cause - if it's related to Torque, 
> or Maui, or something else.  But here goes. . .
> We've got a small cluster of 16 nodes, each with dual hex-core processors.  
> 12 cores per node, 192 cores total.  The problem is that if I launch small 
> jobs, where multiple jobs should be able to share a node without sharing 
> cores, I instead get cores that are running more than one process, while 
> other cores are idle.  The primary executable is WRF (weather prediction 
> model), but the problem occurs for other parallel codes.  The codes have been 
> built to utilize MPI (not OpenMP, or MPI/OpenMP).
> As an example, if I launch a series of jobs which request 4 cores each, I get 
> 3 jobs assigned to each node.  That should be fine, as each node has 12 
> cores, and there should be no need to share cores.  Instead, I get 4 
> "overloaded" cores (each running 3 processes) and 8 idle cores.  Obviously 
> not an ideal situation.  If I submit only a single small job, in which case 
> it's alone on a node, then it runs great.  Similarly, if I launch a large job 
> which spans more than one node, it also works well - as long as it's not 
> sharing nodes with other jobs.  The problem only occurs (and always occurs) 
> when parallel jobs share a node.  BTW, the qsub command does not explicitly 
> request specific cores, or anything like that.
> I'm not the administrator - just the primary user.  The administrator (who 
> was not previously familiar with Torque/Maui) has been struggling with this 
> for a bit, and is rather busy with other duties, so I thought I'd check in 
> here to see if anybody had suggestions I could pass along.
> Here are some specifics, as far as I know them:
>        HP blade hardware
> dual Intel Xeon X5670 processors
>        Infiniband interconnect (not an issue in this case?)
> the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly)
> Torque 3.0.2
> mvapich2-1.7rc1
> PGI7.2-5 compilers
> WRF 3.3.1
> Any thoughts?  I've probably left out relevant information.  If so, please 
> ask for clarification.
> Thanks,
> Mike
> --
> Mike Zulauf
> Meteorologist, Lead Senior
> Asset Optimization
> Iberdrola Renewables
> 1125 NW Couch, Suite 700
> Portland, OR 97209
> Office: 503-478-6304  Cell: 503-913-0403

echo "sysadmin know better bash than english" | sed s/min/mins/ \
        | sed 's/better bash/bash better/' # Yelling in a CERN forum
torqueusers mailing list