|
Loading...
|
torqueusers@supercluster.org
[Prev] Thread [Next] | [Prev] Date [Next]
[torqueusers] problem with jobs sharing cores Fotis Georgatos Sat Feb 11 12:00:43 2012
Hi Mike,
I had to debug a problem during last week which appears somewhat related;
in short, the mpi stack (openmpi) was intervening in cpu affinity.
I was able to solve it in my case with the following line:
"mpiexec --report-bindings --cpus-per-rank 4 -np ..."
In your case I recommend a check on the equivalent FAQ of your mpi stack like:
http://www.open-mpi.org/faq/?category=tuning#using-paffinity-v1.4
From time to time you would like to check that your scheduler is actually
placing jobs on nodes as you imagine it would; this tool would help in this:
http://fotis.web.cern.ch/fotis/QTOP/
(tarball works fine in userspace, rpm & repo are available for sysadmins).
enjoy,
Fotis
On 10/02/2012 01:20, [EMAIL PROTECTED] wrote:
> From:[EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Zulauf, Michael
> Sent: Thursday, February 09, 2012 12:30 PM
> To:[EMAIL PROTECTED]
> Subject: [torqueusers] problem with jobs sharing cores
>
> Hi all. . .
>
> I apologize if this message appears more than once - there was an issue with
> my email address and list registration (which I hope is now fixed), and so
> I'm having to resend this. . .
>
> Anyway, where I work, we've had a problem for a while that we haven't been
> able to resolve. I'm not certain of the cause - if it's related to Torque,
> or Maui, or something else. But here goes. . .
>
> We've got a small cluster of 16 nodes, each with dual hex-core processors.
> 12 cores per node, 192 cores total. The problem is that if I launch small
> jobs, where multiple jobs should be able to share a node without sharing
> cores, I instead get cores that are running more than one process, while
> other cores are idle. The primary executable is WRF (weather prediction
> model), but the problem occurs for other parallel codes. The codes have been
> built to utilize MPI (not OpenMP, or MPI/OpenMP).
>
> As an example, if I launch a series of jobs which request 4 cores each, I get
> 3 jobs assigned to each node. That should be fine, as each node has 12
> cores, and there should be no need to share cores. Instead, I get 4
> "overloaded" cores (each running 3 processes) and 8 idle cores. Obviously
> not an ideal situation. If I submit only a single small job, in which case
> it's alone on a node, then it runs great. Similarly, if I launch a large job
> which spans more than one node, it also works well - as long as it's not
> sharing nodes with other jobs. The problem only occurs (and always occurs)
> when parallel jobs share a node. BTW, the qsub command does not explicitly
> request specific cores, or anything like that.
>
> I'm not the administrator - just the primary user. The administrator (who
> was not previously familiar with Torque/Maui) has been struggling with this
> for a bit, and is rather busy with other duties, so I thought I'd check in
> here to see if anybody had suggestions I could pass along.
>
> Here are some specifics, as far as I know them:
> HP blade hardware
> dual Intel Xeon X5670 processors
> Infiniband interconnect (not an issue in this case?)
> the CentOS equivalent of Red Hat 4.1.2-48 (not sure of what that is exactly)
> Torque 3.0.2
> mvapich2-1.7rc1
> PGI7.2-5 compilers
> WRF 3.3.1
>
> Any thoughts? I've probably left out relevant information. If so, please
> ask for clarification.
>
> Thanks,
> Mike
>
> --
> Mike Zulauf
> Meteorologist, Lead Senior
> Asset Optimization
> Iberdrola Renewables
> 1125 NW Couch, Suite 700
> Portland, OR 97209
> Office: 503-478-6304 Cell: 503-913-0403
--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
| sed 's/better bash/bash better/' # Yelling in a CERN forum
_______________________________________________
torqueusers mailing list
[EMAIL PROTECTED]
http://www.supercluster.org/mailman/listinfo/torqueusers
- [torqueusers] problem with jobs sharing cores Zulauf, Michael 2012/02/09
- Re: [torqueusers] problem with jobs sharing cores Ken Nielson 2012/02/09
- Re: [torqueusers] problem with jobs sharing cores Coyle, James J [ITACD] 2012/02/09
- [torqueusers] problem with jobs sharing cores Fotis Georgatos 2012/02/11 <=
- Re: [torqueusers] problem with jobs sharing cores Zulauf, Michael 2012/02/13