Loading...

xgrid-users@lists.apple.com

[Prev] Thread [Next]  |  [Prev] Date [Next]

Re: [Xgrid] Memory limits in xgrid controller charlie strauss Wed Jun 09 16:01:01 2010


On Jun 8, 2010, at 12:08 PM, Richard Reeve wrote:

Well, I'd tend to agree that it's not a huge problem if it does release it as needed, and unless the total space exceeds 2GB - then it fails to allocate space and dies horribly... anyway, thanks very much for checking that. A very different response to what I see on 10.5 - I think I may have finally discovered why I need to upgrade to snow leopard... shame though, I had got quite into querying the controller database.

just change the permissions on the database? Not sure if those would get reset or not when check_permissions get's run.

Now I'm going to have to go back to the more reliable but slow routine of using the attributes and log jobs to get my information back. If/when I can do that reliably I'll put it on your FAQ I guess...

please do.


Thanks again,

Richard.

On 8 Jun 2010, at 18:34, charlie strauss wrote:

After deleting the jobs the RSIZE went down to 235.


Is it neccessarily true that having a large RSIZE is bad? could this not just be opportunisitic cacheing it can release if needed?





On Jun 8, 2010, at 11:31 AM, charlie strauss wrote:

I tested this on a 10.6 xgrid controller using 10.5 agents and client, and I see something simmilar though maybe not as severe.



With no jobs running my xgridcontrollerd is idleing along at 450M RSIZE in top. When I submit a job it spikes the RSIZE anywhere from 800 to 2300M. For the first three job submissions after the job runs the RSIZE drops instantly to 450M.

HOWEVER, repeated submissions cause a creep in the RSIZE.

after 14 job submissions with the jobs all running concurrenly the RSIZE is now up to 1337M (leet!) Waiting 10 munites after the last submission started running I see that it is still using exactly 1337M

all the jobs are running concurrently and are just sleep executables.


perl -we 'for (0..16) { print `xgrid -job submit -in foo -gid 0 / bin/sleep 5000` }'

foo is a directory with one file in it that is 200MB is size


Now 1GB is not a lot of memory. But if you are correct this keeps scaling until the controller crashes then this is a bug.


I did one more test. I tried the same submission set without the -in foo 200MB file. After 16 jobs were submitted further the RSIZE only increased two 1338.




On Jun 8, 2010, at 10:46 AM, Richard Reeve wrote:

<blushes>

Err, anyway, I completely agree - I'm struggling to believe I've never noticed this myself if it's a real effect - that's why it would be great if somebody could reproduce it.

It certainly seems to work exactly as you describe, and does seem to scale linearly with size*#jobs:

- create a job that will run for a long time with a large -in directory (in this case a couple of hundred MB, but I'm not sure it matters), and then
- submit it
- wait for a job id to return (and so the job is running), and then
- repeat the submission a few times (in this case 7), and

it brings down the controller! Easy...

The guy was running it on a test controller with only 1 (8 core, 16 task) machine running as an agent, and none of the jobs completed before the controller crashed. I'm not sure what memory measure he was using, but my observations of running a few smaller (10s of MB) jobs on our real controller with sleep as the job is that it doesn't matter, just running top on the controller and watching the xgridcontroller process memory usage, it keeps going up with every submission, and doesn't come down until the jobs are deleted from the controller after they've finished.

Cheers,

Richard.

0 - 2.39MB
1st - 228.21MB - 1.22GB
2nd - 453.85MB - 1.4GB
3rd - 679.40MB - 1.65GB
4th - 904.96 - 1.87GB
5th- 1.1GB - 2.21GB
6th - 1.32GB - 2.28GB
7th - CRASH


On 8 Jun 2010, at 15:49, Charlie E. Strauss wrote:

First, let me congratulate you on the first edit since the wiki went public. It was your post that made me investigate directly accessing the
database.  so thanks for the inspiration.

It seems like such a show stopper you'd think this would have shown up
before it the behavior scales inearly.   Sending 200MB files to 5
computers may be unusual but sending say 20 megabytes to 50 computers
might not be unusual.

I wonder if just scales as the specification size x jobs?


Do you think that your problem is in any way specific to whatever it is you are doing? If I tried simulating it like you say below would this work? If I have a chance today I'll try it on a 10.6 controller. If not
I can't try it till next week.

It's certainly the case that these limits exist for single jobs, but what
we have done below is:

1. Create a 200MB file
2. Submit it as a file for a job which will run for a long time
- Monitor memory usage of xgrid controller during submission
3. Wait until it is registered as running
- Record memory usage of xgrid controller after job begins to run
4. Return to 2.

Doing this returns ever increasing memory use on the controller. If this doesn't happen for other people, maybe we finally have to move the controller to 10.6, as we are still using 10.5, though I note from your nice wiki (thanks for that!), that we will then no longer be able to access the controller database from a job, which was quite a handy feature
in 10.5...

Any reports of what happens on 10.6 would be welcome (or 10.5 if you don't
get this effect),

Cheers,

Richard.

On 7 Jun 2010, at 23:39, charlie strauss wrote:

xgrid I beleive has the 32bit memory limit.

Additionally, I have never gotten a single job submission with -in
file  transfer to work that was more than ~600MB (file size).
Additionally the client gives an error messaage on retrival of
specifications that are more than 130MB is size (original file size)

xml plist submissions are 4/3 larger than the original file size and
NextStep plist submissions are 2x the original file size.


What is not making sense here is that you seem to be seeing a problem on the controller not the client. THe controller is only going to transiently hold the submission in memory while it sends it to the agent
and updates it's data base.


Perhaps if you submit the jobs too quickly it is trying to hold too many in memory at once??? Try the following as a test: submit your very large jobs to the controller with a very long duration for the job ( a sleep). then don't submit another job untill the controller shows the
agent is working.   If it still fails then my guess is wrong.






On Jun 7, 2010, at 10:09 AM, Richard Reeve wrote:

Hi,

We've realised for a while that there's a memory limit on single xgrid jobs, and there are plenty of threads about it, but we've recently realised that there appears to be a memory limit on all current xgrid jobs - has anybody else found this or am I going mad? Is there anything
we can do about it?

Cheers,

Richard.

An example from a clean new controller from a guy in our group:

So I started submitting jobs to the cluster with just one copy of the
big
222.3MB file in the -in directory, and checked out the activity monitor
-
crude but thought would give it a go:

0 - 2.39MB
1st - 228.21MB - 1.22GB
2nd - 453.85MB - 1.4GB
3rd - 679.40MB - 1.65GB
4th - 904.96 - 1.87GB
5th- 1.1GB - 2.21GB
6th - 1.32GB - 2.28GB
7th - CRASH

First column = job number
Second column = reported real memory usage of xgrid controller process
by
activity monitor after job has sent to agent
Third column = maximum reported real memory usage of xgrid controller
process during submission

A couple of things - going over 2GB didn't seem to crash it! But going
to far
over did - the max values aren't exact - could be higher as the process
kept
jumping in activity monitor and I'd lose sight of it for a bit.
_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      ([EMAIL PROTECTED])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/cems%40lanl.gov

This email sent to [EMAIL PROTECTED]

Charlie Strauss
Bioscience Division
[EMAIL PROTECTED]
505 665 4838
Quidquid latine dictum sit, altum sonatur.


_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      ([EMAIL PROTECTED])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/cems %40lanl.gov

This email sent to [EMAIL PROTECTED]



_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      ([EMAIL PROTECTED])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/cems%40lanl.gov

This email sent to [EMAIL PROTECTED]

Charlie Strauss
Bioscience Division
[EMAIL PROTECTED]
505 665 4838
Quidquid latine dictum sit, altum sonatur.


Charlie Strauss
Bioscience Division
[EMAIL PROTECTED]
505 665 4838
Quidquid latine dictum sit, altum sonatur.

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      ([EMAIL PROTECTED])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/r%2Bxgrid%40bleaberry.co.uk

This email sent to [EMAIL PROTECTED]

_______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      ([EMAIL PROTECTED])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/cems%40lanl.gov

This email sent to [EMAIL PROTECTED]

Charlie Strauss
Bioscience Division
[EMAIL PROTECTED]
505 665 4838
Quidquid latine dictum sit, altum sonatur.

 _______________________________________________
Do not post admin requests to the list. They will be ignored.
Xgrid-users mailing list      ([EMAIL PROTECTED])
Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/xgrid-users/alexiscircle%40gmail.com

This email sent to [EMAIL PROTECTED]