BlueGene Homepage BlueGene Homepage BlueGene Homepage BlueGene Homepage

Home

Computing Pages

Sample Loadleveler Batch Script (including how to monitor jobs)

List of NYBlue Predefined Partitions

NYBlue Predefined Partitions Naming Convention

Batch Job Submission on NYBlue/L

All jobs should be submitted to the compute nodes from the login node (i.e. from the front end node), as batch jobs via IBM's LoadLeveler. We are currently running LoadLeveler version 3.4.3.3 .

In your LoadLeveler batch script:

Class normal jobs (48 hour wall clock limit) must specify a particular predefined partition of compute nodes on which your job will be run.

Class short jobs (24 hour wall clock limit) must instead specify simply the number of nodes on which to run the job, and the specified number must be either 1024, 2048, 3072, or 4096 (i.e. one, two, three, or four racks).

If you will be running class normal jobs, see the:

NYBlue Predefined Partitions Naming Convention
List of NYBlue Predefined Partitions
Sample Loadleveler Batch Script

hyperlinks on the sidebar for details.

If you will be running class short jobs, see only the last of those three hyperlinks.

Notes For Class Normal Jobs Only

If you will be running a class normal job, to see all available predefined partitions for your LoadLeveler job at a given moment from the front end node, issue this command (it should already be in your path):

readyblocks.pl

Then specify any one of the displayed predefined partitions, in your LoadLeveler job control file.

Note: If there are no available partitions, there will be no output from the above command. In that case, select a partition following the procedure described in the List of NYBlue Predefined Partitions hyperlink on the sidebar. Your LoadLeveler job will be queued and will run when that partition becomes available.

Note: There are times when readyblocks.pl will indicate a predefined partition is available, yet your batch job specifying that partition will be queued rather than run immediately.
In such a case the LoadLeveler scheduler is probably saving the partition for a higher priority job (i.e. a job that has been waiting longer).
Example:
The output from llq -s fengpfs.18080.0 includes this:
BP "R310" is set aside for top-dog.
BP "R310" of Partition "B512TC02" is not available.
In this case the requested partition B512TC02 is being saved by the scheduler for another job, perhaps as part of a 1024-node "top dog" job that is the job next scheduled to use that partition, and another partition.
If the user waits, his/her job will be run.
Alternatively the user could cancel fengpfs.18080.0 and submit a new job specifying another 512-node partition that readyblocks.pl indicates is free, with the hope that it isn't also being saved.
Type topdog to see current top dog jobs.

Notes for Both Class Normal and Class Short Jobs

Bear in mind that the batch system uses backfilling and favors incoming jobs with smaller specified wall clock limits, so it is to your advantage to specify a wall clock limit that is not larger than what your job will actually need.

The Sample Loadleveler Batch Script hyperlink on the sidebar at left describes how to cancel a job, as well as usage of llq -s to learn about the status of a job.

There is a maximum wall clock limit of 48 hours on all class normal LoadLeveler jobs and 24 hours on all class short jobs.

All jobs must specify the class to be used and the class must be normal or short, see the Sample LoadLeveler Batch Script hyperlink on the sidebar at the left which also discusses how to monitor LoadLeveler jobs and the meaning of the codes displayed when one monitors one's batch job.

maxjobs and maxqueued both equal two for class short. The Notes section of the Sample LoadLeveler Batch Script hyperlink on the sidebar at the left explains the significance of this statement.

We are still in the process of testing LoadLeveler and implementing the best configuration for it, your patience is appreciated.


This site maintained by:
bgwebmaster@bnl.gov

One of ten national laboratories overseen and primarily funded by the Office of Science of the U.S. Department of Energy (DOE), Brookhaven National Laboratory conducts research in the physical, biomedical, and environmental sciences, as well as in energy technologies and national security. Brookhaven Lab also builds and operates major scientific facilities available to university, industry and government researchers. Brookhaven is operated and managed for DOE's Office of Science by Brookhaven Science Associates, a limited-liability company founded by Stony Brook University, the largest academic user of Laboratory facilities, and Battelle, a nonprofit, applied science and technology organization.
Privacy and Security Notice