In order to prevent jobs from overusing compute nodes’ memory, causing damage to other users’ jobs sharing the same nodes and impacting the cluster performance, virtual memory limits will be enforced on all jobs after Sept 1, 2013.
Once enforced, if a job exceeds the specified virtual memory limit (i.e., the h_data value in the job script), the job will be killed automatically by the scheduler. In that case, the job will need to be rescheduled requesting a larger amount of memory, this can be accomplished by increasing the value of “h_data” in the job script and re-submitting the job (when using queue scripts – such as job.q – at the command line you can alter the amount of memory requestable for your job with the -d flag — see man queue)
Jobs that do not specify h_data will be assigned a default value. Again, if the default value is too low for the given job, the job will be killed by the scheduler.
Jobs requesting whole node(s) (via -l exclusive) will not be affected by the virtual memory limit. However, if the specified h_data value is too low, the job could be dispatched to low-memory (e.g. 8GB) nodes. If the job uses more than the total memory available on the node, the job (and the node) will crash. In that case, the user will still need to reschedule the job specifying a larger h_data value.
Note that h_data is a per-core value. Therefore, for shared-memory jobs, if (h_data)*(number of cores) is larger than a node’s available total memory, the job will not start. To prevent this users will need to request an h_data value and a number of cores such as their product is within the total amount of memory available on the Hoffman2 cluster nodes (see output of qhost to determine how much memory is available on the different Hoffman2 compute nodes).
Initially, this virtual memory limit enforcement may cause interruptions to those jobs that have been over-using memory at run time in the past. However, in the long run, memory enforcement will significantly increase the cluster’s stability and performance, benefiting all users.
To find out how much virtual memory your job uses, you may submit your job in exclusive mode (with the -l exclusive option, so no other jobs will share the same node as your). Once the job completes successfully, run the command
$ qacct -j job_ID
where job_ID is replaced with the actual job number. See the value of maxvmem in the output. You should specify h_data such that (h_data)*(number of cores) is larger than maxvmem (but less than the total memory size of the compute node you intend to use).
Please contact hpc@ucla.edu should you have any question.