Batch LAPP-LAPTH

From Support-Applicatif

Jump to: navigation, search

This section is meant only for LAPP and LAPTH users


Contents

MUST reference in your publications

Conformément au principe retenu en comité de pilotage MUST, merci de bien vouloir faire figurer le phrase suivante dans toute publication dont les résultats s'appuient sur les facilités de calcul et de stockage offertes par MUST : "Ce travail a été réalisé grâce aux services offerts par le méso-centre de calcul MUST de l'Univ. Savoie Mont Blanc - CNRS/IN2P3".

Please do not forget to add a reference to MUST in your publications if you use the cluster computing facility to obtain your results: "This work has been done thanks to the facilities offered by the Univ. Savoie Mont Blanc - CNRS/IN2P3 MUST computing center".

Useful links

>>> LAPP user support

Your message will be managed by the LAPP computing departement helpdesk (Request Tracker). Please avoid contacting directly a member of IT department team and use the following generic address: LAPP-LAPTh IT support

>>> LAPP IT Department documentation

Some documentation can also be found here: LAPP IT department documentation (restricted access)

>>> Available CPUs

You can check the load of the system on the MUST monitoring Web page.

>>> Available lapp_data disk space

This link is only accessible from LAPP network (or via VPN): http://lapp-quattor/Monitoring/lapp_data-quota.html

User interface machines (UI)

As LAPP or LAPTH user you have access to the following interface machines :

To login to the machine : ssh <user_name>@lappsl6.in2p3.fr or ssh <user_name>@lapthsl6.in2p3.fr

This machine is an user interface (UI) you will have to log to in order to prepare and submit your jobs to the computation farm. Its computing characteristics are identical to the computing machines one.

Technical characteristics :

  • OS : SL6 - 64 bits
  • compilers : C, C++, Fortran 77/90
  • OPENMPI, MPICH1, MPICH2 (parallel computation)
  • libraries : Blas, Lapack

in case you need any other compiler and/or library, send an e-mail to LAPP-LAPTh IT support

Developing/testing batch job

The lappsl6 and lapthsl6 machines environment are identical to the computing machines one. So you can compile and test your jobs on this machines before submiting them to computation farm.

: parametrize reasonably your test job, lappsl6 and lapthsl6 machines are not intended for heavy computation and are shared between many users.

Submiting batch jobs

>>> Very very very important hints

It is not possible to access your home directory (/home1 or /home3) from computation machine
If you have created a symbolic link to /lapp_data from your home directory it will not be possible to use this link in your batch job.
Always submit your jobs from your /lapp_data/... working directory
If you specify an output file name in qsub command (-e,-o option), this file must be in your /lapp_data/... working directory

>>> Batch setup

Before submitting jobs you need to initialize some Linux environment parameters:

  • A basic setup file can be found here : setup file

You simply need to define the LAPP_APP_SHARED parameter at the top of the file

  • $PBS_0_HOME

If you don't specify in which directory your job should work, the default one will be $PBS_O_HOME

  • $TMPDIR

If your job is making a lot of I/O access while running, it can be to read/write your file locally on the worker node instead to your working directory. In this case you can use the WN local directory $TMPDIR to read/write temporary your data file.

 : if you use $TMPDIR, dont't forget to retrieve your file result to your own directory at the end of job and then to clean $TMPDIR

>>> How to submit a job : qsub

Use qsub command to submit your job to computation farm:

 >> qsub -V -j oe -o sOutputFileName  -M mailing_address -m aeb  -l walltime=01:00:00,mem=512mb   sBatchFileName.sh
 
 where: 
    sOutputFileName is the output log file name
    standard output and error are redirected to sOutputFileName via "-j oe -o ..." options 
    -M <your_email_address> -m aeb       ( a: abort, e: end, b: begin )
         if you wish to receive an e-mail in case of error/end of job you have to specify both -M and -m options
    -l walltime is maximum time (hours:minutes:seconds)
       mem is the memory
    sBatchFileName.sh is the file you want to execute on computing machines

if you wish to receive an e-mail in case of error/end of job you have to specify both -M and -m options

By default, walltime is set to 56 hours. It can be checked using the following command:

 >> qmgr -c 'list queue local' | grep walltime

In case you specify a different walltime your job will be stopped if its execution time exceeds defined walltime.

The maximum time allocated to a job is 56 hours. Every job will be killed beyond this time.

>>> Job memory and CPU

The information relating to job CPU/memory consumption will not be append automatically to log file. The user have to add the option -M <your_email_address> -m aeb whil submitting job to get these information by mail when job is done.

No mail is sent to user if a job is killed because it exceeds walltime limit

>>> Test queue : option -q flash

A specific job queue is dedicated to make test. Maximum wall time allocated to these jobs is 5 minutes.

To use this queue, add the following option in your qsub command : -q flash

This queue is not intented to be used intensively by short jobs

>>> Changing groups : option -W group_list

Some users are members of several groups and need to submit jobs from any of them.

To use a different group than the default one, add the following option in your qsub command : -q group_list=<desired group name>

>>> MPI jobs (multiprocessor jobs)

There are 3 MPI libraries available on the cluster, MPICH-1, MPICH-2 and OpenMPI.

 : OpenMPI is very suitable in most cases and easier to use, you should try it first.

The maximum number of processors allocated to a job is 32


To run a MPI job, you need to create a script including lines like the following ones :

CPU_NEEDED=8 # this parameter is the number of CPUs (from 2 to 32) to be reserved for parallel execution

$MPI_OPENMPI_PATH/bin/mpicc -o <your exe> <your source> # if you wish to compile your code before running it

$MPI_OPENMPI_PATH/bin/mpirun -np $CPU_NEEDED <your exe> # to run the executable on the number of CPUs you just asked for


 : If your MPI jobs are really I/O intensive, you can use a specific bunch of computing nodes to improve performance. These nodes are connected within a dedicated Infiniband network and jobs can be submitted to theses nodes using an additional option '-q localIB' to the usual 'qsub -lnodes=....' command.

>>> Jobs with licensed software

Several licensed softwares are currently available on the computation farm for LAPP & LAPTH users :

  • Ansys
  • Maple
  • Mathematica
  • Matlab
  • Abaqus (for Meca users)
  • Samcef (for Meca users)

They have been installed in /grid_sw/software/softs_cc/ directory. Default binaries that can be used are :

  • /grid_sw/software/softs_cc/ansys_inc/v140/ansys/bin/ansys140
  • /grid_sw/software/softs_cc/Maple/current/bin/maple
  • /grid_sw/software/softs_cc/Mathematica/current/Executables/math
  • /grid_sw/software/softs_cc/Matlab/current/bin/matlab
  • /grid_sw/software/softs_cc/Abaqus/Commands/abaqus
  • /grid_sw/software/softs_cc/Samcef/Samv121/samcef

An additional module to Mathematica for High Energy Physics (FeynCalc) has been installed in /grid_sw/software/softs_cc/Mathematica/6.0/AddOns/Applications/HighEnergyPhysics

 : For further information on all available versions, please refer to this web page.

 : If you wish to run another licensed software on MUST, please contact LAPP-LAPTh IT support and we will let you know if it is possible.

Monitoring batch jobs

>>> Job monitoring : showq

The showq command shows the list of all jobs submitted to the computation farm.

This command displays the list of running jobs, the batch farm activity and the list of iddle jobs.

For each job, one can get information about job identifier, user name, number of processor used by job, start and remaining time.

 >> showq

 ACTIVE JOBS ----------------------------------------------------------------------------
 JOBNAME            USERNAME      STATE     PROC     REMAINING        STARTTIME
 52800              plokiju       Running     8        9:37:03    Sat Jun 30 19:39:16
 53024              lhcb023       Running     1     1:02:27:48    Sun Jul  1 12:30:01
 ...
 
 121 Active Jobs     128 of  160 Processors Active (80.00%)           
                      32 of   32 Nodes Active      (100.00%)

 IDLE JOBS ------------------------------------------------------------------------------
 JOBNAME            USERNAME      STATE     PROC     WCLIMIT          QUEUETIME
 53203               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:17
 53204               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:18
 53205               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:33

 : showq display is updated periodically, so this command does not show the batch queue status in real time

 : the range of an iddle job will vary dynamically in time according to the priorities calculated by the queue task manager.

Don't worry if your jobs don't appear instantaly in job queue or if they start at last range of iddle jobs queue

 : Due to a limitation of the scheduler capacity to deal with a too large number of jobs the idle jobs queue size is limited, so some jobs may appear as "blocked" when the cluster is overloaded. This does not mean that the given jobs were rejected by the scheduler, they will be processed automatically (switch to idle status) as soon as some CPU are available.

>>> How to cancel a job : canceljob

>> canceljob <job_id>

To get the job ID, use the showq command defined above.

>>> Check user's priority : diagnose

The user's priorities can be checked via the diagnose command

>> diagnose -f

The output of this command is not really "human readable"...

Personal tools