Batch UdS

From Support-Applicatif

Jump to: navigation, search

This section is meant only for Université Savoie Mont Blanc (USMB) laboratories users (outside LAPP and LAPTH)

An other wiki page is available here : http://www.lama.univ-savoie.fr/wiki/index.php/MUST 

Tutorial LAMA January 22th

Tutoriel LOCIE (niveau débutant) 30 avril 2009

Tutoriel SYMME 8 mars 2010

Contents

MUST reference in your publications

Conformément au principe retenu en comité de pilotage MUST, merci de bien vouloir faire figurer le phrase suivante dans toute publication dont les résultats s'appuient sur les facilités de calcul et de stockage offertes par MUST. "Ce travail a été réalisé grâce aux services offerts par le méso-centre de calcul MUST de l'Univ. Savoie Mont Blanc - CNRS/IN2P3".

Please do not forget to add a reference to MUST in your publications if you use the cluster computing facility to obtain your results. "This work has been done thanks to the facilities offered by the Univ. Savoie Mont Blanc - CNRS/IN2P3 MUST computing center".

Useful links

>>> USMB user support

Your message will be managed by the LAPP computing departement helpdesk (Request Tracker). Please avoid contacting directly a member of IT department team and use the following generic addresses.

Send an e-mail to USMB user support

>>> MUST jobs monitoring

You can check the load of the system on the MUST monitoring Web page.

Getting an account to access computation farm

First fill in the following electronic form : http://lappweb.in2p3.fr/informatique/Univ_Savoie/

Once done print the pdf file that will be generated. Read the document (CNRS computing chart), sign it and ask your laboratory director to sign it as well. Then send this form to Muriel Gougerot as specified at the bottom of the form.

lappuds3.in2p3.fr and lappuds4.in2p3.fr portal machines

Once your account created you will get an e-mail specifying your login/password to get connected to the lappuds4 machine.

  • to login : ssh <user_name>@lappuds3.in2p3.fr or ssh <user_name>@lappuds4.in2p3.fr
  • your home directory name will be : /univ_home/<user_name> and the default allocated space is 10 Go.

This machine is a portal you will have to log to in order to prepare and submit your jobs to the computation farm. Its computing characteristics are identical to the computing machines one.

Technical characteristics :

  • OS : SL6 - 64 bits (compatible 32 bits)
  • compilers : C, C++, Fortran 77/90
  • OPENMPI, MPICH1, MPICH2 (parallel computation)
  • libraries : Blas, Lapack

In case you need any other compiler and/or library, send an e-mail to USMB user support

Password :
If you want to change passwords:

  • run "openssl passwd"
  • enter the new password
  • copy the encrypted message displayed on the screen and send it to USMB user support

Developing / testing batch job

The lappuds3 and lappuds4 machines environment is identical to the computing machines one. So you can compile and test your jobs on any of both machines before submitting them to computation farm.

: parametrize reasonably your test job, lappuds3 and lappuds4 are not intended for heavy computation and are shared between many users.


  • $HOME

If you don't specify in which directory your job should work, the default one will be $HOME

  • $TMPDIR

If your job is making a lot of I/O access while running, it can be to read/write your file locally on the worker node instead to your working directory. In this case you can use the WN local directory $TMPDIR to read/write temporary your data file.

 : if you use $TMPDIR, dont't forget to retrieve your file result to your own directory at the end of job and then to clean $TMPDIR

Submiting batch jobs

>>> How to submit a job : qsub

Use qsub command to submit your job to computation farm:

 >> qsub -V -j oe -o sOutputFileName  -M mailing_address -m aeb  -l walltime=01:00:00,mem=512mb   sBatchFileName.sh
 
 where: 
    sOutputFileName is the output log file name
    standard output and error are redirected to sOutputFileName via "-j oe -o ..." options 
    -M <your_email_address> -m aeb       ( a: abort, e: end, b: begin )
         if you wish to receive an e-mail in case of error/end of job you have to specify both -M and -m options
    -l walltime is maximum time (hours:minutes:seconds)
       mem is the memory
    sBatchFileName.sh is the file you want to execute on computing machines

if you wish to receive an e-mail in case of error/end of job you have to specify both -M and -m options

By default, walltime is set to 56 hours. It can be checked using the following command:

 >> qmgr -c 'list queue local' | grep walltime

In case you specify a different walltime your job will be stopped if its execution time exceeds defined walltime.

The maximum time allocated to a job is 56 hours. Every job will be killed beyond this time.

>>> Job memory and CPU

The information relating to job CPU/memory consumption will not be append automatically to log file. The user have to add the option -M <your_email_address> -m aeb whil submitting job to get these information by mail when job is done.

No mail is sent to user if a job is killed because it exceeds walltime limit

>>> Test queue : option -q flash

A specific job queue is dedicated to make test. Maximum wall time allocated to these jobs is 5 minutes.

To use this queue, add the following option in your qsub command : -q flash

This queue is not intented to be used intensively by short jobs

>>> Changing groups : option -W group_list

Some users are members of several groups and need to submit jobs from any of them.

To use a different group than the default one, add the following option in your qsub command : -q group_list=<desired group name>

>>> MPI jobs (multiprocessor jobs)

There are 3 MPI libraries available on the cluster, MPICH-1, MPICH-2 and OpenMPI.

 : OpenMPI is very suitable in most cases and easier to use, you should try it first.

The maximum number of processors allocated to a job is 32


To run a MPI job, you need to create a script including lines like the following ones :

  • CPU_NEEDED=8  : number of CPUs (from 2 to 32) to be reserved for parallel execution
  • $MPI_OPENMPI_PATH/bin/mpicc -o <your exe> <your source>  : if you wish to compile your code before running it
  • $MPI_OPENMPI_PATH/bin/mpirun -np $CPU_NEEDED <your exe>  : to run the executable on the number of CPUs you just asked for

Then run the job with : qsub -lnodes=<same value as CPU_NEEDED in your script> <your script>


 : If your MPI jobs are really I/O intensive, you can use a specific bunch of computing nodes to improve performance. These nodes are connected within a dedicated Infiniband network and jobs can be submitted to theses nodes using an additional option '-q localIB' to the usual 'qsub -lnodes=....' command.

>>> Jobs with licensed software

4 licensed softwares are currently available on the computation farm for USMB users :

  • Mathematica 5.2 and 6.0
  • Abaqus 6.7-1, 6.8-3, 6.9-EF1 and 6.13-4
  • Matlab 2017a (unlimited access for USMB users)
  • Ansys CFX 12.1 and 13.0

All have been installed in /univ_home/UNIVSOFT/COMMON/ directory. Binaries to use are :

  • /univ_home/UNIVSOFT/COMMON/Mathematica/5.2/Executables/math
  • /univ_home/UNIVSOFT/COMMON/Mathematica/6.0/Executables/math
  • /univ_home/UNIVSOFT/COMMON/Abaqus/Commands/abq671
  • /univ_home/UNIVSOFT/COMMON/Abaqus/Commands/abq683
  • /univ_home/UNIVSOFT/COMMON/Abaqus/Commands/abq69ef1
  • /univ_home/UNIVSOFT/COMMON/Abaqus/Commands/abq6134 (same as /univ_home/UNIVSOFT/COMMON/Abaqus/Commands/abaqus which is the default)
  • /univ_home/UNIVSOFT/COMMON/Matlab/2017a/bin/matlab (with option -nodisplay)
  • /univ_home/UNIVSOFT/COMMON/Ansys/v121/CFX/bin/cfx5
  • /univ_home/UNIVSOFT/COMMON/Ansys/v130/CFX/bin/cfx5


 : If your laboratory already purchased one of them and you wish to use it on the computation farm but you don't have permission yet to do so, please contact USMB user support and we will configure your access rights accordingly.

 : If you wish to run another version of the licensed software above or if you need to run another licensed software on MUST, please contact USMB user support.

Monitoring batch jobs

>>> Job monitoring : showq

The showq command shows the list of all jobs submitted to the computation farm.

This command displays the list of running jobs, the batch farm activity and the list of iddle jobs.

For each job, one can get information about job identifier, user name, number of processor used by job, start and remaining time.

 >> showq

 ACTIVE JOBS ----------------------------------------------------------------------------
 JOBNAME            USERNAME      STATE     PROC     REMAINING        STARTTIME
 52800              plokiju       Running     8        9:37:03    Sat Jun 30 19:39:16
 53024              lhcb023       Running     1     1:02:27:48    Sun Jul  1 12:30:01
 ...
 
 121 Active Jobs     128 of  160 Processors Active (80.00%)           
                      32 of   32 Nodes Active      (100.00%)

 IDLE JOBS ------------------------------------------------------------------------------
 JOBNAME            USERNAME      STATE     PROC     WCLIMIT          QUEUETIME
 53203               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:17
 53204               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:18
 53205               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:33

 : showq display is updated periodically, so this command does not show the batch queue status in real time

 : the range of an iddle job will vary dynamically in time according to the priorities calculated by the queue task manager.

So don't worry if your jobs don't appear instantaly in job queue or if they start at last range of idle jobs queue

 : Due to a limitation of the scheduler capacity to deal with a too large number of jobs the idle jobs queue size is limited, so some jobs may appear as "blocked" when the cluster is overloaded. This does not mean that the given jobs were rejected by the scheduler, they will be processed automatically (switch to idle status) as soon as some CPU are available.

>>> How to cancel a job : canceljob

>> canceljob <job_id>

To get the job ID, use the showq command defined above.

>>> Check user's priority : diagnose

The user's priorities can be checked via the diagnose command

>> diagnose -f

The output of this command is not really "human readable"...


Automated data transfer (based on ssh key exchange)

If your local machine is running under a Linux OS, it is possible to transfer files between your machine and the MUST portal and Worker Nodes without having to give a password. This process is based on a ssh key exchange between the machines.

>>> To transfer files from MUST portal and WN to USMB laboratories machines

>>> ssh key exchange : MUST => USMB laboratories

Get logged to lappuds portal and generate a ssh key

ssh-keygen -t rsa

When you will be prompted for a passphrase, enter an empty passphrase

Generating public/private rsa key pair.
Enter file in which to save the key (/univ_home/user/.ssh/id_rsa): \
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /univ_home/user/.ssh/id_rsa.
Your public key has been saved in /univ_home/user/.ssh/id_rsa.pub.
The key fingerprint is:
..... udsautre@lappuds4.in2p3.fr

Two keys are generated in your $HOME/.ssh directory : id_rsa is your private key and id_rsa.pub is your public key
These files ACL should be set to :

  • ls -ls $HOME/.ssh

-rw-r--r-- 1 user autre-labo 221 Apr 10 00:08 id_rsa.pub
-rw------- 1 user autre-labo 883 Apr 10 00:08 id_rsa

In case ACL are wrong, use the following commands to reset them :

  • cd $HOME/.ssh
  • chmod 600 id_rsa
  • chmod 644 id_rsa.pub


To be able to transfer data files from MUST to your remote storage machine you have to transfer your public key to this machine and register it in the $HOME/.ssh/authorized_keys file

ssh user_USMB@USMB_storage_machine test -d \~/.ssh \|\| mkdir \~/.ssh \; cat \>\> \~/.ssh/authorized_keys <~/.ssh/id_rsa.pub
where user_USMB is your user name on laboratory storage machine and USMB_storage_machine is the name of the laboratory storage machine


>>> How to copy a file : MUST => USMB laboratories

Once the ssh key exchange described above is done no passworg will be required anymore to transfer files from MUST to your laboratory machines.
The command to use to transfer a file from MUST portal and Worker Nodes is :

scp /MUST_directory/myFile user_USMB@USMB_storage_machine:/USMB_directory/myFile

where user_USMB is your user name on laboratory storage machine and USMB_storage_machine is the name of your laboratory storage machine

>>> To transfer files from USMB laboratories machines to MUST portal and WN

>>> ssh key exchange : USMB laboratories => MUST

Get logged to your laboratory machine and generate a ssh key

ssh-keygen -t rsa

When you will be prompted for a passphrase, enter an empty passphrase

Generating public/private rsa key pair.
Enter file in which to save the key (/USMB_home/user/.ssh/id_rsa): \
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /USMB_home/user/.ssh/id_rsa.
Your public key has been saved in /home/USMB_user/.ssh/id_rsa.pub.
The key fingerprint is:
..... user@your_machine

Two keys are generated in your $HOME/.ssh directory : id_rsa is your private key and id_rsa.pub is your public key
These files ACL should be set to :

  • ls -ls $HOME/.ssh

-rw-r--r-- 1 USMB_user USMB_group 221 Apr 10 00:08 id_rsa.pub
-rw------- 1 USMB_user USMB_group 883 Apr 10 00:08 id_rsa

In case ACL are wrong, use the following commands to reset them :

  • cd $HOME/.ssh
  • chmod 600 id_rsa
  • chmod 644 id_rsa.pub


To be able to transfer data files from your remote storage machine to MUST you have to transfer your public key to MUST portal and register it in the $HOME/.ssh/authorized_keys file

ssh user_lappuds4@lappuds4.in2p3.fr test -d \~/.ssh \|\| mkdir \~/.ssh \; cat \>\> \~/.ssh/authorized_keys <~/.ssh/id_rsa.pub
where user_lappuds4 is your user name on MUST portal and


>>> How to copy a file : USMB laboratories => MUST

Once the ssh key exchange described above is done no password will be required anymore to transfer files from your laboratory machines to MUST. The command to use to transfer a file to MUST portal and Worker Nodes is :

scp user_USMB@USMB_storage_machine:/USMB_directory/myFile /MUST_directory/myFile

where user_USMB is your user name on laboratory storage machine and USMB_storage_machine is the name of your laboratory storage machine

Using graphical tools

The NX software is a tool that enables to run graphical applications on lappuds3 or lappuds4 machine, in an optimized way, even through a slow network connection (ADSL).

This tool requires installation of a client module (available on Windows or LINUX). This client replaces both a SSH client and a X emulator.

>>> Installing NX client

Download the software at http://www.nomachine.com/download.php and choose the package(s) related to your Operating System and architecture in "NX Client Products"

  • On Windows, run the executable you downloaded and answer the questions.
  • On Linux, run installation as root with something like "rpm -i nxclient-x.x.x-x.x.rpm".

>>> Configuring NX client

>>> Tips and tricks

Tutorials

LAMA, Bourget du Lac : January 22th

Personal tools