Batch other partners

From Support-Applicatif

Jump to: navigation, search

This section is meant only for users located outside LAPP, LAPTH and Université Savoie Mont Blanc (USMB) laboratories

Contents

MUST reference in your publications

Conformément au principe qui avait été retenu dans un comité de pilotage précédent, je vous remercie de bien vouloir faire figurer le phrase suivante dans toute publication dont les résultats s'appuient sur les facilités de calcul et de stockage offertes par le méso-centre de calcul de l'Université Savoie Mont Blanc. "Ce travail a été réalisé grâce aux services offerts par le méso-centre de calcul MUST de l'Université Savoie Mont Blanc".

Please do not forget to add a reference to MUST in your publications if you use the cluster computing facility to obtain your results. "This work has been done thanks to the facilities offered by the Université Savoie Mont Blanc MUST computing center".

Useful links

>>> MUST user support

Your message will be managed by the LAPP computing departement helpdesk (Request Tracker). Please avoid contacting directly a member of IT department team and use the following generic addresses.

Send an e-mail to: MUST user support

>>> MUST jobs monitoring

You can check the load of the system on the MUST monitoring Web page.

Getting an account to access computation farm

First fill in the following electronic form : http://lappweb.in2p3.fr/informatique/Univ_Savoie/

Once done print the pdf file that will be generated. Read the document (CNRS computing chart), sign it and ask your laboratory director to sign it as well. Then send this form to Muriel Gougerot as specified at the bottom of the form.

lappuds4 portal machine

Once your account created you will get an e-mail specifying your login/password to get connected to the lappuds4 machine.

  • to login : ssh <user_name>@lappuds4.in2p3.fr
  • your home directory name will be : /univ_home/<group_name>/<user_name> and the default allocated space is 1Go to 10Go.

This machine is a portal you will have to log to in order to prepare and submit your jobs to the computation farm. Its computing characteristics are identical to the computing machines one.

Technical characteristics :

  • OS : SL5 - 64 bits (compatible 32 bits)
  • compilers : C, C++, Fortran 77/90
  • OPENMPI, MPICH1, MPICH2 (parallel computation)
  • libraries : Blas, Lapack

In case you need any other compiler and/or library, send an e-mail to MUST user support

Password :
If you want to change your password :

  • openssl passwd
  • enter the new password
  • copy the crypted message displayed on the screen and send it to MUST user support

Developing / testing batch job

The lappuds4 machine environment is identical to the computing machines one. So you can compile and test your jobs on this machine before submitting them to computation farm. A lappuds3 machine similar to lappuds4 is also available, you can use any of them.


: parametrize reasonably your test job, lappuds4 is not intended for heavy computation and is shared between many users.


  • $HOME

If you don't specify in which directory your job should work, the default one will be $HOME

  • $TMPDIR

If your job is making a lot of I/O access while running, it can be to read/write your file locally on the worker node instead to your working directory. In this case you can use the WN local directory $TMPDIR to read/write temporary your data file.

 : if you use $TMPDIR, dont't forget to retrieve your file result to your own directory at the end of job and then to clean $TMPDIR

Submiting batch jobs

>>> How to submit a job : qsub

Use qsub command to submit your job to computation farm :

 >> qsub -V -j oe -o sOutputFileName  -M mailing_address -m aeb  -l walltime=01:00:00,mem=512mb   sBatchFileName.sh
 
 where: 
    sOutputFileName is the output log file name
    standard output and error are redirected to sOutputFileName via "-j oe -o ..." options 
    -M <your_email_address> -m aeb       ( a: abort, e: end, b: begin )
         if you wish to receive an e-mail in case of error/end of job you have to specify both -M and -m options
    -l walltime is maximum time (hours:minutes:seconds)
       mem is the memory
    sBatchFileName.sh is the file you want to execute on computing machines

if you wish to receive an e-mail in case of error/end of job you have to specify both -M and -m options

By default, walltime is set to 56 hours. It can be checked using the following command:

 >> qmgr -c 'list queue local' | grep walltime

In case you specify a different walltime your job will be stopped if its execution time exceeds defined walltime.

The maximum time allocated to a job is 56 hours. Every job will be killed beyond this time.

>>> Job memory and CPU :

The information relating to job CPU/memory consumption will not be append automatically to log file. The user have to add the option -M <your_email_address> -m aeb whil submitting job to get these information by mail when job is done.

No mail is sent to user if a job is killed because it exceeds walltime limit

>>> Test queue : option -q flash

A specific job queue is dedicated to make test. Maximum wall time allocated to these jobs is 5minutes.

To use this queue add the following option is your qsub command : -q flash

This queue is not intented to be used intensively by short jobs

>>> MPI jobs (multiprocessor jobs)

There are 3 MPI libraries available on the cluster, MPICH-1, MPICH-2 and OpenMPI.

 : OpenMPI is very suitable in most cases and easier to use, you should try it first.

The maximum number of processors allocated to a job is 32


To run a MPI job, you need to create a script including lines like the following ones :

  • CPU_NEEDED=8  : number of CPUs (from 2 to 32) to be reserved for parallel execution
  • $MPI_OPENMPI_PATH/bin/mpicc -o <your exe> <your source>  : if you wish to compile your code before running it
  • $MPI_OPENMPI_PATH/bin/mpirun -np $CPU_NEEDED <your exe>  : to run the executable on the number of CPUs you just asked for

Then run the job with : qsub -lnodes=<same value as CPU_NEEDED in your script> <your script>

>>> Jobs with licensed software

3 licensed softwares are currently available on the computation farm. There have been installed for Université Savoie Mont Blanc laboratories and are used in routine by many students:

  • Mathematica 5.2 and 6.0
  • Abaqus 6.7-1, 6.8-3 and 6.9-EF1
  • Matlab 2008b and 2009a

 : MUST do not provide any licence to use these softwares. It is up to your laboratory to provide them : a connexion to your key server will be parametrized in order to get your own licence at the beginning of your jobs.


If you wish to run a licensed software on MUST, please contact MUST user support and we will let you know if it is possible.

>>> Other available software

Monitoring batch jobs

>>> Job monitoring : showq

The showq command shows the list of all jobs submitted to the computation farm.

This command displays the list of running jobs, the batch farm activity and the list of iddle jobs.

For each job, one can get information about job identifier, user name, number of processor used by job, start and remaining time.

 >> showq

 ACTIVE JOBS ----------------------------------------------------------------------------
 JOBNAME            USERNAME      STATE     PROC     REMAINING        STARTTIME
 52800              plokiju       Running     8        9:37:03    Sat Jun 30 19:39:16
 53024              lhcb023       Running     1     1:02:27:48    Sun Jul  1 12:30:01
 ...
 
 121 Active Jobs     128 of  160 Processors Active (80.00%)           
                      32 of   32 Nodes Active      (100.00%)

 IDLE JOBS ------------------------------------------------------------------------------
 JOBNAME            USERNAME      STATE     PROC     WCLIMIT          QUEUETIME
 53203               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:17
 53204               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:18
 53205               lhcb023       Idle       1     1:12:00:00    Sun Jul  1 20:28:33

 : showq display is updated periodically, so this command does not show the batch queue status in real time

 : the range of an iddle job will vary dynamically in time according to the priorities calculated by the queue task manager.

So don't worry if your jobs don't appear instantaly in job queue or if they start at last range of iddle jobs queue

 : Due to a limitation of the scheduler capacity to deal with a too large number of jobs the idle jobs queue size is limited, so some jobs may appear as "blocked" when the cluster is overloaded. This does not mean that the given jobs were rejected by the scheduler, they will be processed automatically (switch to idle status) as soon as some CPU are available.

>>> How to cancel a job : canceljob

>> canceljob <job_id>

To get the job ID, use the showq command defined above.

>>> Check user's priority : diagnose

The user's priorities can be checked via the diagnose command

>> diagnose -f

The output of this command is not really "human readable"...


Automated data transfer (based on ssh key exchange)

If your local machine is running under a Linux OS, it is possible to transfer files between your machine and the MUST portal and Worker Nodes without having to give a password. This process is based on a ssh key exchange between the machines.

>>> To transfer files from MUST portal and WN to your laboratories machines

>>> ssh key exchange : MUST => your laboratory

Get logged to lappuds4 portal and generate a ssh key

ssh-keygen -t rsa

When you will be prompted for a passphrase, enter an empty passphrase

Generating public/private rsa key pair.
Enter file in which to save the key (/univ_home/group/user/.ssh/id_rsa): \
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /univ_home/group/user/.ssh/id_rsa.
Your public key has been saved in /univ_home/group/user/.ssh/id_rsa.pub.
The key fingerprint is:
..... udsautre@lappuds4.in2p3.fr

Two keys are generated in your $HOME/.ssh directory : id_rsa is your private key and id_rsa.pub is your public key
These files ACL should be set to :

  • ls -ls $HOME/.ssh

-rw-r--r-- 1 user autre-labo 221 Apr 10 00:08 id_rsa.pub
-rw------- 1 user autre-labo 883 Apr 10 00:08 id_rsa

In case ACL are wrong, use the following commands to reset them :

  • cd $HOME/.ssh
  • chmod 600 id_rsa
  • chmod 644 id_rsa.pub


To be able to transfer data files from MUST to your remote storage machine you have to transfer your public key to this machine and register it in the $HOME/.ssh/authorized_keys file

ssh user_lab@lab_storage_machine test -d \~/.ssh \|\| mkdir \~/.ssh \; cat \>\> \~/.ssh/authorized_keys <~/.ssh/id_rsa.pub
where user_lab is your user name on laboratory storage machine and lab_storage_machine is the name of the laboratory storage machine


>>> How to copy a file : MUST => your laboratory

Once the ssh key exchange described above is done no passworg will be required anymore to transfer files from MUST to your laboratory machines.
The command to use to transfer a file from MUST portal and Worker Nodes is :

scp /MUST_directory/myFile user_name@lab_storage_machine:/your_directory/myFile

where user_name is your user name on laboratory storage machine and lab_storage_machine is the name of your laboratory storage machine

>>> To transfer files from your laboratories machines to MUST portal and WN

>>> ssh key exchange : your laboratory => MUST

Get logged to your laboratory machine and generate a ssh key

ssh-keygen -t rsa

When you will be prompted for a passphrase, enter an empty passphrase

Generating public/private rsa key pair.
Enter file in which to save the key (/$HOME/.ssh/id_rsa): \
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /$HOME/.ssh/id_rsa.
Your public key has been saved in /$HOME/.ssh/id_rsa.pub.
The key fingerprint is:
..... user@your_machine

Two keys are generated in your $HOME/.ssh directory : id_rsa is your private key and id_rsa.pub is your public key
These files ACL should be set to :

  • ls -ls $HOME/.ssh

-rw-r--r-- 1 user_name group_name 221 Apr 10 00:08 id_rsa.pub
-rw------- 1 user_name group_name 883 Apr 10 00:08 id_rsa

In case ACL are wrong, use the following commands to reset them :

  • cd $HOME/.ssh
  • chmod 600 id_rsa
  • chmod 644 id_rsa.pub


To be able to transfer data files from your remote storage machine to MUST you have to transfer your public key to MUST portal and register it in the $HOME/.ssh/authorized_keys file

ssh user_lappuds4@lappuds4.in2p3.fr test -d \~/.ssh \|\| mkdir \~/.ssh \; 
cat \>\> \~/.ssh/authorized_keys <~/.ssh/id_rsa.pub
where user_lappuds4 is your user name on MUST portal.


>>> How to copy a file : your laboratory => MUST

Once the ssh key exchange described above is done no password will be required anymore to transfer files from your laboratory machines to MUST. The command to use to transfer a file to MUST portal and Worker Nodes is :

scp user_lab@lab_storage_machine:/USMB_directory/myFile /MUST_directory/myFile

where user_lab is your user name on laboratory storage machine and lab_storage_machine is the name of your laboratory storage machine

Tutorials & examples

A set of examples and exercises is available here : [1]
 : as this tutorial was meant for Université Savoie Mont Blanc laboratories, you need to adapt the home directory names to your own to run the given examples.

Personal tools