Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


...

Here are some recent instructions for running jobs on Rivanna (by Nick & Ishara):

1. *** Make sure that Prof. Keller has added you to both the spin and spinquest groups in Rivanna.   Without both groups, you will not be able to gain access to the system. There are two ways of accessing Rivanna (https://www.rc.virginia.edu/userinfo/rivanna/login/); you can follow either step (2) or step (3) mentioned below.

2. Web-based Access  (Click on "Launch OpenOndemand" > You will need your UVA computing ID and password to log in)

  • You can navigate to your "Files", "Jobs", "Clusters", etc. via the menu bar (see the above image)

Image Removed

...

***

Using Tensorflow with BKM2002-Formulation


1.  Copy the sample files from the following Rivanna folder "/project/ptgroup/ANN_scripts/BKM-Formulation-Test/BKM-tf"
       cd  /project/ptgroup/ANN_scripts/BKM-Formulation-Test/BKM-tf

2. Run the following commands on your terminal
                    module load anaconda/2020.11-py3.8
                   

...

      module load singularity/3.57.21
                    module load tensorflow/2.1.0-py37

...

                     following step is needed to run only once (it will copy the relevant .sif file to your /home directory)
                    cp  $$CONTAINERDIR/tensorflow-2.1.0-py37.sif /home/$USER  

...

#!/usr/bin/env bash
#SBATCH -p standard
#SBATCH --output=result_%a.out
#SBATCH -c 1
#SBATCH -t 16:30:00
#SBATCH -A spinquest

 (make sure that you have  the same module loads included in your grid.slurm file)

3. Run the following command to submit the job
    ./jobscript.sh <Name_of_Job> <Number_of_Replicas>

example: 
./jobscript.sh CFF_BKM_tf_Test 10

Using PyTorch with BKM2002-Formulation



1.  Copy the sample files from the following Rivanna folder "/project/ptgroup/ANN_scripts/BKM-Formulation-Test/BKM-PyTorch"
       cd  /project/ptgroup/ANN_scripts/BKM-Formulation-Test/BKM-PyTorch

2. Run the following commands on your terminal
                    module load anaconda/2020.11-py3.8
                   

...

module load singularity/3.

...

7.

...

1
                    module load

...

pytorch/

...

1.8.1

...

singularity run --nv /home/$USER/tensorflow-2.1.0-py37.sif /home/cee9hc/ANN_GPD_Calc_Test/Full_ML_fit_evaluation.py ${SLURM_ARRAY_TASK_ID}


                     following step is needed to run only once (it will copy the relevant .sif file to your /home directory)
                    cp  $$CONTAINERDIR/pytorch-1.8.1.sif /home/$USER

 (make sure that you have  the same module loads included in your grid.slurm file)

3. Run the following command to submit the job
   $ ./jobscript.sh <Name_of_Job> <Number_of_Replicas>

example:  $ ./jobscript.sh CFF_BKM_PyTorch_Test 10


Note:
If you download the code from GitHub to a Windows machine and then if you upload those files to Rivanna; then you will need to do the following steps

$ chmod u+x jobscript.sh
$ sed -i -e 's/\r$//' jobscript.sh
   and
$ sed -i -e 's/\r$//' <all_files>  in order to avoid any dos < - > unix conversions

** If you copy the fiels from /project/ptgroup/ANN_scripts/BKM-Formulation-Test/BKM-PyTorch then you don't have to do these above modification steps **

For more details check Zulkaida's folder on the github page: https://github.com/extraction-tools/ANN/tree/master/Zulkaida/BKM





Using Tensorflow with VA-Formulation


The following steps are for an example to submit a job for neural-net fit to 'N' number of kinematic settings in the data set (where N is an integer reflects to the range of kinematic settings which you will input in the sbatch command to submit the job).


1. Make sure that Prof. Keller has added you to both the spin and spinquest groups in Rivanna.
2. Copy the sample files from the following Rivanna folder "/project/ptgroup/ANN_scripts/VA-Formulation-Initial-Test"
      $ cd  /project/ptgroup/ANN_scripts/VA-Formulation-Initial-Test

Here are the list of file that you need to have in your work directory:

Definitions
                      BHDVCStf.py
                      Lorentz_Vector.py
                      TVA1_UU.py
Data file →  dvcs_xs_May-2021_342_sets.csv
Main file → Full_ML_fit_evaluation_Set2.py
Job submission file → Job.slurm

3. Change the path(s) in the following files
    3.1) Highlighted line in "Job.slurm" file (please see below) with the correct path of 'your files'
           Image Added
   
   3.2) Similarly update the paths on "Full_ML_fit_evaluation_Set2.py" file
           Line numbers → 22, 31, 154
 

4. For a quick test, you can change the "number of samples" to a small number to test (in other words "number of replicas") which is in line number 115: 'numSamples = 10' as an example. You can change this numSamples value to any number of replicas that you need.

5. Run the following commands on your terminal
       $ module load anaconda/2020.11-py3.8
       $ module load singularity/3.7.1
       $ module load tensorflow/2.1.0-py37
       $ cp $CONTAINERDIR/tensorflow-2.1.0-py37.sif /home/$USER
 (make sure that you have  the same module loads included in your Job.slurm file)

6. Run the following command
      $ sbatch --array=0-2 Job.slurm
   Note: Here 0-14 means the number of kinematic settings that you want to run in parallel (this is parallelization of local fits), and as a part of the output you will see Results#.csv (where # is an integer number) files which contain distributions of Compton Form Factors (CFFs) from each (individual) local fit.

Below is an example of the above steps (up to step #6):
Image Added


7. After you submit your job:
   * You can view your jobs using the web-browser (please see the following screen-shots)
   Image Added

   * You can find commands to check the status of your job, cancel job(s), other commands related to handling jobs using .slurm file etc. using the following page
      https://www.rc.virginia.edu/userinfo/rivanna/slurm/

8. At the end of your job, you will find several types of output files (please see the the following screenshot)
    Results*.csv  → These files contain CFFs distributions corresponding to each kinematic setting
    best-netowrk*.hdf5 → These files are the 'best'/'optimum' neural-network files for each kinematic setting   
    result_*.out → These files contain the output while it's been running for each kinematic setting
Image Added


Note: The true CFFs values which were used to generate these pseudo-data are given in 'https://github.com/extraction-tools/ANN/blob/master/Liliet/PseudoData2/dvcs_xs_May-2021_342_sets_with_trueCFFs.csv' only for the purpose of your comparison with what you obtain from your neural-net.

Important: Please consider that this is an example for running a neural-net fitting job on Rivanna for your reference.

...

3. Secure Shell Access:

  • You will need a UVA's VPN in order to SSH to Rivanna.  Follow this link and follow the instructions on the page to download and configure the VPN. There are three types of network accesss available: "UVA More Secure Network", "UVA Anywhere", and "High Security VPN". "UVA More Secure Network" would be the prefered one, but "UVA Anywhere" would also work if "UVA More Secure Network" is not available.
  • Open a UNIX terminal and connect to Rivanna with the command "ssh -Y mst3k@rivanna.hpc.virginia.edu(replacing mst3k with your computing id).  The password is the same as your UVA netbadge password. If you are not now at the Rivanna command line check that the step(1) + above steps are successfully completed.
  • To move the code and associated resources into your Rivanna directory, you can use secure copy from another terminal.  The resources you need for running the code are the same as with the Colab notebook, except that it is necessary to run a pure python file as opposed to a notebook on Rivanna.  I have uploaded a python version of the same code to the Github.  One additional resource that was not previously necessary is the bash script to run the Rivanna job, called as  "Job1.slurm" for this example. You will need to make sure the locations/paths mentioned in your program files contain your change computing ID.  Additionally, within the python file you will want to change the learning rate and numSamples parameters to the specified values.
  • Run the job with the command "sbatch --array=0-14 Job1.slurm". For more information about this command see this link. This will run all the kinematic sets simultaneously for however many replicas you specified with the numSamples parameter.  For 1000 replicas, the process may take up to six or seven hours.  If you desire to run only a handful of kinematic sets for 1000 replicas that can be done much more quickly but requires some slight changes to the code (The replicas would be parallelized instead of the kinematic sets.  In fact, both could be parallelized, but Rivanna prevents you from running more than a few thousand jobs simultaneously).   Let me know if that is the case and I can edit the code for you.

4. Once the job is started, you will see a job ID in your Rivanna terminal. You can monitor the progress of your job by navigating the the "Jobs" page on your web browser (see fig 1 above). 

5. The results of the replicas will be in your home directory of Rivanna under the name Results(0-14).csv.  These can be sent back to your local system for analysis with the scp command or downloading via the OpenOnDemand