Basic Exercises 3

Basic Exercises #3

This third set of exercises are to familiarise yourself with using batch jobs to perform some of the actions you have already done, but without an interactive session.

Go Back to Splash »

Exercises

Submit a Simple Job
File Operations
Using SGE Variables
Partitions
Timed Jobs
Extension Exercises

(back to top)

Submit A Simple Job

Submitting a job file when one is provided to you is a matter of copy and paste.

Either copy it from here - or locate it in the copied folder from the previous exercise (you’ll need to locate it using ls and cd or create a file using vim - please name it workshop.job).

You will have to exit any interactive session. To check the server you are currently on you can use the hostname command.

Simple Job

Job File

#!/bin/bash
#SBATCH --job-name=Workshop_Basic
#SBATCH --partition=Workshop     
#SBATCH --cpus-per-task=1        
#SBATCH --mem=2G                 # total memory per node
#SBATCH --output=output.%j.out   # output file with job ID
#SBATCH --error=output.%j.err    # error file (can be merged with output if you want)

echo $SLURM_JOB_ID
echo $SLURM_SUBMIT_DIR
echo $SLURMD_NODENAME
echo $SLURM_SUBMIT_DIR

Submit the file by running:

  sbatch workshop.job

And check the newly created output files.

Viewing the Queue

Given that the queue isnt too busy, our job should complete almost immediately.

Lets put a sleep command into the job to get it to wait 10s before finishing. Use man sleep to find out how to use the command.

Edit the job file to add the sleep command.

Once it has been submited run squeue --me and you should see your job either waiting or running in the queue, with details from the job script filling the queue information.

You can view the entire queue by omitting the --me.

    squeue

This can be quite long and messy - considering piping to the less command, or filtering on the partition(s) relevant to you.

    squeue | less
    squeue -p workshops,short

(back to top)

File Operations

Simple file read

We can now start piecing some of the skills we’ve learnt in previous exercises to do file operations within a jobscript.

Copy the workshop.job file and name the copy workshop_io.job.

Re-using your penguin_names.txt - modify workshop_io.job to read in the file and print the penguin names into the output file. Rename the jobname to be workshop_io instead of workshop_basic. Submit your job and check it was successful.

Dynamic File Read

Now modify the same jobscript file, but to take an argument fed from the commandline to read a given filename, like before.

When submitting jobs this way, what do you think you need to be careful of?

  # Get the first argument
  FILENAME=$1
  cat $FILENAME

Creating and Destroying

Create a new copy of workshop.job called workshop_cyberman.job, and rename the job name accordingly.

When creating and deleting files and folders within a jobscript we need to be extra careful - particularly because using relative filepaths and wildcards can lead to huge losses of data.

Modify your workshop_cyberman.job to create the following file and folder structure in your lustre user space:

You should not be using relative paths - use full paths such as mkdir -p /mnt/lustre/users/its/anon123/example1/folder1 not mkdir example1.

\example1\
    \--->folder1
    \--->folder2
        \--->afile1
        \--->afile2
    \--->folder3
        \--->afile1

Run the job and double check the folder has been created and the structure looks like the above.

Modify your jobscript to then create the same structure but the top level is called example2, and then safely remove the files and folders. Use a sleep command of sufficient length to see it was successful (eg 60s)

It might be worthwhile to put some useful echo commands to state the job has created the folders as it completes them and then deletes them.

(back to top)

Using Slurm Variables

Simple Use of SGE Variables

Copy the workshop.job script and call it workshop_variables.job.

Modify the script to change directory to your ITS home directory, print the current working directory command output and the change back to the jobscript working directory using the $SLURM_SUBMIT_DIR and $HOME variables.

Simple Host check

Sometimes the job can go wrong, either because of an error in your script, a compute node failure, or a central Sussex IT failure. In general it is good to write the host name of the compute node you’re job is running on. This way, in the event of failure which you believe is a fault with the system, you can provide the node your job was running on.

Modify your jobscript to write the hostname of the compute name your jobs is sent to.

Simple core count check

Modify your jobscript to ask for 4 cores and only 500MB of RAM per core. Output the number of cores your job was assigned (you’ll need $SLURM_NTASKS and $SLURM_CPUS_PER_TASK) to the output file.

(back to top)

Partitions

In order to make the HPC and scheduler as fair as possible, we use partitions to limit runtimes and number of cores which can be consumed by those jobs.

This is in an attempt to prevent a user on a quiet night, hogging the entire cluster for 2 weeks…

There are two types of partitions:

General: (short,long,verylong) These are your available to everyone partitions, that limit CPU, RAM, GPUs and runtime.
Private: (gabrial, vision_lab, cisc, workshops) These are limited to particular research groups who purchased nodes for Artemis. Jobs submitted here will always run before general partitions.

As such - if you have a long running job it is recommended you implement checkpointing into your code. This will allow you to run with more cores competitively and simply resume where the previous run ended.

Abuse of this method to submit 1000s of small jobs to overwhelm the schedular logic counts as a policy violation and your account may be suspended.

Un-queuable Job

Copy and create a new jobscript called workshop_oops.job.

Request 129 cores.

Submit the job and check its status in the queue.

Unfortuantely this job will never run, so we need to delete it.

In the output of the squeue command, your jobid will have been listed. Using that jobs id, submit the command:

  scancel <jobid>

(back to top)

Timed Jobs

For ecological reasons you might want to submit a job to run overnight, or after a specfic time. And example might be a job which needs to run after another remote system has updated (eg. Downloading Data after a public release).

Another option is to run at time when the UK energy grid is most green/expected to be most green. Or if your job is dependent on another job having completed (technically errored, failed and killed will also count as completed…)

Start Job After Timestamp

To do this you can set the --begin flag.

Theres many formats for this. And its important to note all times are in UTC.

  # ISO 8601 format (recommended)
  sbatch --begin=2026-12-24T12:00:00 myscript.sh

  # With space instead of 'T'
  sbatch --begin="2026-12-24 12:00:00" myscript.sh

  # Without seconds
  sbatch --begin="2026-12-24 12:00" myscript.sh

  # With slashes (alternative separator)
  sbatch --begin="12/24/2026 12:00:00" myscript.sh

  # Without year (assumes current or next year)
  sbatch --begin="Dec 24 12:00" myscript.sh

  # Full text format
  sbatch --begin="December 24, 2026 12:00" myscript.sh

Submit a previous job while setting this commandline flag - set it to be 1 minutes after you submit and check back to see if it completed when you expected.

You can look at past jobs that have completed with the sacct -j <jobid> command.

Start Job After another Job completes

Copy and create two new jobscript files called workshop_slow.job and workshop_waiting.job.

Modify the first jobfile to be a simple sleep job for 30 seconds. Have it print the start time to the output file using the date +"%T" command. The second workshop_waiting.job can be any of your previous jobscripts. Add a similar time statement out to the log file.

Now you can submit both jobs in sequence, with a requirement the the first job must be completed before the second starts. Here we make use of awk so that it only returns the jobid.

jid1=$(sbatch test.job | awk '{print $4}')
sbatch --dependency=afterok:$jid1 test.job

Check the status of the queue and the sequence of timestamps to confirm the order once all your jobs have completed.

(back to top)

Extension Exercises

These are extension exercises if you wish to spend more time learning the more advance features of the slurm batch scheduler. Support for these is not provided for you, and solutions will be self evident of reaching the requested result.

Each of these tasks are completable with the skills you have developed over these three exercises so far. However you may need to investigate more with --help and man.

Sequential Fibonacci Sequence

Create a jobscript which calculates the Fibonacci sequence by matching the following conditions.

Calculates 100 Fibonacci numbers and writes them to a file - Job then completes

If a previous job has calculated 100 Fibonacci numbers, read the last two numbers in the file and calculate the next 100

If the job were to go beyond 1k Fibonacci numbers, it only returns the next 25 , and then stops.

The job sleeps for 30 second after calculting its 100 numbers

The submit time for all of the jobs need to be within 1 minute (submit time is the time when the job is sent from command line, start time is when the job starts)

GPU Job

For this you will need to submit a job which reserves a single gpu_card, runs the nvidia-smi command and stores the results to a log file. This log file should also provide the date and time the command was run, which node it was run on and the directory the job was submitted from.

Environment Creater

Create a jobscript which will automatically load at least 30, non-conflicting, same tool-chain modules for a user. It will then save that module set for the user. This jobscript should only run after midnight of the next day of the jobs submission to reduce carbon footprint.

(back to top)