Welcome to the new Lisa getting started guide. Please let us know if you find anything that is unclear, incorrect, or if any links are broken.
As the name suggest, this guide helps you to start using the Lisa cluster computer. It only treats the absolute basics of the Lisa cluster: connecting to Lisa, running commands on the interactive node and submitting a very simple job script.
However, to use the system efficiently requires more effort. Our User guide contains all you need to know in order to use the system efficiently. After you've mastered the basics explained in this Getting started guide, we strongly encourage you to read the User guide. In this Getting started guide, we also reference the User guide repeatedly as a source of more in-depth information.
We are currently transitioning from the PBS to the SLURM ressource manager. This page explains the commands for the SLURM ressource manager. If want to use the old PBS ressource manager, please use Getting started (PBS).
You can imagine a cluster computer as a collection of regular computers (known as nodes), tied together with network cables that are similar to the network cables in your home or office. Each node has its own CPU, memory and disk space, in addition to which they generally have access to a shared file system. On a cluster computer, you can run hundreds of computational tasks simultaneously.
Interacting with a cluster computer is different from a normal computer. Normal computers are mostly used interactively, i.e. you type a command or click with your mouse, and your computer instantly responds by e.g. running a program. Cluster computers are mostly used non-interactively.
A cluster computer such as Lisa has (at least) two types of nodes: login nodes and batch nodes. You connect to Lisa through the login node (see next section). This is an interactive node: similar to your own pc, it immediately responds to the commands you type. There are only a few login nodes on a cluster computer, and you only use them for light tasks: writing and compiling software, preparing your input data, writing job scripts (we'll get to that) etc. Since the login nodes are only meant for light tasks, many users can be on the same login node at the same time.
Your 'big' calculations will be done on the batch nodes. These perform what is known as batch jobs. A batch job is essentially a recipe of commands (put together in a job script) that you want the computer to execute. Calculations on the batch nodes are not performed right away. Instead, you submit your job script to the job queue. As soon as sufficient resources (i.e. batch nodes) are available for your job, the system will take your job from the queue, and send it to the batch nodes for execution.
In the sections below, we explain how you can login to the login nodes of Lisa, execute some basic Linux commands there (interactively), write a job script, submit it to the job queue, and finally inspect your output once your job is finished.
To connect to Lisa, you need an application called a terminal. Linux and Mac systems come with a terminal application (for Linux: Applications => Accessories => Terminal; for Mac: Applications => Utilities => Terminal). On Windows, you will need to install a terminal application (such as putty or mobaXterm) yourself.
For Linux/Mac: you can login by typing
ssh <username>@lisa.surfsara.nl
where username is the username we provided you with.
For Windows: find a tab 'Session' (or similar) and put lisa.surfsara.nl
in the host field. If there is a user or login field, you can put in your Lisa username there. As port number, enter 22 (this is generally the default if left empty). Finally, click Open/Connect (or something similar) to start a connection to Lisa.
The GPU nodes have their own login node. Users who have access to the GPU partition can login here directly by replacing the hostname lisa.surfsara.nl
by login-gpu.lisa.surfsara.nl
in the instructions above.
If all goes well, you should now be logged into the system. The first time you login, SSH will show you the RSA fingerprint for Lisa and ask you if you want to continue (you can verify the fingerprint here). Enter yes
to continue.
For more extensive login options, see the User manual.
Most PCs have a graphical user interface: you interact with your computer by clicking on files, applications, etc. Lisa is a Unix/Linux system, and you interact primarily wit Lisa through a command line interface. Try a couple of basic commands:
who
shows you the list of users that are currently logged in on the same node.date
shows you the current date and time.top
shows you the list of processes currently running on the node, including how much resource (cpu, memory, etc) they use. Press q
to return to the command line again.ls
shows the current files in your home directory (if you just logged in for the first time, it may well be empty).mkdir [mydir]
create a directory with name 'mydir'.cd [mydir]
change directory to directory 'mydir'.logout
logs you out of the Lisa system.If the system is running a command that you want to interrupt, you can always use ctrl+c (try it with the top
command).
If you have no experience using Unix or Linux systems, we suggest you read our Unix tutorial for Lisa. A more extensive (general) Unix tutorial can be found here. Note that some examples in the second tutorial (especially those about variables) are aimed at a different shell (csh) than the default shell on Lisa), meaning that commands are slightly different.
There are two ways to transfer files between Lisa and your PC: via the scp command in the terminal, or using an FTP file browser.
If you open a terminal on your local machine (i.e. without logging in to Lisa), you can use scp to copy files. The syntax is
scp [source] [destination]
For example, to copy my_file from the current directory on your local machine to the directory destinationdir on Lisa.
scp my_file <username>@lisa.surfsara.nl:destinationdir
Alternatively, you can install and use an SFTP browser such as Filezilla (available for Windows, MacOS and Linux), Cyberduck (Winodws, MacOS) or WinSCP (Windows). MobaXterm (Windows) also has an integrated SFTP browser. To connect, you need to set the hostname (lisa.surfsara.nl), connection protocol (SCP, SSH2, SFTP or similar) and port number (22) in your SFTP client.
For more info, see the User manual.
There are several file systems accessible on Lisa.
File system | Quota | Speed | Shared between nodes | Experiation | Backup |
---|---|---|---|---|---|
Home | 200 GB | Normal | Yes | None | Nightly incremental |
Scratch | N.A. (size see here) | Fast | No | End of job | No |
Scratch-shared | N.A. (size 3TB) | Normal | Yes | At most 14 days | No |
Projects | Varies per project | Normal | Yes | Project duration | No |
Archive | N.A. | Very slow | Only available on login nodes | None | Nightly |
When you login on Lisa, by default, you're on the home file system. This is the regular file system where you can store your job scripts, datasets, etc. In addition, we have various other file systems, for several applications. You can always access the home file system through the $HOME
environment variable, e.g. using ls $HOME
you can list all the files and folder in your home folder. You can check how much free space you still have using the quota
command.
Scratch is a local disk in each machine. As such, it is very fast. It is often recommended to copy e.g. your dataset to scratch at the start of your job, so that subsequent file reads are fast. It is also ideal to store temporary files, generated during your job. You may also store output files there, but scratch is wiped at the end of the job. Therefore, if you store output on scratch, make sure you copy your results from scratch to your home at the end of the job. You can access scratch through the "$TMPDIR"
enviroment variable. For more information on efficient use of scratch, see this and this section of the User manual.
Scratch-shared is not local to the machine and therefore slower than scratch, but can be convenient if you need temporary space to store data that all the nodes in your job need to be able to access. It is accessible at /nfs/scratch
.
The project file system can be used
It can be accessed at /project/<username>
or /project/<projectname>
. You can check how much free space you have using df -h /project/[<username|projectname>]
. To share files on the project file system, you need to make sure to write files with the correct file permissions. See the corresponding section in the User manual on how to do that.
The archive file system is intended for long term storage of large amounts of data. Most of this data is (eventually) stored on tape, and therefore accessing it may take a while. The archive is accessible at /archive/<username>
. The archive system is designed to only handle large files efficiently. If you want to archive many smaller files, please compress them first in a single tar file, before copying it to the archive. Never store a large amount of small files on the archive: they may be scattered across different tapes and it will put a large load on the archive to retreive all those files if you need them at a later stage. See this section of the User manual for more information on using the archive appropriately.
A job script usually consists of the following parts (although in some cases you may not need 2, 3 and 5).
You job script should start with the requirements of the job. All lines that start with #SBATCH
will be interpreted by the batch system (SLURM) as job requirements. Most commonly, you specify the shell that should be used for interpreting your job script, the number of nodes and the wall clock time for the job. E.g. for a job that requires 1 node for 1,5 h, and a script that should be interpreted using the bash terminal:
#!/bin/bash #SBATCH -N 1 #SBATCH -t 01:30:00
The wall clock time is the time that the nodes remain allocated to you: if your job does not finish within the specified wall clock time, it will be terminated, so make sure you choose your wall clock time carefully. We suggested you choose it liberally based on your expected runtime, e.g. 1.5-2x higher.
Other requirements that you can set are e.g. the number of tasks per node that you want to start (for MPI jobs), the type of CPU that you require (Lisa is a heterogeneous system with 3 types of CPUs), number of cores the CPU should have (currently, all CPUs in Lisa have 16 cores), memory requirements (currently, there are nodes with 32 and 64 GB memory). For all options and how to use them, see the user guide.
To use software installed on Lisa, you generally have to load the corresponding module first. You can use the module avail
command to see a list of all software modules on Lisa. Additionally, you can generally find the names of the modules in the corresponding software page.
To load a module, e.g. Python, you use
module load Python
This load the default version of Python. To load a specific version, use e.g.
module load Python/3.5.0
More information on using modules can be found here.
The $TMPDIR environment variable points to a temporary folder on the local scratch disk. You can use this to copy input data to scratch, so that any subsequent file reads by your program are much quicker than when they are done from your home directory.
To copy a folder with input files to scratch for a single node job, use e.g.
cp -r $HOME/input_dir "$TMPDIR"
To copy a folder with input files to each of the scratch disks in a multi-node job, you can use the mpicopy tool.
module load mpicopy module load openmpi mpicopy $HOME/input_dir "$TMPDIR"
You may at this point also want to create a folder on the scratch disk to store output
mkdir "$TMPDIR"/output_dir
Now, you invoke the program that you actually want to run. For example, you have a python program 'my_program.py' that takes the input and output directories as arguments:
python my_program.py "TMPDIR"/input_dir "$TMPDIR"/output_dir
Finally, you copy the results from the scratch disk back to your home folder
cp -r "$TMPDIR"/output_dir $HOME
WARNING: after your job finishes, the "$TMPDIR" on the scratch disk is removed. Don't forget to copy your results back to your home, or they will be lost!
#!/bin/bash #Set job requirements #SBATCH -N 1 --ntasks-per-node=16 #SBATCH -t 01:30:00 #Loading modules module load python #Copy input data to scratch and create output directory cp -r $HOME/input_dir "$TMPDIR" mkdir "$TMPDIR"/output_dir #Run program python my_program.py "TMPDIR"/input_dir "$TMPDIR"/output_dir #Copy output data from scratch to home cp -r "$TMPDIR"/output_dir $HOME
Note that while this script illustrates the most common elements of a job script, it is probably not efficient: if the Python program only runs on a single core, many cores are left idle, wasting resources.
To write efficient jobs, see the chapter on writing job scripts in the user manual.
To submit a job described in the job script 'my_job.sh', use
sbatch my_job.sh
Upon submission, the system will report the ID that has been assigned to the job.
Using squeue
you can check the current job queue to see which jobs are running, which jobs are still in the queue, and which jobs are blocked.
Running just squeue
will report the state of the entire job queue, while running it with the -u $USER
parameter will display only the status of your own jobs.
Finally, to cancel a job, you can use
scancel <jobid>
where <jobid>
is the job ID reported by sbatch
when submitting the job. If you have lost this ID, it is also shown in the squeue
output.
More advanced queue operations can be found on the SLURM website.
To learn how to start interactive job or an array job, get information on a specific job, or monitor the resource usage (e.g. CPU usage) of a running job, please read the corresponding section in the User guide.
In an interactive session, the terminal shows two types of output streams: the standard output and standard error streams. Regular output (e.g. the result of a calculation) that a program wants to show in the terminal is generally written to the standard output stream, while error output (e.g. your program reports it is missing an argument) is commonly written to the standard error stream.
By default, SLURM will redirect all output to a file called <jobid>.out
, which will be placed in the directory from where you submitted your job. This file contains both the standard output and error streams, combined. If you want to send these two streams to different files, you can achieve this by setting certain sbatch parameters; this is described in the official sbatch documentation. You can read these files in a text editor (or using cat <filename>
) to see the output and errors produced by all the commands in your job script.
When you run a job on Lisa, the costs of this job are deducted from your budget. It depends on your affiliation if this has consequences for you
On Lisa, we have accounts and we have logins. A budget is tied to an account, and an account can have multiple logins (e.g. a senior researcher may have an account, and one or more of his PhD students have logins under this account). Thus, multiple logins may share the same budget.
The unit of accounting is SBU: System Billing Unit. The cost of a job is equal to the number of wall clock hours the job actually takes (so not on the wall time estimated in the job script), multiplied by the number of cores that are used. For example, you submitted a job script, reserving 6 nodes of 16 cores each, with an estimated wall clock time of 4 h. Your job is finished after 2 h and 30 minutes. The cost of your job will then be
6 x 16 x 2.5 = 240 SBU
Since accounting is done on the actual runtime, and not on the estimated wall clock time you put in the job script, there is no penalty in specifying a liberal wall clock time in your job script (other than that your job may start later, because it is more difficult to fit in the schedule).
The system is set up that a job has exclusive access to the nodes. That means that you will be charged for all cores in the nodes that you reserved, even if your application just uses a single core.
You can check your account info and budget using the commands accinfo
and accuse
.
Contact us at helpdesk@surfsara.nl
The SURFsara Data Archive allows the user to safely archive up to petabytes of valuable research data.
Persistent identifiers (PIDs) ensure the findability of your data. SURFsara offers a PID provisioning service in cooperation with the European Persistent Identifier Consortium (EPIC).
B2SAFE is a robust, secure and accessible data management service. It allows common repositories to reliably implement data management policies, even in multiple administrative domains.
The grid is a transnational distributed infrastructure of compute clusters and storage systems. SURFsara is active as partner in various...
Spider is a dynamic, flexible, and customizable platform locally hosted at SURF. Optimized for collaboration, it is supported by an ecosystem of tools to allow for data-intensive projects that you can start up quickly and easily.
The Data Ingest Service is a service provided by SURFsara for users that want to upload a large amount of data to SURFsara and who not have the sufficient amount...
The Collaboratorium is a visualization and presentation space for science and industry. The facility is of great use for researchers that are faced with...
Data visualization can play an important role in research, specifically in data analysis to complement other analysis methods, such as statistical analysis.