Disclaimer: for this to work, you must be an ETH Zurich affiliated person and own a nethz-account.
So, you wanna play around a bit with machine learning. Or run a crazy physics simulation. But you only have a lame Macbook Air, and you wanna be able to sleep in the night without being kept awake by its spinning fans. Solution? Just use ETHs Euler general purpose super computer! Even though they write explicitely that Euler is “not a supercomputer”, supercomputer just sounds cooler than “general purpose” computer.
So called shareholders (mostly lab groups and ETH departments) who invested money own a reserved percentage of Eulers computing power. However, there is also a slice reserved for students and everyone else. The best thing about it? It works without need for an application or any other bureaucracy. You can just log in and begin. Normally, your jobs will be run pretty fast after placing them in the queue.
We will run an example from the scikit-learn website for demonstration. There will be some some small changes, because we have no GUI on the cluster. But first, let’s log in. Important: you can only log in to Euler from within the ETH network or when connected via VPN.
ssh <your nethz-name>@euler.ethz.ch
You will be greeted with a disclaimer you have to accept by typing
Yes the first time. Then, we can start by loading the python module.
module load python/2.7
# get sample script from scikit-learn
Now you’ll have to modify the script to make it run without X11. For this we have to tell matplotlib to write to disk instead trying to display the images directly. Modify the downloaded file at the top and the bottom to look like this:
# at the top of the file,
# after the long introductionary comment
from time import time
# those two lines must be inserted here
import matplotlib.pyplot as plt
import numpy as np
# replaced plt.show() by the following line
When you submit a job for the batch processing system, it will inherit the current environment. As we already loaded the python module, we are now ready to go:
# submit job
bsub "python plot_image_denoising.py"
> Generic job.
> Job <9613719> is submitted to queue <normal.4h>.
# you can now check the status of your job via bjobs
> JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
> 9613719 muejo RUN normal.4h euler01 e1096 *oising.py Aug 25 00:34
After some time, in the directory a file called
lsf.o<jobid> will appear, besides the
plot.png our script generated. That’s it, we’re done!
- Your job will be killed after 1 hour. You can use an option
-W hh:mm with
bsub let it run longer, but it will wait longer in the queue.
- Same is valid for CPU cores (
-n 2 uses 2 cores instead of 1) and memory (
-R "rusage[mem=2048]" uses 2GB per core). If I remember correctly, you can use up to 48 cores at the same time. If you submit two jobs requiring more than 48 jobs in total, they will be run sequentially.
Further information (only accessible from within ETH network):
If you need some special packages
You can install libraries via PIP locally in your home dir.
mkdir -p $HOME/python/lib64/python2.7/site-packages
module load python/2.7
# now, install e.g. theano
python -m pip install --install-option="--prefix=$HOME/python" theano
Some packages require Euler module dependencies. For example, if you wanna use h5py, you have to load the
hdf5 module before loading the python module:
module load hdf5
module load python/2.7.2
>>> import h5py
The sad thing is that only the not publicly-accessible Leonhard cluster has GPU nodes… So if you need that, you will have to ask Cluster Support.
I am a student who does not know anything about scientific computing, let alone ETHs HPC infrasctructure. This blog post just tries to provide some guidance for students who do not want to torture their Macbook too much. I do not provide any support, and also I am not responsible if the admins get angry at you because you ran your stuff on a login node.
Find support here
For D-ITET students
There is an additional resource for D-ITET students. You can run jobs on the idling tardis machines (in the ETZ computer rooms) when they are. If you ever wondered, that is the reason for the “do not turn off” signs. The batch system is very similar to the one on Euler. Find more infos in the D-ITET Computing Wiki.
This post was updated on Jan 26 2017 to reflect the decomissioning of the old Brutus cluster and its Wiki. The new Wiki is now to be found here (Only from within ETH network!).