PBS Python

This section describes a way to do parallel computing on a cluster using Parallel Python (Python 2.6) and PBS.

Parallel Computing on your computer

Parallel computing is a way to use simultaneously several cpus. There are different means to do it in Python, but the easier, as for me, is to use the module PP written by Vitalii Vanovschi. Here is an example of how to use it after installing it.

# Modules
import pp
import numpy as np

# Main object in order to use PP
job_server = pp.Server()

# Define a worker, ie the function that will be sent to do a "job"
def worker (arg):
    return np.sqrt(arg + 1)

# Send the jobs
jobs = []
for arg in range(10):
    jobs.append(job_server.submit(func=worker, args=(arg,), modules=('numpy as np',)))

# Retrieve the jobs
res = []
for job in jobs:
    res.append(job())

# Free the memory
job_server.destroy()

# Print the results
print res

In this example, the argument is quite simple. Yet, you may tackle more complex arguments like classes. In this case, the “unpickle problem” can occur. It means that PP can not access some objects because there were not copied in the memory space allocated to the job. A simple solution is to pass the object in question as an explicit argument in the “args” tuple. It is also important to check modules that are used and to indicate them to the submit method.

Using a cluster

A cluster is a set of computers connected to each other on a local network, to which you can share out the jobs. Suppose you have ten computers with 24 cpus each, then you can simulate a single 240 cpus computer. Generally speaking, a cluster is made up of a head (computer which is connected to the outside world and for instance reachable thanks to a ssh protocol) and of many nodes (computers that do the computations). The idea is the to launch a Python script on the head, that will send a many jobs to the nodes.

For this purpose, as you may not be the administrator of the cluster, you will probably need to install locally (means in your home) PP. If the nodes set up is not consistent with the head, you may also need to install locally frequently-used modules like numpy and scipy. In this case, you can use the command EXPORT PYTHONPATH=%PYTHONPATH:<my-homepath-to-pp>:<my-homepath-to-another-module> in your home/.bashrc in order to indicate to Python to look for modules in your home. If not, Python will not be able to load the locally-installed modules passed to the submit method.

As for me, the easier way to init a PP server is to use the auto-discovery function. For this purpose, you have to launch on each node:

python <path-to-pp>/ppserver.py -a &

Launching ppserver.py on each node is mandatory for the nodes to receive the jobs, while the -a option enable the auto-discovery mode. This makes it possible to init a PP server indicating to use all the available nodes (“*” argument) instead of listing the nodes IPs. Let us modify our example to explain it further.

# Script launched on the cluster head

# Modules
import pp
import numpy as np

# Main object in order to use PP
job_server = pp.Server(ppservers=("*",), ncpus=0)

# Define a worker, ie the function that will be sent to do a "job"
def worker (arg):
    return np.sqrt(arg + 1)

# Send the jobs
jobs = []
for arg in range(1000):
    jobs.append(job_server.submit(func=worker, args=(arg,), modules=('numpy as np',)))

# Retrieve the jobs
res = []
for job in jobs:
    res.append(job())

# Free the memory
job_server.destroy()

# Print the results
print res

The only thing that changed is the PP server initialization. First, we indicated to use available nodes (means nodes with ppserver.py -a launched). Secondly, we indicated to keep the head free of jobs. This is the role of “ncpus=0”. Indeed, the cluster head is aimed at managing the jobs, but not at doing hard computations. If not, the whole cluster may be slowed down.

Using PBS

PBS is a job scheduling software. It enables to share the computational resources of the cluster with people. In practice, you submit a task to PBS with a required amount of resources. Then, your task is queued until these resources are available. The three main command to use are:

  • qstat: gives information on the cluster (see the manual). For instance, qstat -f summarizes a lot of information.

  • showq: lists the tasks in the queue.

  • qsub: submit a task to the queue.

Here is an example of bash script to submit with qsub (qsub this_script.sh). This script defines the required resources, then launches Parallel Python on the assigned nodes (thus Parallel Python will use all the cpus of these nodes), runs the Python script and finally kills all the Parallel Python processes.

#!/bin/sh

# Job name
#PBS -N python_script1

# Queue (halfday, week or month)
#PBS -q week

# Nodes (nodes=[number of nodes]:ppn=[number of cpus]+nodes=... Maximum 4 tokens)
#PBS -l nodes=8:ppn=24

# Time of computation (cput=07:00:00 or walltime=07:00:00)
#PBS -l walltime=48:00:00

# Mails are sent when the job starts and when it terminates or aborts
#PBS -m bea

# Email address
#PBS -M my.mail@my.institute.com

# Standard output
#PBS -e python_job1.err
#PBS -o python_job1.log

# By default, PBS scripts execute in your home directory, not the
# directory from which they were submitted. The following line
# places you in the directory from which the job was submitted.
cd $PBS_O_WORKDIR

# Launch Parallel Python on the nodes
# a: enable auto-discovery service
# r: restart worker process after each task completion
# t: timeout to exit if no connections with clients exist
timeout=3600
mynodes=`cat $PBS_NODEFILE | uniq`
for node in $mynodes
do
    ssh -f $node python /home/maxime/pp-1.6.4/ppserver.py -r &
done

# Run the script (gives the nodes as argument)
python /home/maxime/script.py $mynodes

# Kill Parallel Python
for node in $mynodes
do
    ssh $node 'killall python -u maxime'
done

exit 0

With this bash example, you can explicitly indicate the nodes to use to Parallel Python: jobserver = pp.Server(ppservers=tuple(sys.argv1:), ncpus=0) instead of: jobserver = pp.Server(ppservers=(“*”,), ncpus=0). This is necessary when several Python tasks run simultaneously.

Oddities

There are some things I still do not really understand. The major one is the “to many open files error (24)”. It seems that PP does not really close all the sockets with the consequence that the error 24 can occur at the end of your script, when you try to save your results on the hard disk. This is quite annoying ! A simple solution is to use less nodes.

I would like to thanks Juan-Carlos, Cedric and Olivier for guiding me to the cluster usage.