-
Notifications
You must be signed in to change notification settings - Fork 2
Expand file tree
/
Copy pathrcc.Rmd
More file actions
707 lines (456 loc) · 26.4 KB
/
rcc.Rmd
File metadata and controls
707 lines (456 loc) · 26.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
# High Performance Computing at UQ
```{r, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(htmltools)
```
## UQ's Research Computing Centre ##
The Research Computing Centre (RCC) provides coordinated management and support of The University of Queensland’s sustained and substantial investment in eResearch. The RCC helps UQ researchers across disciplines make the most of the University’s eResearch technologies, such as High Performance Computing (HPC), data storage, data management, visualisation, workflow and videoconferencing.
Specifically for the MME, the RCC coordinates access and support for 3 HPCs and the Research Data Manager (RDM).
### HPC access ###
Go to https://www.qriscloud.org.au/ and register for an account. Try to log in with your staff account, which may require logging out of any UQ websites. This step is automated, and you will be granted access immediately.
You must apply separately for each HPC server you want to access, providing a "technical justification". You apply at https://www.qriscloud.org.au/.
Tinaroo does not appear on the QRIScompute request page, go [here](https://services.qriscloud.org.au/services/request/new/151270360cb54d0783bffd482b4651d2).
The 3 HPCs under the RCC banner are:
**FlashLite**
The technical justification for FlashLite is:
1. You need to read and write vast amounts of data, such as CMIP6 data.
2. You keep running out of memory on Tinaroo or Awoonga
3. David Abramson has asked us to demonstrate the BeeGFS filesystem caching enhancements
An example submission made by Phil Dyer. Access was approved very quickly.
> MathMarEcol Bioregionalisation on FlashLite
>
> My lab is working on large scale marine ecology and climate models. The input data is around 30-50TB. David Abramson has asked us to use FlashLite to demostrate the BeeGFS filesystem caching enhancements, which is only available on FlashLite.
**Tinaroo**
The technical justification for Tinaroo is:
1. Your code runs faster when you can use more nodes (MPI)
2. Need a GUI, such as with MATLAB
Using more nodes means your code runs across more than one computer. If you are just using multiple cores within your computer, then Awoonga is a better choice.
**Awoonga**
The technical justification for Awoonga is:
1. Your code is written to use multiple cores within your computer
2. You are doing embarrassingly parallel work or parameter sweeps, where each run is independent
The Zooplankton size spectrum modelling is embarrassingly parallel, because each grid cell can ignore surrounding grid cells. Computing GCMs would not be embarrassingly parallel, because you need to know the neighbouring grid cells to decide your next state, at every time step..
An example submission made by Phil Dyer, approved within 5 minutes.
> MathMarEcol bioregionalisation
>
> My code currently works in a shared memory mode within a single node. Most of the computations are embarrassingly parallel. I run parameter sweeps across different datasets as well, which can be run independently on separate nodes. I can make use of 20-40+ cores, and typically need about 5 to 10 GB per core within a run.
#### Network Access
You must be on the UQ intranet to access the servers. If you are connected to eduroam/uq cable/uq wifi, then you are already in the intranet.
Everywhere else you will need to install the UQ VPN software and log in. Follow the instructions at:
<https://my.uq.edu.au/information-and-services/information-technology/working-remotely/vpn-virtual-private-network>
#### The terminal
To access any of the HPC's, you will need a terminal program on your computer.
On mac, this is Terminal.
On windows, this is Command Prompt or Windows Powershell (not Windows Powershell ISE). If you don't have an up to date Windows 10 machine, install [PuTTy](https://www.chiark.greenend.org.uk/~sgtatham/putty/).
Some people like the Ubuntu app in windows 10, just remember the *mnt/c* syntax to get to the files on your computer.
### Logging in ###
Once you have your terminal open
```{sh, eval=FALSE}
ssh <hpc cluster login alias>
```
Your username and password are the same as your UQ login.
| Cluster | Login Nodes | Cluster Login Alias |
| --- | --- | --- |
| Awoonga | awoonga1.rcc.uq.edu.au | awoonga.rcc.uq.edu.au |
| FlashLite | flashlite1.rcc.uq.edu.au | flashlite.rcc.uq.edu.au |
| | flashlite2.rcc.uq.edu.au | |
| Tinaroo | tinaroo1.rcc.uq.edu.au | tinaroo.rcc.uq.edu.au |
| | tinaroo2.rcc.uq.edu.au | |
| Wiener | wiener.hpc.dc.uq.edu.au | wiener.hpc.dc.uq.edu.au |
### Submitting a Job ###
Add section on submitted a job via PBS to Awoonga/Flashlite
For the moment, you can get more information [here](http://www2.rcc.uq.edu.au/hpc/guides/index.html?default.html). NOTE: You need to VPN in if not on campus.
## SMP High Performance Computing ##
<http://research.smp.uq.edu.au/computing/getafix.html>
The School of Mathematics and Physics have their own HPC called Getafix. Getafix is a cluster and will assign your job to nodes using the SLURM program.
### Getafix Structure ###
Each computer in the cluster is called a node. Each node has it's own cpu's, memory and graphics card. Nodes all share a common network drive, so all your data is available on all the nodes. Most nodes in Getafix have about 24 cores, a few have more and a few have less. Each node also typically has about 250GB of memory.
One node is special, the login node. The login node works like any remote computer, you can run jobs on the login node, but out of respect for other users, you should only do setup operations and short tests on the login node.
### SLURM ###
SLURM is the program that balances all the jobs between users across the cluster. You specify what you want the cluster to do, and how much time, nodes, cpu's and memory you think you will need. SLURM then waits until the cluster has enough resources available to run your job and starts the job. If you try to use more resources than you asked for, your job will be killed. If you ask for lots of resources, it might take a while for enough nodes to be available, since smaller jobs will usually fill up the available resources as other jobs end.
### Other Resources ###
The cluster has access to shared resources, particularly shared storage and networking, so your data is available on all the nodes at the same location.
### Getting into Getafix
#### Request Access
Email ITS at help@its.uq.edu.au and request access to the SMP cluster which is called Getafix.
This can take a few days. Make sure you explain your connection to the School of Maths and Physics. For HDR students, you will often have both a staff and student account. Let ITS know which one you want to use to log in with.
#### Logging into Getafix
Once you have your terminal open and logged into the UQ VPN (if off campus), type:
```{r eval=FALSE, engine='bash'}
ssh <username>@getafix.smp.uq.edu.au
```
Where `<username>` is your staff or student login.
If a popup appears saying `"Authenticity can't be established"`. Say yes.
Enter your password.
If you see:
```{r, eval=FALSE, engine='bash'}
> ssh: Could not resolve hostname getafix.smp.uq.edu.au: This is usually
> a temporary error during hostname resolution and means that the local
> server did not receive a response from an authoritative server.
```
Then you either typed it incorrectly or are not on the UQ internal network, see *Network Access*.
If you really want R X11 plots to appear on your desktop, add `-X` to the login. This works on Mac and Linux. Windows may need an X11 server, which you can most easily get from the Ubuntu app. I strongly recommend
saving plots into files and copying them to your computer, X11 can be very slow to render.
```{r, eval=FALSE, engine='bash'}
ssh -X <username>@getafix.smp.uq.edu.au
```
#### Advanced
1. SSH config file
`\~/.ssh/config`:
```{r, eval=FALSE, engine='bash'}
Host my-getafix
ForwardX11Trusted yes #equivalent to adding -X
User uqpdyer
HostName 130.102.3.59
```
Now, you only have to type
```{r, eval=FALSE, engine='bash'}
ssh my-getafix
```
### Putting your data on Getafix
You need to have your data on Getafix. Where to put it?
```{r eval=FALSE}
ls /
```
#### Home
```{r eval=FALSE}
/home/username/
```
Which is often shortened to
```{r eval=FALSE}
~/
```
You have 40GB of data allowed in here. The home directory is often used for programs and configuration files. R libraries will install here by default, which is fine. Don't use the home directory for data sets, results and plots.
#### Data
```{r eval=FALSE}
/data/username/
```
You have 400GB of data allowed in here. Put your data sets and scripts in the `/data/username/` directory.
#### Data1, Data2 … Data6
Most of the time you use `/data/` but there are other data folders that
you can access on request.
```{r eval=FALSE}
/data1/username/
/data2/username/
```
`data1` and `data2` are older system drives that we are not using.
`data3` to `data6` are high speed access drives for very large file
reading and writing. These could be useful for work on GCM's.
### Sending your data to getafix
#### scp
scp is copy over ssh, it should work anywhere ssh works.
```{r, eval=FALSE, engine='bash'}
scp -r local_folder <username>@getafix.smp.uq.edu.au:/data/<username>/<folder>
```
#### WinSCP
A graphical interface for scp.
#### rsync
Rsync is smarter than scp, it only copies changes, so it can be faster
than scp.
usually installed on Linux, may not be available on Mac or Windows.
```{r, eval=FALSE, engine='bash'}
rsync local_folder <username>@getafix.smp.uq.edu.au:/data/<username>/<folder>
```
### Running jobs on Getafix
#### Modules
To allow different versions of software to be used by different users (like python 3.4 versus python 3.6), Getafix uses Linux modules. On Getafix, R is in a module, and won't work until you load the module. You
can use the default version of a module, or specify a particular version of a module if your job depends on it.
To see what modules are available, type:
```{r, eval=FALSE, engine='bash'}
module avail
```
When you see a module you want to use, load it with:
```{r, eval=FALSE, engine='bash'}
module load <module name>
```
I use the following modules:
```{r, eval=FALSE, engine='bash'}
module load R
module load pandoc #rendering rmarkdown into html and pdf
module load gdal #spatial packages like sf
module load geos #spatial packages like sf
module load proj #spatial packages like sf
```
#### R on Getafix
R on Getafix is pretty similar to R on your local computer, without RStudio.
#### Installing packages
Install packages to a local library in your home directory.
```{r, eval=FALSE, engine='bash'}
module load R
R
```
```{r, eval=FALSE}
install.packages("foreach")
install.packages("doParallel")
install.packages("sf")
```
Accept the default library location unless you have a reason not to.
If your package fails to install, check if there is a system package missing.
For example, to install `sf`, you need to load more modules:
```{r, eval=FALSE, engine='bash'}
module load R
module load gdal
module load geos
module load proj
R
```
Now `sf` should install.
```{r, eval=FALSE}
install.packages("sf")
```
If it still does not work, send an email to help@its.uq.edu.au, they have been helpful setting me up when something was missing or broken.
#### Parameters for Rmd files
You can pass parameters into Rmd files from the command line. You need to set up the \`params\` in the YAML header, see
<https://bookdown.org/yihui/rmarkdown/parameterized-reports.html>
Using Rscript, you can render a report using a specified set of parameters like this:
```{r, eval=FALSE, engine='bash'}
paramlist='list(nCores =24, gRange = 1:300 )'
Rscript -e 'knitr::opts_knit$set(progress = TRUE, verbose = TRUE); rmarkdown::render("/data/uqpdyer/demo.Rmd", params = '"${paramlist}"', output_format = "html_notebook", output_file = "outfile.html", output_dir = "/data/uqpdyer", envir = new.env())'
```
This is not as convenient as setting parameters in RStudio or within an R session. You have to be very careful about quotes, and the command line is hard to edit.
Sending parameters to an R script in bulk is much easier, but lacks the reproducability of Rmd.
##### Parameters for script files
Use the base `commandArgs` or the `optparse` package to use command line parameters in your script files
Then you can run the script from the command line like this:
```{r, eval=FALSE, engine='bash'}
Rscript script.R input1 input2
```
#### SLURM
So far, R has been running on the login node. Some things are easiest to do on the login node, like installing packages and transferring files. Computationally heavy tasks should be done on other nodes. To use the other nodes, you submit jobs with SLURM.
SLURM is a job scheduler. People request jobs by specifying the resources they need and the scripts to run, and SLURM queues the jobs until some nodes are available to run the job. You can also request an
interactive session, but it is more efficient to set up jobs and come back to get the results.
##### `sbatch`
The command `sbatch` is used to submit jobs, which are run when the resources become available.
You use SLURM by calling `sbatch`:
```{r, eval=FALSE, engine='bash'}
sbatch script.sh
```
where `script.sh` looks like
```{r, eval=FALSE, engine='bash'}
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=0-00:10
echo "hello world"
```
The `#SBATCH` lines at the top are parameters for the sbatch command, you can also include them in the command call:
```{r, eval=FALSE, engine='bash'}
sbatch script.sh --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1G --time=0-00:10 script.sh
```
Then `script.sh` can simplify down to:
```{r, eval=FALSE, engine='bash'}
#!/bin/bash
echo "hello world"
```
When working with R, we run scripts or arbitrary code snippets with Rscript, but the `sbatch` command still expects Rscript to be wrapped in `script_for_r.sh`
``` shell
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --mem-per-cpu=2G
#SBATCH --time=0-01:00
module load R
Rscript myRscript.R
```
So running `sbatch` looks the same:
```{r, eval=FALSE, engine='bash'}
sbatch script_for_r.sh
```
Here is a meaningful r script we can test out. Note that I have specified 20 cpu's in both `script_for_r.sh` and `myRscript.R`
```{r, eval=FALSE}
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
x <- iris[which(iris[,5] != "setosa"), c(1,5)]
trials <- 10000
ptime <- system.time({
r <- ~foreach~(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime
stime <- system.time({
r <- ~foreach~(icount(trials), .combine=cbind) %do% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
stime
stopCluster(cl)
cl <- makeCluster(20)
registerDoParallel(cl)
ptime <- system.time({
r <- ~foreach~(icount(trials), .combine=cbind) %dopar% {
ind <- sample(100, 100, replace=TRUE)
result1 <- glm(x[ind,2]~x[ind,1], family=binomial(logit))
coefficients(result1)
}
})[3]
ptime
```
##### `salloc` and `srun`
If you really need to run something interactively, such as a Matlab GUI, make sure you log in with `-X` or `-Y` and then call:
``` 1
salloc --ntasks=1 --cpus-per-task=1 --mem-per-cpu=1G --time=0-00:10
srun --pty -x11 matlab
```
##### Many nodes
So far, we have used only 1 node. The memory and cpu's are all in the same node, and R behaves in exactly the same way as on your machine, only with more cores. You are clearly limited to a speed up of around 20 times with a single node.
To go beyond the single node limit, you have a few options.
1. Divide your job up
- If you can break your job up into many independent runs, then you can submit a job for each run, and collect all the different outputs after all the jobs have run.
- I have done this to test out the effects of changing combinations of parameters.
2. Rslurm
- The Rslurm package will generate a set of scripts that submit a collection of jobs. Rslurm does the work of dividing up your job for you.
- Take care with this package, the default settings will allocate far too many jobs and overwhelm Getafix.
3. MPI
- Message Passing Interface (MPI) is a library that shares information across nodes.
- MPI requires extra setup in your code, and you need to use `mpirun` or `srun` inside the sbatch script
```{r, eval=FALSE, engine='bash'}
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=1G
#SBATCH --time=0-00:10
mpirun echo "running with MPI"
srun echo "run with SRUN"
```
doMPI and Rmpi are packages for working with MPI in R.
TODO: I have not figured out MPI on the cluster yet, it didn't "just work"
4. future.batchTools
- The future.batchtools package lets you submit SLURM jobs from R. The benefit of using future.batchtools is that you write R code using future syntax, then specify a SLURM plan, and all futures at a level specified by the plan are sent off as jobs. The script waits until the job finishes.
- I recommend running a single node task with 1 cpu and a modest amount of ram, but a long running time, then all the heavy lifting is done by new jobs submitted by your script.
- Take care with this package, I haven't checked the defaults, but it might request too many resources like Rslurm.
##### R with parameters
Being able to pass in parameters is useful for dividing your job up.
Some examples of sbatch scripts that use R parameters
```{r, eval=FALSE, engine='bash'}
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=7G
#SBATCH --time=1-22:00
module load R
Rscript script.R $1 $2
```
```{r, eval=FALSE, engine='bash'}
sbatch run_script.sh input1 input2
sbatch run_script.sh input1 input3
```
`$1` becomes the first argument, `input1` both times, and `$2` becomes
the second argument, `input2` in the first run and `input3` in the
second line.
Rendering an Rmd file with parameters:
```{r, eval=FALSE, engine='bash'}
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem-per-cpu=7G
#SBATCH --time=1-22:00
module load R
module load pandoc #essential for knitr and rmarkdown
paramlist='list(nCores =24, gRange = 1:300 )'
Rscript -e 'knitr::opts_knit$set(progress = TRUE, verbose = TRUE); rmarkdown::render("/data/uqpdyer/demo.Rmd", params = '"${paramlist}"', output_format = "html_notebook", output_file = "outfile.html", output_dir = "/data/uqpdyer", envir = new.env())'
```
If you want to generate a collection of rendered Rmd files, you have to
write up a script for each set of parameters.
The other way to generate a set of Rmd files with different parameters,
in parallel, is to copy the file and change the parameters in each file.
Then you can use job arrays by naming the files from 1 to n.
The simplest approach I can think of to run many different Rmd files is
to have a shell script, that calls an R script, that renders an Rmd.
Parameters passed to the shell script go all the way down to the Rmd
file.
```{r, eval=FALSE}
args = commandArgs(trailingOnly = TRUE)
knitr::opts_knit$set(progress = TRUE, verbose = TRUE)
paramlist=list(nCores =as.numeric(args[1]), gRange = as.numeric(args[2]):as.numeric(args[3]))
rmarkdown::render("/data/uqpdyer/demo.Rmd", params = paramlist, output_format = "html_notebook",
output_file = "outfile.html", output_dir = "/data/uqpdyer", envir = new.env())
```
### Debugging Rmd files
Debugging regular R isn't great, but debugging Rmd files non-interactively is even worse.
Here are some tricks:
#### Cache on
Turn on caching for your Rmd file.
```{r caching_on}
knitr::opts_chunk$set(cache=TRUE)
```
#### Load in cache dir
The qwraps2 package on github has a function, `lazyload_cache`, which will pull in the saved cache so you can interact with it in RStudio.
<https://github.com/dewittpe/qwraps2/blob/master/R/lazyload_cache.R>
Once you call `lazyload_cache()` on the cache generated by knitr, your environment has all the variables seen up until a failure point.
## General HPC Guidance ##
### Getting Results back
The batch jobs are not interactive. You set them up to run and come back for the results. Usually you save a file to disk.
All printed output goes to a file `slurm-<job number>.out`.
For example, in SLURM you can change the name with
```{r, eval=FALSE, engine='bash'}
#!/bin/bash
#SBATCH -o output_file_%A_%a.out #%A is the job number, %a is the array number
#SBATCH -e error_file.err
```
### Data.frames
If you have R objects you want to see, save them to disk.
- saveRDS
R binary format, single object, you specify the name when you readRDS
- fwrite
data.table csv writer, fast, and gives you a csv
- feather
a new binary format backed by Hadley, fast, and compatible with
python
- fst
another new binary format, extremely fast and you can read in
subsets of the data.frame
- Archivist
puts everything into a database and gives a unique hash to every
object. Good for reproducibility, you can't mix up objects from
different runs later on because the hash always changes
### Plots
Always save plots to disk, unless you are using Rmd files, which will
put the plot into the output html anyway.
Use `ggsave()` for ggplots, and `dev.png()`, your plotting code, `dev.off()` for other kinds of R plots.
It is a good idea to set the width, height and dpi for all your plots.
### Playing nice
The cluster is a shared resource.
Try to be as precise as possible when allocating resources, so others can use what you don't need.
Be careful with any package that submits jobs for you (`Rslurm`,`future.batchtools`). Check the jobs are requesting a reasonable set of resources.
### Job Submission Tips ###
If you have trouble getting access to the resources you want, consider if you can split up your jobs into smaller parts. The queue system (PBSpro and SLURM) will normally favour smaller, shorter jobs. If you request a single job which needs, for example, 24 cores, then you will have a long (days) delay to start, because the HPC needs to find 24 cores on the same node, which may require existing jobs to finish. While your job is queued, other smaller jobs will jump in front of you and get time. If the cores don't need to talk to each other, consider submitting the job as an array job, with 24 individual runs of 1 core each.
### What HPC should I use? ###
The MME lab has users across all HPCs (Awoonga-Phil, Flashlite-Isaac, Tinarro-Patrick, Getafix-Patrick). Our staff/students will continue to use the system which works best for our individual needs.
Our experience is that getting time on Getafix is the easiest - apart from the end of session when a lot of undergraduates are completing assignments. However, Getafix is a bit more of a stand-alone cluster, which is run by SMP and does not have fast write access to the UQ Research Data Manager (RDM). The RCC HPC's all have fast write access to the RDM.
As it is a smaller community of users, the staff in charge of Getafix seem to be more flexible in terms of installing new modules if you request. Because the RCC is responsible for the HPC support for the rest of UQ, they need more rigorous protocols to install new modules but they will do so if asked.
### Parallel coding in R
#### Add a section on parallelising in SLURM/PBSpro
Include something on Epilogue and Prologue
And pro/cons of parallelising in/outside of R
#### parlapply, par*apply etc.
The parallel package gives you par\* versions of the apply family of functions.
Easy to drop in if you usually use the apply family from base R, you replace any `apply`, `lapply`, `sapply` with `parapply`, `parlapply`, `parsapply`.
#### foreach and %dopar%
The `foreach` package has a few uses.
First, it changes how you use `for` loops in R. Base R `for` loops usually work by changing a variable defined outside the loop. `foreach` is more like an `apply`, where a result is returned, so `foreach` works without needing to use side effects.
Second, because `foreach` can work without side effects, it can run the loop in parallel easily. There are a number of backends you can use, including `doParallel`, `doMPI`, `doFuture`. You can quickly change how parallel works by replacing `%do%` with `%dopar%` and changing the registerDo\* function at the start of your script.
If you already like using `foreach` over the apply family, moving to parallel is easy. The main downside is that `foreach` is not particularly efficient with resources. If you are mostly using the apply family, switching to `foreach` is much harder.
#### `future`
The `future` package is built around the idea of a promise. The promise is "Eventually the variable will have a result assigned to it, but not straight away". You replace `<-` with `%<-%` and the line of code becomes a promise, rather than something to do right away.
Generally, you don't need to change your code much, just add `%` around your `<-` at strategic points, but you can gain a lot of flexibility, and there are many backends.
`future.apply` gives you parallel versions of `apply`. `doFuture` mixes `future` and `foreach`.
`future.batchtools` spawns jobs on the cluster that return the results into your session. `future.callr` does local parallel calculations in separate R processes. `future` alone will do local multiprocessor calculations and use multiple nodes.
#### General Coding habits for going parallel
1. Put things into lists
- It is easy to loop over lists, and you can quickly parallelise the processing of a list.
2. Don't do graphics in parallel
- Generally speaking, you can specify ggplot objects in parallel, but if you try to print or save them in parallel, R crashes.
- You probably have to work very carefully to open multiple `dev.png()` for plotting base R plots.
3. Make functions small
- If you repeat something frequently, put it into a function. You can then put functions into packages. Try to keep functions small, then they are easier to use in parallel. Generally don't try to work in parallel within a function.
- If you want a function to use parallel processing, I strongly recommend you use either `future` or `foreach` with `%dopar%`. By default, they both drop back to sequential processing, but the user can set up parallel processing before calling the function to get speed improvements.
4. Look for loops, but don't use R base `for` loops
- Base R `for` loops cannot be used in parallel, because they rely on side effects.