High Performance Computing for Humanists at Rutgers

The Office of Advanced Research Computing (OARC) at Rutgers provides a number of computing resources and training services to facilitate research computing needs across the university. Amarel is OARC’s high performance computing environment. It can be accessed from any personal computer, and it offers Rutgers scholars additional resources for pursuing computational projects.

Motivations for Use

Why might a humanist use, let alone need, high performance computing? One might assume the answer has something to do with “big data,” but this phrase can be misleading given the many different methods and data formats found in the digital humanities. The more useful question to ask is rather: when might a humanist’s needs outstrip their personal computer’s processing power?

Here are a few general types of scenario in which you might find yourself needing something your personal computer and even cloud-based programming environments can’t fulfill:

Working with data in the form of many separate files
- Example: You have 200,000 25kb text files, each representing a single page of a book or newspaper.
- The Problem: Reading a large number of files into memory sequentially can bottleneck quickly, even when they’re each very small on their own, due to hardware limitations on personal computers (i.e., even when data itself is in an efficient cloud storage drive). Running code in parallel – that is, dividing tasks among multiple processor cores simultaneously – can speed up independent tasks considerably, but most personal computers only have four.
- Solution: Amarel infrastructure facilitates fast data transfer, and sessions can deploy with many cores to make repetitive tasks more efficient and parallel. It is possible to allocate resources on Amarel that would cost several dollars an hour through other available high-performance computing environments like AWS or Azure.
Working with data in the form of extremely large individual files
- Example: you have several multi-million-row spreadsheets, each representing the metadata for a single tweet, transaction, or network relation.
- The Problem: Large individual files can exceed RAM capacity (4-8GB on most personal computers) or use enough to time out operations you wish to carry out.
- Solution: In addition to the above benefits, Amarel sessions can deploy with much more capacious memory allotments.
Code that needs a stable environment to run for an extended length of time
- Example: You are running a modeling algorithm several times that will take several hours each time.
- The Problem: Even if you are able to leave a computer to run code on its own for an extended length of time, personal computers expend a lot of computational power on tasks that we don’t necessarily see that can nonetheless interrupt jobs (security software is perhaps the biggest culprit here).
- Solution: An Amarel session can run for up to three days and save results to a persistent /home directory for later retrieval or additional work.
Testing setup code or system operations
- Example: you are installing several packages and want to ensure they and their dependencies interact correctly without jumbling your own local environment.
- The Problem: You have a lot of stuff on your computer and don’t want to risk mixing up the versions you need for some programs when installing others. Or perhaps you’re trying out a new workflow and want to see how it goes before committing to the change.
- Solution: A linux command line terminal or linux desktop can be used to install or run software. That said, for testing purposes Amarel can be a bit more limited than basic free-tier virtual computers on AWS or Azure, which can run other operating systems and are more flexible when it comes to installation.

Amarel Set-up and Launching a Session

The first step to using Amarel is creating an account at https://oarc.rutgers.edu/access/. It only takes a few minutes to fill out all the fields, but it can take a day or two for the account to be processed.

The command line is the most efficient and most flexible manner for using Amarel and clusters like it. While command line familiarity can be useful for humanists for several reasons, it does have its own learning curve. Conveniently, Amarel also uses a more familiar Graphical User Interface (GUI) that removes this hurdle by allowing users to launch sessions from their browsers – so long as you are using a Rutgers campus internet connection or a VPN (more on that below). OARC’s extensive guide has a lot more to say about these aspects and much more.

Visit the OnDemand GUI at this webpage. This interface for Amarel offers three main options at present: an R RStudio Server, a Matlab GUI, a Linux Desktop, and several Python Jupyter Notebooks. If you are programming in R or Matlab, the corresponding options are your best bet. Unless you have particular needs, use Jupyter Notebook 3 for Python. For any other uses, select the Amarel Desktop option (Linux).

Specify the resources you’d like to request for each session before launching. For general use, use “main” as the Partition. If applicable, choose which version of the program you wish to run. Set the number of hours that the session will run, the number of cores, and the amount of memory (unless set to a default, as with RStudio Servers). If you’re not sure what resources you’ll need, 1 hour, 4 cores, and 16GB memory should suffice as a default.

When requesting resources for a session, there are some important community-oriented guidelines to follow whether you are using Amarel or any other shared computing environment.
– Only ask for what you need. Use minimal necessary memory and processing when requesting a session, and don’t request more time than you will need. This will also expedite your session’s start: comparatively low-resource allocations will take less than a minute.
– Always look before you leap. Test small examples to make sure code runs correctly and to estimate appropriate memory, processing, and time resources, especially if intending to place more than one job.

Clicking the “Launch” button will initiate the request; once the session is created click the “Connect” button to enter your session in a new browser tab.

Sessions will terminate when their allotted time expires, so make sure to test first, allocate accordingly, and write data to drive as necessary. You can always see how much time is left (or reconnect if you accidentally close your session browser tab) from the OnDemand console under “My Interactive Sessions.” It is always possible to delete a session from the dashboard should it prove insufficient or should its tasks finish early; upon doing so, the session resources will be released for other sessions.

Session Basics

The /home drive, your default storage location, provides 100GB of storage that is accessible to its user account across any manner of session on Amarel. A temporary /scratch drive is available as well with much more storage space. If you’re dealing with big data, working in /scratch and moving results to /home upon completion will yield best performance.

Moving data into Amarel storage can be done with the usual import features in RStudio Servers or Jupyter Notebooks. In RStudio, for example, simply click the “Upload” button in the “Files” tab in the lower-right quadrant to upload local files or zipped directories. To download output files, select the files in question, click the “More” button near “Upload,” and then click “Export.” You can also obtain repositories from GitHub, as the last section of this page demonstrates.

In desktop sessions, files can be downloaded and uploaded from/to cloud storage options like Box or Google Drive through a browser within the session. For very large numbers of data files, programmatic access via paid cloud services like AWS or Azure is preferable both as a means of access and as a means of storage.

Additional packages can be downloaded from the various language servers as normal. Note, however, that some packages in these languages that have dependencies in other languages – such as the rJava package in R – may be unable to load. In an Amarel Linux Desktop, however, you’ll need to download software as zip files or build from source rather than via package management systems.¹

From here, parallelization of code proceeds as it would for whichever language you’re using on any local computer.

Off-Campus Access

As noted above, to access Amarel you’ll need to use either a Rutgers campus internet connection or a VPN (Virtual Private Network). There are plenty of practical let alone personal reasons why being tethered to campus might be a limitation. VPNs allow users to access a local network (e.g., the Rutgers campus network) securely from your home wifi. Like most universities, Rutgers has its own somewhat tedious two-factor VPN system that involves 1) setting up a mobile phone app for identity verification, 2) activating the VPN service, and 3) setting up a VPN client on your computer.

The VPN allows remote computers to access resources that are typically only available on the campus network by assigning them a Rutgers IP. This allows VPN users to access licensed library resources directly as if they were on campus, without going through the library website or logging into the proxy.

First, you’ll need download Duo Mobile on your mobile device. Once this is done, proceed to Rutgers’ authentication setup portal; the process is fairly straightforward, but OIT has put together a very clear guide as well. The one step that might be confusing to a user unfamiliar with this process is that after adding your device you should change the “Ask me to choose an authentication method” drop-down menu selection to “Automatically send this device a Duo Push.”

Second, you’ll need to activate the VPN service for your particular RUID account on the services management page: just check the box next to “Remote Access VPN” and then click “Activate Services” below.

Finally, you can install the VPN client itself, Cisco AnyConnect. Log in and the page will notify you that it is sending a Duo push notification; on your mobile device click “Approve.” The next window will contain a large blank space with a “Start AnyConnect” link. Click it to generate a blue download button for whichever operating system your computer runs. Open the .exe file, accept the license agreement, and install. OIT has instructions for these steps too if you need more detailed instruction.

Cisco AnyConnect should now either show up in your system tray – in the task bar in the lower right in Windows and upper right in Mac – or in the Start Menu under “Recently Added.” AnyConnect will open as a tiny window in the corner of your screen. Type vpn.rutgers.edu into the text field and click “Connect.” You will be prompted to enter your user name (RUID), password (regular RUID password) and a “Second Password”: as the instructions here helpfully suggest, this should be “push” to send a Duo push notification to your phone.

And with that, you are finally connected! Once finished with your work, open AnyConnect again and click “Disconnect” to return to your regular wifi network. From hereon out, all that’s necessary to connect is opening AnyConnect in your system tray (even if you had to use the Start Menu the first time), clicking “Connect,” entering your password, and approving a mobile Duo push.

An RStudio Server Example

This workflow is a bit different from a personal computer, so here’s a brief test case. Let’s say you’re trying to run a code file stored in a Git repository rather than uploading a file of your own. Installing programs can be a bit involved on a shared computing environment, but there’s a shortcut.

RStudio has a Terminal tab in the lower left in addition to the usual Console tab. Switch to Terminal, then enter the following line to download the repository for the Text Analysis with Newspapers workshop.

wget https://github.com/azleslie/ChronAmQuant/archive/master.zip

For other repositories, change the user name (azleslie) and repository name (ChronAmQuant) as necessary; the other parts of the url should remain consistent.

Unzip the file we just downloaded; its contents will take the form of a <DIRECTORY_NAME>-master directory.

unzip master.zip

And delete the zip when you’re done:

rm master.zip

Now switch back to the Console tab and change working directories into the repository we just cloned.

setwd("/home/<YOUR_RUID>/ChronAmQuant-master")

In the Files tab in the lower-right quadrant of the RStudio Server you will now find a ChronAmQuant-master directory; click on it to open it and click on the file named “ChronAm Workshop 2 for Users.Rmd” to open it in the script editor.

How much faster is code run in parallel in a dedicated computing environment? This workshop provides a brief test case using the R parallel package. First run the initial setup code chunk. Then read in the following sample data from the workshop:

hits <- read.csv("Sample Data/theodore_roosevelt_sn85035720.csv")

If you scroll down to line 286, you’ll see a function using the R parallel package to calculate whether any of the 160 character strings in this data frame are approximate matches of each other. Alternatively, you can copy it from here.

unique_par <- function (input) {
  core_num <- detectCores()-1
  clust <- makeCluster(core_num, outfile="")
  clusterExport(clust, varlist=c("input"), envir=environment())
  result <- parLapply(clust, seq_along(input$Collocates),
              function (x) {
                if (length(which(adist(input$Collocates[x], input$Collocates) <80)) >1) {
                  "No"
                  } else {
                    "Yes"
                    }
                })
  stopCluster(clust)
  return(result)
}

If you are new to parallelizing code in R – something of an art form in itself – let me just make a few notes in passing.² detectCores identifies the number of available logical processors or threads (usually 2x the number of requested cores), makeCluster defines the number of threads to use in the cluster for parallelization, and clusterExport makes particular objects in memory available to that cluster; these are necessary setup steps. parLapply is the parallelized version of lapply, with the one difference that it requires a cluster as its first argument. And don’t forget to stopCluster when finished.

Now run this function wrapped in a system.time function to see how long R takes to execute it:

system.time(hits$Unique <- unique_par(hits))

This should return around 5.9 seconds on 16 threads. Because this sample data is so small it would actually be faster to run this with fewer threads – the overhead of managing all the processes is inflating the time a bit. But with larger datasets, as the overhead cost begins to get dwarfed by the multiplied productivity of each addition process, you would want to remove -1 from the core_num assignment (designed to protect personal computers from overextending themselves) to use the full resources of your session.

By comparison, here’s the non-parallel version of the same code, again wrapped in system.time.

system.time(
hits$Unique <- lapply(seq_along(hits$Collocates),
                 function (x) {
                  if (length(which(adist(hits$Collocates[x], hits$Collocates) <80)) >1) {
                    "No"
                    } else {
                      "Yes"
                      }
                  })
)

Despite the disproportionate overhead cost when parallelizing code on such a small sample dataset, the non-parallel code will take a little over 11 seconds: almost twice as long. You’ll never be 16x as fast, but the larger the dataset the greater the multiplier. For this example, when the size of the dataset is doubled the parallelized code is 3x faster; when the dataset is quadrupled the parallelized code is 3.6x faster. Now, if the dataset were 10x as large – 1,600 observations, by no means “big data” and yet enough for an operation like this to require several hours – Amarel would be even more valuable.

See https://rutgers-oarc.github.io/training/guides/Cluster_User_Guide/#installing-your-own-software for instructions on how to install software dependencies. ↩
Even in R, parallel is not the only option for executing code in parallel; for example, there is the combination of the doParallel and foreach packages. ↩