Taught by Andrew Goldstone, Department of English, Rutgers University–New Brunswick

Wednesday, April 30, 2014
4:30 p.m.–6:30 p.m.
Alexander Library, Room 413
169 College Avenue, New Brunswick, NJ

With the increasing prominence of the digital humanities, humanists are once again asking themselves whether they can make use of the computer’s most fundamental capacity: its ability to count. This workshop introduces some of the methodological choices required for computational counting: what representations of data are suitable for machine processing? Once you have such a representation, how can you begin to analyze it? We will make these questions concrete through an introduction to R, which is both a programming language and a software environment for data analysis. We focus on the nature of computational thinking: the scholar’s work of representing and analyzing data on the computer is a process of highly disciplined expression. We will work together to analyze some samples of data (bibliographic data and word-use data), from loading files in the ubiquitous comma-separated value (CSV) format, to searching and tabulating data programmatically, to the “grammar” of basic visualization.

No programming experience required. Patience, however, helps.

Notes from the workshop

The following is a somewhat edited and amplified write-up (by Andrew Goldstone) of the workshop notes. The original slides from the workshop are also available. To work through the below would take approximately two and a half hours.

Shall we count?

The aim of this workshop is to provide some foundations for thinking about counting as a way of answering some questions we have in our disciplines. Foundations, and not just practical instruction in tool-using, matter because there is a lot at stake in the decision to explore counting methods. James English writes about literary study:

Academic disciplines (and even interdisciplines or hybrids) are relational entities; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the nai?ve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival.1

Comma-separated values

This workshop focuses on tables. It’s worth thinking about what can and cannot be represented in a table, but the table is a ubiquitous model for keeping track of data. When you think of tables and computers, you might think of Excel spreadsheets. Unfortunately, the Excel format is both too complex and too opaque to allow for direct programmatic manipulation, except through the very awkward mechanisms of Excel’s own native languages and macros. That format is also proprietary and vulnerable to problems when it comes to sharing and archiving. I have found that for my own purposes, I have spent the most time working on tables in CSV or comma-separated values formats. Excel and any other spreadsheet program can save your spreadsheets as “CSV” or “Text CSV” (though, as you will see, this will entail a drastic, though productive, simplification of the format of the data). Here is a small table in CSV format:

firstname,surname,bornCountry
Alice,Munro,Canada
Mo,Yan,China
Tomas,Tranströmer,Sweden

The norms of CSV

  • plain-text file for tabular data
  • delimiter separates columns (usually , or a tab)
  • newline separates rows
  • names of columns in first row (optional)
  • tricky bits:
    • what if a data point contains a comma?
    • what if a data point contains a quotation mark?
    • what text-encoding should be used?
    • how do you know what rules have been followed? (There is RFC 4180, but no promises.)

People as a table

Let’s look at some more elaborate CSV-format data. In the sample files, look for laureates.csv and open it in RStudio using the Open File command (or open it in a text editor. For more on text editors, see the notes for the workshop on digital text.)2

id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year
892,Alice,Munro,1931-07-10,0000-00-00,Canada,CA,Wingham,,,,female,2013
880,Mo,Yan,0000-00-00,0000-00-00,China,CN,Gaomi,,,,male,2012
868,Tomas,Tranströmer,1931-04-15,0000-00-00,Sweden,SE,Stockholm,,,,male,2011
854,Mario,"Vargas Llosa",1936-03-28,0000-00-00,Peru,PE,Arequipa,,,,male,2010
844,Herta,Müller,1953-08-17,0000-00-00,Romania,RO,"Nitzkydorf, Banat",,,,female,2009
832,"Jean-Marie Gustave","Le Clézio",1940-04-13,0000-00-00,France,FR,Nice,,,,male,2008
817,Doris,Lessing,1919-10-22,2013-11-17,"Persia (now Iran)",IR,Kermanshah,"United Kingdom",UK,London,female,2007
808,Orhan,Pamuk,1952-06-07,0000-00-00,Turkey,TR,Istanbul,,,,male,2006
801,Harold,Pinter,1930-10-10,2008-12-24,"United Kingdom",UK,London,"United Kingdom",UK,London,male,2005

Notice the large number of conventional choices implied by this table: quotation marks to surround items with spaces; dates in YYYY-MM-DD format; “still living” represented as a 0000-00-00 date; arbitrary ID numbers; gender coded as male or female… None of these codings are described explicitly; CSV has very limited accommodation for metadata (just the column names). The rest of the metadata has to live in a separate file. Working in this format means keeping careful track of choices for how categories have been coded.

Text as a table

Here is part of a tabular representation of a scholarly article3:

WORDCOUNTS,WEIGHT
the,766
of,482
and,305
in,259
to,224
a,195
new,101
as,101
that,86
it,75

This so-called bag of words indicates only the number of times each type of word occurs in the article (according to JSTOR’s OCR), regardless of order. For this purpose, the CSV format is quite amenable. Notice what has been discarded: not just word order but punctuation, page layout, typography… (What dimensions of the page could be tabulated by extending the table?)

It is worth spending some time thinking about what can and cannot be accommodated in a data format like CSV. Let’s make a rough typology of some of the kinds of data we might be interesting in counting up.

Data types

Simple: numerical

  • Whole numbers (integer scale). How many (books, people, words, genres…)?
  • Real numbers (interval scale). How much (distance, time, money…)? Special cases:
    • percentages or proportions (ratio scale). How much of the total (population, corpus of texts…)?
    • dates. When? (And does the day, month, year, decade, century… matter?)

Simple: categorical

  • Unordered. Which of… (languages, nations, genders(?))? Special cases:
    • binary or Boolean category: true or false, yes or no.
    • many categories (headwords in the dictionary, authors in the catalogue).
  • Ordinal. Which (letter of the alphabet, sales rank, “like, dislike, or neutral”)?

Categories may be represented by numbers, often in more than one way:

  • true: 1, false: 0
  • like: 1, neutral: 0, dislike: -1
  • like: 2, neutral: 1, dislike: 0
  • a: 1, b: 2, c: 3… (character encoding)

Compound types

With these atoms of data, we can then make more complex forms:

The list / the series

As in a series, perhaps, of percentages:

17.5, 3.0, 11.0, 13.0, 8.1, 11.9, 11.0, 3.7, 3.2, 1.5

The list of lists / the table

The table can also be represented as a list of lists (of equal length):

firstname: Alice, Mo, Tomas
surname: Munro, Yan, Tranströmer
bornCountry: Canada, China, Sweden

firstname surname     bornCountry
Alice     Munro       Canada
Mo        Yan         China
Tomas     Tranströmer Sweden

You might also think of a table as a list of “cases,” one case per row, where each row is described by the same collection of data (first name, surname, country of birth).

And text?

For computational purposes, the central representation of text is in the form of a (looooong) list of characters (a “string”):

O, n, c, e, *space*, u, p, o, n, *space*, a, *space*,
t, i, m, e

But other representations exist:

  • the bag of words (to: 2, be: 2, or: 1, not: 1)
  • content analyses (automated, human, or semi-automated, classifications of texts which can then be tallied and analyzed in turn)
  • marked-up text

      <sp who="#Salinus"><speaker>Duke.</speaker>
      <p>Haplesse <name>Egeon</name> whom the fates
      haue markt...</p>
  • parsed trees (reflecting grammar—a grammar tree is a classic example of a data format which is impossible to encode in a single table)
  • page images (bonus activity: explain how image can be represented in a table)

Programming in a nutshell

Let’s get counting. This is going to require some programming. What is programming?

  1. A program is a formal description of a process for transforming data. Composing a program is a matter of expressing what you want to do in a constrained language.
  2. A computer performs calculations on numbers and stores the results of those calculations.
  3. If the inputs, outputs, and the formal description can be encoded as numbers, a program can be executed on a computer. At that point the formal description also looks like a recipe of instructions for the computer. In programming, one often switches back and forth between the expressive mode of description and the more machine-focused mode of instruction.

The R experience

The console

The console is the window with the > prompt. In this window, you type an expression, and R figures out its value (and sometimes: stores a value, draws a figure, reads a file from the disk, saves a file on the disk), and tells you. And that’s all.

The script

A script is a set of expressions in a text file, one after another. R goes through and figures out their value one by one. And that’s all.

First steps in the console

R is a parrot

The simplest kind of expression consists of a value. To figure out the value of a value, R doesn’t have to work very hard:

2
[1] 2

From here on, these notes show, first, the line you can type into R—indicated by the box with a grey background—and then, following it, a second box, with a white background background, showing the response R gives. There’s no need to type in the response. (Ignore, for now, the strange [1] you see when R echoes these values back to you. It’s just trying to be friendly.)

"Shiver me timbers"
[1] "Shiver me timbers"

Notice that as you type a ", RStudio fills in the closing " automatically. This can be a little disconcerting but is a useful convenience. You can “overtype” the close quote as well. RStudio will do the same thing with parentheses and brackets.

R gets crabby easily

On the other hand, already we can provide a first introduction to some of the ways you can type things R does not understand. Doing this is a normal part of the work, and hitting glitches and making errors is an important part of a learning process. R is particularly bad at explaining to you why it has not accepted what you typed. It’s worth practicing making R crabby, so you can see that this experience is not the end of the world:

Shiver
Shiver me timbers
help
(
"Shiver

To escape from these cases, press ESC.

Some important features

The constrained environment of the interactive prompt (the > where you type a line and press return) is rigorously linear and serial. Once you’ve typed return, you can’t edit the line to fix mistakes. But you can quickly copy over into a new line what you previously typed:

  • Use the up and down arrows (or the RStudio History pane) to move through the history of past lines.
  • Use the tab key to fill in partly-typed words that are known to R.4
  • use the help feature: help("paste") or ?paste displays help on the thing called paste.

R data kinds (“modes”)

Numbers

R does not enforce the difference between integers and non-integers very rigidly. Most of the time, in R you just think of “numbers” with and without decimal places.

Strings

Text comes in strings, surrounded with "":

"Avast"
[1] "Avast"
"\"Avast,\" he said"
[1] "\"Avast,\" he said"
"Beware the \\"
[1] "Beware the \\"

Represent a newline with \n and a tab with \t. In all these cases, \ is a special “escape” character indicating that the next character has a special interpretation.

Booleans

In R, a Boolean value may be TRUE or FALSE, T or F for short.

Factors

This special type, for representing categorical data, is discussed below.

Rithmetic

Try:

2 * 2
[1] 4
5/7
[1] 0.7143

You can use as many spaces or as few as you want.

Now try a logical expression using the operators ==, which means “is equal?,” != which means “is not equal?”, and > and <:

4 == 3
[1] FALSE
4 > 3
[1] TRUE
4 < 3
[1] FALSE
4 != 3
[1] TRUE

These expressions have Boolean values. Boolean values have their own arithmetic, defined by the operators familiar from catalogue searching: and, or, not. In R these are notated as follows:

(2 > 1) & (1 > 5)
[1] FALSE
(2 > 1) | (1 > 5)
[1] TRUE
!(1 > 5)
[1] TRUE

R functions

Functions map inputs to outputs. The syntax is:

function_name(input_expression)

for a function with one input or

function_name(input1, input2)

for a function with 2. And so on. The inputs can be any expression.

Here are some examples of functions applied to simple values:

sqrt(4)
[1] 2
nchar("Munro")
[1] 5
paste("Alice", "Munro")
[1] "Alice Munro"

You can guess what these functions do, but you could also look up the official explanation with help(nchar) or help(paste).

Here are some examples where a function takes an expression as an input—which might include another function—and so ad infinitum. (This ability to use one function’s output as input to another function—or even the same function—is central to the working of algorithms.)

sqrt(4 * 4)
[1] 4
sqrt(nchar("Four"))
[1] 2
paste(paste("Alice", "Munro"), "(Canada)")
[1] "Alice Munro (Canada)"

Functions in R have one other way of indicating inputs, called named parameters. They look like this:

paste("Munro", "Alice", sep = ", ")
[1] "Munro, Alice"
paste("Munro", "Alice", sep = "")
[1] "MunroAlice"

Here the third parameter is given the name sep, which has a special role in the paste function (what is it?).

Assignment

Computers do calculations and store the results. We’ve done calculations; what about storing?

In R, <- stores a value under a name which you can refer to (or change) later. The format is

name <- expression

R figures out the value of expression and stores it in name.5

Once you’ve stored a value under name, the value of that name is…the stored value. This sounds funny, but try a few examples:

x <- 108
x
[1] 108
x + 2
[1] 110
storage <- 10
storage <- storage - 10
My_Perfectly_Good_Name2012 <- "Mo Yan"

In the “Environment” pane of RStudio, you can watch the results of your assignments: the names and their associated values suddenly appear in the list.

Names can be short or long, but can’t have spaces in them, and they have to start with a letter.

R compound data types

Vectors (for a series of values)

What I described as a series is called a vector in R. Construct a vector with the special function c (concatenate):

xs <- c(2, 4, 8)
xs
[1] 2 4 8
bs <- c(T, F, T)
bs
[1]  TRUE FALSE  TRUE
people <- c("Munro", "Mo", "Transtromer")
people
[1] "Munro"       "Mo"          "Transtromer"
c(people, "Vargas Llosa")
[1] "Munro"        "Mo"           "Transtromer"  "Vargas Llosa"

Notice that vectors can hold not just numbers but strings or Booleans. (They aren’t allowed to hold a mixture. For that a different compound type exists, which we won’t say much about, the list).

Subscripting

Once you have a series, how do you pick out parts of it? Choose an element or elements from a vector with []:

xs[2]
[1] 4
people[1]
[1] "Munro"

Again, any expression whose value is a meaningful subscript can go in the square brackets:

xs[1 + 1]  # a silly example
[1] 4

Sequences

One special kind of vector has so many uses R has a special way of notating it: this is the sequence:

1:3
[1] 1 2 3
c(1:3, 6:8)
[1] 1 2 3 6 7 8

Now try an experiment. What is the value of these expressions?

people[1:2]
[1] "Munro" "Mo"   
people[c(1, 3)]
[1] "Munro"       "Transtromer"

If the subscript is itself a vector, then we get back a vector, not just a single element.

Logical subscripting

Now figure out what’s going on here:

people[bs]
[1] "Munro"       "Transtromer"

In the expression

v[logic_v]

if logic_v is made up of Boolean values and has the same length as v, it acts as a kind of mask applied to the series: only the elements of v corresponding to the TRUE values of logic_v are picked out.

Vector operations

All of the arithmetic discussed above has a vector version. So do many R functions. In general, the idea is do the same thing to each element of the vector (operators apply “elementwise”):

c(1, 3, 5) + c(2, 4, 6)
[1]  3  7 11
c(T, F, F) | c(F, T, F)
[1]  TRUE  TRUE FALSE
paste(c("a", "b"), c("c", "d"))
[1] "a c" "b d"

If the Boolean arithmetic above seemed vague, you can generate “truth tables” that show you the workings of the Boolean operators6:

c(T, T, F, F) & c(T, F, T, F)  # truth table for AND
[1]  TRUE FALSE FALSE FALSE
c(T, T, F, F) | c(T, F, T, F)  # truth table for OR
[1]  TRUE  TRUE  TRUE FALSE
!c(T, F)  # truth table for NOT
[1] FALSE  TRUE

Another important operator: x %in% y checks whether x is found in the vector y. This is clear enough for the case where x is a single value:

"c" %in% c("b", "c", "d", "e")
[1] TRUE
"a" %in% c("b", "c", "d", "e")
[1] FALSE

Explain to yourself what happens when x is a vector of multiple values:

c("a", "b", "c") %in% c("b", "c", "d", "e")
[1] FALSE  TRUE  TRUE

Recycling

The last vectorial convenience has to do with operations that need vectors of the same length. R lets you supply a single value and then extends it to the length required by the context of an operation or a function. Compare:

c(1, 3, 5) + 1
[1] 2 4 6
c(1, 3, 5) + c(1, 1, 1)
[1] 2 4 6
paste("The", c("beginning", "end"))
[1] "The beginning" "The end"      
paste(c("The", "The"), c("beginning", "end"))
[1] "The beginning" "The end"      
c(1, 3, 5) == 3
[1] FALSE  TRUE FALSE
c(1, 3, 5) == c(3, 3, 3)
[1] FALSE  TRUE FALSE
choice <- xs > 3
xs > c(3, 3, 3)
[1] FALSE  TRUE  TRUE
choice
[1] FALSE  TRUE  TRUE
xs[choice]
[1] 4 8
xs[xs > 3]
[1] 4 8

That last example is a very common idiom of the R language. It expresses the following: “The elements of xs that are greater than 3” in a concise way. But now you can also see how the R machine actually works that out:

  1. Take 3 and repeat it enough times to make a vector the same length as xs.
  2. Compare xs to that repeated 3 vector, yielding a logical vector.
  3. Use the logical vector as a subscript for xs to pick out only some elements of that vector.

Recycling actually works on vectors of any length, though this feature is less often used than recycling a single element:

1:4 + 1:2
[1] 2 4 4 6
letters  # built-in variable of length 26
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"
[17] "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
paste(letters, c("odd", "even"))
 [1] "a odd"  "b even" "c odd"  "d even" "e odd"  "f even" "g odd" 
 [8] "h even" "i odd"  "j even" "k odd"  "l even" "m odd"  "n even"
[15] "o odd"  "p even" "q odd"  "r even" "s odd"  "t even" "u odd" 
[22] "v even" "w odd"  "x even" "y odd"  "z even"

The data frame

Finally we come to the central compound data type in R, the data frame. The data frame represents tabular data. A data frame is a list of vectors not necessarily of the same type, but all of the same length. Let’s make one. The rather special data.frame() function takes named parameters and makes a data frame out of them, using the parameter names as the names of the columns.

laureates <- data.frame(
    firstname=c("Alice","Mo","Tomas"),
    surname=c("Munro","Yan","Tranströmer"),
    bornCountry=c("Canada","China","Sweden"),
    age_now=c(82,59,83))
laureates
  firstname     surname bornCountry age_now
1     Alice       Munro      Canada      82
2        Mo         Yan       China      59
3     Tomas Tranströmer      Sweden      83

To simplify the results of what goes on below, one slightly magical incantation should be added here.7

laureates <- data.frame(
    firstname=c("Alice","Mo","Tomas"),
    surname=c("Munro","Yan","Tranströmer"),
    bornCountry=c("Canada","China","Sweden"),
    age_now=c(82,59,83),
    stringsAsFactors=F)

Indexing by row and column

If we want to access parts of the data frame, a single subscript is no longer enough; now we need two subscripts to pick out rows and columns. In general:

data_frame[rows, columns]

gives us only those elements of data_frame in the rows specified by a subscript vector rows and the columns specified by columns. Try it out:

laureates[1, 1]
[1] "Alice"
laureates[1, 2]
[1] "Munro"

In addition to numbers, our subscripts can use the column names:

laureates[1, "firstname"]
[1] "Alice"
laureates[2, "surname"]
[1] "Yan"

And, just as with vectors, subscripts can be vectors, not just single indices:

laureates[3, c("firstname", "surname")]
  firstname     surname
3     Tomas Tranströmer

Fiddly note: If your data frame subscript expression picks out more than one column, its value is itself a data frame rather than a vector (even if you pick only one row). For the purposes of this lesson, this distinction is not important.

Exercise (1)

Write a single expression in terms of laureates to produce the full name of Canada’s laureate. If your answer is a vector of multiple elements, write a more complicated expression that yields a single string. You will have to use a function. The solution is in the answers section.

Omitted indices

Leaving a blank where an index would be means “I want all of ’em”:

laureates[3, ]
  firstname     surname bornCountry age_now
3     Tomas Tranströmer      Sweden      83
laureates[, 2]
[1] "Munro"       "Yan"         "Tranströmer"
laureates[, c("surname", "bornCountry")]
      surname bornCountry
1       Munro      Canada
2         Yan       China
3 Tranströmer      Sweden

This is useful in conjunction with logical indexing:

laureates[c(T, F, T), ]
  firstname     surname bornCountry age_now
1     Alice       Munro      Canada      82
3     Tomas Tranströmer      Sweden      83

A shorthand

Picking out a single column is so common, that R lets you use a more concise syntax:

laureates[, "surname"]
[1] "Munro"       "Yan"         "Tranströmer"
laureates$surname
[1] "Munro"       "Yan"         "Tranströmer"

A single column is just…a regular old vector, which can be subscripted in turn:

laureates$surname[2]
[1] "Yan"

But this only works on a single column:

laureates$firstname, surname # error
laureates$(firstname, surname) # sigh
laureates$c(firstname, surname) # alas

Some less-made-up data

We’ve worked on miniature data long enough. Let’s work on a bigger table—though not much bigger. R has a special function devoted to reading CSV files from your hard drive. Its input is the name of the file and its output is a data frame.

laureates <- read.csv("laureates.csv",
                      stringsAsFactors=F)

Please take the magical stringsAsFactors=F incantation on faith for now. This is the first R operation you have seen which involves the hard drive; it is an example of “File I/O.” This is a fount of interesting errors and problems. One of the major flaws of R is its extremely obscure way of telling you about such problems. For example, if it can’t find the file you asked for (either because of a typo in the name, or because it’s looking in the wrong folder), you’ll see an error like:

Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open file 'laureates.csv': No such file or directory

One solution is to figure out the full path name of the file you want (i.e., where in your nested folders it lives). If you are used to using the Finder/Explorer to do this, then the simplest way8 to figure this out is with another special R function:

file.choose()

This opens a dialog box; navigate to where you have stored laureates.csv and click “open.” The value of the function now appears, and it is the full path of the file. On my system this looks like:

"/Users/agoldst/Documents/dhru/counting/laureates.csv"

You can now copy and paste this path, in quotes, into read.csv. On my system (but not on yours) this looks like:

laureates <- read.csv("/Users/agoldst/Documents/dhru/counting/laureates.csv",
                      stringsAsFactors=F)

Properties of the frame

We said before that a CSV file involves very minimal metadata. R stores the same metadata about a data frame. Find the names of the columns of a data frame with a function, names:

names(laureates)
 [1] "id"                "firstname"         "surname"          
 [4] "born"              "died"              "bornCountry"      
 [7] "bornCountryCode"   "bornCity"          "diedCountry"      
[10] "diedCountryCode"   "diedCity"          "gender"           
[13] "year"              "category"          "overallMotivation"
[16] "share"             "motivation"        "name"             
[19] "city"              "country"          

And the number of rows (which in this case is the number of laureates):

nrow(laureates)
[1] 106

The logic of the query

Now that we have a longer table (not that long, but still a bit long to see all at a glance), we want to slice and dice it. Indeed, one of the most important things we can do with a table of data is to choose parts of it. Combining what we have seen so far, we can use logical vectors, Boolean operators, and subscripting to pick out parts of a data frame in R.

This is an operation we all do all the time when we search library catalogues or databases. So think of this as a version of a search query. Only instead of clicking some menus and seeing results visually, we have the capacity to store and do further calculations on the results of our queries.

But let’s start with a query. What are the surnames of laureates born in Sweden?

laureates$bornCountry == "Sweden"
  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [51] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [71] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
 [81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [91] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[101] FALSE FALSE FALSE FALSE FALSE FALSE
swedes <- laureates$bornCountry == "Sweden"
laureates$surname[swedes]
[1] "Tranströmer"    "Johnson"        "Lagerkvist"    
[4] "Karlfeldt"      "von Heidenstam" "Lagerlöf"      

Let’s look at the full rows of the table for laureates born in Sweden, too:

laureates[swedes, ]
    id            firstname        surname       born       died
3  868                Tomas    Tranströmer 1931-04-15 0000-00-00
40 649               Eyvind        Johnson 1900-07-29 1976-08-25
63 622           Pär Fabian     Lagerkvist 1891-05-23 1974-07-11
78 604            Erik Axel      Karlfeldt 1864-07-20 1931-04-08
92 585   Carl Gustaf Verner von Heidenstam 1859-07-06 1940-05-20
98 579 Selma Ottilia Lovisa       Lagerlöf 1858-11-20 1940-03-16
   bornCountry bornCountryCode       bornCity diedCountry
3       Sweden              SE      Stockholm            
40      Sweden              SE Svartbjörnsbyn      Sweden
63      Sweden              SE          Växjö      Sweden
78      Sweden              SE         Karlbo      Sweden
92      Sweden              SE      Olshammar      Sweden
98      Sweden              SE       Mårbacka      Sweden
   diedCountryCode  diedCity gender year   category overallMotivation
3                              male 2011 literature                NA
40              SE Stockholm   male 1974 literature                NA
63              SE Stockholm   male 1951 literature                NA
78              SE Stockholm   male 1931 literature                NA
92              SE   Övralid   male 1916 literature                NA
98              SE  Mårbacka female 1909 literature                NA
   share
3      1
40     2
63     1
78     1
92     1
98     1
                                                                                                                                                    motivation
3                                                                    "because, through his condensed, translucent images, he gives us fresh access to reality"
40                                                                              "for a narrative art, far-seeing in lands and ages, in the service of freedom"
63 "for the artistic vigour and true independence of mind with which he endeavours in his poetry to find answers to the eternal questions confronting mankind"
78                                                                                                                         "The poetry of Erik Axel Karlfeldt"
92                                                           "in recognition of his significance as the leading representative of a new era in our literature"
98                                          "in appreciation of the lofty idealism, vivid imagination and spiritual perception that characterize her writings"
   name city country
3    NA   NA      NA
40   NA   NA      NA
63   NA   NA      NA
78   NA   NA      NA
92   NA   NA      NA
98   NA   NA      NA

That’s a lot of stuff, so from now we’ll only look at some of the columns while we’re picking out rows. Let’s try another query:

women <- laureates$gender == "female"
laureates[women, c("firstname", "surname", "year")]
              firstname    surname year
1                 Alice      Munro 2013
5                 Herta     Müller 2009
7                 Doris    Lessing 2007
10             Elfriede    Jelinek 2004
18              Wislawa Szymborska 1996
21                 Toni   Morrison 1993
23               Nadine   Gordimer 1991
69             Gabriela    Mistral 1945
72                Pearl       Buck 1938
81               Sigrid     Undset 1928
83               Grazia    Deledda 1926
98 Selma Ottilia Lovisa   Lagerlöf 1909

Not…many. We now express more complex ideas, like “All the rows corresponding to women born in Sweden”:

laureates[women & swedes, c("firstname", "surname", 
    "year")]
              firstname  surname year
98 Selma Ottilia Lovisa Lagerlöf 1909

Or “all the rows corresponding to women or Swedes”:

laureates[women | swedes, c("firstname", "surname", 
    "year")]
              firstname        surname year
1                 Alice          Munro 2013
3                 Tomas    Tranströmer 2011
5                 Herta         Müller 2009
7                 Doris        Lessing 2007
10             Elfriede        Jelinek 2004
18              Wislawa     Szymborska 1996
21                 Toni       Morrison 1993
23               Nadine       Gordimer 1991
40               Eyvind        Johnson 1974
63           Pär Fabian     Lagerkvist 1951
69             Gabriela        Mistral 1945
72                Pearl           Buck 1938
78            Erik Axel      Karlfeldt 1931
81               Sigrid         Undset 1928
83               Grazia        Deledda 1926
92   Carl Gustaf Verner von Heidenstam 1916
98 Selma Ottilia Lovisa       Lagerlöf 1909

Or “all the rows corresponding to women or people who are not Swedes, but take only the first and names and surnames”:

laureates[women & !swedes, c("firstname", "surname")]
   firstname    surname
1      Alice      Munro
5      Herta     Müller
7      Doris    Lessing
10  Elfriede    Jelinek
18   Wislawa Szymborska
21      Toni   Morrison
23    Nadine   Gordimer
69  Gabriela    Mistral
72     Pearl       Buck
81    Sigrid     Undset
83    Grazia    Deledda

Exercise (2)

Write an expression whose value is a data frame containing the names and prize-years of all the laureates who died in a country other than the country of their birth. The solution is in the answers section.

Counting

Now let’s finally use the computer to count (explicitly; secretly, it’s been doing a lot of counting already). For counting in R, the workhorse is the table function. At its simplest, table takes a vector as an input and returns a tabulation9 showing how many times each value in the vector is repeated:

table(c("a", "b", "a", "c", "b"))

a b c 
2 2 1 

This is already a useful operation. Notice that because we’re always talking to R in vectors, we always start by counting everything. Instead of asking, “How many Swedish laureates?” we could ask, “How many laureates from each country?” This is more information, but in R it is more concise (because more general):

table(laureates$bornCountryCode)

   AT BE BG CA CH CL CN CO CZ DE DK EG ES FI FR GP GR GT HU IE IN IR 
 3  1  1  1  2  1  2  2  1  1  7  3  1  4  1 10  1  1  1  1  4  1  1 
IS IT JP LC LT MG MX NG NO PE PL PT RO RU SE TR TT UA UK US ZA 
 1  6  2  1  1  1  1  1  2  1  5  1  1  5  6  2  1  1  6  8  2 

…and division

The thing about counting is that we’re most often interested not in the question how many? but in how many, out of all of them? Now we can make use of our metadata, and R’s vectorized arithmetic.

table(laureates$bornCountryCode)/nrow(laureates) * 
    100

           AT     BE     BG     CA     CH     CL     CN     CO     CZ 
2.8302 0.9434 0.9434 0.9434 1.8868 0.9434 1.8868 1.8868 0.9434 0.9434 
    DE     DK     EG     ES     FI     FR     GP     GR     GT     HU 
6.6038 2.8302 0.9434 3.7736 0.9434 9.4340 0.9434 0.9434 0.9434 0.9434 
    IE     IN     IR     IS     IT     JP     LC     LT     MG     MX 
3.7736 0.9434 0.9434 0.9434 5.6604 1.8868 0.9434 0.9434 0.9434 0.9434 
    NG     NO     PE     PL     PT     RO     RU     SE     TR     TT 
0.9434 1.8868 0.9434 4.7170 0.9434 0.9434 4.7170 5.6604 1.8868 0.9434 
    UA     UK     US     ZA 
0.9434 5.6604 7.5472 1.8868 

Exercise (3)

Write an expression for a tabulation of the number of men and women to win the Nobel in literature. The solution is in the answers section.

Cross-tabulation

Cross-tabulation means answering questions like, “How many of each gender were born in each country?”

table(laureates$bornCountryCode, laureates$gender)
    
     female male
          0    3
  AT      1    0
  BE      0    1
  BG      0    1
  CA      1    1
  CH      0    1
  CL      1    1
  CN      0    2
  CO      0    1
  CZ      0    1
  DE      0    7
  DK      1    2
  EG      0    1
  ES      0    4
  FI      0    1
  FR      0   10
  GP      0    1
  GR      0    1
  GT      0    1
  HU      0    1
  IE      0    4
  IN      0    1
  IR      1    0
  IS      0    1
  IT      1    5
  JP      0    2
  LC      0    1
  LT      0    1
  MG      0    1
  MX      0    1
  NG      0    1
  NO      0    2
  PE      0    1
  PL      1    4
  PT      0    1
  RO      1    0
  RU      0    5
  SE      1    5
  TR      0    2
  TT      0    1
  UA      0    1
  UK      0    6
  US      2    6
  ZA      1    1

Think of this as a tabulation of tabulations: first R splits up the table by bornCountryCode, then splits up the result by gender before giving us the count. Notice that the result is now not a single row of numbers but many rows (or, more precisely, a two-dimensional array—almost like a data frame).

Sorting

Let’s return to single (rather than cross) tabulations for a bit. After how many? comes which is most or which is least? Tables (and vectors, in fact) can be rearranged in order by the sort function.

laureate_countries <- table(laureates$bornCountryCode)
sort(laureate_countries)

AT BE BG CH CO CZ EG FI GP GR GT HU IN IR IS LC LT MG MX NG PE PT RO 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
TT UA CA CL CN JP NO TR ZA    DK ES IE PL RU IT SE UK DE US FR 
 1  1  2  2  2  2  2  2  2  3  3  4  4  5  5  6  6  6  7  8 10 

The sorted result normally goes from least to most, but often the reverse is easier to read. For that, sort is invoked with a named parameter, decreasing:

sort(laureate_countries, decreasing = T)

FR US DE IT SE UK PL RU ES IE    DK CA CL CN JP NO TR ZA AT BE BG CH 
10  8  7  6  6  6  5  5  4  4  3  3  2  2  2  2  2  2  2  1  1  1  1 
CO CZ EG FI GP GR GT HU IN IR IS LC LT MG MX NG PE PT RO TT UA 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 

Exercise (4)

Write an expression for the top three countries-of-death of the Nobel laureates. This is a trick question. The solution is in the answers section.

More, and messier, data

Let’s count something else, something we really couldn’t just eyeball. The Modernist Journals Project has provided text-formatted tables of item metadata for some of the periodicals they have digitized. I these tabulations for Poetry and The Crisis from http://sourceforge.net/projects/mjplab/files/ and included them in the sample data archive. But as MJP may update the data, please re-download from that link if you are using this data for research of any kind.

Let’s start just by seeing whether MJP has provided us with a nice CSV format we can use right off. The readLines function looks for a file on disk returns it as a vector of lines. Normally you want the whole file, but for our purposes we can specify the named parameter n to get just the first few lines:

readLines("Poetry_2.everytitle.txt", n = 4)
[1] "creator | editor | translator | title | genre | pages | volume | issue | date | journal title | journal subtitle | issue name | journal editor | publisher | journal location | issue length (pp) | issue height (cm) | issue width (cm)"
[2] "Ficke, Arthur Davison |  |  | Poetry | poetry | 1-2 | 1 | 1 | 1912-10-01 | Poetry | A Magazine of Verse |  | Monroe, Harriet | Harriet Monroe | Chicago | 40 | 20 | 14.7      "                                                          
[3] "Moody, William Vaughan |  |  | I Am the Woman | poetry | 3-6 | 1 | 1 | 1912-10-01 | Poetry | A Magazine of Verse |  | Monroe, Harriet | Harriet Monroe | Chicago | 40 | 20 | 14.7      "                                                 
[4] "Pound, Ezra |  |  | To Whistler, American | poetry | 7-7 | 1 | 1 | 1912-10-01 | Poetry | A Magazine of Verse |  | Monroe, Harriet | Harriet Monroe | Chicago | 40 | 20 | 14.7      "                                                     

!?!!! This is a delimited text file of data, but it isn’t comma-delimited. Instead it uses | as the delimiter. A close reading of help(read.csv) and some experimentation yielded me the following command, which uses read.table, a variant of read.csv that can deal with files that use things other than commas:

poetry_titles <- read.table("Poetry_2.everytitle.txt", 
    sep = "|", strip.white = T, stringsAsFactors = F, 
    quote = "", header = T)

That’s probably the murkiest line of R code in this workshop.10

crisis_titles <- read.table("Crisis_2.everytitle.txt", 
    sep = "|", strip.white = T, stringsAsFactors = F, 
    quote = "", header = T)

A comparison

Now that we have the two tables of items from the two magazines, let’s begin to compare them. One of the most interesting metadata fields is the genre assigned by the TEI encoders to each item. Let’s compare genre proportions.

table(poetry_titles$genre)/nrow(poetry_titles)

 advertisements        articles          images         letters 
      0.0284599       0.2683039       0.0002295       0.0257058 
letters; poetry          poetry 
      0.0002295       0.6770714 
table(crisis_titles$genre)/nrow(crisis_titles)

advertisements       articles          drama        fiction 
     0.0697295      0.3383862      0.0009328      0.0233209 
        images        letters         poetry 
     0.4626866      0.0496735      0.0552705 

Combine and recount

From here on it will be easier to have one data frame that combines the two tables. Since they have the same columns, we can simply “stack” one on top of the other. rbind is R’s function for stacking data frames.

mags <- rbind(poetry_titles, crisis_titles)

Since the MJP data has scrupulously recorded the journal title for every item, it’s easy to cross-tabulate genres by journals:

table(mags$genre, mags$journal.title)
                 
                  Crisis Poetry
  advertisements     299    124
  articles          1451   1169
  drama                4      0
  fiction            100      0
  images            1984      1
  letters            213    112
  letters; poetry      0      1
  poetry             237   2950

(Fussily, R has renamed what the original file called journal title to journal.title.)

Who’s in both?

Earlier on we saw the %in% operator. Here’s a chance to apply it (returning to our separate data frames for the two journals):

poetry_in_crisis <- poetry_titles$creator %in%
    crisis_titles$creator

That gives a logical vector which we can use as a subscript:

shared_auths <- poetry_titles$creator[poetry_in_crisis]

If you print shared_auths you will see repeated names, since, remember, the table is of item metadata. If we want a list of names where each occurs only once, we can use the unique function:

unique(shared_auths)
[1] ""                         "Lindsay, Nicholas Vachel"
[3] "Noyes, Alfred"            "Kreymborg, Alfred"       
[5] "Anonymous"                "Johnson, Fenton"         
[7] "Cleghorn, Sarah N."      

Whoops! Let’s tidy that up by getting rid of the results for blanks and Anon.

shared_auths <- unique(shared_auths[shared_auths != "" &
                          shared_auths != "Anonymous"])

Now we can take a look in our table mags in order to tally up the activities of these authors who contributed to both periodicals:

mags_shared <- mags[mags$creator %in% shared_auths, 
    ]
table(mags_shared$journal.title, mags_shared$genre, 
    mags_shared$creator)
, ,  = Cleghorn, Sarah N.

        
         articles fiction letters poetry
  Crisis        1       0       0      0
  Poetry        0       0       0      1

, ,  = Johnson, Fenton

        
         articles fiction letters poetry
  Crisis        0       3       0      4
  Poetry        0       0       0      5

, ,  = Kreymborg, Alfred

        
         articles fiction letters poetry
  Crisis        0       0       0      1
  Poetry        6       0       1     14

, ,  = Lindsay, Nicholas Vachel

        
         articles fiction letters poetry
  Crisis        0       1       1      0
  Poetry        2       0       0      6

, ,  = Noyes, Alfred

        
         articles fiction letters poetry
  Crisis        0       0       0      2
  Poetry        0       0       0      1

This, notice, is a three-way contingency table. R shows it to us as a series of two-way contingency tables., one for each “creator” of items.11

From tables back to data frames

Now I’ve been evasive about just what kind of value the table function returns. It looks like tabular data, it’s called a “table,” but is it a data frame? No, it’s actually another member of R’s bestiary of complex types, namely, a table. Well that’s helpful. For practical purposes what matters is learning how to convert a table to a data frame so that we can do everything we know how to do to data frames. R provides a function for this, as.data.frame:

laur_country_tab <- table(laureates$bornCountryCode)
laureate_countries <- as.data.frame(laur_country_tab)

Now print out laureate_countries to see what this data frame looks like. You might notice that the column headers are the supremely undescriptive Var1, Freq. Assign new column names using the following syntax12:

names(laureate_countries) <- c("country", "count")

Now laureate_countries is a data frame with two columns containing the tabulated counts, which you can explore as we have been exploring the untabulated data.

Visualization, grammatically

The last part of this workshop (which we didn’t get to do on April 30) introduces one more way to explore tabular data, and especially counted-up tabular data: visualization. Here are two principles for thinking about visualization in this context.

  1. A visualization transforms data inputs into graphical outputs. (Sound familiar?).

  2. A grammatical visualization consistently transforms dimensions of the data into aesthetic dimensions of the output.

R users can avail themselves of a very powerful software library for making grammatical visualizations, Hadley Wickham’s ggplot2. Once you’ve learned the basics of R data types and making data frames, you can start making plots with ggplot.

Loading the library

Much of what makes R useful is not part of the basic program you got when you installed R. Instead, you can obtain extra source code that extends R by adding new functions you can use in your own R code. ggplot is one such “R package.” It is easy to obtain: in RStudio, choose “Install Packages…” from the “Tools” menu, type in “ggplot2,” and click “Install.”

Once ggplot2 has been installed, it can be loaded with the following function call:

library("ggplot2")

Making a point (plot)

Let’s start by thinking through a simple point plot. Here’s some new data: a table in which each row gives the number of translations published in the United States in the given year, according to the UNESCO Index Translationum.

us_tx <- read.csv("us-trans.csv")
us_tx
   year translations
1  1979         1634
2  1980         1386
3  1981         1201
4  1982         1174
5  1983         1250
6  1984         1706
7  1985         1574
8  1986         1717
9  1987         1653
10 1988         1846
11 1989         1856
12 1990         1906
13 1991         2074
14 1992         2167
15 1993         2185
16 1994         1915
17 1995         2081
18 1996         1998
19 1997         1936
20 1998         2023
21 1999         1888
22 2000         1343
23 2001         1343
24 2002         1494
25 2003         1356
26 2004         1666
27 2005         2057
28 2006         2289
29 2007         2195
30 2008         1431

The plot grammar

Here is the “grammar” of the point plot:

  1. Years on the x-axis, from left to right
  2. Number of translations on the y-axis, with 0 on the bottom
  3. For each row of data, draw a point.

The code

ggplot’s qplot function requires a data frame and a specification of the plot grammar. The specification is done using named parameters to the function:

qplot(x=year,       # aesthetics (mapping)
      y=translations,
      geom="point", # geometry (shape)
      data=us_tx)   # data source

Our data frame has columns year and translations. So we tell qplot we want x to be year, and y to be translations. The last part of our specification was the decision to draw a point for each row of data. This is set by the geom="point" parameter. Finally, we tell qplot what data frame to work on using the data=us_tx parameter.13

Notice that qplot goes on to make a lot of further choices for us: it picks where to start and end the x and y axes and how many “ticks” to label along each axis; it adds a shaded grid to help you read off numbers from the chart; indeed, it’s made a choice about the size of the point it draws. qplot offers a bazillion parameters for adjusting all of these things, but one of its virtues, when it comes to starting out with counting, is that its default guesses are often pretty darn good. So you can wait to learn about how to tweak the visualization until you have gotten your bearings just getting the durn thing to make plots.

Conjugating the plot

Points don’t make the year-to-year trend particularly easy to see. From the grammatical perspective, we can think about other choices of “geom” without changing our decision about how to map x and y. I think of this as the grammar of conjugating a plot in different shapes.

Data over time are often shown with a line:

qplot(x=year,
      y=translations,
      geom="line",  # change the shape
      data=us_tx)

Since we’re counting things, we might also want to fill in the area below the line down to zero. This gives a “filled area” plot:

qplot(x=year,
      y=translations,
      geom="area", # change the shape yet again
      data=us_tx)

It’s worth thinking about what the different plots emphasize differently. One further shape possibility that might have occurred to you is a bar plot. As you might hope, to do this you pass geom="bar", but in this case one extra parameter to qplot is also needed:

qplot(x=year,
      y=translations,
      geom="bar",
      stat="identity",
      data=us_tx)

For now, just take this as a quirk of qplot: to say “I want a bar plot,” you have to say geom="bar",stat="identity".14

Scales, in general

It is fairly straightforward for us to figure out how to map a quantity like “number of translations” to a spatial dimension (y). Not all aesthetic mappings are so obvious. In particular, how do we map categorical data into the visual?

ggplot tries hard to do what you ask. If you tell it that x or y should be mapped from a categorical value, it will make its best guess.

So let’s return to our data frame counting up Nobel laureates by country of birth, laureate_countries, and consider:

qplot(x=country,y=count,geom="point",
      data=laureate_countries)

What is the grammar of this plot?

  1. ???
  2. laureate count on y axis
  3. point for each country

The “best guess”, here, was to arrange the country codes in alphabetical order along the x axis. This is not terrible, though given the number of countries, it’s pretty hard to read, except maybe to notice that France (FR) is champ. Even there, the point is so far from the x axis that it takes work to match the point to the country. That we could fix by using a different shape. Let’s try bars, not omitting the magical stat="identity":

qplot(x=country,y=count,
      geom="bar",
      stat="identity",
      data=laureate_countries)

That’s a little better, though not much; there are so many country codes that they get squashed together here. In RStudio, you can click the “zoom” button in the plot pane to see a bigger version of the plot. (Look for RStudio’s convenient buttons for saving plots as well.)

Dating

Now let’s do a little visualization of tallies from our periodical metadata set. A basic counting question: did the number of articles published in isues of Poetry change over time?

First we have to create the necessary data frame:

poetry_articles <- poetry_titles[poetry_titles$genre == 
    "articles", ]
art_series <- as.data.frame(table(poetry_articles$date))
names(art_series) <- c("date", "count")

Now we are in a position to plot the series:

qplot(x = date, y = count, geom = "bar", stat = "identity", 
    data = art_series)

Notice something strange has happened on the x axis. What type of data is art_series$date?

art_series$date[1]
[1] 1912-10-01
123 Levels: 1912-10-01 1912-11-01 1912-12-01 1913-01-01 ... 1922-12-01

As far as R knows, the date is a factor (which it has “cleverly” converted from its original string format). Now fortunately the convention used for notating dates here ensures that alphabetical order is also chronological order (why?), but qplot does not know that: as far as it’s concerned, art_series$date is a categorical variable.

R has a specialized data type for dates, however, and a function for turning strings in YYYY-MM-DD format into that type. Here’s how we do that, adding a new column to our art_series data frame:

art_series$converted_date <- as.Date(art_series$date)

Now try:

qplot(x = converted_date, y = count, geom = "bar", 
    stat = "identity", data = art_series)

qplot now understands that x is representing a date, and labels the axis more sensibly. Whether the plot tells us something intelligible about the changing editorial policies of Poetry magazine is another question.15

Counting in more than one dimension

We’ve all ready seen two- and even three-way contingency tables. How are these to be plotted? Let’s use our combined mags data frame on Poetry and The Crisis as to slot in one more piece of the visualization puzzle.

First, as before, we construct a data frame from a table, this time with counts by date and genre. Thus each row answers the question, “How many of this kind of item were published in Poetry on this date?”

poetry_genre_series <- as.data.frame(table(poetry_titles$date,
    poetry_titles$genre))
names(poetry_genre_series) <- c("date","genre",
    "count")
# Convert the string-format dates to R's Date type 
poetry_genre_series$conv_date <- as.Date(poetry_genre_series$date)

How to count this data in a plot? Thinking grammatically, we want to add a new dimension to our visual mappings: we will use the two spatial dimensions for time and counts, as before, but now we will indicate a categorical variable using another aspect of the visual: color. Let’s start with a point plot that shows how many items were published per issue in the two journals:

  1. Issue dates on the x axis, left to right
  2. Item counts by journal on the y axis, bottom to top
  3. Distinguish genres by color
  4. One point for each row of the table
qplot(x=conv_date,y=count,color=genre,geom="point",
      data=poetry_genre_series)

That tells us a few things, and reveals that qplot will produce a legend for us once we introduce a color= visual mapping. But it could be made easier to read. One possibility would be to use a connected line rather than points. geom="line" is what we need…but we also have to tell qplot which points to connect using the group= parameter.

qplot(x=conv_date,y=count,color=genre,group=genre,
      geom="line",data=poetry_genre_series)

This noisy plot helpfully indicates that Poetry did indeed consistently publish mostly poetry items over time, though it might also help you pick out some interesting issues to look at in which the generic mixture is unusual.

To aid that, however, the lines are not as informative as a visual grammar that allows you to make a clearer comparison between genres in each year. One possibility for that would be to use bars, but to stack the bars for genres on top of one another.

We already know to tell qplot geom="bar",stat="identity". Two other parameters have to change. To specify the color of the bars, one uses fill= rather than color=. (Graphics R fun.) To specify stacked bars, one uses position="stack", which is at least not totally opaque.

qplot(x = conv_date, y = count, fill = genre, geom = "bar", 
    stat = "identity", position = "stack", data = poetry_genre_series)

Small multiples

So far, we’ve gotten three columns of data onto a single plot. Though it’s possible to squeeze in more, ggplot gives us another option that is often more useful. This is the technique of small multiples: make multiple copies of the plot for different slices of the data.

So, we could have redo our plot of genres in Poetry as a row of plots, one for each genre. Think of it as the visual equivalent of embedding in grammar:

  1. Make a plot for each genre in alphabetical order, mapping genres from left to right in alphabetical order, in which:
    1. years are on the x axis
    2. counts of items are on the y axis
    3. draw a bar for each year

ggplot refers to this as “faceting”:

qplot(x=conv_date,y=count,
      geom="bar",
      stat="identity", # as for single plot
      facets= . ~ genre, # faceting
      data=poetry_genre_series)

The bizarre expression . ~ genre is a special formula value. In this context, it means “one row, with plots for each value of genre.” For a vertical column of plots, you’d use genre ~ ..

But why stop there? If a two-way contingency table was represented as a single row of plots, we can represent a three-way contingency table as a table of plots. We have a combined data frame, mags, for Poetry and The Crisis. Let’s count up items by genre in the two journals:

genre_series <- as.data.frame(table(mags$date,
    mags$genre,mags$journal.title))
names(genre_series) <- c("date","genre","journal",
    "count")
## Convert the string-format dates to R's Date type 
genre_series$conv_date <- as.Date(genre_series$date)

Now we can make a collection of plots, with rows of plots for each genre and two columns of plots, one for each of the two journals:

qplot(x=conv_date,y=count,group=genre,
      facets=genre ~ journal,geom="bar",
      stat="identity",data=genre_series)

This particular graphic might or might not suggest some avenues for further investigation, though to me the main thing it shows is that sometimes a big plot showing all the data doesn’t tell you more than a smaller table might:

table(mags$genre, mags$journal.title)
                 
                  Crisis Poetry
  advertisements     299    124
  articles          1451   1169
  drama                4      0
  fiction            100      0
  images            1984      1
  letters            213    112
  letters; poetry      0      1
  poetry             237   2950

The overall generic mixtures of these two periodicals, and the importance of the image in the tallies of The Crisis, are perhaps the main facts of interest here. But perhaps the shifts over time suggest possibilities too.

Counting on: further reading

Introductions to R

Navarro, Daniel. Learning Statistics with R. http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Pts. 2–3. This (free draft) introductory statistics textbook for psychology students includes an especially lucid introduction to R.

Jockers, Matthew. Text Analysis with R for Students of Literature. Springer, forthcoming. This textbook in preparation introduces R with a focus on analyzing literary texts.

The R Project. An Introduction to R. This, from the creators of R, is often frustrating (and tends to assume quite a bit of programming and statistical experience).

Visualization

Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer, 2009. http://dx.doi.org/10.1007/978-0-387-98141-3. Rutgers Library has online access to this quite lucid exposition by ggplot’s author.

Wickham, Hadley. Online documentation for ggplot2. http://docs.ggplot2.org/. Reprehensibly sparse.

Wilkinson, Leland. The Grammar of Graphics. 2nd ed. Springer, 2005. http://link.springer.com/book/10.1007/0-387-28695-0. Rutgers Library has online access to this, the theoretical basis for ggplot.

Solutions to exercises

1. Canada’s laureate

paste(laureates[1, "firstname"], laureates[1, "surname"])
[1] "Alice Munro"

2. Exiles and émigrés

laureates[laureates$bornCountryCode
              != laureates$diedCountryCode,
          c("surname","year")]
           surname year
1            Munro 2013
2              Yan 2012
3      Tranströmer 2011
4     Vargas Llosa 2010
5           Müller 2009
6        Le Clézio 2008
7          Lessing 2007
8            Pamuk 2006
10         Jelinek 2004
11         Coetzee 2003
12         Kertész 2002
13         Naipaul 2001
14        Xingjian 2000
15           Grass 1999
16        Saramago 1998
17              Fo 1997
20              Oe 1994
21        Morrison 1993
22         Walcott 1992
23        Gordimer 1991
27         Brodsky 1987
28         Soyinka 1986
29           Simon 1985
32  García Márquez 1982
33         Canetti 1981
34          Milosz 1980
36          Singer 1978
38          Bellow 1976
41           White 1973
42            Böll 1972
45         Beckett 1969
47        Asturias 1967
48           Agnon 1966
51         Seferis 1963
53          Andric 1961
54           Perse 1960
57           Camus 1957
58         Jiménez 1956
66           Eliot 1948
68           Hesse 1946
69         Mistral 1945
76           Bunin 1933
79           Lewis 1930
80            Mann 1929
81          Undset 1928
84            Shaw 1925
86           Yeats 1923
91       Gjellerup 1917
95       Hauptmann 1912
96     Maeterlinck 1911
100        Kipling 1907
102    Sienkiewicz 1905
104       Bjørnson 1903

3. Women and men

table(laureates$gender)

female   male 
    12     94 

4. Countries of death. We’ll always have…

A table can be indexed like a vector. It turns out, however, that the number one “country” is a blank (the living laureates). Hence the expression is:

sort(table(laureates$diedCountry), decreasing = T)[1:4]

                       France United Kingdom            USA 
            19             17              9              9 

Edited 4/30/14 by AG: added slides link.
Edited 5/19/14 by AG: added workshop notes.


  1. James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’” NLH 41, no. 2 (Spring 2010): xii–xiii.?

  2. Source: requests to api.nobelprize.org. See http://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_org/developer. To construct the laureates.csv file, I used the R code in this gist.?

  3. Source: a wordcounts CSV file from a https://constellate.org/ request.?

  4. Words known to R include function names, variable names, and file paths (particularly handy). These terms are explained further on.?

  5. Actually R may not bother to figure out the value until you actually use it, because R is (this is the technical term, really) lazy.?

  6. If you want to get fancy, try outer(c(T,F),c(T,F),"&") and the same expression with "|". See help("outer").?

  7. By default, string values are transformed into factors, the R type for representing categorical data. This is useful in some cases but for now the difference will be confusing. stringsAsFactors=F ensures that our strings stay stringy.?

  8. The alternative approach is to set R’s “working directory” to the folder containing laureates.csv. Do this with the setwd() function (see help("setwd").?

  9. Slight hand-waving here, because table returns a data type we haven’t discussed yet. For practical applications we’re mostly going to see tables that look like data frames if you squint.

    The return type of table is an object of class table, which is a subclass of array. An array is a generalization of a vector to any number of indices; a vector is isomorphic to a 1-dimensional array, a matrix is a 2-dimensional array, etc. Good, I’m glad we cleared that up.?

  10. sep specifies the delimiter. It can only be one character, but fortunately the strip.white parameter tells R to get rid of white space before and after delimiters. Because MJP has not escaped single or double quotes, R will choke on apostrophes unless we tell it to take each table entry literally rather than trying to look for pairs of ' everywhere. Thus quote="". stringsAsFactors we’ve seen before. Finally, read.table must explicitly be told that a header line is included with header=T.?

  11. The possibility of three-way or n-way contingency tables is the reason a table is an array rather than a vector or data frame. The three way table can be subscripted with expressions like t[i, j, k]?

  12. It is also possible to get more literate column names making the table with the xtabs function before converting to a data frame. In this case: xtabs(~ bornCountryCode,laureates). But there’s something funny here, which is best left for a later lesson.

    All right, if you really want, I’ll tell you, since it only took me twenty re-readings of the help files to figure it out. xtabs requires that you indicate which columns of the data frame to tabulate with a “one-sided formula” of the form ~ col1 + col2 + col3.... It sets the dimnames of the resulting array from the formula, and as.data.frame derives column and rownames from that. I hope you’re happy now.?

  13. If you’ve been following carefully, you might be wondering how in the world x=year could be an acceptable function parameter. There is no variable named year, so how does R know what values to plot? Shouldn’t it be x="year" or x=us_tx$year or something like that? The answer is that qplot pretends that each of the columns of the data frame specified by the data parameter is a variable in its own right: when it goes to figure out the value of its x parameter, it will be able to look up year in the data frame. The R-speak way of saying this is that the parameters are “evaluated in the data frame.” (If you’ve been following very closely, you might even have a guess as to why R doesn’t give an object 'year' not found error before qplot has a chance to evaluate it in the data frame.)?

  14. The reason for this lies in an aspect of the grammar I passed over. Sometimes we transform the data before mapping it into the visual dimension. This transformation intervening between the data and the visual mapping is known to ggplot as the “stat.” For example, sometimes we count up how often a given value occurs: this is what we have been doing with the table function, but qplot will do this for you automatically if you supply the raw, untallied data and then use the bin stat. In fact, since bar plots are so commonly used to show tallies of this kind, by default when you set geom="bar" qplot assumes stat="bin" and maps y to the tallies of x values.

    If this seems obscure, try qplot(x=gender,geom="bar",data=laureates). Notice that there’s no explicit y mapping.

    In the case of our yearly translation counts, however, we do not want the heights of the bars to correspond to the number of times a given count of translations occurs in the data! We just want the height of the bar to be equal to the number of translations. This is simplest stat of all, the identity transformation, which leaves all values unchanged. But we have to set it explicitly using stat="identity". In fact, if you leave off this parameter, qplot will give you an error message that tries to tell you what I’ve just told you but with even less clarity.?

  15. The unsightly striation—random white space between the bars here and there—is one of the few moments when ggplot’s defaults let you down, visually.?