Taught by Andrew Goldstone, Department of English, Rutgers University–New Brunswick
Wednesday, April 30, 2014
4:30 p.m.–6:30 p.m.
Alexander Library, Room 413
169 College Avenue, New Brunswick, NJ
With the increasing prominence of the digital humanities, humanists are once again asking themselves whether they can make use of the computer’s most fundamental capacity: its ability to count. This workshop introduces some of the methodological choices required for computational counting: what representations of data are suitable for machine processing? Once you have such a representation, how can you begin to analyze it? We will make these questions concrete through an introduction to R, which is both a programming language and a software environment for data analysis. We focus on the nature of computational thinking: the scholar’s work of representing and analyzing data on the computer is a process of highly disciplined expression. We will work together to analyze some samples of data (bibliographic data and word-use data), from loading files in the ubiquitous comma-separated value (CSV) format, to searching and tabulating data programmatically, to the “grammar” of basic visualization.
No programming experience required. Patience, however, helps.
Notes from the workshop
The following is a somewhat edited and amplified write-up (by Andrew Goldstone) of the workshop notes. The original slides from the workshop are also available. To work through the below would take approximately two and a half hours.
Shall we count?
The aim of this workshop is to provide some foundations for thinking about counting as a way of answering some questions we have in our disciplines. Foundations, and not just practical instruction in tool-using, matter because there is a lot at stake in the decision to explore counting methods. James English writes about literary study:
Academic disciplines (and even interdisciplines or hybrids) are relational entities; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the nai?ve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival.1
This workshop focuses on tables. It’s worth thinking about what can and cannot be represented in a table, but the table is a ubiquitous model for keeping track of data. When you think of tables and computers, you might think of Excel spreadsheets. Unfortunately, the Excel format is both too complex and too opaque to allow for direct programmatic manipulation, except through the very awkward mechanisms of Excel’s own native languages and macros. That format is also proprietary and vulnerable to problems when it comes to sharing and archiving. I have found that for my own purposes, I have spent the most time working on tables in CSV or comma-separated values formats. Excel and any other spreadsheet program can save your spreadsheets as “CSV” or “Text CSV” (though, as you will see, this will entail a drastic, though productive, simplification of the format of the data). Here is a small table in CSV format:
firstname,surname,bornCountry Alice,Munro,Canada Mo,Yan,China Tomas,Tranströmer,Sweden
The norms of CSV
- plain-text file for tabular data
- delimiter separates columns (usually
,or a tab)
- newline separates rows
- names of columns in first row (optional)
- tricky bits:
- what if a data point contains a comma?
- what if a data point contains a quotation mark?
- what text-encoding should be used?
- how do you know what rules have been followed? (There is RFC 4180, but no promises.)
People as a table
Let’s look at some more elaborate CSV-format data. In the sample files, look for
laureates.csv and open it in RStudio using the Open File command (or open it in a text editor. For more on text editors, see the notes for the workshop on digital text.)2
id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year 892,Alice,Munro,1931-07-10,0000-00-00,Canada,CA,Wingham,,,,female,2013 880,Mo,Yan,0000-00-00,0000-00-00,China,CN,Gaomi,,,,male,2012 868,Tomas,Tranströmer,1931-04-15,0000-00-00,Sweden,SE,Stockholm,,,,male,2011 854,Mario,"Vargas Llosa",1936-03-28,0000-00-00,Peru,PE,Arequipa,,,,male,2010 844,Herta,Müller,1953-08-17,0000-00-00,Romania,RO,"Nitzkydorf, Banat",,,,female,2009 832,"Jean-Marie Gustave","Le Clézio",1940-04-13,0000-00-00,France,FR,Nice,,,,male,2008 817,Doris,Lessing,1919-10-22,2013-11-17,"Persia (now Iran)",IR,Kermanshah,"United Kingdom",UK,London,female,2007 808,Orhan,Pamuk,1952-06-07,0000-00-00,Turkey,TR,Istanbul,,,,male,2006 801,Harold,Pinter,1930-10-10,2008-12-24,"United Kingdom",UK,London,"United Kingdom",UK,London,male,2005
Notice the large number of conventional choices implied by this table: quotation marks to surround items with spaces; dates in YYYY-MM-DD format; “still living” represented as a
0000-00-00 date; arbitrary ID numbers; gender coded as
female… None of these codings are described explicitly; CSV has very limited accommodation for metadata (just the column names). The rest of the metadata has to live in a separate file. Working in this format means keeping careful track of choices for how categories have been coded.
Text as a table
Here is part of a tabular representation of a scholarly article3:
WORDCOUNTS,WEIGHT the,766 of,482 and,305 in,259 to,224 a,195 new,101 as,101 that,86 it,75
This so-called bag of words indicates only the number of times each type of word occurs in the article (according to JSTOR’s OCR), regardless of order. For this purpose, the CSV format is quite amenable. Notice what has been discarded: not just word order but punctuation, page layout, typography… (What dimensions of the page could be tabulated by extending the table?)
It is worth spending some time thinking about what can and cannot be accommodated in a data format like CSV. Let’s make a rough typology of some of the kinds of data we might be interesting in counting up.
- Whole numbers (integer scale). How many (books, people, words, genres…)?
- Real numbers (interval scale). How much (distance, time, money…)? Special cases:
- percentages or proportions (ratio scale). How much of the total (population, corpus of texts…)?
- dates. When? (And does the day, month, year, decade, century… matter?)
- Unordered. Which of… (languages, nations, genders(?))? Special cases:
- binary or Boolean category: true or false, yes or no.
- many categories (headwords in the dictionary, authors in the catalogue).
- Ordinal. Which (letter of the alphabet, sales rank, “like, dislike, or neutral”)?
Categories may be represented by numbers, often in more than one way:
- true: 1, false: 0
- like: 1, neutral: 0, dislike: -1
- like: 2, neutral: 1, dislike: 0
- a: 1, b: 2, c: 3… (character encoding)
With these atoms of data, we can then make more complex forms:
The list / the series
As in a series, perhaps, of percentages:
17.5, 3.0, 11.0, 13.0, 8.1, 11.9, 11.0, 3.7, 3.2, 1.5
The list of lists / the table
The table can also be represented as a list of lists (of equal length):
firstname: Alice, Mo, Tomas surname: Munro, Yan, Tranströmer bornCountry: Canada, China, Sweden firstname surname bornCountry Alice Munro Canada Mo Yan China Tomas Tranströmer Sweden
You might also think of a table as a list of “cases,” one case per row, where each row is described by the same collection of data (first name, surname, country of birth).
For computational purposes, the central representation of text is in the form of a (looooong) list of characters (a “string”):
O, n, c, e, *space*, u, p, o, n, *space*, a, *space*, t, i, m, e
But other representations exist:
- the bag of words (to: 2, be: 2, or: 1, not: 1)
- content analyses (automated, human, or semi-automated, classifications of texts which can then be tallied and analyzed in turn)
<sp who="#Salinus"><speaker>Duke.</speaker> <p>Haplesse <name>Egeon</name> whom the fates haue markt...</p>
- parsed trees (reflecting grammar—a grammar tree is a classic example of a data format which is impossible to encode in a single table)
page images (bonus activity: explain how image can be represented in a table)
Programming in a nutshell
Let’s get counting. This is going to require some programming. What is programming?
- A program is a formal description of a process for transforming data. Composing a program is a matter of expressing what you want to do in a constrained language.
- A computer performs calculations on numbers and stores the results of those calculations.
- If the inputs, outputs, and the formal description can be encoded as numbers, a program can be executed on a computer. At that point the formal description also looks like a recipe of instructions for the computer. In programming, one often switches back and forth between the expressive mode of description and the more machine-focused mode of instruction.
The R experience
The console is the window with the
> prompt. In this window, you type an expression, and R figures out its value (and sometimes: stores a value, draws a figure, reads a file from the disk, saves a file on the disk), and tells you. And that’s all.
A script is a set of expressions in a text file, one after another. R goes through and figures out their value one by one. And that’s all.
First steps in the console
R is a parrot
The simplest kind of expression consists of a value. To figure out the value of a value, R doesn’t have to work very hard:
From here on, these notes show, first, the line you can type into R—indicated by the box with a grey background—and then, following it, a second box, with a white background background, showing the response R gives. There’s no need to type in the response. (Ignore, for now, the strange
 you see when R echoes these values back to you. It’s just trying to be friendly.)
"Shiver me timbers"
 "Shiver me timbers"
Notice that as you type a
", RStudio fills in the closing
" automatically. This can be a little disconcerting but is a useful convenience. You can “overtype” the close quote as well. RStudio will do the same thing with parentheses and brackets.
R gets crabby easily
On the other hand, already we can provide a first introduction to some of the ways you can type things R does not understand. Doing this is a normal part of the work, and hitting glitches and making errors is an important part of a learning process. R is particularly bad at explaining to you why it has not accepted what you typed. It’s worth practicing making R crabby, so you can see that this experience is not the end of the world:
Shiver Shiver me timbers help
To escape from these cases, press
Some important features
The constrained environment of the interactive prompt (the
> where you type a line and press return) is rigorously linear and serial. Once you’ve typed return, you can’t edit the line to fix mistakes. But you can quickly copy over into a new line what you previously typed:
- Use the up and down arrows (or the RStudio History pane) to move through the history of past lines.
- Use the tab key to fill in partly-typed words that are known to R.4
- use the help feature:
?pastedisplays help on the thing called
R data kinds (“modes”)
R does not enforce the difference between integers and non-integers very rigidly. Most of the time, in R you just think of “numbers” with and without decimal places.
Text comes in strings, surrounded with
"\"Avast,\" he said"
 "\"Avast,\" he said"
"Beware the \"
 "Beware the \"
Represent a newline with
\n and a tab with
\t. In all these cases,
\ is a special “escape” character indicating that the next character has a special interpretation.
In R, a Boolean value may be
F for short.
This special type, for representing categorical data, is discussed below.
2 * 2
You can use as many spaces or as few as you want.
Now try a logical expression using the operators
==, which means “is equal?,”
!= which means “is not equal?”, and
4 == 3
4 > 3
4 < 3
4 != 3
These expressions have Boolean values. Boolean values have their own arithmetic, defined by the operators familiar from catalogue searching: and, or, not. In R these are notated as follows:
(2 > 1) & (1 > 5)
(2 > 1) | (1 > 5)
!(1 > 5)
Functions map inputs to outputs. The syntax is:
for a function with one input or
for a function with 2. And so on. The inputs can be any expression.
Here are some examples of functions applied to simple values:
You can guess what these functions do, but you could also look up the official explanation with
Here are some examples where a function takes an expression as an input—which might include another function—and so ad infinitum. (This ability to use one function’s output as input to another function—or even the same function—is central to the working of algorithms.)
sqrt(4 * 4)
paste(paste("Alice", "Munro"), "(Canada)")
 "Alice Munro (Canada)"
Functions in R have one other way of indicating inputs, called named parameters. They look like this:
paste("Munro", "Alice", sep = ", ")
 "Munro, Alice"
paste("Munro", "Alice", sep = "")
Here the third parameter is given the name
sep, which has a special role in the
paste function (what is it?).
Computers do calculations and store the results. We’ve done calculations; what about storing?
<- stores a value under a name which you can refer to (or change) later. The format is
name <- expression
R figures out the value of
expression and stores it in name.5
Once you’ve stored a value under name, the value of that name is…the stored value. This sounds funny, but try a few examples:
x <- 108 x
x + 2
storage <- 10 storage <- storage - 10 My_Perfectly_Good_Name2012 <- "Mo Yan"
In the “Environment” pane of RStudio, you can watch the results of your assignments: the names and their associated values suddenly appear in the list.
Names can be short or long, but can’t have spaces in them, and they have to start with a letter.
R compound data types
Vectors (for a series of values)
What I described as a series is called a vector in R. Construct a vector with the special function
xs <- c(2, 4, 8) xs
 2 4 8
bs <- c(T, F, T) bs
 TRUE FALSE TRUE
people <- c("Munro", "Mo", "Transtromer") people
 "Munro" "Mo" "Transtromer"
c(people, "Vargas Llosa")
 "Munro" "Mo" "Transtromer" "Vargas Llosa"
Notice that vectors can hold not just numbers but strings or Booleans. (They aren’t allowed to hold a mixture. For that a different compound type exists, which we won’t say much about, the list).
Once you have a series, how do you pick out parts of it? Choose an element or elements from a vector with
Again, any expression whose value is a meaningful subscript can go in the square brackets:
xs[1 + 1] # a silly example
One special kind of vector has so many uses R has a special way of notating it: this is the sequence:
 1 2 3
 1 2 3 6 7 8
Now try an experiment. What is the value of these expressions?
 "Munro" "Mo"
 "Munro" "Transtromer"
If the subscript is itself a vector, then we get back a vector, not just a single element.
Now figure out what’s going on here:
 "Munro" "Transtromer"
In the expression
logic_v is made up of Boolean values and has the same length as
v, it acts as a kind of mask applied to the series: only the elements of
v corresponding to the
TRUE values of
logic_v are picked out.
All of the arithmetic discussed above has a vector version. So do many R functions. In general, the idea is do the same thing to each element of the vector (operators apply “elementwise”):
c(1, 3, 5) + c(2, 4, 6)
 3 7 11
c(T, F, F) | c(F, T, F)
 TRUE TRUE FALSE
paste(c("a", "b"), c("c", "d"))
 "a c" "b d"
If the Boolean arithmetic above seemed vague, you can generate “truth tables” that show you the workings of the Boolean operators6:
c(T, T, F, F) & c(T, F, T, F) # truth table for AND
 TRUE FALSE FALSE FALSE
c(T, T, F, F) | c(T, F, T, F) # truth table for OR
 TRUE TRUE TRUE FALSE
!c(T, F) # truth table for NOT
 FALSE TRUE
Another important operator:
x %in% y checks whether
x is found in the vector
y. This is clear enough for the case where
x is a single value:
"c" %in% c("b", "c", "d", "e")
"a" %in% c("b", "c", "d", "e")
Explain to yourself what happens when
x is a vector of multiple values:
c("a", "b", "c") %in% c("b", "c", "d", "e")
 FALSE TRUE TRUE
The last vectorial convenience has to do with operations that need vectors of the same length. R lets you supply a single value and then extends it to the length required by the context of an operation or a function. Compare:
c(1, 3, 5) + 1
 2 4 6
c(1, 3, 5) + c(1, 1, 1)
 2 4 6
paste("The", c("beginning", "end"))
 "The beginning" "The end"
paste(c("The", "The"), c("beginning", "end"))
 "The beginning" "The end"
c(1, 3, 5) == 3
 FALSE TRUE FALSE
c(1, 3, 5) == c(3, 3, 3)
 FALSE TRUE FALSE
choice <- xs > 3 xs > c(3, 3, 3)
 FALSE TRUE TRUE
 FALSE TRUE TRUE
 4 8
xs[xs > 3]
 4 8
That last example is a very common idiom of the R language. It expresses the following: “The elements of
xs that are greater than 3” in a concise way. But now you can also see how the R machine actually works that out:
- Take 3 and repeat it enough times to make a vector the same length as
xsto that repeated
3vector, yielding a logical vector.
- Use the logical vector as a subscript for
xsto pick out only some elements of that vector.
Recycling actually works on vectors of any length, though this feature is less often used than recycling a single element:
1:4 + 1:2
 2 4 4 6
letters # built-in variable of length 26
 "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"  "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
paste(letters, c("odd", "even"))
 "a odd" "b even" "c odd" "d even" "e odd" "f even" "g odd"  "h even" "i odd" "j even" "k odd" "l even" "m odd" "n even"  "o odd" "p even" "q odd" "r even" "s odd" "t even" "u odd"  "v even" "w odd" "x even" "y odd" "z even"
The data frame
Finally we come to the central compound data type in R, the data frame. The data frame represents tabular data. A data frame is a list of vectors not necessarily of the same type, but all of the same length. Let’s make one. The rather special
data.frame() function takes named parameters and makes a data frame out of them, using the parameter names as the names of the columns.
laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83)) laureates
firstname surname bornCountry age_now 1 Alice Munro Canada 82 2 Mo Yan China 59 3 Tomas Tranströmer Sweden 83
To simplify the results of what goes on below, one slightly magical incantation should be added here.7
laureates <- data.frame( firstname=c("Alice","Mo","Tomas"), surname=c("Munro","Yan","Tranströmer"), bornCountry=c("Canada","China","Sweden"), age_now=c(82,59,83), stringsAsFactors=F)
Indexing by row and column
If we want to access parts of the data frame, a single subscript is no longer enough; now we need two subscripts to pick out rows and columns. In general:
gives us only those elements of
data_frame in the rows specified by a subscript vector
rows and the columns specified by
columns. Try it out:
In addition to numbers, our subscripts can use the column names:
And, just as with vectors, subscripts can be vectors, not just single indices:
laureates[3, c("firstname", "surname")]
firstname surname 3 Tomas Tranströmer
Fiddly note: If your data frame subscript expression picks out more than one column, its value is itself a data frame rather than a vector (even if you pick only one row). For the purposes of this lesson, this distinction is not important.
Write a single expression in terms of
laureates to produce the full name of Canada’s laureate. If your answer is a vector of multiple elements, write a more complicated expression that yields a single string. You will have to use a function. The solution is in the answers section.
Leaving a blank where an index would be means “I want all of ’em”:
firstname surname bornCountry age_now 3 Tomas Tranströmer Sweden 83
 "Munro" "Yan" "Tranströmer"
laureates[, c("surname", "bornCountry")]
surname bornCountry 1 Munro Canada 2 Yan China 3 Tranströmer Sweden
This is useful in conjunction with logical indexing:
laureates[c(T, F, T), ]
firstname surname bornCountry age_now 1 Alice Munro Canada 82 3 Tomas Tranströmer Sweden 83
Picking out a single column is so common, that R lets you use a more concise syntax:
 "Munro" "Yan" "Tranströmer"
 "Munro" "Yan" "Tranströmer"
A single column is just…a regular old vector, which can be subscripted in turn:
But this only works on a single column:
laureates$firstname, surname # error laureates$(firstname, surname) # sigh laureates$c(firstname, surname) # alas
Some less-made-up data
We’ve worked on miniature data long enough. Let’s work on a bigger table—though not much bigger. R has a special function devoted to reading CSV files from your hard drive. Its input is the name of the file and its output is a data frame.
laureates <- read.csv("laureates.csv", stringsAsFactors=F)
Please take the magical
stringsAsFactors=F incantation on faith for now. This is the first R operation you have seen which involves the hard drive; it is an example of “File I/O.” This is a fount of interesting errors and problems. One of the major flaws of R is its extremely obscure way of telling you about such problems. For example, if it can’t find the file you asked for (either because of a typo in the name, or because it’s looking in the wrong folder), you’ll see an error like:
Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file 'laureates.csv': No such file or directory
One solution is to figure out the full path name of the file you want (i.e., where in your nested folders it lives). If you are used to using the Finder/Explorer to do this, then the simplest way8 to figure this out is with another special R function:
This opens a dialog box; navigate to where you have stored
laureates.csv and click “open.” The value of the function now appears, and it is the full path of the file. On my system this looks like:
You can now copy and paste this path, in quotes, into
read.csv. On my system (but not on yours) this looks like:
laureates <- read.csv("/Users/agoldst/Documents/dhru/counting/laureates.csv", stringsAsFactors=F)
Properties of the frame
We said before that a CSV file involves very minimal metadata. R stores the same metadata about a data frame. Find the names of the columns of a data frame with a function,
 "id" "firstname" "surname"  "born" "died" "bornCountry"  "bornCountryCode" "bornCity" "diedCountry"  "diedCountryCode" "diedCity" "gender"  "year" "category" "overallMotivation"  "share" "motivation" "name"  "city" "country"
And the number of rows (which in this case is the number of laureates):
The logic of the query
Now that we have a longer table (not that long, but still a bit long to see all at a glance), we want to slice and dice it. Indeed, one of the most important things we can do with a table of data is to choose parts of it. Combining what we have seen so far, we can use logical vectors, Boolean operators, and subscripting to pick out parts of a data frame in R.
This is an operation we all do all the time when we search library catalogues or databases. So think of this as a version of a search query. Only instead of clicking some menus and seeing results visually, we have the capacity to store and do further calculations on the results of our queries.
But let’s start with a query. What are the surnames of laureates born in Sweden?
laureates$bornCountry == "Sweden"
 FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE  FALSE FALSE FALSE FALSE FALSE FALSE
swedes <- laureates$bornCountry == "Sweden" laureates$surname[swedes]
 "Tranströmer" "Johnson" "Lagerkvist"  "Karlfeldt" "von Heidenstam" "Lagerlöf"
Let’s look at the full rows of the table for laureates born in Sweden, too:
That’s a lot of stuff, so from now we’ll only look at some of the columns while we’re picking out rows. Let’s try another query:
Not…many. We now express more complex ideas, like “All the rows corresponding to women born in Sweden”:
Or “all the rows corresponding to women or Swedes”:
Or “all the rows corresponding to women or people who are not Swedes, but take only the first and names and surnames”:
Write an expression whose value is a data frame containing the names and prize-years of all the laureates who died in a country other than the country of their birth. The solution is in the answers section.
Now let’s finally use the computer to count (explicitly; secretly, it’s been doing a lot of counting already). For counting in R, the workhorse is the
table function. At its simplest,
table takes a vector as an input and returns a tabulation9 showing how many times each value in the vector is repeated:
This is already a useful operation. Notice that because we’re always talking to R in vectors, we always start by counting everything. Instead of asking, “How many Swedish laureates?” we could ask, “How many laureates from each country?” This is more information, but in R it is more concise (because more general):
The thing about counting is that we’re most often interested not in the question how many? but in how many, out of all of them? Now we can make use of our metadata, and R’s vectorized arithmetic.
Write an expression for a tabulation of the number of men and women to win the Nobel in literature. The solution is in the answers section.
Cross-tabulation means answering questions like, “How many of each gender were born in each country?”
Think of this as a tabulation of tabulations: first R splits up the table by
bornCountryCode, then splits up the result by
gender before giving us the count. Notice that the result is now not a single row of numbers but many rows (or, more precisely, a two-dimensional array—almost like a data frame).
Let’s return to single (rather than cross) tabulations for a bit. After how many? comes which is most or which is least? Tables (and vectors, in fact) can be rearranged in order by the
The sorted result normally goes from least to most, but often the reverse is easier to read. For that,
sort is invoked with a named parameter,
Write an expression for the top three countries-of-death of the Nobel laureates. This is a trick question. The solution is in the answers section.
More, and messier, data
Let’s count something else, something we really couldn’t just eyeball. The Modernist Journals Project has provided text-formatted tables of item metadata for some of the periodicals they have digitized. I these tabulations for Poetry and The Crisis from http://sourceforge.net/projects/mjplab/files/ and included them in the sample data archive. But as MJP may update the data, please re-download from that link if you are using this data for research of any kind.
Let’s start just by seeing whether MJP has provided us with a nice CSV format we can use right off. The
readLines function looks for a file on disk returns it as a vector of lines. Normally you want the whole file, but for our purposes we can specify the named parameter
n to get just the first few lines:
!?!!! This is a delimited text file of data, but it isn’t comma-delimited. Instead it uses
| as the delimiter. A close reading of
help(read.csv) and some experimentation yielded me the following command, which uses
read.table, a variant of
read.csv that can deal with files that use things other than commas:
That’s probably the murkiest line of R code in this workshop.10
Now that we have the two tables of items from the two magazines, let’s begin to compare them. One of the most interesting metadata fields is the genre assigned by the TEI encoders to each item. Let’s compare genre proportions.
Combine and recount
From here on it will be easier to have one data frame that combines the two tables. Since they have the same columns, we can simply “stack” one on top of the other.
rbind is R’s function for stacking data frames.
Since the MJP data has scrupulously recorded the journal title for every item, it’s easy to cross-tabulate genres by journals:
(Fussily, R has renamed what the original file called
journal title to
Who’s in both?
Earlier on we saw the
%in% operator. Here’s a chance to apply it (returning to our separate data frames for the two journals):
That gives a logical vector which we can use as a subscript:
If you print
shared_auths you will see repeated names, since, remember, the table is of item metadata. If we want a list of names where each occurs only once, we can use the
Whoops! Let’s tidy that up by getting rid of the results for blanks and Anon.
Now we can take a look in our table
mags in order to tally up the activities of these authors who contributed to both periodicals:
This, notice, is a three-way contingency table. R shows it to us as a series of two-way contingency tables., one for each “creator” of items.11
From tables back to data frames
Now I’ve been evasive about just what kind of value the
table function returns. It looks like tabular data, it’s called a “table,” but is it a data frame? No, it’s actually another member of R’s bestiary of complex types, namely, a table. Well that’s helpful. For practical purposes what matters is learning how to convert a table to a data frame so that we can do everything we know how to do to data frames. R provides a function for this,
Now print out
laureate_countries to see what this data frame looks like. You might notice that the column headers are the supremely undescriptive Var1, Freq. Assign new column names using the following syntax12:
laureate_countries is a data frame with two columns containing the tabulated counts, which you can explore as we have been exploring the untabulated data.
The last part of this workshop (which we didn’t get to do on April 30) introduces one more way to explore tabular data, and especially counted-up tabular data: visualization. Here are two principles for thinking about visualization in this context.
A visualization transforms data inputs into graphical outputs. (Sound familiar?).
A grammatical visualization consistently transforms dimensions of the data into aesthetic dimensions of the output.
R users can avail themselves of a very powerful software library for making grammatical visualizations, Hadley Wickham’s ggplot2. Once you’ve learned the basics of R data types and making data frames, you can start making plots with ggplot.
Loading the library
Much of what makes R useful is not part of the basic program you got when you installed R. Instead, you can obtain extra source code that extends R by adding new functions you can use in your own R code. ggplot is one such “R package.” It is easy to obtain: in RStudio, choose “Install Packages…” from the “Tools” menu, type in “ggplot2,” and click “Install.”
Once ggplot2 has been installed, it can be loaded with the following function call:
Making a point (plot)
Let’s start by thinking through a simple point plot. Here’s some new data: a table in which each row gives the number of translations published in the United States in the given year, according to the UNESCO Index Translationum.
The plot grammar
Here is the “grammar” of the point plot:
- Years on the x-axis, from left to right
- Number of translations on the y-axis, with 0 on the bottom
- For each row of data, draw a point.
qplot function requires a data frame and a specification of the plot grammar. The specification is done using named parameters to the function:
Our data frame has columns
translations. So we tell
qplot we want x to be
year, and y to be translations. The last part of our specification was the decision to draw a point for each row of data. This is set by the
geom="point" parameter. Finally, we tell
qplot what data frame to work on using the
qplot goes on to make a lot of further choices for us: it picks where to start and end the x and y axes and how many “ticks” to label along each axis; it adds a shaded grid to help you read off numbers from the chart; indeed, it’s made a choice about the size of the point it draws.
qplot offers a bazillion parameters for adjusting all of these things, but one of its virtues, when it comes to starting out with counting, is that its default guesses are often pretty darn good. So you can wait to learn about how to tweak the visualization until you have gotten your bearings just getting the durn thing to make plots.
Conjugating the plot
Points don’t make the year-to-year trend particularly easy to see. From the grammatical perspective, we can think about other choices of “geom” without changing our decision about how to map x and y. I think of this as the grammar of conjugating a plot in different shapes.
Data over time are often shown with a line:
Since we’re counting things, we might also want to fill in the area below the line down to zero. This gives a “filled area” plot:
It’s worth thinking about what the different plots emphasize differently. One further shape possibility that might have occurred to you is a bar plot. As you might hope, to do this you pass
geom="bar", but in this case one extra parameter to
qplot is also needed:
For now, just take this as a quirk of
qplot: to say “I want a bar plot,” you have to say
Scales, in general
It is fairly straightforward for us to figure out how to map a quantity like “number of translations” to a spatial dimension (y). Not all aesthetic mappings are so obvious. In particular, how do we map categorical data into the visual?
ggplot tries hard to do what you ask. If you tell it that
y should be mapped from a categorical value, it will make its best guess.
So let’s return to our data frame counting up Nobel laureates by country of birth,
laureate_countries, and consider:
What is the grammar of this plot?
- laureate count on y axis
- point for each country
The “best guess”, here, was to arrange the country codes in alphabetical order along the x axis. This is not terrible, though given the number of countries, it’s pretty hard to read, except maybe to notice that France (
FR) is champ. Even there, the point is so far from the x axis that it takes work to match the point to the country. That we could fix by using a different shape. Let’s try bars, not omitting the magical
That’s a little better, though not much; there are so many country codes that they get squashed together here. In RStudio, you can click the “zoom” button in the plot pane to see a bigger version of the plot. (Look for RStudio’s convenient buttons for saving plots as well.)
Now let’s do a little visualization of tallies from our periodical metadata set. A basic counting question: did the number of articles published in isues of Poetry change over time?
First we have to create the necessary data frame:
Now we are in a position to plot the series:
Notice something strange has happened on the x axis. What type of data is
As far as R knows, the date is a factor (which it has “cleverly” converted from its original string format). Now fortunately the convention used for notating dates here ensures that alphabetical order is also chronological order (why?), but
qplot does not know that: as far as it’s concerned,
art_series$date is a categorical variable.
R has a specialized data type for dates, however, and a function for turning strings in
YYYY-MM-DD format into that type. Here’s how we do that, adding a new column to our
art_series data frame:
qplot now understands that
x is representing a date, and labels the axis more sensibly. Whether the plot tells us something intelligible about the changing editorial policies of Poetry magazine is another question.15
Counting in more than one dimension
We’ve all ready seen two- and even three-way contingency tables. How are these to be plotted? Let’s use our combined
mags data frame on Poetry and The Crisis as to slot in one more piece of the visualization puzzle.
First, as before, we construct a data frame from a table, this time with counts by date and genre. Thus each row answers the question, “How many of this kind of item were published in Poetry on this date?”
How to count this data in a plot? Thinking grammatically, we want to add a new dimension to our visual mappings: we will use the two spatial dimensions for time and counts, as before, but now we will indicate a categorical variable using another aspect of the visual: color. Let’s start with a point plot that shows how many items were published per issue in the two journals:
- Issue dates on the x axis, left to right
- Item counts by journal on the y axis, bottom to top
- Distinguish genres by color
- One point for each row of the table
That tells us a few things, and reveals that
qplot will produce a legend for us once we introduce a
color= visual mapping. But it could be made easier to read. One possibility would be to use a connected line rather than points.
geom="line" is what we need…but we also have to tell
qplot which points to connect using the
This noisy plot helpfully indicates that Poetry did indeed consistently publish mostly poetry items over time, though it might also help you pick out some interesting issues to look at in which the generic mixture is unusual.
To aid that, however, the lines are not as informative as a visual grammar that allows you to make a clearer comparison between genres in each year. One possibility for that would be to use bars, but to stack the bars for genres on top of one another.
We already know to tell
geom="bar",stat="identity". Two other parameters have to change. To specify the color of the bars, one uses
fill= rather than
color=. (Graphics R fun.) To specify stacked bars, one uses
position="stack", which is at least not totally opaque.
So far, we’ve gotten three columns of data onto a single plot. Though it’s possible to squeeze in more, ggplot gives us another option that is often more useful. This is the technique of small multiples: make multiple copies of the plot for different slices of the data.
So, we could have redo our plot of genres in Poetry as a row of plots, one for each genre. Think of it as the visual equivalent of embedding in grammar:
- Make a plot for each genre in alphabetical order, mapping genres from left to right in alphabetical order, in which:
- years are on the x axis
- counts of items are on the y axis
- draw a bar for each year
ggplot refers to this as “faceting”:
The bizarre expression
. ~ genre is a special formula value. In this context, it means “one row, with plots for each value of
genre.” For a vertical column of plots, you’d use
genre ~ ..
But why stop there? If a two-way contingency table was represented as a single row of plots, we can represent a three-way contingency table as a table of plots. We have a combined data frame,
mags, for Poetry and The Crisis. Let’s count up items by genre in the two journals:
Now we can make a collection of plots, with rows of plots for each genre and two columns of plots, one for each of the two journals:
This particular graphic might or might not suggest some avenues for further investigation, though to me the main thing it shows is that sometimes a big plot showing all the data doesn’t tell you more than a smaller table might:
The overall generic mixtures of these two periodicals, and the importance of the image in the tallies of The Crisis, are perhaps the main facts of interest here. But perhaps the shifts over time suggest possibilities too.
Counting on: further reading
Introductions to R
Navarro, Daniel. Learning Statistics with R. http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Pts. 2–3. This (free draft) introductory statistics textbook for psychology students includes an especially lucid introduction to R.
Jockers, Matthew. Text Analysis with R for Students of Literature. Springer, forthcoming. This textbook in preparation introduces R with a focus on analyzing literary texts.
The R Project. An Introduction to R. This, from the creators of R, is often frustrating (and tends to assume quite a bit of programming and statistical experience).
Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer, 2009. http://dx.doi.org/10.1007/978-0-387-98141-3. Rutgers Library has online access to this quite lucid exposition by ggplot’s author.
Wickham, Hadley. Online documentation for ggplot2. http://docs.ggplot2.org/. Reprehensibly sparse.
Wilkinson, Leland. The Grammar of Graphics. 2nd ed. Springer, 2005. http://link.springer.com/book/10.1007/0-387-28695-0. Rutgers Library has online access to this, the theoretical basis for ggplot.
Solutions to exercises
1. Canada’s laureate
2. Exiles and émigrés
3. Women and men
4. Countries of death. We’ll always have…
A table can be indexed like a vector. It turns out, however, that the number one “country” is a blank (the living laureates). Hence the expression is:
Edited 4/30/14 by AG: added slides link.
Edited 5/19/14 by AG: added workshop notes.
James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’” NLH 41, no. 2 (Spring 2010): xii–xiii.?
Source: requests to
api.nobelprize.org. See http://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_org/developer. To construct the
laureates.csvfile, I used the R code in this gist.?
Words known to R include function names, variable names, and file paths (particularly handy). These terms are explained further on.?
Actually R may not bother to figure out the value until you actually use it, because R is (this is the technical term, really) lazy.?
If you want to get fancy, try
outer(c(T,F),c(T,F),"&")and the same expression with
By default, string values are transformed into factors, the R type for representing categorical data. This is useful in some cases but for now the difference will be confusing.
stringsAsFactors=Fensures that our strings stay stringy.?
The alternative approach is to set R’s “working directory” to the folder containing
laureates.csv. Do this with the
Slight hand-waving here, because
tablereturns a data type we haven’t discussed yet. For practical applications we’re mostly going to see tables that look like data frames if you squint.
The return type of table is an object of class
table, which is a subclass of
array. An array is a generalization of a vector to any number of indices; a vector is isomorphic to a 1-dimensional array, a matrix is a 2-dimensional array, etc. Good, I’m glad we cleared that up.?
sepspecifies the delimiter. It can only be one character, but fortunately the
strip.whiteparameter tells R to get rid of white space before and after delimiters. Because MJP has not escaped single or double quotes, R will choke on apostrophes unless we tell it to take each table entry literally rather than trying to look for pairs of
stringsAsFactorswe’ve seen before. Finally,
read.tablemust explicitly be told that a header line is included with
The possibility of three-way or n-way contingency tables is the reason a
tableis an array rather than a vector or data frame. The three way table can be subscripted with expressions like
t[i, j, k]?
It is also possible to get more literate column names making the table with the
xtabsfunction before converting to a data frame. In this case:
xtabs(~ bornCountryCode,laureates). But there’s something funny here, which is best left for a later lesson.
All right, if you really want, I’ll tell you, since it only took me twenty re-readings of the help files to figure it out.
xtabsrequires that you indicate which columns of the data frame to tabulate with a “one-sided formula” of the form
~ col1 + col2 + col3.... It sets the
dimnamesof the resulting array from the formula, and
as.data.framederives column and rownames from that. I hope you’re happy now.?
If you’ve been following carefully, you might be wondering how in the world
x=yearcould be an acceptable function parameter. There is no variable named
year, so how does R know what values to plot? Shouldn’t it be
x=us_tx$yearor something like that? The answer is that
qplotpretends that each of the columns of the data frame specified by the
dataparameter is a variable in its own right: when it goes to figure out the value of its
xparameter, it will be able to look up
yearin the data frame. The R-speak way of saying this is that the parameters are “evaluated in the data frame.” (If you’ve been following very closely, you might even have a guess as to why R doesn’t give an
object 'year' not founderror before
qplothas a chance to evaluate it in the data frame.)?
The reason for this lies in an aspect of the grammar I passed over. Sometimes we transform the data before mapping it into the visual dimension. This transformation intervening between the data and the visual mapping is known to ggplot as the “stat.” For example, sometimes we count up how often a given value occurs: this is what we have been doing with the
qplotwill do this for you automatically if you supply the raw, untallied data and then use the
binstat. In fact, since bar plots are so commonly used to show tallies of this kind, by default when you set
yto the tallies of
If this seems obscure, try
qplot(x=gender,geom="bar",data=laureates). Notice that there’s no explicit
In the case of our yearly translation counts, however, we do not want the heights of the bars to correspond to the number of times a given count of translations occurs in the data! We just want the height of the bar to be equal to the number of translations. This is simplest stat of all, the identity transformation, which leaves all values unchanged. But we have to set it explicitly using
stat="identity". In fact, if you leave off this parameter,
qplotwill give you an error message that tries to tell you what I’ve just told you but with even less clarity.?
The unsightly striation—random white space between the bars here and there—is one of the few moments when ggplot’s defaults let you down, visually.?