Taught by Andrew Goldstone, Department of English, Rutgers University–New Brunswick
Wednesday, April 30, 2014
4:30 p.m.–6:30 p.m.
Alexander Library, Room 413
169 College Avenue, New Brunswick, NJ
With the increasing prominence of the digital humanities, humanists are once again asking themselves whether they can make use of the computer’s most fundamental capacity: its ability to count. This workshop introduces some of the methodological choices required for computational counting: what representations of data are suitable for machine processing? Once you have such a representation, how can you begin to analyze it? We will make these questions concrete through an introduction to R, which is both a programming language and a software environment for data analysis. We focus on the nature of computational thinking: the scholar’s work of representing and analyzing data on the computer is a process of highly disciplined expression. We will work together to analyze some samples of data (bibliographic data and word-use data), from loading files in the ubiquitous comma-separated value (CSV) format, to searching and tabulating data programmatically, to the “grammar” of basic visualization.
No programming experience required. Patience, however, helps.
Notes from the workshop
The following is a somewhat edited and amplified write-up (by Andrew Goldstone) of the workshop notes. The original slides from the workshop are also available. To work through the below would take approximately two and a half hours.
Shall we count?
The aim of this workshop is to provide some foundations for thinking about counting as a way of answering some questions we have in our disciplines. Foundations, and not just practical instruction in tool-using, matter because there is a lot at stake in the decision to explore counting methods. James English writes about literary study:
Academic disciplines (and even interdisciplines or hybrids) are relational entities; they must define themselves by what they are not. And what literary studies is not is a “counting” discipline. This negative relation to numbers is traditional— foundational, even—and it has not been seriously challenged by the rise of interdisciplinarity….Literary studies has shouldered much of the burden of…defending qualitative models and strategies against the nai?ve or cynical quantitative paradigm that has become the doxa of higher-educational management. Under these institutional circumstances, antagonism toward counting has begun to feel like an urgent struggle for survival.1
Comma-separated values
This workshop focuses on tables. It’s worth thinking about what can and cannot be represented in a table, but the table is a ubiquitous model for keeping track of data. When you think of tables and computers, you might think of Excel spreadsheets. Unfortunately, the Excel format is both too complex and too opaque to allow for direct programmatic manipulation, except through the very awkward mechanisms of Excel’s own native languages and macros. That format is also proprietary and vulnerable to problems when it comes to sharing and archiving. I have found that for my own purposes, I have spent the most time working on tables in CSV or comma-separated values formats. Excel and any other spreadsheet program can save your spreadsheets as “CSV” or “Text CSV” (though, as you will see, this will entail a drastic, though productive, simplification of the format of the data). Here is a small table in CSV format:
firstname,surname,bornCountry
Alice,Munro,Canada
Mo,Yan,China
Tomas,Tranströmer,Sweden
The norms of CSV
- plain-text file for tabular data
- delimiter separates columns (usually
,
or a tab) - newline separates rows
- names of columns in first row (optional)
- tricky bits:
- what if a data point contains a comma?
- what if a data point contains a quotation mark?
- what text-encoding should be used?
- how do you know what rules have been followed? (There is RFC 4180, but no promises.)
People as a table
Let’s look at some more elaborate CSV-format data. In the sample files, look for laureates.csv
and open it in RStudio using the Open File command (or open it in a text editor. For more on text editors, see the notes for the workshop on digital text.)2
id,firstname,surname,born,died,bornCountry,bornCountryCode,bornCity,diedCountry,diedCountryCode,diedCity,gender,year
892,Alice,Munro,1931-07-10,0000-00-00,Canada,CA,Wingham,,,,female,2013
880,Mo,Yan,0000-00-00,0000-00-00,China,CN,Gaomi,,,,male,2012
868,Tomas,Tranströmer,1931-04-15,0000-00-00,Sweden,SE,Stockholm,,,,male,2011
854,Mario,"Vargas Llosa",1936-03-28,0000-00-00,Peru,PE,Arequipa,,,,male,2010
844,Herta,Müller,1953-08-17,0000-00-00,Romania,RO,"Nitzkydorf, Banat",,,,female,2009
832,"Jean-Marie Gustave","Le Clézio",1940-04-13,0000-00-00,France,FR,Nice,,,,male,2008
817,Doris,Lessing,1919-10-22,2013-11-17,"Persia (now Iran)",IR,Kermanshah,"United Kingdom",UK,London,female,2007
808,Orhan,Pamuk,1952-06-07,0000-00-00,Turkey,TR,Istanbul,,,,male,2006
801,Harold,Pinter,1930-10-10,2008-12-24,"United Kingdom",UK,London,"United Kingdom",UK,London,male,2005
Notice the large number of conventional choices implied by this table: quotation marks to surround items with spaces; dates in YYYY-MM-DD format; “still living” represented as a 0000-00-00
date; arbitrary ID numbers; gender coded as male
or female
… None of these codings are described explicitly; CSV has very limited accommodation for metadata (just the column names). The rest of the metadata has to live in a separate file. Working in this format means keeping careful track of choices for how categories have been coded.
Text as a table
Here is part of a tabular representation of a scholarly article3:
WORDCOUNTS,WEIGHT
the,766
of,482
and,305
in,259
to,224
a,195
new,101
as,101
that,86
it,75
This so-called bag of words indicates only the number of times each type of word occurs in the article (according to JSTOR’s OCR), regardless of order. For this purpose, the CSV format is quite amenable. Notice what has been discarded: not just word order but punctuation, page layout, typography… (What dimensions of the page could be tabulated by extending the table?)
It is worth spending some time thinking about what can and cannot be accommodated in a data format like CSV. Let’s make a rough typology of some of the kinds of data we might be interesting in counting up.
Data types
Simple: numerical
- Whole numbers (integer scale). How many (books, people, words, genres…)?
- Real numbers (interval scale). How much (distance, time, money…)? Special cases:
- percentages or proportions (ratio scale). How much of the total (population, corpus of texts…)?
- dates. When? (And does the day, month, year, decade, century… matter?)
Simple: categorical
- Unordered. Which of… (languages, nations, genders(?))? Special cases:
- binary or Boolean category: true or false, yes or no.
- many categories (headwords in the dictionary, authors in the catalogue).
- Ordinal. Which (letter of the alphabet, sales rank, “like, dislike, or neutral”)?
Categories may be represented by numbers, often in more than one way:
- true: 1, false: 0
- like: 1, neutral: 0, dislike: -1
- like: 2, neutral: 1, dislike: 0
- a: 1, b: 2, c: 3… (character encoding)
Compound types
With these atoms of data, we can then make more complex forms:
The list / the series
As in a series, perhaps, of percentages:
17.5, 3.0, 11.0, 13.0, 8.1, 11.9, 11.0, 3.7, 3.2, 1.5
The list of lists / the table
The table can also be represented as a list of lists (of equal length):
firstname: Alice, Mo, Tomas
surname: Munro, Yan, Tranströmer
bornCountry: Canada, China, Sweden
firstname surname bornCountry
Alice Munro Canada
Mo Yan China
Tomas Tranströmer Sweden
You might also think of a table as a list of “cases,” one case per row, where each row is described by the same collection of data (first name, surname, country of birth).
And text?
For computational purposes, the central representation of text is in the form of a (looooong) list of characters (a “string”):
O, n, c, e, *space*, u, p, o, n, *space*, a, *space*,
t, i, m, e
But other representations exist:
- the bag of words (to: 2, be: 2, or: 1, not: 1)
- content analyses (automated, human, or semi-automated, classifications of texts which can then be tallied and analyzed in turn)
-
marked-up text
<sp who="#Salinus"><speaker>Duke.</speaker> <p>Haplesse <name>Egeon</name> whom the fates haue markt...</p>
- parsed trees (reflecting grammar—a grammar tree is a classic example of a data format which is impossible to encode in a single table)
-
page images (bonus activity: explain how image can be represented in a table)
Programming in a nutshell
Let’s get counting. This is going to require some programming. What is programming?
- A program is a formal description of a process for transforming data. Composing a program is a matter of expressing what you want to do in a constrained language.
- A computer performs calculations on numbers and stores the results of those calculations.
- If the inputs, outputs, and the formal description can be encoded as numbers, a program can be executed on a computer. At that point the formal description also looks like a recipe of instructions for the computer. In programming, one often switches back and forth between the expressive mode of description and the more machine-focused mode of instruction.
The R experience
The console
The console is the window with the >
prompt. In this window, you type an expression, and R figures out its value (and sometimes: stores a value, draws a figure, reads a file from the disk, saves a file on the disk), and tells you. And that’s all.
The script
A script is a set of expressions in a text file, one after another. R goes through and figures out their value one by one. And that’s all.
First steps in the console
R is a parrot
The simplest kind of expression consists of a value. To figure out the value of a value, R doesn’t have to work very hard:
2
[1] 2
From here on, these notes show, first, the line you can type into R—indicated by the box with a grey background—and then, following it, a second box, with a white background background, showing the response R gives. There’s no need to type in the response. (Ignore, for now, the strange [1]
you see when R echoes these values back to you. It’s just trying to be friendly.)
"Shiver me timbers"
[1] "Shiver me timbers"
Notice that as you type a "
, RStudio fills in the closing "
automatically. This can be a little disconcerting but is a useful convenience. You can “overtype” the close quote as well. RStudio will do the same thing with parentheses and brackets.
R gets crabby easily
On the other hand, already we can provide a first introduction to some of the ways you can type things R does not understand. Doing this is a normal part of the work, and hitting glitches and making errors is an important part of a learning process. R is particularly bad at explaining to you why it has not accepted what you typed. It’s worth practicing making R crabby, so you can see that this experience is not the end of the world:
Shiver
Shiver me timbers
help
(
"Shiver
To escape from these cases, press ESC
.
Some important features
The constrained environment of the interactive prompt (the >
where you type a line and press return) is rigorously linear and serial. Once you’ve typed return, you can’t edit the line to fix mistakes. But you can quickly copy over into a new line what you previously typed:
- Use the up and down arrows (or the RStudio History pane) to move through the history of past lines.
- Use the tab key to fill in partly-typed words that are known to R.4
- use the help feature:
help("paste")
or?paste
displays help on the thing calledpaste
.
R data kinds (“modes”)
Numbers
R does not enforce the difference between integers and non-integers very rigidly. Most of the time, in R you just think of “numbers” with and without decimal places.
Strings
Text comes in strings, surrounded with ""
:
"Avast"
[1] "Avast"
"\"Avast,\" he said"
[1] "\"Avast,\" he said"
"Beware the \\"
[1] "Beware the \\"
Represent a newline with \n
and a tab with \t
. In all these cases, \
is a special “escape” character indicating that the next character has a special interpretation.
Booleans
In R, a Boolean value may be TRUE
or FALSE
, T
or F
for short.
Factors
This special type, for representing categorical data, is discussed below.
Rithmetic
Try:
2 * 2
[1] 4
5/7
[1] 0.7143
You can use as many spaces or as few as you want.
Now try a logical expression using the operators ==
, which means “is equal?,” !=
which means “is not equal?”, and >
and <
:
4 == 3
[1] FALSE
4 > 3
[1] TRUE
4 < 3
[1] FALSE
4 != 3
[1] TRUE
These expressions have Boolean values. Boolean values have their own arithmetic, defined by the operators familiar from catalogue searching: and, or, not. In R these are notated as follows:
(2 > 1) & (1 > 5)
[1] FALSE
(2 > 1) | (1 > 5)
[1] TRUE
!(1 > 5)
[1] TRUE
R functions
Functions map inputs to outputs. The syntax is:
function_name(input_expression)
for a function with one input or
function_name(input1, input2)
for a function with 2. And so on. The inputs can be any expression.
Here are some examples of functions applied to simple values:
sqrt(4)
[1] 2
nchar("Munro")
[1] 5
paste("Alice", "Munro")
[1] "Alice Munro"
You can guess what these functions do, but you could also look up the official explanation with help(nchar)
or help(paste)
.
Here are some examples where a function takes an expression as an input—which might include another function—and so ad infinitum. (This ability to use one function’s output as input to another function—or even the same function—is central to the working of algorithms.)
sqrt(4 * 4)
[1] 4
sqrt(nchar("Four"))
[1] 2
paste(paste("Alice", "Munro"), "(Canada)")
[1] "Alice Munro (Canada)"
Functions in R have one other way of indicating inputs, called named parameters. They look like this:
paste("Munro", "Alice", sep = ", ")
[1] "Munro, Alice"
paste("Munro", "Alice", sep = "")
[1] "MunroAlice"
Here the third parameter is given the name sep
, which has a special role in the paste
function (what is it?).
Assignment
Computers do calculations and store the results. We’ve done calculations; what about storing?
In R, <-
stores a value under a name which you can refer to (or change) later. The format is
name <- expression
R figures out the value of expression
and stores it in name.5
Once you’ve stored a value under name, the value of that name is…the stored value. This sounds funny, but try a few examples:
x <- 108
x
[1] 108
x + 2
[1] 110
storage <- 10
storage <- storage - 10
My_Perfectly_Good_Name2012 <- "Mo Yan"
In the “Environment” pane of RStudio, you can watch the results of your assignments: the names and their associated values suddenly appear in the list.
Names can be short or long, but can’t have spaces in them, and they have to start with a letter.
R compound data types
Vectors (for a series of values)
What I described as a series is called a vector in R. Construct a vector with the special function c
(concatenate):
xs <- c(2, 4, 8)
xs
[1] 2 4 8
bs <- c(T, F, T)
bs
[1] TRUE FALSE TRUE
people <- c("Munro", "Mo", "Transtromer")
people
[1] "Munro" "Mo" "Transtromer"
c(people, "Vargas Llosa")
[1] "Munro" "Mo" "Transtromer" "Vargas Llosa"
Notice that vectors can hold not just numbers but strings or Booleans. (They aren’t allowed to hold a mixture. For that a different compound type exists, which we won’t say much about, the list).
Subscripting
Once you have a series, how do you pick out parts of it? Choose an element or elements from a vector with []
:
xs[2]
[1] 4
people[1]
[1] "Munro"
Again, any expression whose value is a meaningful subscript can go in the square brackets:
xs[1 + 1] # a silly example
[1] 4
Sequences
One special kind of vector has so many uses R has a special way of notating it: this is the sequence:
1:3
[1] 1 2 3
c(1:3, 6:8)
[1] 1 2 3 6 7 8
Now try an experiment. What is the value of these expressions?
people[1:2]
[1] "Munro" "Mo"
people[c(1, 3)]
[1] "Munro" "Transtromer"
If the subscript is itself a vector, then we get back a vector, not just a single element.
Logical subscripting
Now figure out what’s going on here:
people[bs]
[1] "Munro" "Transtromer"
In the expression
v[logic_v]
if logic_v
is made up of Boolean values and has the same length as v
, it acts as a kind of mask applied to the series: only the elements of v
corresponding to the TRUE
values of logic_v
are picked out.
Vector operations
All of the arithmetic discussed above has a vector version. So do many R functions. In general, the idea is do the same thing to each element of the vector (operators apply “elementwise”):
c(1, 3, 5) + c(2, 4, 6)
[1] 3 7 11
c(T, F, F) | c(F, T, F)
[1] TRUE TRUE FALSE
paste(c("a", "b"), c("c", "d"))
[1] "a c" "b d"
If the Boolean arithmetic above seemed vague, you can generate “truth tables” that show you the workings of the Boolean operators6:
c(T, T, F, F) & c(T, F, T, F) # truth table for AND
[1] TRUE FALSE FALSE FALSE
c(T, T, F, F) | c(T, F, T, F) # truth table for OR
[1] TRUE TRUE TRUE FALSE
!c(T, F) # truth table for NOT
[1] FALSE TRUE
Another important operator: x %in% y
checks whether x
is found in the vector y
. This is clear enough for the case where x
is a single value:
"c" %in% c("b", "c", "d", "e")
[1] TRUE
"a" %in% c("b", "c", "d", "e")
[1] FALSE
Explain to yourself what happens when x
is a vector of multiple values:
c("a", "b", "c") %in% c("b", "c", "d", "e")
[1] FALSE TRUE TRUE
Recycling
The last vectorial convenience has to do with operations that need vectors of the same length. R lets you supply a single value and then extends it to the length required by the context of an operation or a function. Compare:
c(1, 3, 5) + 1
[1] 2 4 6
c(1, 3, 5) + c(1, 1, 1)
[1] 2 4 6
paste("The", c("beginning", "end"))
[1] "The beginning" "The end"
paste(c("The", "The"), c("beginning", "end"))
[1] "The beginning" "The end"
c(1, 3, 5) == 3
[1] FALSE TRUE FALSE
c(1, 3, 5) == c(3, 3, 3)
[1] FALSE TRUE FALSE
choice <- xs > 3
xs > c(3, 3, 3)
[1] FALSE TRUE TRUE
choice
[1] FALSE TRUE TRUE
xs[choice]
[1] 4 8
xs[xs > 3]
[1] 4 8
That last example is a very common idiom of the R language. It expresses the following: “The elements of xs
that are greater than 3” in a concise way. But now you can also see how the R machine actually works that out:
- Take 3 and repeat it enough times to make a vector the same length as
xs
. - Compare
xs
to that repeated3
vector, yielding a logical vector. - Use the logical vector as a subscript for
xs
to pick out only some elements of that vector.
Recycling actually works on vectors of any length, though this feature is less often used than recycling a single element:
1:4 + 1:2
[1] 2 4 4 6
letters # built-in variable of length 26
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p"
[17] "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
paste(letters, c("odd", "even"))
[1] "a odd" "b even" "c odd" "d even" "e odd" "f even" "g odd"
[8] "h even" "i odd" "j even" "k odd" "l even" "m odd" "n even"
[15] "o odd" "p even" "q odd" "r even" "s odd" "t even" "u odd"
[22] "v even" "w odd" "x even" "y odd" "z even"
The data frame
Finally we come to the central compound data type in R, the data frame. The data frame represents tabular data. A data frame is a list of vectors not necessarily of the same type, but all of the same length. Let’s make one. The rather special data.frame()
function takes named parameters and makes a data frame out of them, using the parameter names as the names of the columns.
laureates <- data.frame(
firstname=c("Alice","Mo","Tomas"),
surname=c("Munro","Yan","Tranströmer"),
bornCountry=c("Canada","China","Sweden"),
age_now=c(82,59,83))
laureates
firstname surname bornCountry age_now
1 Alice Munro Canada 82
2 Mo Yan China 59
3 Tomas Tranströmer Sweden 83
To simplify the results of what goes on below, one slightly magical incantation should be added here.7
laureates <- data.frame(
firstname=c("Alice","Mo","Tomas"),
surname=c("Munro","Yan","Tranströmer"),
bornCountry=c("Canada","China","Sweden"),
age_now=c(82,59,83),
stringsAsFactors=F)
Indexing by row and column
If we want to access parts of the data frame, a single subscript is no longer enough; now we need two subscripts to pick out rows and columns. In general:
data_frame[rows, columns]
gives us only those elements of data_frame
in the rows specified by a subscript vector rows
and the columns specified by columns
. Try it out:
laureates[1, 1]
[1] "Alice"
laureates[1, 2]
[1] "Munro"
In addition to numbers, our subscripts can use the column names:
laureates[1, "firstname"]
[1] "Alice"
laureates[2, "surname"]
[1] "Yan"
And, just as with vectors, subscripts can be vectors, not just single indices:
laureates[3, c("firstname", "surname")]
firstname surname
3 Tomas Tranströmer
Fiddly note: If your data frame subscript expression picks out more than one column, its value is itself a data frame rather than a vector (even if you pick only one row). For the purposes of this lesson, this distinction is not important.
Exercise (1)
Write a single expression in terms of laureates
to produce the full name of Canada’s laureate. If your answer is a vector of multiple elements, write a more complicated expression that yields a single string. You will have to use a function. The solution is in the answers section.
Omitted indices
Leaving a blank where an index would be means “I want all of ’em”:
laureates[3, ]
firstname surname bornCountry age_now
3 Tomas Tranströmer Sweden 83
laureates[, 2]
[1] "Munro" "Yan" "Tranströmer"
laureates[, c("surname", "bornCountry")]
surname bornCountry
1 Munro Canada
2 Yan China
3 Tranströmer Sweden
This is useful in conjunction with logical indexing:
laureates[c(T, F, T), ]
firstname surname bornCountry age_now
1 Alice Munro Canada 82
3 Tomas Tranströmer Sweden 83
A shorthand
Picking out a single column is so common, that R lets you use a more concise syntax:
laureates[, "surname"]
[1] "Munro" "Yan" "Tranströmer"
laureates$surname
[1] "Munro" "Yan" "Tranströmer"
A single column is just…a regular old vector, which can be subscripted in turn:
laureates$surname[2]
[1] "Yan"
But this only works on a single column:
laureates$firstname, surname # error
laureates$(firstname, surname) # sigh
laureates$c(firstname, surname) # alas
Some less-made-up data
We’ve worked on miniature data long enough. Let’s work on a bigger table—though not much bigger. R has a special function devoted to reading CSV files from your hard drive. Its input is the name of the file and its output is a data frame.
laureates <- read.csv("laureates.csv",
stringsAsFactors=F)
Please take the magical stringsAsFactors=F
incantation on faith for now. This is the first R operation you have seen which involves the hard drive; it is an example of “File I/O.” This is a fount of interesting errors and problems. One of the major flaws of R is its extremely obscure way of telling you about such problems. For example, if it can’t find the file you asked for (either because of a typo in the name, or because it’s looking in the wrong folder), you’ll see an error like:
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") : cannot open file 'laureates.csv': No such file or directory
One solution is to figure out the full path name of the file you want (i.e., where in your nested folders it lives). If you are used to using the Finder/Explorer to do this, then the simplest way8 to figure this out is with another special R function:
file.choose()
This opens a dialog box; navigate to where you have stored laureates.csv
and click “open.” The value of the function now appears, and it is the full path of the file. On my system this looks like:
"/Users/agoldst/Documents/dhru/counting/laureates.csv"
You can now copy and paste this path, in quotes, into read.csv
. On my system (but not on yours) this looks like:
laureates <- read.csv("/Users/agoldst/Documents/dhru/counting/laureates.csv",
stringsAsFactors=F)
Properties of the frame
We said before that a CSV file involves very minimal metadata. R stores the same metadata about a data frame. Find the names of the columns of a data frame with a function, names
:
names(laureates)
[1] "id" "firstname" "surname"
[4] "born" "died" "bornCountry"
[7] "bornCountryCode" "bornCity" "diedCountry"
[10] "diedCountryCode" "diedCity" "gender"
[13] "year" "category" "overallMotivation"
[16] "share" "motivation" "name"
[19] "city" "country"
And the number of rows (which in this case is the number of laureates):
nrow(laureates)
[1] 106
The logic of the query
Now that we have a longer table (not that long, but still a bit long to see all at a glance), we want to slice and dice it. Indeed, one of the most important things we can do with a table of data is to choose parts of it. Combining what we have seen so far, we can use logical vectors, Boolean operators, and subscripting to pick out parts of a data frame in R.
This is an operation we all do all the time when we search library catalogues or databases. So think of this as a version of a search query. Only instead of clicking some menus and seeing results visually, we have the capacity to store and do further calculations on the results of our queries.
But let’s start with a query. What are the surnames of laureates born in Sweden?
laureates$bornCountry == "Sweden"
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[11] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[21] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[31] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[41] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[51] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[71] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[81] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[91] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
[101] FALSE FALSE FALSE FALSE FALSE FALSE
swedes <- laureates$bornCountry == "Sweden"
laureates$surname[swedes]
[1] "Tranströmer" "Johnson" "Lagerkvist"
[4] "Karlfeldt" "von Heidenstam" "Lagerlöf"
Let’s look at the full rows of the table for laureates born in Sweden, too:
laureates[swedes, ]
id firstname surname born died
3 868 Tomas Tranströmer 1931-04-15 0000-00-00
40 649 Eyvind Johnson 1900-07-29 1976-08-25
63 622 Pär Fabian Lagerkvist 1891-05-23 1974-07-11
78 604 Erik Axel Karlfeldt 1864-07-20 1931-04-08
92 585 Carl Gustaf Verner von Heidenstam 1859-07-06 1940-05-20
98 579 Selma Ottilia Lovisa Lagerlöf 1858-11-20 1940-03-16
bornCountry bornCountryCode bornCity diedCountry
3 Sweden SE Stockholm
40 Sweden SE Svartbjörnsbyn Sweden
63 Sweden SE Växjö Sweden
78 Sweden SE Karlbo Sweden
92 Sweden SE Olshammar Sweden
98 Sweden SE Mårbacka Sweden
diedCountryCode diedCity gender year category overallMotivation
3 male 2011 literature NA
40 SE Stockholm male 1974 literature NA
63 SE Stockholm male 1951 literature NA
78 SE Stockholm male 1931 literature NA
92 SE Övralid male 1916 literature NA
98 SE Mårbacka female 1909 literature NA
share
3 1
40 2
63 1
78 1
92 1
98 1
motivation
3 "because, through his condensed, translucent images, he gives us fresh access to reality"
40 "for a narrative art, far-seeing in lands and ages, in the service of freedom"
63 "for the artistic vigour and true independence of mind with which he endeavours in his poetry to find answers to the eternal questions confronting mankind"
78 "The poetry of Erik Axel Karlfeldt"
92 "in recognition of his significance as the leading representative of a new era in our literature"
98 "in appreciation of the lofty idealism, vivid imagination and spiritual perception that characterize her writings"
name city country
3 NA NA NA
40 NA NA NA
63 NA NA NA
78 NA NA NA
92 NA NA NA
98 NA NA NA
That’s a lot of stuff, so from now we’ll only look at some of the columns while we’re picking out rows. Let’s try another query:
women <- laureates$gender == "female"
laureates[women, c("firstname", "surname", "year")]
firstname surname year
1 Alice Munro 2013
5 Herta Müller 2009
7 Doris Lessing 2007
10 Elfriede Jelinek 2004
18 Wislawa Szymborska 1996
21 Toni Morrison 1993
23 Nadine Gordimer 1991
69 Gabriela Mistral 1945
72 Pearl Buck 1938
81 Sigrid Undset 1928
83 Grazia Deledda 1926
98 Selma Ottilia Lovisa Lagerlöf 1909
Not…many. We now express more complex ideas, like “All the rows corresponding to women born in Sweden”:
laureates[women & swedes, c("firstname", "surname",
"year")]
firstname surname year
98 Selma Ottilia Lovisa Lagerlöf 1909
Or “all the rows corresponding to women or Swedes”:
laureates[women | swedes, c("firstname", "surname",
"year")]
firstname surname year
1 Alice Munro 2013
3 Tomas Tranströmer 2011
5 Herta Müller 2009
7 Doris Lessing 2007
10 Elfriede Jelinek 2004
18 Wislawa Szymborska 1996
21 Toni Morrison 1993
23 Nadine Gordimer 1991
40 Eyvind Johnson 1974
63 Pär Fabian Lagerkvist 1951
69 Gabriela Mistral 1945
72 Pearl Buck 1938
78 Erik Axel Karlfeldt 1931
81 Sigrid Undset 1928
83 Grazia Deledda 1926
92 Carl Gustaf Verner von Heidenstam 1916
98 Selma Ottilia Lovisa Lagerlöf 1909
Or “all the rows corresponding to women or people who are not Swedes, but take only the first and names and surnames”:
laureates[women & !swedes, c("firstname", "surname")]
firstname surname
1 Alice Munro
5 Herta Müller
7 Doris Lessing
10 Elfriede Jelinek
18 Wislawa Szymborska
21 Toni Morrison
23 Nadine Gordimer
69 Gabriela Mistral
72 Pearl Buck
81 Sigrid Undset
83 Grazia Deledda
Exercise (2)
Write an expression whose value is a data frame containing the names and prize-years of all the laureates who died in a country other than the country of their birth. The solution is in the answers section.
Counting
Now let’s finally use the computer to count (explicitly; secretly, it’s been doing a lot of counting already). For counting in R, the workhorse is the table
function. At its simplest, table
takes a vector as an input and returns a tabulation9 showing how many times each value in the vector is repeated:
table(c("a", "b", "a", "c", "b"))
a b c
2 2 1
This is already a useful operation. Notice that because we’re always talking to R in vectors, we always start by counting everything. Instead of asking, “How many Swedish laureates?” we could ask, “How many laureates from each country?” This is more information, but in R it is more concise (because more general):
table(laureates$bornCountryCode)
AT BE BG CA CH CL CN CO CZ DE DK EG ES FI FR GP GR GT HU IE IN IR
3 1 1 1 2 1 2 2 1 1 7 3 1 4 1 10 1 1 1 1 4 1 1
IS IT JP LC LT MG MX NG NO PE PL PT RO RU SE TR TT UA UK US ZA
1 6 2 1 1 1 1 1 2 1 5 1 1 5 6 2 1 1 6 8 2
…and division
The thing about counting is that we’re most often interested not in the question how many? but in how many, out of all of them? Now we can make use of our metadata, and R’s vectorized arithmetic.
table(laureates$bornCountryCode)/nrow(laureates) *
100
AT BE BG CA CH CL CN CO CZ
2.8302 0.9434 0.9434 0.9434 1.8868 0.9434 1.8868 1.8868 0.9434 0.9434
DE DK EG ES FI FR GP GR GT HU
6.6038 2.8302 0.9434 3.7736 0.9434 9.4340 0.9434 0.9434 0.9434 0.9434
IE IN IR IS IT JP LC LT MG MX
3.7736 0.9434 0.9434 0.9434 5.6604 1.8868 0.9434 0.9434 0.9434 0.9434
NG NO PE PL PT RO RU SE TR TT
0.9434 1.8868 0.9434 4.7170 0.9434 0.9434 4.7170 5.6604 1.8868 0.9434
UA UK US ZA
0.9434 5.6604 7.5472 1.8868
Exercise (3)
Write an expression for a tabulation of the number of men and women to win the Nobel in literature. The solution is in the answers section.
Cross-tabulation
Cross-tabulation means answering questions like, “How many of each gender were born in each country?”
table(laureates$bornCountryCode, laureates$gender)
female male
0 3
AT 1 0
BE 0 1
BG 0 1
CA 1 1
CH 0 1
CL 1 1
CN 0 2
CO 0 1
CZ 0 1
DE 0 7
DK 1 2
EG 0 1
ES 0 4
FI 0 1
FR 0 10
GP 0 1
GR 0 1
GT 0 1
HU 0 1
IE 0 4
IN 0 1
IR 1 0
IS 0 1
IT 1 5
JP 0 2
LC 0 1
LT 0 1
MG 0 1
MX 0 1
NG 0 1
NO 0 2
PE 0 1
PL 1 4
PT 0 1
RO 1 0
RU 0 5
SE 1 5
TR 0 2
TT 0 1
UA 0 1
UK 0 6
US 2 6
ZA 1 1
Think of this as a tabulation of tabulations: first R splits up the table by bornCountryCode
, then splits up the result by gender
before giving us the count. Notice that the result is now not a single row of numbers but many rows (or, more precisely, a two-dimensional array—almost like a data frame).
Sorting
Let’s return to single (rather than cross) tabulations for a bit. After how many? comes which is most or which is least? Tables (and vectors, in fact) can be rearranged in order by the sort
function.
laureate_countries <- table(laureates$bornCountryCode)
sort(laureate_countries)
AT BE BG CH CO CZ EG FI GP GR GT HU IN IR IS LC LT MG MX NG PE PT RO
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
TT UA CA CL CN JP NO TR ZA DK ES IE PL RU IT SE UK DE US FR
1 1 2 2 2 2 2 2 2 3 3 4 4 5 5 6 6 6 7 8 10
The sorted result normally goes from least to most, but often the reverse is easier to read. For that, sort
is invoked with a named parameter, decreasing
:
sort(laureate_countries, decreasing = T)
FR US DE IT SE UK PL RU ES IE DK CA CL CN JP NO TR ZA AT BE BG CH
10 8 7 6 6 6 5 5 4 4 3 3 2 2 2 2 2 2 2 1 1 1 1
CO CZ EG FI GP GR GT HU IN IR IS LC LT MG MX NG PE PT RO TT UA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Exercise (4)
Write an expression for the top three countries-of-death of the Nobel laureates. This is a trick question. The solution is in the answers section.
More, and messier, data
Let’s count something else, something we really couldn’t just eyeball. The Modernist Journals Project has provided text-formatted tables of item metadata for some of the periodicals they have digitized. I these tabulations for Poetry and The Crisis from http://sourceforge.net/projects/mjplab/files/ and included them in the sample data archive. But as MJP may update the data, please re-download from that link if you are using this data for research of any kind.
Let’s start just by seeing whether MJP has provided us with a nice CSV format we can use right off. The readLines
function looks for a file on disk returns it as a vector of lines. Normally you want the whole file, but for our purposes we can specify the named parameter n
to get just the first few lines:
readLines("Poetry_2.everytitle.txt", n = 4)
[1] "creator | editor | translator | title | genre | pages | volume | issue | date | journal title | journal subtitle | issue name | journal editor | publisher | journal location | issue length (pp) | issue height (cm) | issue width (cm)"
[2] "Ficke, Arthur Davison | | | Poetry | poetry | 1-2 | 1 | 1 | 1912-10-01 | Poetry | A Magazine of Verse | | Monroe, Harriet | Harriet Monroe | Chicago | 40 | 20 | 14.7 "
[3] "Moody, William Vaughan | | | I Am the Woman | poetry | 3-6 | 1 | 1 | 1912-10-01 | Poetry | A Magazine of Verse | | Monroe, Harriet | Harriet Monroe | Chicago | 40 | 20 | 14.7 "
[4] "Pound, Ezra | | | To Whistler, American | poetry | 7-7 | 1 | 1 | 1912-10-01 | Poetry | A Magazine of Verse | | Monroe, Harriet | Harriet Monroe | Chicago | 40 | 20 | 14.7 "
!?!!! This is a delimited text file of data, but it isn’t comma-delimited. Instead it uses |
as the delimiter. A close reading of help(read.csv)
and some experimentation yielded me the following command, which uses read.table
, a variant of read.csv
that can deal with files that use things other than commas:
poetry_titles <- read.table("Poetry_2.everytitle.txt",
sep = "|", strip.white = T, stringsAsFactors = F,
quote = "", header = T)
That’s probably the murkiest line of R code in this workshop.10
crisis_titles <- read.table("Crisis_2.everytitle.txt",
sep = "|", strip.white = T, stringsAsFactors = F,
quote = "", header = T)
A comparison
Now that we have the two tables of items from the two magazines, let’s begin to compare them. One of the most interesting metadata fields is the genre assigned by the TEI encoders to each item. Let’s compare genre proportions.
table(poetry_titles$genre)/nrow(poetry_titles)
advertisements articles images letters
0.0284599 0.2683039 0.0002295 0.0257058
letters; poetry poetry
0.0002295 0.6770714
table(crisis_titles$genre)/nrow(crisis_titles)
advertisements articles drama fiction
0.0697295 0.3383862 0.0009328 0.0233209
images letters poetry
0.4626866 0.0496735 0.0552705
Combine and recount
From here on it will be easier to have one data frame that combines the two tables. Since they have the same columns, we can simply “stack” one on top of the other. rbind
is R’s function for stacking data frames.
mags <- rbind(poetry_titles, crisis_titles)
Since the MJP data has scrupulously recorded the journal title for every item, it’s easy to cross-tabulate genres by journals:
table(mags$genre, mags$journal.title)
Crisis Poetry
advertisements 299 124
articles 1451 1169
drama 4 0
fiction 100 0
images 1984 1
letters 213 112
letters; poetry 0 1
poetry 237 2950
(Fussily, R has renamed what the original file called journal title
to journal.title
.)
Who’s in both?
Earlier on we saw the %in%
operator. Here’s a chance to apply it (returning to our separate data frames for the two journals):
poetry_in_crisis <- poetry_titles$creator %in%
crisis_titles$creator
That gives a logical vector which we can use as a subscript:
shared_auths <- poetry_titles$creator[poetry_in_crisis]
If you print shared_auths
you will see repeated names, since, remember, the table is of item metadata. If we want a list of names where each occurs only once, we can use the unique
function:
unique(shared_auths)
[1] "" "Lindsay, Nicholas Vachel"
[3] "Noyes, Alfred" "Kreymborg, Alfred"
[5] "Anonymous" "Johnson, Fenton"
[7] "Cleghorn, Sarah N."
Whoops! Let’s tidy that up by getting rid of the results for blanks and Anon.
shared_auths <- unique(shared_auths[shared_auths != "" &
shared_auths != "Anonymous"])
Now we can take a look in our table mags
in order to tally up the activities of these authors who contributed to both periodicals:
mags_shared <- mags[mags$creator %in% shared_auths,
]
table(mags_shared$journal.title, mags_shared$genre,
mags_shared$creator)
, , = Cleghorn, Sarah N.
articles fiction letters poetry
Crisis 1 0 0 0
Poetry 0 0 0 1
, , = Johnson, Fenton
articles fiction letters poetry
Crisis 0 3 0 4
Poetry 0 0 0 5
, , = Kreymborg, Alfred
articles fiction letters poetry
Crisis 0 0 0 1
Poetry 6 0 1 14
, , = Lindsay, Nicholas Vachel
articles fiction letters poetry
Crisis 0 1 1 0
Poetry 2 0 0 6
, , = Noyes, Alfred
articles fiction letters poetry
Crisis 0 0 0 2
Poetry 0 0 0 1
This, notice, is a three-way contingency table. R shows it to us as a series of two-way contingency tables., one for each “creator” of items.11
From tables back to data frames
Now I’ve been evasive about just what kind of value the table
function returns. It looks like tabular data, it’s called a “table,” but is it a data frame? No, it’s actually another member of R’s bestiary of complex types, namely, a table. Well that’s helpful. For practical purposes what matters is learning how to convert a table to a data frame so that we can do everything we know how to do to data frames. R provides a function for this, as.data.frame
:
laur_country_tab <- table(laureates$bornCountryCode)
laureate_countries <- as.data.frame(laur_country_tab)
Now print out laureate_countries
to see what this data frame looks like. You might notice that the column headers are the supremely undescriptive Var1, Freq. Assign new column names using the following syntax12:
names(laureate_countries) <- c("country", "count")
Now laureate_countries
is a data frame with two columns containing the tabulated counts, which you can explore as we have been exploring the untabulated data.
Visualization, grammatically
The last part of this workshop (which we didn’t get to do on April 30) introduces one more way to explore tabular data, and especially counted-up tabular data: visualization. Here are two principles for thinking about visualization in this context.
-
A visualization transforms data inputs into graphical outputs. (Sound familiar?).
-
A grammatical visualization consistently transforms dimensions of the data into aesthetic dimensions of the output.
R users can avail themselves of a very powerful software library for making grammatical visualizations, Hadley Wickham’s ggplot2. Once you’ve learned the basics of R data types and making data frames, you can start making plots with ggplot.
Loading the library
Much of what makes R useful is not part of the basic program you got when you installed R. Instead, you can obtain extra source code that extends R by adding new functions you can use in your own R code. ggplot is one such “R package.” It is easy to obtain: in RStudio, choose “Install Packages…” from the “Tools” menu, type in “ggplot2,” and click “Install.”
Once ggplot2 has been installed, it can be loaded with the following function call:
library("ggplot2")
Making a point (plot)
Let’s start by thinking through a simple point plot. Here’s some new data: a table in which each row gives the number of translations published in the United States in the given year, according to the UNESCO Index Translationum.
us_tx <- read.csv("us-trans.csv")
us_tx
year translations
1 1979 1634
2 1980 1386
3 1981 1201
4 1982 1174
5 1983 1250
6 1984 1706
7 1985 1574
8 1986 1717
9 1987 1653
10 1988 1846
11 1989 1856
12 1990 1906
13 1991 2074
14 1992 2167
15 1993 2185
16 1994 1915
17 1995 2081
18 1996 1998
19 1997 1936
20 1998 2023
21 1999 1888
22 2000 1343
23 2001 1343
24 2002 1494
25 2003 1356
26 2004 1666
27 2005 2057
28 2006 2289
29 2007 2195
30 2008 1431
The plot grammar
Here is the “grammar” of the point plot:
- Years on the x-axis, from left to right
- Number of translations on the y-axis, with 0 on the bottom
- For each row of data, draw a point.
The code
ggplot’s qplot
function requires a data frame and a specification of the plot grammar. The specification is done using named parameters to the function:
qplot(x=year, # aesthetics (mapping)
y=translations,
geom="point", # geometry (shape)
data=us_tx) # data source
Our data frame has columns year
and translations
. So we tell qplot
we want x to be year
, and y to be translations. The last part of our specification was the decision to draw a point for each row of data. This is set by the geom="point"
parameter. Finally, we tell qplot
what data frame to work on using the data=us_tx
parameter.13
Notice that qplot
goes on to make a lot of further choices for us: it picks where to start and end the x and y axes and how many “ticks” to label along each axis; it adds a shaded grid to help you read off numbers from the chart; indeed, it’s made a choice about the size of the point it draws. qplot
offers a bazillion parameters for adjusting all of these things, but one of its virtues, when it comes to starting out with counting, is that its default guesses are often pretty darn good. So you can wait to learn about how to tweak the visualization until you have gotten your bearings just getting the durn thing to make plots.
Conjugating the plot
Points don’t make the year-to-year trend particularly easy to see. From the grammatical perspective, we can think about other choices of “geom” without changing our decision about how to map x and y. I think of this as the grammar of conjugating a plot in different shapes.
Data over time are often shown with a line:
qplot(x=year,
y=translations,
geom="line", # change the shape
data=us_tx)
Since we’re counting things, we might also want to fill in the area below the line down to zero. This gives a “filled area” plot:
qplot(x=year,
y=translations,
geom="area", # change the shape yet again
data=us_tx)
It’s worth thinking about what the different plots emphasize differently. One further shape possibility that might have occurred to you is a bar plot. As you might hope, to do this you pass geom="bar"
, but in this case one extra parameter to qplot
is also needed:
qplot(x=year,
y=translations,
geom="bar",
stat="identity",
data=us_tx)
For now, just take this as a quirk of qplot
: to say “I want a bar plot,” you have to say geom="bar",stat="identity"
.14
Scales, in general
It is fairly straightforward for us to figure out how to map a quantity like “number of translations” to a spatial dimension (y). Not all aesthetic mappings are so obvious. In particular, how do we map categorical data into the visual?
ggplot tries hard to do what you ask. If you tell it that x
or y
should be mapped from a categorical value, it will make its best guess.
So let’s return to our data frame counting up Nobel laureates by country of birth, laureate_countries
, and consider:
qplot(x=country,y=count,geom="point",
data=laureate_countries)
What is the grammar of this plot?
- ???
- laureate count on y axis
- point for each country
The “best guess”, here, was to arrange the country codes in alphabetical order along the x axis. This is not terrible, though given the number of countries, it’s pretty hard to read, except maybe to notice that France (FR
) is champ. Even there, the point is so far from the x axis that it takes work to match the point to the country. That we could fix by using a different shape. Let’s try bars, not omitting the magical stat="identity"
:
qplot(x=country,y=count,
geom="bar",
stat="identity",
data=laureate_countries)
That’s a little better, though not much; there are so many country codes that they get squashed together here. In RStudio, you can click the “zoom” button in the plot pane to see a bigger version of the plot. (Look for RStudio’s convenient buttons for saving plots as well.)
Dating
Now let’s do a little visualization of tallies from our periodical metadata set. A basic counting question: did the number of articles published in isues of Poetry change over time?
First we have to create the necessary data frame:
poetry_articles <- poetry_titles[poetry_titles$genre ==
"articles", ]
art_series <- as.data.frame(table(poetry_articles$date))
names(art_series) <- c("date", "count")
Now we are in a position to plot the series:
qplot(x = date, y = count, geom = "bar", stat = "identity",
data = art_series)
Notice something strange has happened on the x axis. What type of data is art_series$date
?
art_series$date[1]
[1] 1912-10-01
123 Levels: 1912-10-01 1912-11-01 1912-12-01 1913-01-01 ... 1922-12-01
As far as R knows, the date is a factor (which it has “cleverly” converted from its original string format). Now fortunately the convention used for notating dates here ensures that alphabetical order is also chronological order (why?), but qplot
does not know that: as far as it’s concerned, art_series$date
is a categorical variable.
R has a specialized data type for dates, however, and a function for turning strings in YYYY-MM-DD
format into that type. Here’s how we do that, adding a new column to our art_series
data frame:
art_series$converted_date <- as.Date(art_series$date)
Now try:
qplot(x = converted_date, y = count, geom = "bar",
stat = "identity", data = art_series)
qplot
now understands that x
is representing a date, and labels the axis more sensibly. Whether the plot tells us something intelligible about the changing editorial policies of Poetry magazine is another question.15
Counting in more than one dimension
We’ve all ready seen two- and even three-way contingency tables. How are these to be plotted? Let’s use our combined mags
data frame on Poetry and The Crisis as to slot in one more piece of the visualization puzzle.
First, as before, we construct a data frame from a table, this time with counts by date and genre. Thus each row answers the question, “How many of this kind of item were published in Poetry on this date?”
poetry_genre_series <- as.data.frame(table(poetry_titles$date,
poetry_titles$genre))
names(poetry_genre_series) <- c("date","genre",
"count")
# Convert the string-format dates to R's Date type
poetry_genre_series$conv_date <- as.Date(poetry_genre_series$date)
How to count this data in a plot? Thinking grammatically, we want to add a new dimension to our visual mappings: we will use the two spatial dimensions for time and counts, as before, but now we will indicate a categorical variable using another aspect of the visual: color. Let’s start with a point plot that shows how many items were published per issue in the two journals:
- Issue dates on the x axis, left to right
- Item counts by journal on the y axis, bottom to top
- Distinguish genres by color
- One point for each row of the table
qplot(x=conv_date,y=count,color=genre,geom="point",
data=poetry_genre_series)
That tells us a few things, and reveals that qplot
will produce a legend for us once we introduce a color=
visual mapping. But it could be made easier to read. One possibility would be to use a connected line rather than points. geom="line"
is what we need…but we also have to tell qplot
which points to connect using the group=
parameter.
qplot(x=conv_date,y=count,color=genre,group=genre,
geom="line",data=poetry_genre_series)
This noisy plot helpfully indicates that Poetry did indeed consistently publish mostly poetry items over time, though it might also help you pick out some interesting issues to look at in which the generic mixture is unusual.
To aid that, however, the lines are not as informative as a visual grammar that allows you to make a clearer comparison between genres in each year. One possibility for that would be to use bars, but to stack the bars for genres on top of one another.
We already know to tell qplot
geom="bar",stat="identity"
. Two other parameters have to change. To specify the color of the bars, one uses fill=
rather than color=
. (Graphics R fun.) To specify stacked bars, one uses position="stack"
, which is at least not totally opaque.
qplot(x = conv_date, y = count, fill = genre, geom = "bar",
stat = "identity", position = "stack", data = poetry_genre_series)
Small multiples
So far, we’ve gotten three columns of data onto a single plot. Though it’s possible to squeeze in more, ggplot gives us another option that is often more useful. This is the technique of small multiples: make multiple copies of the plot for different slices of the data.
So, we could have redo our plot of genres in Poetry as a row of plots, one for each genre. Think of it as the visual equivalent of embedding in grammar:
- Make a plot for each genre in alphabetical order, mapping genres from left to right in alphabetical order, in which:
- years are on the x axis
- counts of items are on the y axis
- draw a bar for each year
ggplot refers to this as “faceting”:
qplot(x=conv_date,y=count,
geom="bar",
stat="identity", # as for single plot
facets= . ~ genre, # faceting
data=poetry_genre_series)
The bizarre expression . ~ genre
is a special formula value. In this context, it means “one row, with plots for each value of genre
.” For a vertical column of plots, you’d use genre ~ .
.
But why stop there? If a two-way contingency table was represented as a single row of plots, we can represent a three-way contingency table as a table of plots. We have a combined data frame, mags
, for Poetry and The Crisis. Let’s count up items by genre in the two journals:
genre_series <- as.data.frame(table(mags$date,
mags$genre,mags$journal.title))
names(genre_series) <- c("date","genre","journal",
"count")
## Convert the string-format dates to R's Date type
genre_series$conv_date <- as.Date(genre_series$date)
Now we can make a collection of plots, with rows of plots for each genre and two columns of plots, one for each of the two journals:
qplot(x=conv_date,y=count,group=genre,
facets=genre ~ journal,geom="bar",
stat="identity",data=genre_series)
This particular graphic might or might not suggest some avenues for further investigation, though to me the main thing it shows is that sometimes a big plot showing all the data doesn’t tell you more than a smaller table might:
table(mags$genre, mags$journal.title)
Crisis Poetry
advertisements 299 124
articles 1451 1169
drama 4 0
fiction 100 0
images 1984 1
letters 213 112
letters; poetry 0 1
poetry 237 2950
The overall generic mixtures of these two periodicals, and the importance of the image in the tallies of The Crisis, are perhaps the main facts of interest here. But perhaps the shifts over time suggest possibilities too.
Counting on: further reading
Introductions to R
Navarro, Daniel. Learning Statistics with R. http://health.adelaide.edu.au/psychology/ccs/teaching/lsr/. Pts. 2–3. This (free draft) introductory statistics textbook for psychology students includes an especially lucid introduction to R.
Jockers, Matthew. Text Analysis with R for Students of Literature. Springer, forthcoming. This textbook in preparation introduces R with a focus on analyzing literary texts.
The R Project. An Introduction to R. This, from the creators of R, is often frustrating (and tends to assume quite a bit of programming and statistical experience).
Visualization
Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer, 2009. http://dx.doi.org/10.1007/978-0-387-98141-3. Rutgers Library has online access to this quite lucid exposition by ggplot’s author.
Wickham, Hadley. Online documentation for ggplot2. http://docs.ggplot2.org/. Reprehensibly sparse.
Wilkinson, Leland. The Grammar of Graphics. 2nd ed. Springer, 2005. http://link.springer.com/book/10.1007/0-387-28695-0. Rutgers Library has online access to this, the theoretical basis for ggplot.
Solutions to exercises
1. Canada’s laureate
paste(laureates[1, "firstname"], laureates[1, "surname"])
[1] "Alice Munro"
2. Exiles and émigrés
laureates[laureates$bornCountryCode
!= laureates$diedCountryCode,
c("surname","year")]
surname year
1 Munro 2013
2 Yan 2012
3 Tranströmer 2011
4 Vargas Llosa 2010
5 Müller 2009
6 Le Clézio 2008
7 Lessing 2007
8 Pamuk 2006
10 Jelinek 2004
11 Coetzee 2003
12 Kertész 2002
13 Naipaul 2001
14 Xingjian 2000
15 Grass 1999
16 Saramago 1998
17 Fo 1997
20 Oe 1994
21 Morrison 1993
22 Walcott 1992
23 Gordimer 1991
27 Brodsky 1987
28 Soyinka 1986
29 Simon 1985
32 García Márquez 1982
33 Canetti 1981
34 Milosz 1980
36 Singer 1978
38 Bellow 1976
41 White 1973
42 Böll 1972
45 Beckett 1969
47 Asturias 1967
48 Agnon 1966
51 Seferis 1963
53 Andric 1961
54 Perse 1960
57 Camus 1957
58 Jiménez 1956
66 Eliot 1948
68 Hesse 1946
69 Mistral 1945
76 Bunin 1933
79 Lewis 1930
80 Mann 1929
81 Undset 1928
84 Shaw 1925
86 Yeats 1923
91 Gjellerup 1917
95 Hauptmann 1912
96 Maeterlinck 1911
100 Kipling 1907
102 Sienkiewicz 1905
104 Bjørnson 1903
3. Women and men
table(laureates$gender)
female male
12 94
4. Countries of death. We’ll always have…
A table can be indexed like a vector. It turns out, however, that the number one “country” is a blank (the living laureates). Hence the expression is:
sort(table(laureates$diedCountry), decreasing = T)[1:4]
France United Kingdom USA
19 17 9 9
Edited 4/30/14 by AG: added slides link.
Edited 5/19/14 by AG: added workshop notes.
-
James English, “Everywhere and Nowhere: The Sociology of Literature After ‘the Sociology of Literature,’” NLH 41, no. 2 (Spring 2010): xii–xiii.?
-
Source: requests to
api.nobelprize.org
. See http://www.nobelprize.org/nobel_organizations/nobelmedia/nobelprize_org/developer. To construct thelaureates.csv
file, I used the R code in this gist.? -
Source: a
wordcounts
CSV file from a https://constellate.org/ request.? -
Words known to R include function names, variable names, and file paths (particularly handy). These terms are explained further on.?
-
Actually R may not bother to figure out the value until you actually use it, because R is (this is the technical term, really) lazy.?
-
If you want to get fancy, try
outer(c(T,F),c(T,F),"&")
and the same expression with"|"
. Seehelp("outer")
.? -
By default, string values are transformed into factors, the R type for representing categorical data. This is useful in some cases but for now the difference will be confusing.
stringsAsFactors=F
ensures that our strings stay stringy.? -
The alternative approach is to set R’s “working directory” to the folder containing
laureates.csv
. Do this with thesetwd()
function (seehelp("setwd")
.? -
Slight hand-waving here, because
table
returns a data type we haven’t discussed yet. For practical applications we’re mostly going to see tables that look like data frames if you squint.The return type of table is an object of class
table
, which is a subclass ofarray
. An array is a generalization of a vector to any number of indices; a vector is isomorphic to a 1-dimensional array, a matrix is a 2-dimensional array, etc. Good, I’m glad we cleared that up.? -
sep
specifies the delimiter. It can only be one character, but fortunately thestrip.white
parameter tells R to get rid of white space before and after delimiters. Because MJP has not escaped single or double quotes, R will choke on apostrophes unless we tell it to take each table entry literally rather than trying to look for pairs of'
everywhere. Thusquote=""
.stringsAsFactors
we’ve seen before. Finally,read.table
must explicitly be told that a header line is included withheader=T
.? -
The possibility of three-way or n-way contingency tables is the reason a
table
is an array rather than a vector or data frame. The three way table can be subscripted with expressions liket[i, j, k]
? -
It is also possible to get more literate column names making the table with the
xtabs
function before converting to a data frame. In this case:xtabs(~ bornCountryCode,laureates)
. But there’s something funny here, which is best left for a later lesson.All right, if you really want, I’ll tell you, since it only took me twenty re-readings of the help files to figure it out.
xtabs
requires that you indicate which columns of the data frame to tabulate with a “one-sided formula” of the form~ col1 + col2 + col3...
. It sets thedimnames
of the resulting array from the formula, andas.data.frame
derives column and rownames from that. I hope you’re happy now.? -
If you’ve been following carefully, you might be wondering how in the world
x=year
could be an acceptable function parameter. There is no variable namedyear
, so how does R know what values to plot? Shouldn’t it bex="year"
orx=us_tx$year
or something like that? The answer is thatqplot
pretends that each of the columns of the data frame specified by thedata
parameter is a variable in its own right: when it goes to figure out the value of itsx
parameter, it will be able to look upyear
in the data frame. The R-speak way of saying this is that the parameters are “evaluated in the data frame.” (If you’ve been following very closely, you might even have a guess as to why R doesn’t give anobject 'year' not found
error beforeqplot
has a chance to evaluate it in the data frame.)? -
The reason for this lies in an aspect of the grammar I passed over. Sometimes we transform the data before mapping it into the visual dimension. This transformation intervening between the data and the visual mapping is known to ggplot as the “stat.” For example, sometimes we count up how often a given value occurs: this is what we have been doing with the
table
function, butqplot
will do this for you automatically if you supply the raw, untallied data and then use thebin
stat. In fact, since bar plots are so commonly used to show tallies of this kind, by default when you setgeom="bar"
qplot
assumesstat="bin"
and mapsy
to the tallies ofx
values.If this seems obscure, try
qplot(x=gender,geom="bar",data=laureates)
. Notice that there’s no explicity
mapping.In the case of our yearly translation counts, however, we do not want the heights of the bars to correspond to the number of times a given count of translations occurs in the data! We just want the height of the bar to be equal to the number of translations. This is simplest stat of all, the identity transformation, which leaves all values unchanged. But we have to set it explicitly using
stat="identity"
. In fact, if you leave off this parameter,qplot
will give you an error message that tries to tell you what I’ve just told you but with even less clarity.? -
The unsightly striation—random white space between the bars here and there—is one of the few moments when ggplot’s defaults let you down, visually.?