Friday, February 21, 2014
- Work-in-progress discussion, 2:00 pm–3:30 pm
Murray Hall, Room 107
510 George St., New Brunswick
- Network analysis workshop, 4:30 pm–6:30 pm
Alexander Library, Room 413
169 College Ave., New Brunswick
Taught by Hoyt Long and Richard So (University of Chicago).
Hoyt Long and Richard So join us to discuss their work in progress in their project Literary Networks: New Computational Methods in the Sociology of Culture and to introduce one of their key analytical techniques. They will discuss a precirculated paper (see below) and then lead a workshop introducing the network visualization and analysis tool Gephi and its application to literary-historical data. You are welcome to join either the work-in-progress discussion, the Gephi workshop, or both.
RSVP to Vishal Kamath at firstname.lastname@example.org.
For the work-in-progress discussion, the precirculated reading is available on Sakai: [Edit: precirculated reading no longer available here. Please check literarynetworks.uchicago.edu for the latest circulating version of “Who is Thomas Curtis Clark? Modernist Networks of Exclusion.”]
Hoyt Long is an Assistant Professor of Japanese Literature at the University of Chicago. His research and teaching center on modern Japan, with particular interests in regional literature, publishing history, media and communication, and environmental history. He also has an interest in the application of social-scientific methods to the study of how texts and ideas emerge and circulate within social and material systems.
Richard Jean So is an Assistant Professor of English at the University of Chicago. His teaching and research interests center on modern American literature in a transnational context. Within this area, he is interested in theories of cultural transnationalism, the history of media and communications, and the “Pacific” (which includes U.S., Asian American, and East Asian cultures) as a coherent area of study. He also does substantial work in the digital humanities. He is interested in the use of new computational and social scientific methods, such as text mining, to model a form of textual criticism that mediates between distant and close reading approaches.
Before the workshop
The computers in Alexander 413 will be set up for this workshop. If you wish to use your own machine instead, a little setup is required first:
- Download Gephi and run the installer. (Follow the installation instructions, unless you are a Mac user, in which case ignore the outdated instructions and see the next step).
- Try to run Gephi. Mac users should be prompted to install Java if necessary. If it runs, you’re set. If you encounter problems with Java and Gephi, you unfortunately have a difficult yak-shaving task ahead of you, and you may wish simply to use our lab machines for the workshop.
- In Gephi, install the following plugins (instructions for installing plugins):
- Force Atlas 3D
- Noverlap Layout
- Download the sample data files (zip archive) and unzip them somewhere you can find them.
Notes from the Workshop
Supplied by Hoyt Long, University of Chicago; somewhat expanded by Andrew Goldstone
Open Gephi and look over the layout of the interface. There are three panels: Overview, Data Laboratory, and Preview. Key menus: Workspace, Plugins, and Window. (Also, install plugins under Tools: Plugins.)
Importing an edgelist and adjusting display properties
Here, we will use
SampleData1.csv (from the sample data).
Working in the overview
Create a new project (File: New Project); look over the various sections that make up each panel. In the Overview, you’ll find partition and layout on the left; some graph properties, filters, and statistics on the right. In Preview, you can adjust the visualization options. But first you need data.
SampleData1.csvfile (using File: Open). You have a choice of whether the network is "directed" or "undirected" (does it matter whether we say the edge runs from A to B or from B to A?). In this case, choose "Undirected." We will ignore the "Time Frame" option for now. (This option allows you to important several files, each one representing the same network at a different point in time. See wiki.gephi.org/index.php/Import_Dynamic_Data.) By default, the "create missing nodes" option is checked, which is what we want. (Gephi needs both a list of edges and a list of nodes, but it can create the list of nodes just by looking at all the endpoints of the edges.)
Initially, the nodes are randomly placed. Now we can choose a layout for the graph from the menu on the Layout pane. Try Force Atlas. Click "Run" to start the algorithm, then "Stop" once the layout has stabilized. If things glob together, try adjusting the options on the layout (attraction, repulsion, gravity) and running again. Now try Force Atlas 3D. (Here HL discussed the "science" and "art" of graph layouts.)
To zoom in and out, use the mouse scroll wheel; to pan the view, hold down the right mouse button and drag (or, on Mac, hold down Command as you drag). You can also adjust the zoom by clicking the tiny upward-pointing triangle on the lower right to reveal extra settings, including a "Zoom" slider. To reset the graph back to the center of the window, click the miniature magnifying glass icon near the lower left. As you hover the mouse cursor over a node, the edges connected to that node are highlighted.
You can also use the "Expansion" and "Contraction" layouts to spread out or compress the graph.
You can manually edit the layout by click on nodes and dragging them.
To see node labels, click the "T" icon third from the left of the row of buttons on the bottom. The label font and the text size can be set using the button and slider towards the right of that same row of buttons.
A further layout, Label Adjust, helps to move labels in the graph to make them easier to read.
For more on the view options, see The Gephi Visualization Tutorial.
Working in the Data Laboratory
Switch over to the Data Laboratory. This is a spreadsheet-style view of the data. There are two tables, Nodes and Edges. Note, in particular, the following features:
In the Edges table, in addition to columns listing the starting and ending points of each edge, there is also a Weight column: this is the number of publications by that poet in that journal. (Compare
SampleData1.csv: notice that Gephi has automatically tallied up 3 listings of
Aldington,POEinto a single
Aldington,POEedge with weight 3.)
You can add additional data of any kind to nodes and edges by using the "Add Column" function.
You can browser the data quickly using the "Filter" text box to show only nodes/edges which match a search term. For example, to see only POE publications, type
POEin the Filter box and select "Target" from the adjacent menu.
Working in the Preview
Here you can create a more polished visualization by adjusting the visualization you set up in the Overview. The location of nodes is translated over from the Overview, but the labeling settings are not. Thus, experiment with adjusting label size, edge thickness, edge color, setting edges Curved or not. The little disk icon at the upper right of the "Preview Settings" can be used to save a set of settings you like. Click "Refresh" (bottom left) to draw the graph.
To save the image itself, click the "Export" button at the bottom left. PDF is for print-quality layouts; PNG is better for the web.
Warning: always save images that look useful!!
A note on saving projects
(From AG. When I reopen a saved project in Gephi, the graph window is sometimes blank in the Overview. It can be reactivated by choosing "Graph" from the Window menu.)
Load the network data for 1924-1925, using the steps just outlined, from
SampleData2.csv. When you load this file, a new workspace will be created in the project; you can switch among workspaces (and rename them) using the "Workspace" menu and little buttons at the lower right. Choose the Force Atlas layout, run it, adjust the visualization, and add node labels.
More Advanced Visualization Techniques
Even with this (fairly small) dataset, it’s quite hard to make sense of what’s going on. We can’t even discriminate between poets and journals. So let’s learn how to do that.
What we need is to load some attributes (i.e., metadata) into our data laboratory. We will use the prepared file
TypeAttributes.csv, which lists each node together with a label: "Poet" or "Journal." To use these in Gephi, the
Ids will have to match exactly; notice the column headers as well (
Click "Import Spreadsheet" (button in the row of buttons at the top of the "Data Table" tab). Click the "…" button and choose the
Make sure to disable the "Force nodes to be created as new ones" option.
This adds some new nodes that aren’t actually in the network. Eliminate them by manually selecting all nodes that have no Label. Then right-click or (Mac) control-click on the selection and choose "Delete all."
Return to the Overview. Choose the Partition tab on the left. If the menu does not offer you "Type" as an option, try clicking the green-arrow refresh button. Now choose to partition by type and click Apply.
Bonus Exercise: try importing Facebook data through Netvizz app and playing around with it.
What we’re missing now is an understanding of the relative weight of these nodes. They are all the same size, and yet we know that this doesn’t adequately express how different they are on the dimension of quantity (i.e., how many publications).
Under the "Statistics" panel on the right, calculate the average weighted degree by clicking "Run" to the right of "Avg. Weighted Degree." (The weighted degree is the sum of the weights of all edges that touch a node—in other words, the total number of publications represented in the data for a given node, whether that node is a poet or a journal.) You’ll also see a plot of the degree distribution.
In the visualization, we can adjust the size of nodes to indicate their weighted degree. In upper left panel, choose the "Ranking" tab. From the menu, choose "Weighted Degree." Click the inverted red diamond icon to visualize the weighted degree as node size. Click "Apply." You may now find it helpful to rerun the Force Atlas layout and the Label Adjust.
Sometimes it is helpful to ignore edges below a certain weight. For this, use the Filters tab in the right panel. Open the Edges folder and double-click "Edge Weight." You can now drag the ends of the slider to set an edge-weight range; click "Filter" to hide edges whose weights are outside the range.
You can now rerun the Average Degree and Avg. Weighted Degree statistics. Once this is done, you can hide the isolated notes (nodes of degree zero). To do this, return to the Filters tab; click Attributes; click Range; and click Degree; now click the "Range (Degree)" line under "Queries," click "Parameters," and just the range to have a low end of 1 instead of zero. Click "Filter" and the isolated nodes disappear.
Projection to unipartite and community detection
Save the project as you have it now and then duplicate the “1924-1925” workspace.
Find the Multimode Network Transformation pane on the right of the Overview; click Load Attributes, and choose "Type" from the "Attribute type" menu. Select the following combination: Left matrix: "Poet-Journal"; Right Matrix: "Journal-Poet." (For more on this bipartite-to-unipartite conversion, see this post by Shawn Graham.
Check the "Remove Edges" and "Remove Nodes" options, then click Run. What remains it network of poets (each pair of poets is connected by the number of shared publications in any journal).
Rerun the Force Atlas layout and graphical layout options.
Now, under the Statistics pane on the right hand side of the Overview, find the "Modularity" line and click "Run." This uses a community detection algorithm on the network.
Now a new partition possibility is available; click the refresh button under the Partition tab, and choose "Modularity Class," then click Run. Now the graph is colored according to the communities found by the algorithm. Then partition the graph according to these communities.
Slicing your dataset
This part of the workshop shows how to isolate data from just some of the years in the dataset and produce those CSV files we have been loading into Gephi.
SampleNetworkData.xlsx file in Excel.
PoemDataworksheet, sort data by the “year” column. Scroll down to the particular year(s) you want, and highlight just the
Journal IDcolumns for those year(s). Copy these to a new worksheet in the same Excel file.
This is our edgelist, but it is of no use to us yet because we need unique identifiers for the nodes instead of just numbers. (These identifiers have to be carefully chosen. Among the things that make Gephi choke: spaces between words; spaces in front of words; more than 2 columns. You also need to be careful that nodes that should be distinct, don’t end up with the same labels: e.g., using last names, but two people are named Johnson).
Add a row of headers at the top: the two columsn we had need to be
JournalId(no spaces!). Next to each of these, create a new column for the identifiers:
Now we need to look up the metadata from our other sheets in the spread using the following formulas:
For the first entry in the
PoetName column, enter this formula:
For the first
$A2 here is where the item you are looking up is located;
PoetData!$A:$C is the sheet and columns that comprise your desired lookup range;
3 indicates the column where the desired replacement values will be; and
0 is a parameter that always stays the same.
Once you’ve entered the formulas, double-click on the bottom right corner of the cell to fill the rest of the column with these same formulas.
Create a new Excel spreadsheet and copy these newly-created columns to a new excel spreadsheet, using "Paste Values" so as to paste only the values, not the formulas.
Use File: Save As to save the spreadsheet at as CSV file. Now it’s ready to load into Gephi!
Edit 2/13/14 by AG: description edited; precirculated readings.
Edit 2/15/14 by AG: description edited.
Edit 2/17/14 by AG: location.
Edit 2/19/14 by AG: workshop setup instructions.
Edit 2/25/14 by AG: precirculated paper no longer up.
Edit 4/22/14 by AG: added Hoyt’s notes with my (hopefully not wrong) additional notes.