Wednesday, November 20, 2013
1:10 pm–3:10 pm
Alexander Library 413 169 College Ave., New Brunswick, NJ
Taught by Andrew Goldstone, English Department.

This workshop aims to expand our horizons for thinking about how we handle text on our computers. In order to attain liberation from Word, we will explore the difference between text editors and word processors, discuss the ways computers represent text as content or form, and experiment with some key technologies for digital document preparation. We will dally with three related computer languages in rapid succession. We will begin with markdown, a minimal but versatile set of plain-text conventions. Then we will learn to convert markdown into equivalent HTML markup using Pandoc. Finally, we will introduce the LaTeX document-processing system, which elegantly typesets plain text into PDF files. No prior experience with any form of markup or other computer coding is required. This workshop is free but spaces are limited. Graduate students, faculty, and staff are welcome. Please email Vishal Kamath to register at vkamath at scarletmail.rutgers.edu.

Before the workshop

Suggested: bring a document you have written on your computer, in any format. At the workshop, all the software we will use will be available on the lab machines. If you prefer to bring your own laptop computer for some or all of the workshop, here is the software we will be using (all free and cross-platform):

  • Komodo Edit for the introduction to text editing. You may use any other text editor you prefer, of course.
  • Pandoc for markdown-to-HTML conversion.
  • TeX Live (Mac, Windows, Unix). (A large download.)

For more on what these things are all about, you may wish to read AG’s pages about markdown and LaTeX. The creator of markdown, John Gruber, has a page which allows you to experiment with markdown in your browser.

Empowerment Part II

The actual “empowerment” (modest but real) comes in getting a more detailed understanding of the way the systems we already use handle text, and in learning more ways to manipulate that text, beyond the confines of any single program. The business of plain-text-slinging, a minor craft on its own, nonetheless forms a natural starting point for thinking more deeply about analyzing digitized texts, expressing yourself in “code” of various kinds, and composing in the digital medium.

Downloads

In order to do the workshop on your own, first install Pandoc and LaTeX (links above). Komodo Edit is optional; any text editor will do, though I’ll occasionally refer to details in Komodo (menu items, etc.) that may be slightly different in other editors. See below for text editor suggestions.

The handout from the workshop (PDF)

Sample files (zip). Unzip this and place the resulting emp2 folder some place where you can find it (like your home directory). (Slightly modified from those used during the workshop.)

How does a computer represent text?

Text encoding

 H  e  l  l  o     w  o  r  l  d
48 65 6c 6c 6f 20 77 6f 72 6c 64

Markup

<p>Hello world</p>

Image

Text images could be bitmaps like this or curves, as in Postscript, PDF, SVG.

(image CC-BY Wendell Oskay)

Text encoding

ASCII

The American Standard Code for Information Interchange: Roman letters, numbers, punctuation and control characters correspond to 7-bit numbers (0 to 127); each character fits in a byte (8 bits).

Here is ascii.txt (in emp2):

I am a test file.
I have 2 lines

The numerical representation of these characters, in decimal, is:

    I   SP   a   m  SP   a  SP   t   e   s   t  SP   f   i   l   e
    73  32  97 109  32  97  32 116 101 115 116  32 102 105 108 101
    .   NL   I  SP   h   a   v   e  SP   2  SP   l   i   n   e   s
    46  10  73  32 104  97 118 101  32  50  32 108 105 110 101 115

It is more conventional to write numerical representations like this in hexadecimal, which has exactly two digits per byte. (The Wikipedia page on hexadecimal explains how base 16 numbers are written.) In hexadecimal the same numbers are written:

    I   SP   a   m  SP   a  SP   t   e   s   t  SP   f   i   l   e
    49  20  61  6d  20  61  20  74  65  73  74  20  66  69  6c  65
    .   NL   I  SP   h   a   v   e  SP   2  SP   l   i   n   e   s
    2e  0a  49  20  68  61  76  65  20  32  20  6c  69  6e  65  73

Notice that capital letters are distinct from lowercase, and spaces and new lines need their own bytes. The convention of hexadecimal 0a for newline is not universal, and linebreaks are often a conversion problem when moving text files between computers running different operating systems.

Unicode

Problem: what about…all the other letters in the world?

L'anglocentrisme, ça m'énerve.

The Unicode standard. The core of the standard is the assignment of characters (“glyphs”) in all the world’s writing systems to codepoints.

Codepoints are notated U+xxxx where xxxx is a four-digit hexadecimal (base 16) number.

There are a lot of possible codepoints. (But not as many as I said in the workshop.) The four-digit codepoints U+xxxx allow for only 65536 distinct glyphs. The full Unicode specification has room for another million or so glyphs by adding another 4 bits of information: the highest code point is actually U+10FFFF. See Wikipedia s.v. Unicode.

Some Unicode examples

(the text on this webpage makes use of Unicode to incorporate characters from non-Latin scripts using a single encoding)

ç   U+00E7  LATIN SMALL LETTER C WITH CEDILLA
a   U+0061  LATIN SMALL LETTER A
    U+0020  SPACE

??????? ??? ????…

?   U+1F04  GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA

???

?  U+6C34  CJK UNIFIED IDEOGRAPH-6C34
?  U+306E  HIRAGANA LETTER NO

UTF-8

The codepoints are encoding in various ways. In UTF-8, now the most common text encoding on the web, Unicode code points are mapped to bytes in a more compact way—and with an Anglocentric convenience: ASCII is preserved.

Here is unicode.txt:

I am a test file.
Voilà la deuxième ligne.

I   SP   a   m  SP   a  SP   t   e   s   t  SP   f   i   l   e
49  20  61  6d  20  61  20  74  65  73  74  20  66  69  6c  65
.   NL   V   o   i   l       à  SP   l   a  SP   d   e   u   x
2e  0a  56  6f  69  6c  c3  a0  20  6c  61  20  64  65  75  78
i        è   m   e  SP   l   i   g   n   e   .  
69  c3  a8  6d  65  20  6c  69  67  6e  65  2e

Notice that UTF-8 encodes the Latin alphabet just like ASCII (e.g. a is hexadecimal 61), but that the characters with diacritical marks, à and è, require two bytes in this encoding (c3a0 and c3a8 respectively).

Activity: the text editor

  1. Open some programs for text: Notepad (Mac: TextEdit), Komodo Edit, Word. Type anything in each.
  2. What differences do you notice in the representation of text?
  3. What different capacities do you notice? Pay attention to the kinds of decisions the different programs make for you—and the kinds of things you have to be explicit about.

Editing: varia

Komodo Edit has a text encoding menu, by the way: look at the bottom of the window.

You might also notice that lines are not broken for you automatically. In a text editor, you must type a carriage return to make a line break part of the text file. However, all text editors have an option to “soft wrap” text—i.e., wrap it for display without inserting the carriage returns (those 0a bytes). In Komodo this is the “Word Wrap” option in the “View” menu.

Why plain text?

  1. “Human-readable”
  2. Fewer conversion problems (encodings still an issue)
  3. Text as data: programmatic manipulation
  4. Plain text as blueprint for multiple forms…
  5. All programming, and all the web, lives in the world of plain text

Some editors I’ve heard of

I chose Komodo Edit for the workshop because it was the only text editors with the features I wanted that was available on PC’s as well as Macs. I do not regularly use it, and there are many other good choices of text editors. None of the activities in this workshop require any particular editor program; all of the following could be considered:

I personally use MacVim, but I found it challenging to learn and don’t recommend starting with it if you are new to the world of plain text and computer languages. I suggest TextWrangler or Notepad++ for those starting out writing plain text for any purpose.

Markdown

The principle: make plain text more expressive with some extra conventions.

  • still pretty easy for a human writer/reader to interpret
  • but systematic enough to be processed programmatically

Text conventions

*emphasis* or _emphasis_
**strong emphasis**
"(optionally) make quotes curly--and dashes nice--automatically"
extra white  space     doesn't      matter

Where white space does matter

Paragraphs are broken by blank lines.

This is the start of a new paragraph.
But this isn't.

Extra spaces at the end of a line??
make a "hard linebreak."

    Four spaces at the start of a line mark "code."
    This text is meant literally: *not styled*.

Structure

# Heading
## Subheading
### Subsubheading

> A block quotation, which can be
> spread over multiple lines if you like.

1. An enumerated list.
2. Another thing.
1. A third thing.

- An "unordered" list.
- Must you bullet point?

Hypertext

Links are written in two parts: the actual URL and the text of the link:

[this is a link to my home page](http://andrewgoldstone.com)

Activity

  1. Create a new markdown document in your text editor. In Komodo Edit, either use “New From Template” and add a .md to the file name, or simply create a new document and then use the language menu at the bottom of the document window: it says “Text”; click, go to the “Other” submenu, and choose “Markdown.” You can use any text editor to edit markdown, but for the rest of this workshop you must make sure your file has .md at the end of its name (and not .txt).
  2. Try writing with markdown structures or try hand-converting one of your own documents into markdown.
  3. Save the file to emp2 with the name first.md.
  4. (If you prefer, you can look at the sample markdown file I have included in the emp2 folder, sample1.md. Feel free to substitute sample1.md for first.md in the examples below.)
  5. What do you notice about this process that is different from word processing?

Conversion

Markdown is designed to be converted into other formats (in fact it is now an input-format option on wordpress.com, as it has long been on tumblr.com).

  1. Open PowerShell (Windows: All Programs > Accessories > Windows PowerShell) or Terminal (Mac, in /Applications/Utilities).
  2. Change directories to the place you saved your markdown file using the cd command. Navigating your hard drive using cd may be tricky at first. On Windows, if you put emp2 in My Documents, you would type: cd '.\My Documents\emp2'; on a Mac, if you put emp2 in your home directory, just cd emp2 would work; but if you’d saved emp2 in Downloads, you’d need cd Downloads/emp2. Notice that Mac uses a forward slash, not the Windows backslash. Type return after the command.
  3. Type ls (and a return): this lists the contexts of the folder you’ve changed into. Make sure first.md is listed.
  4. Run pandoc: pandoc first.md --smart -s -t html5 -o first.html
  5. Open the folder where you saved first.md. You’ll find a new file, first.html. Double-click the file to open it in a web browser.
  6. (You can also, if you wish, try this with sample1.md: pandoc sample1.md --smart -s -t html5 -o sample1.html. I have provided, in the converted folder in emp2, a sample1.html which contains the results of running the that command on my machine.)

What do you see?

Excursus: running command-line programs

Each command on the command line has the same general shape: a command followed by a series of arguments and options separated by spaces. In the example command to pandoc above, we had:

pandoc first.md --smart -s -t html5 -o first.html
  • pandoc: the name of the program
  • first.md: an argument to the command, here the source text file. The extension to the file names its format (pandoc notices this).
  • -s: an option to the command (options start with one or two hyphens). This one means “stand-alone”: see the section on html structure below.
  • -t html5: an option with a parameter: this one means “output html5.” We could omit this, and let pandoc guess from the extension .html. But by default it outputs html as XHTML4, not HTML5.
  • --smart: a long option name for typographic quotes and dashes.
  • -o first.html: the output file as an option with parameter. If the file already exists, pandoc will replace it completely, so be careful.

-o is standard for output, though some programs will instead just use the order of plain arguments: program inputfile outputfile. And other programs give you no choice about the output file name. Argument order usually matters, and option order usually doesn’t. Other command-line commands to try: cat, echo, ls, mkdir, cd, touch, pwd. (These are standard Unix commands but are also supported by Windows PowerShell.)

The pandoc website also has a good introductory tutorial on running the program.

Bonus activity: compare the results of pandoc first.md -o first.html and pandoc first.md -t html5 -o first.html to the results with -s in the browser and in Komodo Edit. The -s option adds an HTML header (you may notice text-encoding problems otherwise in the browser!). Without -t html5, you will notice some differences in the HTML source, though you shouldn’t see any differences in the appearance of the webpage.

HTML

Open first.html (or sample1.html) in Komodo Edit. Now what do you see?

HTML and markdown compared
  • Same principle: make plain text more expressive with conventions.
  • Most of the information is text “content.”
  • The rest—also text—is annotations.
  • But HTML is more explicit about structure (and more versatile).

Tagging

Annotations normally take the form of pairs of tags: <xyz>some annotated text</xyz>.

<em>Annotation for emphasis</em>
<strong>Annotation for strong emphasis</strong>

Extra white  space    still     doesn't         matter

<h1>Top-level heading</h1>
<h2>Subheading</h2>
<h3>Etc.</h3>

If you haven’t already, open up sample1.html in Komodo Edit to see examples of some of these tags.

Structure

HTML tells you about structure. Every paragraph is also tagged.

<p>I am a paragraph.</p> <p>I am a second paragraph, even if I am on the same line.</p>
<p>I am a paragraph using the self-closing <br>tag to break a line.</p>

(<br> is a rare example of a stand-alone tag. In XHTML it would have to be <br /> but HTML5 is more permissive.)

Now here is the doozy: HTML is strictly hierarchical.

<!DOCTYPE html>
<html>
    <head>
    ...
    </head>
    <body>
        <p>...</p>
        <p>...
            <em>...</em>
        ...</p>
    </body>
</html>

(The meaning of pandoc’s -s option is that it adds a suitable <head> so you don’t have to figure that part out yourself.)

What is wrong with the following?

<p>I aspire to be a <em>well-formed slice of HTML.</p>
<p>But apparently I have issues.</p>

And?

<p>I aspire to be a <em>well-formed slice of HTML.</p>
<p>But</em> apparently I have issues.</p>

This is quite right, on the other hand:

<p>I am an <em>emphasis with a <strong>strong emphasis</strong> inside</em>.</p>

I am an emphasis with a strong emphasis inside.

Block quotes

The hierarchy means that block quotes and lists require more annotations:

<blockquote>
    <p>A single-paragraph block.</p>
</blockquote>

<blockquote>
    <p>A two-paragraph block, first round.</p>
    <p>A two-paragraph block, second round.</p>
</blockquote>

Which represents something like:

A two-paragraph block, first round.

A two-paragraph block, second round.

Lists

<ol>
    <li>First</li>
    <li>Second</li>
    <li>And last.</li>
</ol>

<ul>
    <li>First</li>
    <li>Second</li>
    <li>And last.</li>
</ul>

The two constructs differ only in the outer ul/ol tags, but the browser renders them rather differently:

  1. First
  2. Second
  3. And last.
  • First
  • Second
  • And last.

Abstraction

There are also two purely abstract tags:

  • <div> (a block-level division, i.e. one that can contain paragraphs)
  • <span> (an inline element within a paragraph).

These are mainly useful in conjunction with more elaborate styling using CSS.

Within the < > of the opening tag, HTML allows attributes of the form name="value". A common abstract attribute is a unique id for an element:

<p id="par1">I am a paragraph.<p>

More useful is the href attribute of the “anchor” tag <a>.

<a href="http://andrewgoldstone.com">this is a link to my home page</a>

You can also specify an href to an id:

<a href="#par1">I refer back to the element with
id <code>par1</code> on this page.</a>

Heads and bodies

The HTML hierarchy includes not just the data of the document but (broadly speaking) metadata. All of the displayed elements of the document live within the <body>...</body> tags. The metadata lives in the head:

<head>
<title>The title of the page</title>
<meta charset="utf-8">
</head>

And even outside the <html>...</html> comes the document type declaration, <!DOCTYPE html> (which means: the rest of this is an HTML5 document).

Activity

Edit the first.html you created using any of the tags here, and save it as a new file second.html. Try loading the page in your browser.

Then try

pandoc --atx-headers second.html -o second.md

Open second.md in Komodo Edit. What does it look like?

For comparison, take a look at sample2.html in the emp2 folder. This is a modified version of sample1.html. The result of using pandoc to convert this back to markdown are in converted as sample2.md.

(Leaving off the --atx-headers option would get you another markdown header convention not described above, in which headers are “underlined” with hyphens or equals signs.)

Making websites

Websites consist of structured text in HTML plus presentational information in CSS. CSS is for another day (or see links to online guides below), but the point of CSS is that it gives rules for rendering HTML elements. (For example: put all paragraphs in 14-point font. Make the body background light gray. Etc.) More powerful websites do their data-crunching with programs behind the scenes and send their results to the “client” in HTML. Many sites also transform their HTML using little programs in Javascript that the client web-browser runs.

Conversion (2a)

Look in emp2 for the emp1.md file. Open this in Komodo Edit. This may look familiar… Now try

pandoc -s emp1.md -o emp1.html

Double-click emp1.html in Windows Explorer/MacOS Finder. Okay, no prob: a web page. What about:

pandoc -s -t slidy emp1.md -o emp1s.html

Try opening emp1s.html. If the web browser asks you, allow scripts to run. What has happened? (I have put the emp1.html and emp1s.html generated on my machine in converted for comparison.)

This sort of thing is what is afforded by separating the text content and structure from the mode of presentation. To read more about the many other formats that pandoc can convert markdown too, read the pandoc documentation.

(Incidentally, while there are many other programs that read and write markdown, this ability to convert to slidy and other HTML slideshow formats is particular to pandoc.)

Conversion (2b)

But we can go even further. Try:

pandoc -s emp1.md -o emp1.tex
pdflatex emp1

Now look for the newly-created emp1.pdf file. Open it up. What has happened?

(Notice that on the command line, pdflatex takes only a single argument, the name of the source file, without an extension. Its output is always named the same way as the source file, but with a .pdf extension. We’re about to do the rest of our TeXing inside a graphical program, so I’ll leave aside the details of running command line TeX. The point here is simply that somehow the same markdown source can also lead to a typeset PDF. As usual, look for sample emp1.tex and emp1.pdf files in the converted folder.)

Conversion (3)

Try:

pandoc -s sample1.md -o sample1.tex

Open TeXWorks (Windows; TeXShop on Mac) and open sample1.tex inside that program. What similarities can you see to the markdown source? You might have to scroll down to where you see \begin{document}

(For comparison, there is a sample1.tex inside converted in emp2.)

LaTeX

The language of the file you are now examining is LaTeX, which is itself a dialect of TeX. TeX is a typesetting language invented by Donald Knuth; LaTeX is the system built on top of it by Leslie Lamport and many others, beginning in the early 1980s. (The Wikipedia TeX page has a little more history). Most users of TeX these days use LaTeX.

About pronouncing the name

The name is significant in two ways. On the one hand, it is profoundly alienating to learn that you’re “supposed” to read the last letter of TeX as a Greek chi and pronounce it like the ch in loch. The name becomes a supernerd shibboleth—a feeling which is reinforced by the ritual snootishness between those who pronounce the first syllable of LaTeX “la” and those who say it like the word “lay.” All of that is depressing.

But the name is significant in another way. Remember that the standard for encoding text, ASCII, leaves no room for other alphabets and languages. The aspiration implicit in the name TeX—anticipating the sensibility of Unicode by more than a decade—is that computers can and should make sense of multiple scripts. Even more importantly, TeX was, according to Knuth, born out of a determination to get computers to typeset books well. TeX expresses the idea of making the computer handle text in the ways that people want—rather than forcing us to use text only in the ways that are most convenient for computers.

Continuing on from where we had to stop in the workshop…

Earlier we processed markdown into HTML using pandoc. Then we tried using pdflatex at the command line to make PDF files. Now we’re going to work on TeX in an all-in-one program, TeXWorks (Windows) or TeXShop (Mac). These programs include a text editor window very similar to Komodo (but specialized for TeX), and they let you run the TeX program without having to type commands at the command line. This is completely equivalent to using a separate editor and running command-line TeX. It’s mostly a matter of taste and convenience. You can switch back and forth at will! This is another virtue of using plain text formats: you can edit them in many different programs, and you’re not “locked in” once and for all.

Activity

Let’s examine a rudimentary LaTeX file.

  1. Open up sample3.tex from the emp2 sample files using the “Open” command in the “File” menu of TeXShop or TeXWorks.
  2. Choose “XeLaTeX” as your typesetting engine. In both TeXWorks and TeXShop, there is a pop-up menu near the top of the document menu (it says “PDFLaTeX” by default in TeXWorks and “LaTeX” by default in TeXShop).
  3. Typeset! In TeXWorks, click the green “play” button at the top left of the document window; in TeXShop, click the “Typeset” button.
  4. Take a look at the resulting PDF. (This PDF is displayed within the TeXWorks/Shop program; but you can also open it up in another PDF viewer, like Adobe Reader or Preview. For comparison, see sample3.pdf in converted in emp2. )

What is different about this document-preparation process from word processing?

XeLaTeX is a newer “flavor” of LaTeX which emphasizes Unicode support and allows you to use any font on your system. See below for a note of explanation.

The language

The structure of a LaTeX document is not too different from an HTML document.

Document class declaration and preamble
\documentclass[12pt]{article}

Then follow a series of commands called the preamble. Its purpose is a bit like the HTML <head>. These do most of the work of tweaking the page format. When you start out, it’s easiest to use someone else’s preamble as a template. I’ve supplied one in template.tex. In the next activity, we’ll try that out.

Document body
\begin{document}

Here is where the text of the document goes, just like the HTML <body>.

\end{document}

The same principle

LaTeX is, yet once more, plain text with conventions to make the text more expressive. Actually, unlike HTML, LaTeX is actually a programming language—but here’s the point! Porgramming languages are plain text with expressive conventions. It’s just that some of the expression is done with another computer program that processes the text.

In LaTeX, the convention is: the conventional material is signalled by words beginning with a backslash \. These are TeX commands. The commands can have parameters, normally surrounded by balanced braces { }. Sometimes commands themselves come in pairs, like HTML’s paired tags. For example:

\begin{quote}
I am a blockquote too!
\end{quote}

But sometimes not:

\emph{I am emphasized text.}

Basic commands for text

Extra white  space     doesn't      matter.

One or more blank lines in the source makes a paragraph break.

If you wish to insert a linebreak of your own, use a double backslash \\.

``Double quotes'' and `single';
the apostrophe's easy;
and dashes---the em dash---and the en dash (for numbers, as in 1990--2000).
An ellipsis\ldots gets its own command (notice the space after disappears).

“Double quotes” and ‘single’;
the apostrophe’s easy;
and dashes—the em dash—and the en dash (for numbers, as in 1990–2000).
An ellipsis…gets its own command (notice the space after disappears).

Here is an example of \emph{emphasis} and of
\emph{emphasis with a \emph{further} emphasis within it}.

\footnote{Note text. Numbers are automatic.}

And a special convention: one that does nothing.

% Everything in the line after a % sign is a "comment."

(In HTML comments are made by <!-- This is the comment. -->. Markdown has no comments.)

Sectioning

\section{Section name}, \subsection{...}, \subsubsection{...}

Environments

Documents are further structured by environments, which set text differently from ordinary paragraphs.

\begin{quote}
A blockquote.
\end{quote}

\begin{verse}
The poem must resist the intelligence \\
Almost successfully. Illustration:

A brune figure in winter evening resists \\
Identity.
\end{verse}

Activity

You are free to try editing sample3.tex and re-typesetting it. But let’s do something more realistic and write a new document on the basis of a template with a fuller preamble.

  1. Make a copy of template.tex. Call it anything you want, but make sure the name ends in .tex. Open it up in TeXWorks/TeXShop.
  2. Put in some text in the body of the new document—try out the LaTeX commands above.
  3. Find the name of a font you know you have on your system, and stick its in place of Palatino in the \setmainfont{Palatino} command in the preamble.
  4. Bonus: try editing the titling parameters in the preamble. (Look for \author{...}.)
  5. Typeset.

Conclusion: ways of working

If you’ve completed all the activities, you have now tried editing text in three different plain-text languages. You’ve also learned that it’s possible to convert automatically between the formats using pandoc. The implication is that you have a choice when it comes to the format you yourself compose text in, and that you have a second, independent choice about the format you present or publish the text in.

So how to work on text outside of Word? The great advantage of markdown for composition is its terseness. I have found it a convenient format for writing in, especially in shorter forms (blog posts, handouts, quick slide shows). As a presentational format, it is very plain: its main use would be as an e-mail convention.

Markdown is really most useful as something you convert—as source code (to use the term loosely). It is a very easy way to compose for the web browser: I rarely write directly in HTML; normally I write markdown and use pandoc to convert to HTML when I’m finished editing. Usually I then see a few more changes I want to make, so I go through a few cycles of editing-pandoc-open in browser.

I also sometimes write markdown and convert to LaTeX, then typeset into PDF, especially for short documents (like the handout from the workshop). For longer documents I compose directly in LaTeX out of long habit, but it’s equally possible to compose for PDF or print in markdown (possibly with LaTeX mixed in; see the note below on doing this).

Extra notes

LaTeX flavors

The activities ask you to typeset using XeLaTeX rather than PDFLaTeX. You’ll find that if you try to use pdflatex on sample3.tex or template.tex as we did on that converted emp1.tex, you’ll get errors. Though it adds on extra wrinkle to your process, this business of switching to XeLaTeX makes it much easier to use multiple languages and to use any font you like with LaTeX—both important for humanistic text composition.

The LaTeX language you write in is the same whether you use XeLaTeX or PDFLaTeX. The only part of the document that needs to be altered for different LaTeX flavors is preamble. template.tex shows you a XeLaTeX preamble; for a PDFLaTeX example, see this sample file.

By default, pandoc -s sample1.md -o sample1.tex generates a document with a preamble for PDFLaTeX. Yet another option to pandoc makes a XeLaTeX-flavored document. Try:

pandoc -s --latex-engine=xelatex sample1.md -o sample1xe.tex

This makes a TeX file suitable for typesetting with XeLaTeX out of sample1.md. You can open sample1xe.tex in TeXWorks/TeXShop and typeset it as above.

Pandoc extra

Direct to PDF

Pandoc can actually also run LaTeX for you. This allows it to convert seamlessly from markdown to PDF (secretly it first makes markdown into LaTeX and then typesets it with LaTeX). Try:

pandoc sample1.md -o sample1direct.pdf

Extended markdown

Gruber’s original markdown syntax is a bit limited even for bare-bones composition. Pandoc understands a special markdown dialect with some extra bells and whistles. The Pandoc manual gives all the details, but for humanists the most notable addition is the footnote. The footnote convention has two parts. The footnote marker is annotated with brackets and a caret:

In my text, the place where the footnote marker
goes is here.[^note] Then there's some more text.

You can put any combination of letters, numbers, and hyphens after the caret (I chose note but you can put 1, zoomwow1one or anything you like.) You can then place the text of the note anywhere else in your document by repeating the marker and adding a colon:

[^note]: This text will go in the footnote.

When it converts markdown to HTML, pandoc automatically numbers the footnotes and puts some extra markup in your document to stick footnotes at the bottom of the page and make the footnote markers internal hyperlinks. When it converts markdown to LaTeX, pandoc’s task is much simpler: it reunites the note text and the marker in a \footnote{} command.

Pandoc also has a markdown convention for citations, which tries to format citations using information from a bibliographic database. Unfortunately, the citation format is not sophisticated enough for humanistic scholarship. By contrast, in LaTeX you can use an extremely powerful set of citation commands (supplied by the biblatex package) in conjunction with a bibliographic database stored in yet another plain text format, the bibtex format. That too is a topic for another day.

Markdown mix-in

There is one more key aspect to markdown syntax I did not mention but which I use all the time. Let’s say you know you’re planning to convert your markdown to HTML. You can freely mix markdown and HTML in your text file, and pandoc will keep your HTML and convert the markdown to HTML around it. The following is perfectly legitimate markdown:

In this paragraph I use both *markdown emphasis conventions* and
an HTML <strong>strong emphasis</strong> tag pair.

The downside is that this markdown will not convert perfectly into LaTeX or other formats (you would have to take a detour through HTML first). But often I use markdown only as an expedient pre-text for HTML.

Pandoc allows you to do the same with TeX. Thus pandoc is quite happy to convert this:

In this paragraph I use *markdown emphasis conventions*,
a \LaTeX\ macro, and a \emphasis{command for emphasis}.

into LaTeX. (This feature means you can mix, for example, LaTeX citation commands into markdown. That is actually how I often write.) Again the conversion is now one-way, markdown to LaTeX, instead of moving seamlessly to other languages like HTML.

Finally: a bonus activity

  1. Save a Word document with some text in .docx format under the name test.docx.
  2. PowerShell or Terminal: mv test.docx test.zip.
  3. Back in Windows Explorer or the Finder, double-click test.zip.
  4. Inside the new test folder, look for a word folder. Inside that, look for document.xml. Try opening document.xml in Komodo Edit or another editor…can you find the text you typed?
  5. What is the nature of the docx format? Does it look like anything else you’ve seen? How is it different?

More on markdown

andrewgoldstone.com/md/ for my notes;
daringfireball.net/projects/markdown/basics for an introduction by Markdown’s inventor;
johnmacfarlane.net/pandoc/README.html#pandocs-markdown for pandoc’s extended markdown
(also available through the command man pandoc_markdown).

More on HTML

www.htmldog.com/guides/html: puckish and good introductions to HTML and CSS.
developer.mozilla.org: lots of material, including introductions to HTML, CSS, and every other web technology under the sun. Uneven.

More on TeX

andrewgoldstone.com/tex has pointers to further reading (the exhortation “Typography or Death!” is optional).
Getting Started with TeX, LaTeX, and Friends by the TeX Users’ Group, the clearinghouse for things .
LyX, from lyx.org, which I do not use, provides a graphical (“What You See Is What You Mean”) way to edit LaTeX documents.
Stack Exchange is now the best place to search for answers to TeX questions (and pose new ones).

Showing my work

The way I generated the workshop materials might serve as examples of slightly more complex pandoc usage. To see the markdown source for this page (and the slideshow equivalent), click here (this is a gist).

To generate the html, I run

pandoc annc.md notes.md --smart --base-header-level 3 -o notes.html

and copy and paste the result into WordPress. (Notice I do not use the -s option because WordPress supplies header elements to posts. annc.md contains markdown for the workshop announcement.)

To generate the slideshow, I run

pandoc notes.md --smart -t slidy -s --slide-level 2 -o slides.html

(Thanks to the verbosity of these revised notes, the generated slideshow is longer than the one I showed in the workshop, and has a few “slides” that spill off the screen. )

To generate the handout, I wrote this source and ran

pandoc handout.md --latex-engine=xelatex -V fontsize=10pt \
    -V mainfont='Garamond Premier Pro' -o handout.pdf

The pandoc manual explains the extra options I use (though note that Garamond Premier Pro is a font I had to buy, so check out the font options on your own system).

Edited 11/23/13 by AG: Added workshop notes.
Edited 12/19/13 by AG: Fixed some typos.