Wednesday, November 20, 2013
1:10 pm–3:10 pm
Alexander Library 413 169 College Ave., New Brunswick, NJ
Taught by Andrew Goldstone, English Department.
This workshop aims to expand our horizons for thinking about how we handle text on our computers. In order to attain liberation from Word, we will explore the difference between text editors and word processors, discuss the ways computers represent text as content or form, and experiment with some key technologies for digital document preparation. We will dally with three related computer languages in rapid succession. We will begin with markdown, a minimal but versatile set of plain-text conventions. Then we will learn to convert markdown into equivalent HTML markup using Pandoc. Finally, we will introduce the LaTeX document-processing system, which elegantly typesets plain text into PDF files. No prior experience with any form of markup or other computer coding is required. This workshop is free but spaces are limited. Graduate students, faculty, and staff are welcome. Please email Vishal Kamath to register at vkamath at scarletmail.rutgers.edu
.
Before the workshop
Suggested: bring a document you have written on your computer, in any format. At the workshop, all the software we will use will be available on the lab machines. If you prefer to bring your own laptop computer for some or all of the workshop, here is the software we will be using (all free and cross-platform):
- Komodo Edit for the introduction to text editing. You may use any other text editor you prefer, of course.
- Pandoc for markdown-to-HTML conversion.
- TeX Live (Mac, Windows, Unix). (A large download.)
For more on what these things are all about, you may wish to read AG’s pages about markdown and LaTeX. The creator of markdown, John Gruber, has a page which allows you to experiment with markdown in your browser.
Empowerment Part II
The actual “empowerment” (modest but real) comes in getting a more detailed understanding of the way the systems we already use handle text, and in learning more ways to manipulate that text, beyond the confines of any single program. The business of plain-text-slinging, a minor craft on its own, nonetheless forms a natural starting point for thinking more deeply about analyzing digitized texts, expressing yourself in “code” of various kinds, and composing in the digital medium.
Downloads
In order to do the workshop on your own, first install Pandoc and LaTeX (links above). Komodo Edit is optional; any text editor will do, though I’ll occasionally refer to details in Komodo (menu items, etc.) that may be slightly different in other editors. See below for text editor suggestions.
The handout from the workshop (PDF)
Sample files (zip). Unzip this and place the resulting emp2
folder some place where you can find it (like your home directory). (Slightly modified from those used during the workshop.)
How does a computer represent text?
Text encoding
H e l l o w o r l d
48 65 6c 6c 6f 20 77 6f 72 6c 64
Markup
<p>Hello world</p>
Image
Text images could be bitmaps like this or curves, as in Postscript, PDF, SVG.
(image CC-BY Wendell Oskay)
Text encoding
ASCII
The American Standard Code for Information Interchange: Roman letters, numbers, punctuation and control characters correspond to 7-bit numbers (0 to 127); each character fits in a byte (8 bits).
Here is ascii.txt
(in emp2
):
I am a test file.
I have 2 lines
The numerical representation of these characters, in decimal, is:
I SP a m SP a SP t e s t SP f i l e
73 32 97 109 32 97 32 116 101 115 116 32 102 105 108 101
. NL I SP h a v e SP 2 SP l i n e s
46 10 73 32 104 97 118 101 32 50 32 108 105 110 101 115
It is more conventional to write numerical representations like this in hexadecimal, which has exactly two digits per byte. (The Wikipedia page on hexadecimal explains how base 16 numbers are written.) In hexadecimal the same numbers are written:
I SP a m SP a SP t e s t SP f i l e
49 20 61 6d 20 61 20 74 65 73 74 20 66 69 6c 65
. NL I SP h a v e SP 2 SP l i n e s
2e 0a 49 20 68 61 76 65 20 32 20 6c 69 6e 65 73
Notice that capital letters are distinct from lowercase, and spaces and new lines need their own bytes. The convention of hexadecimal 0a for newline is not universal, and linebreaks are often a conversion problem when moving text files between computers running different operating systems.
Unicode
Problem: what about…all the other letters in the world?
L'anglocentrisme, ça m'énerve.
The Unicode standard. The core of the standard is the assignment of characters (“glyphs”) in all the world’s writing systems to codepoints.
Codepoints are notated U+xxxx
where xxxx is a four-digit hexadecimal (base 16) number.
There are a lot of possible codepoints. (But not as many as I said in the workshop.) The four-digit codepoints U+xxxx
allow for only 65536 distinct glyphs. The full Unicode specification has room for another million or so glyphs by adding another 4 bits of information: the highest code point is actually U+10FFFF
. See Wikipedia s.v. Unicode.
Some Unicode examples
(the text on this webpage makes use of Unicode to incorporate characters from non-Latin scripts using a single encoding)
ç U+00E7 LATIN SMALL LETTER C WITH CEDILLA
a U+0061 LATIN SMALL LETTER A
U+0020 SPACE
??????? ??? ????…
? U+1F04 GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA
???
? U+6C34 CJK UNIFIED IDEOGRAPH-6C34
? U+306E HIRAGANA LETTER NO
UTF-8
The codepoints are encoding in various ways. In UTF-8, now the most common text encoding on the web, Unicode code points are mapped to bytes in a more compact way—and with an Anglocentric convenience: ASCII is preserved.
Here is unicode.txt
:
I am a test file.
Voilà la deuxième ligne.
I SP a m SP a SP t e s t SP f i l e
49 20 61 6d 20 61 20 74 65 73 74 20 66 69 6c 65
. NL V o i l à SP l a SP d e u x
2e 0a 56 6f 69 6c c3 a0 20 6c 61 20 64 65 75 78
i è m e SP l i g n e .
69 c3 a8 6d 65 20 6c 69 67 6e 65 2e
Notice that UTF-8 encodes the Latin alphabet just like ASCII (e.g. a
is hexadecimal 61), but that the characters with diacritical marks, à
and è
, require two bytes in this encoding (c3a0 and c3a8 respectively).
Activity: the text editor
- Open some programs for text: Notepad (Mac: TextEdit), Komodo Edit, Word. Type anything in each.
- What differences do you notice in the representation of text?
- What different capacities do you notice? Pay attention to the kinds of decisions the different programs make for you—and the kinds of things you have to be explicit about.
Editing: varia
Komodo Edit has a text encoding menu, by the way: look at the bottom of the window.
You might also notice that lines are not broken for you automatically. In a text editor, you must type a carriage return to make a line break part of the text file. However, all text editors have an option to “soft wrap” text—i.e., wrap it for display without inserting the carriage returns (those 0a
bytes). In Komodo this is the “Word Wrap” option in the “View” menu.
Why plain text?
- “Human-readable”
- Fewer conversion problems (encodings still an issue)
- Text as data: programmatic manipulation
- Plain text as blueprint for multiple forms…
- All programming, and all the web, lives in the world of plain text
Some editors I’ve heard of
I chose Komodo Edit for the workshop because it was the only text editors with the features I wanted that was available on PC’s as well as Macs. I do not regularly use it, and there are many other good choices of text editors. None of the activities in this workshop require any particular editor program; all of the following could be considered:
- Mac: TextWrangler (free), MacVim (challenging to use), emacs or Aquamacs (ditto), BBEdit (commercial), TextMate (commercial), TextEdit (very basic, on every system)
- Windows: Notepad++ (free), TextPad (commercial), UltraEdit (commercial), Notepad (very basic, on every system)
- Cross-platform: emacs, vim, Komodo Edit, Sublime Text 2 (commercial)
I personally use MacVim, but I found it challenging to learn and don’t recommend starting with it if you are new to the world of plain text and computer languages. I suggest TextWrangler or Notepad++ for those starting out writing plain text for any purpose.
Markdown
The principle: make plain text more expressive with some extra conventions.
- still pretty easy for a human writer/reader to interpret
- but systematic enough to be processed programmatically
Text conventions
*emphasis* or _emphasis_
**strong emphasis**
"(optionally) make quotes curly--and dashes nice--automatically"
extra white space doesn't matter
Where white space does matter
Paragraphs are broken by blank lines.
This is the start of a new paragraph.
But this isn't.
Extra spaces at the end of a line??
make a "hard linebreak."
Four spaces at the start of a line mark "code."
This text is meant literally: *not styled*.
Structure
# Heading
## Subheading
### Subsubheading
> A block quotation, which can be
> spread over multiple lines if you like.
1. An enumerated list.
2. Another thing.
1. A third thing.
- An "unordered" list.
- Must you bullet point?
Hypertext
Links are written in two parts: the actual URL and the text of the link:
[this is a link to my home page](http://andrewgoldstone.com)
Activity
- Create a new markdown document in your text editor. In Komodo Edit, either use “New From Template” and add a
.md
to the file name, or simply create a new document and then use the language menu at the bottom of the document window: it says “Text”; click, go to the “Other” submenu, and choose “Markdown.” You can use any text editor to edit markdown, but for the rest of this workshop you must make sure your file has.md
at the end of its name (and not.txt
). - Try writing with markdown structures or try hand-converting one of your own documents into markdown.
- Save the file to
emp2
with the namefirst.md.
- (If you prefer, you can look at the sample markdown file I have included in the
emp2
folder,sample1.md
. Feel free to substitutesample1.md
forfirst.md
in the examples below.) - What do you notice about this process that is different from word processing?
Conversion
Markdown is designed to be converted into other formats (in fact it is now an input-format option on wordpress.com, as it has long been on tumblr.com).
- Open PowerShell (Windows:
All Programs > Accessories > Windows PowerShell
) or Terminal (Mac, in/Applications/Utilities
). - Change directories to the place you saved your markdown file using the
cd
command. Navigating your hard drive usingcd
may be tricky at first. On Windows, if you putemp2
inMy Documents
, you would type:cd '.\My Documents\emp2'
; on a Mac, if you putemp2
in your home directory, justcd emp2
would work; but if you’d savedemp2
inDownloads
, you’d needcd Downloads/emp2
. Notice that Mac uses a forward slash, not the Windows backslash. Type return after the command. - Type
ls
(and a return): this lists the contexts of the folder you’ve changed into. Make surefirst.md
is listed. - Run pandoc:
pandoc first.md --smart -s -t html5 -o first.html
- Open the folder where you saved
first.md
. You’ll find a new file,first.html
. Double-click the file to open it in a web browser. - (You can also, if you wish, try this with
sample1.md
:pandoc sample1.md --smart -s -t html5 -o sample1.html
. I have provided, in theconverted
folder inemp2
, asample1.html
which contains the results of running the that command on my machine.)
What do you see?
Excursus: running command-line programs
Each command on the command line has the same general shape: a command followed by a series of arguments and options separated by spaces. In the example command to pandoc above, we had:
pandoc first.md --smart -s -t html5 -o first.html
pandoc
: the name of the programfirst.md
: an argument to the command, here the source text file. The extension to the file names its format (pandoc notices this).-s
: an option to the command (options start with one or two hyphens). This one means “stand-alone”: see the section on html structure below.-t html5
: an option with a parameter: this one means “output html5.” We could omit this, and let pandoc guess from the extension.html
. But by default it outputs html as XHTML4, not HTML5.--smart
: a long option name for typographic quotes and dashes.-o first.html
: the output file as an option with parameter. If the file already exists, pandoc will replace it completely, so be careful.
-o
is standard for output, though some programs will instead just use the order of plain arguments: program inputfile outputfile
. And other programs give you no choice about the output file name. Argument order usually matters, and option order usually doesn’t. Other command-line commands to try: cat
, echo
, ls
, mkdir
, cd
, touch
, pwd
. (These are standard Unix commands but are also supported by Windows PowerShell.)
The pandoc website also has a good introductory tutorial on running the program.
Bonus activity: compare the results of pandoc first.md -o first.html
and pandoc first.md -t html5 -o first.html
to the results with -s
in the browser and in Komodo Edit. The -s
option adds an HTML header (you may notice text-encoding problems otherwise in the browser!). Without -t html5
, you will notice some differences in the HTML source, though you shouldn’t see any differences in the appearance of the webpage.
HTML
Open first.html
(or sample1.html
) in Komodo Edit. Now what do you see?
HTML and markdown compared
- Same principle: make plain text more expressive with conventions.
- Most of the information is text “content.”
- The rest—also text—is annotations.
- But HTML is more explicit about structure (and more versatile).
Tagging
Annotations normally take the form of pairs of tags: <xyz>some annotated text</xyz>
.
<em>Annotation for emphasis</em>
<strong>Annotation for strong emphasis</strong>
Extra white space still doesn't matter
<h1>Top-level heading</h1>
<h2>Subheading</h2>
<h3>Etc.</h3>
If you haven’t already, open up sample1.html
in Komodo Edit to see examples of some of these tags.
Structure
HTML tells you about structure. Every paragraph is also tagged.
<p>I am a paragraph.</p> <p>I am a second paragraph, even if I am on the same line.</p>
<p>I am a paragraph using the self-closing <br>tag to break a line.</p>
(<br>
is a rare example of a stand-alone tag. In XHTML it would have to be <br />
but HTML5 is more permissive.)
Now here is the doozy: HTML is strictly hierarchical.
<!DOCTYPE html>
<html>
<head>
...
</head>
<body>
<p>...</p>
<p>...
<em>...</em>
...</p>
</body>
</html>
(The meaning of pandoc’s -s
option is that it adds a suitable <head>
so you don’t have to figure that part out yourself.)
What is wrong with the following?
<p>I aspire to be a <em>well-formed slice of HTML.</p>
<p>But apparently I have issues.</p>
And?
<p>I aspire to be a <em>well-formed slice of HTML.</p>
<p>But</em> apparently I have issues.</p>
This is quite right, on the other hand:
<p>I am an <em>emphasis with a <strong>strong emphasis</strong> inside</em>.</p>
I am an emphasis with a strong emphasis inside.
Block quotes
The hierarchy means that block quotes and lists require more annotations:
<blockquote>
<p>A single-paragraph block.</p>
</blockquote>
<blockquote>
<p>A two-paragraph block, first round.</p>
<p>A two-paragraph block, second round.</p>
</blockquote>
Which represents something like:
A two-paragraph block, first round.
A two-paragraph block, second round.
Lists
<ol>
<li>First</li>
<li>Second</li>
<li>And last.</li>
</ol>
<ul>
<li>First</li>
<li>Second</li>
<li>And last.</li>
</ul>
The two constructs differ only in the outer ul/ol
tags, but the browser renders them rather differently:
- First
- Second
- And last.
- First
- Second
- And last.
Abstraction
There are also two purely abstract tags:
<div>
(a block-level division, i.e. one that can contain paragraphs)<span>
(an inline element within a paragraph).
These are mainly useful in conjunction with more elaborate styling using CSS.
Links and attributes
Within the < >
of the opening tag, HTML allows attributes of the form name="value"
. A common abstract attribute is a unique id
for an element:
<p id="par1">I am a paragraph.<p>
More useful is the href
attribute of the “anchor” tag <a>
.
<a href="http://andrewgoldstone.com">this is a link to my home page</a>
You can also specify an href
to an id:
<a href="#par1">I refer back to the element with
id <code>par1</code> on this page.</a>
Heads and bodies
The HTML hierarchy includes not just the data of the document but (broadly speaking) metadata. All of the displayed elements of the document live within the <body>...</body>
tags. The metadata lives in the head:
<head>
<title>The title of the page</title>
<meta charset="utf-8">
</head>
And even outside the <html>...</html>
comes the document type declaration, <!DOCTYPE html>
(which means: the rest of this is an HTML5 document).
Activity
Edit the first.html
you created using any of the tags here, and save it as a new file second.html
. Try loading the page in your browser.
Then try
pandoc --atx-headers second.html -o second.md
Open second.md
in Komodo Edit. What does it look like?
For comparison, take a look at sample2.html
in the emp2
folder. This is a modified version of sample1.html
. The result of using pandoc to convert this back to markdown are in converted
as sample2.md
.
(Leaving off the --atx-headers
option would get you another markdown header convention not described above, in which headers are “underlined” with hyphens or equals signs.)
Making websites
Websites consist of structured text in HTML plus presentational information in CSS. CSS is for another day (or see links to online guides below), but the point of CSS is that it gives rules for rendering HTML elements. (For example: put all paragraphs in 14-point font. Make the body background light gray. Etc.) More powerful websites do their data-crunching with programs behind the scenes and send their results to the “client” in HTML. Many sites also transform their HTML using little programs in Javascript that the client web-browser runs.
Conversion (2a)
Look in emp2
for the emp1.md
file. Open this in Komodo Edit. This may look familiar… Now try
pandoc -s emp1.md -o emp1.html
Double-click emp1.html
in Windows Explorer/MacOS Finder. Okay, no prob: a web page. What about:
pandoc -s -t slidy emp1.md -o emp1s.html
Try opening emp1s.html
. If the web browser asks you, allow scripts to run. What has happened? (I have put the emp1.html
and emp1s.html
generated on my machine in converted
for comparison.)
This sort of thing is what is afforded by separating the text content and structure from the mode of presentation. To read more about the many other formats that pandoc can convert markdown too, read the pandoc documentation.
(Incidentally, while there are many other programs that read and write markdown, this ability to convert to slidy and other HTML slideshow formats is particular to pandoc.)
Conversion (2b)
But we can go even further. Try:
pandoc -s emp1.md -o emp1.tex
pdflatex emp1
Now look for the newly-created emp1.pdf
file. Open it up. What has happened?
(Notice that on the command line, pdflatex
takes only a single argument, the name of the source file, without an extension. Its output is always named the same way as the source file, but with a .pdf
extension. We’re about to do the rest of our TeXing inside a graphical program, so I’ll leave aside the details of running command line TeX. The point here is simply that somehow the same markdown source can also lead to a typeset PDF. As usual, look for sample emp1.tex
and emp1.pdf
files in the converted
folder.)
Conversion (3)
Try:
pandoc -s sample1.md -o sample1.tex
Open TeXWorks (Windows; TeXShop on Mac) and open sample1.tex
inside that program. What similarities can you see to the markdown source? You might have to scroll down to where you see \begin{document}
…
(For comparison, there is a sample1.tex
inside converted
in emp2
.)
LaTeX
The language of the file you are now examining is LaTeX, which is itself a dialect of TeX. TeX is a typesetting language invented by Donald Knuth; LaTeX is the system built on top of it by Leslie Lamport and many others, beginning in the early 1980s. (The Wikipedia TeX page has a little more history). Most users of TeX these days use LaTeX.
About pronouncing the name
The name is significant in two ways. On the one hand, it is profoundly alienating to learn that you’re “supposed” to read the last letter of TeX as a Greek chi and pronounce it like the ch in loch. The name becomes a supernerd shibboleth—a feeling which is reinforced by the ritual snootishness between those who pronounce the first syllable of LaTeX “la” and those who say it like the word “lay.” All of that is depressing.
But the name is significant in another way. Remember that the standard for encoding text, ASCII, leaves no room for other alphabets and languages. The aspiration implicit in the name TeX—anticipating the sensibility of Unicode by more than a decade—is that computers can and should make sense of multiple scripts. Even more importantly, TeX was, according to Knuth, born out of a determination to get computers to typeset books well. TeX expresses the idea of making the computer handle text in the ways that people want—rather than forcing us to use text only in the ways that are most convenient for computers.
Continuing on from where we had to stop in the workshop…
Earlier we processed markdown into HTML using pandoc. Then we tried using pdflatex at the command line to make PDF files. Now we’re going to work on TeX in an all-in-one program, TeXWorks (Windows) or TeXShop (Mac). These programs include a text editor window very similar to Komodo (but specialized for TeX), and they let you run the TeX program without having to type commands at the command line. This is completely equivalent to using a separate editor and running command-line TeX. It’s mostly a matter of taste and convenience. You can switch back and forth at will! This is another virtue of using plain text formats: you can edit them in many different programs, and you’re not “locked in” once and for all.
Activity
Let’s examine a rudimentary LaTeX file.
- Open up
sample3.tex
from theemp2
sample files using the “Open” command in the “File” menu of TeXShop or TeXWorks. - Choose “XeLaTeX” as your typesetting engine. In both TeXWorks and TeXShop, there is a pop-up menu near the top of the document menu (it says “PDFLaTeX” by default in TeXWorks and “LaTeX” by default in TeXShop).
- Typeset! In TeXWorks, click the green “play” button at the top left of the document window; in TeXShop, click the “Typeset” button.
- Take a look at the resulting PDF. (This PDF is displayed within the TeXWorks/Shop program; but you can also open it up in another PDF viewer, like Adobe Reader or Preview. For comparison, see
sample3.pdf
inconverted
inemp2
. )
What is different about this document-preparation process from word processing?
XeLaTeX is a newer “flavor” of LaTeX which emphasizes Unicode support and allows you to use any font on your system. See below for a note of explanation.
The language
The structure of a LaTeX document is not too different from an HTML document.
Document class declaration and preamble
\documentclass[12pt]{article}
Then follow a series of commands called the preamble. Its purpose is a bit like the HTML <head>
. These do most of the work of tweaking the page format. When you start out, it’s easiest to use someone else’s preamble as a template. I’ve supplied one in template.tex
. In the next activity, we’ll try that out.
Document body
\begin{document}
Here is where the text of the document goes, just like the HTML <body>
.
\end{document}
The same principle
LaTeX is, yet once more, plain text with conventions to make the text more expressive. Actually, unlike HTML, LaTeX is actually a programming language—but here’s the point! Porgramming languages are plain text with expressive conventions. It’s just that some of the expression is done with another computer program that processes the text.
In LaTeX, the convention is: the conventional material is signalled by words beginning with a backslash \
. These are TeX commands. The commands can have parameters, normally surrounded by balanced braces {
}
. Sometimes commands themselves come in pairs, like HTML’s paired tags. For example:
\begin{quote}
I am a blockquote too!
\end{quote}
But sometimes not:
\emph{I am emphasized text.}
Basic commands for text
Extra white space doesn't matter.
One or more blank lines in the source makes a paragraph break.
If you wish to insert a linebreak of your own, use a double backslash \\
.
``Double quotes'' and `single';
the apostrophe's easy;
and dashes---the em dash---and the en dash (for numbers, as in 1990--2000).
An ellipsis\ldots gets its own command (notice the space after disappears).
“Double quotes” and ‘single’;
the apostrophe’s easy;
and dashes—the em dash—and the en dash (for numbers, as in 1990–2000).
An ellipsis…gets its own command (notice the space after disappears).
Here is an example of \emph{emphasis} and of
\emph{emphasis with a \emph{further} emphasis within it}.
\footnote{Note text. Numbers are automatic.}
And a special convention: one that does nothing.
% Everything in the line after a % sign is a "comment."
(In HTML comments are made by <!-- This is the comment. -->
. Markdown has no comments.)
Sectioning
\section{Section name}, \subsection{...}, \subsubsection{...}
Environments
Documents are further structured by environments, which set text differently from ordinary paragraphs.
\begin{quote}
A blockquote.
\end{quote}
\begin{verse}
The poem must resist the intelligence \\
Almost successfully. Illustration:
A brune figure in winter evening resists \\
Identity.
\end{verse}
Activity
You are free to try editing sample3.tex
and re-typesetting it. But let’s do something more realistic and write a new document on the basis of a template with a fuller preamble.
- Make a copy of
template.tex
. Call it anything you want, but make sure the name ends in.tex
. Open it up in TeXWorks/TeXShop. - Put in some text in the body of the new document—try out the LaTeX commands above.
- Find the name of a font you know you have on your system, and stick its in place of
Palatino
in the\setmainfont{Palatino}
command in the preamble. - Bonus: try editing the titling parameters in the preamble. (Look for
\author{...}
.) - Typeset.
Conclusion: ways of working
If you’ve completed all the activities, you have now tried editing text in three different plain-text languages. You’ve also learned that it’s possible to convert automatically between the formats using pandoc. The implication is that you have a choice when it comes to the format you yourself compose text in, and that you have a second, independent choice about the format you present or publish the text in.
So how to work on text outside of Word? The great advantage of markdown for composition is its terseness. I have found it a convenient format for writing in, especially in shorter forms (blog posts, handouts, quick slide shows). As a presentational format, it is very plain: its main use would be as an e-mail convention.
Markdown is really most useful as something you convert—as source code (to use the term loosely). It is a very easy way to compose for the web browser: I rarely write directly in HTML; normally I write markdown and use pandoc to convert to HTML when I’m finished editing. Usually I then see a few more changes I want to make, so I go through a few cycles of editing-pandoc-open in browser.
I also sometimes write markdown and convert to LaTeX, then typeset into PDF, especially for short documents (like the handout from the workshop). For longer documents I compose directly in LaTeX out of long habit, but it’s equally possible to compose for PDF or print in markdown (possibly with LaTeX mixed in; see the note below on doing this).
Extra notes
LaTeX flavors
The activities ask you to typeset using XeLaTeX rather than PDFLaTeX. You’ll find that if you try to use pdflatex
on sample3.tex
or template.tex
as we did on that converted emp1.tex
, you’ll get errors. Though it adds on extra wrinkle to your process, this business of switching to XeLaTeX makes it much easier to use multiple languages and to use any font you like with LaTeX—both important for humanistic text composition.
The LaTeX language you write in is the same whether you use XeLaTeX or PDFLaTeX. The only part of the document that needs to be altered for different LaTeX flavors is preamble. template.tex
shows you a XeLaTeX preamble; for a PDFLaTeX example, see this sample file.
By default, pandoc -s sample1.md -o sample1.tex
generates a document with a preamble for PDFLaTeX. Yet another option to pandoc makes a XeLaTeX-flavored document. Try:
pandoc -s --latex-engine=xelatex sample1.md -o sample1xe.tex
This makes a TeX file suitable for typesetting with XeLaTeX out of sample1.md
. You can open sample1xe.tex
in TeXWorks/TeXShop and typeset it as above.
Pandoc extra
Direct to PDF
Pandoc can actually also run LaTeX for you. This allows it to convert seamlessly from markdown to PDF (secretly it first makes markdown into LaTeX and then typesets it with LaTeX). Try:
pandoc sample1.md -o sample1direct.pdf
Extended markdown
Gruber’s original markdown syntax is a bit limited even for bare-bones composition. Pandoc understands a special markdown dialect with some extra bells and whistles. The Pandoc manual gives all the details, but for humanists the most notable addition is the footnote. The footnote convention has two parts. The footnote marker is annotated with brackets and a caret:
In my text, the place where the footnote marker
goes is here.[^note] Then there's some more text.
You can put any combination of letters, numbers, and hyphens after the caret (I chose note
but you can put 1
, zoomwow1one
or anything you like.) You can then place the text of the note anywhere else in your document by repeating the marker and adding a colon:
[^note]: This text will go in the footnote.
When it converts markdown to HTML, pandoc automatically numbers the footnotes and puts some extra markup in your document to stick footnotes at the bottom of the page and make the footnote markers internal hyperlinks. When it converts markdown to LaTeX, pandoc’s task is much simpler: it reunites the note text and the marker in a \footnote{}
command.
Pandoc also has a markdown convention for citations, which tries to format citations using information from a bibliographic database. Unfortunately, the citation format is not sophisticated enough for humanistic scholarship. By contrast, in LaTeX you can use an extremely powerful set of citation commands (supplied by the biblatex package) in conjunction with a bibliographic database stored in yet another plain text format, the bibtex format. That too is a topic for another day.
Markdown mix-in
There is one more key aspect to markdown syntax I did not mention but which I use all the time. Let’s say you know you’re planning to convert your markdown to HTML. You can freely mix markdown and HTML in your text file, and pandoc will keep your HTML and convert the markdown to HTML around it. The following is perfectly legitimate markdown:
In this paragraph I use both *markdown emphasis conventions* and
an HTML <strong>strong emphasis</strong> tag pair.
The downside is that this markdown will not convert perfectly into LaTeX or other formats (you would have to take a detour through HTML first). But often I use markdown only as an expedient pre-text for HTML.
Pandoc allows you to do the same with TeX. Thus pandoc is quite happy to convert this:
In this paragraph I use *markdown emphasis conventions*,
a \LaTeX\ macro, and a \emphasis{command for emphasis}.
into LaTeX. (This feature means you can mix, for example, LaTeX citation commands into markdown. That is actually how I often write.) Again the conversion is now one-way, markdown to LaTeX, instead of moving seamlessly to other languages like HTML.
Finally: a bonus activity
- Save a Word document with some text in
.docx
format under the nametest.docx
. - PowerShell or Terminal:
mv test.docx test.zip
. - Back in Windows Explorer or the Finder, double-click
test.zip.
- Inside the new
test
folder, look for aword
folder. Inside that, look fordocument.xml
. Try openingdocument.xml
in Komodo Edit or another editor…can you find the text you typed? - What is the nature of the
docx
format? Does it look like anything else you’ve seen? How is it different?
More on markdown
andrewgoldstone.com/md/ for my notes;
daringfireball.net/projects/markdown/basics for an introduction by Markdown’s inventor;
johnmacfarlane.net/pandoc/README.html#pandocs-markdown for pandoc’s extended markdown
(also available through the command man pandoc_markdown
).
More on HTML
www.htmldog.com/guides/html: puckish and good introductions to HTML and CSS.
developer.mozilla.org: lots of material, including introductions to HTML, CSS, and every other web technology under the sun. Uneven.
More on TeX
andrewgoldstone.com/tex has pointers to further reading (the exhortation “Typography or Death!” is optional).
Getting Started with TeX, LaTeX, and Friends by the TeX Users’ Group, the clearinghouse for things .
LyX, from lyx.org, which I do not use, provides a graphical (“What You See Is What You Mean”) way to edit LaTeX documents.
Stack Exchange is now the best place to search for answers to TeX questions (and pose new ones).
Showing my work
The way I generated the workshop materials might serve as examples of slightly more complex pandoc usage. To see the markdown source for this page (and the slideshow equivalent), click here (this is a gist).
To generate the html, I run
pandoc annc.md notes.md --smart --base-header-level 3 -o notes.html
and copy and paste the result into WordPress. (Notice I do not use the -s
option because WordPress supplies header elements to posts. annc.md
contains markdown for the workshop announcement.)
To generate the slideshow, I run
pandoc notes.md --smart -t slidy -s --slide-level 2 -o slides.html
(Thanks to the verbosity of these revised notes, the generated slideshow is longer than the one I showed in the workshop, and has a few “slides” that spill off the screen. )
To generate the handout, I wrote this source and ran
pandoc handout.md --latex-engine=xelatex -V fontsize=10pt \
-V mainfont='Garamond Premier Pro' -o handout.pdf
The pandoc manual explains the extra options I use (though note that Garamond Premier Pro is a font I had to buy, so check out the font options on your own system).
Edited 11/23/13 by AG: Added workshop notes.
Edited 12/19/13 by AG: Fixed some typos.