Document preparation at home vs. the arXiv

I like to think that I am wary of some basic typography guidelines and that I pay attention to them when preparing my documents. For the most part, I care about line and page breaks and how they affect the readability of a sentence. A visually appealing document is important enough for me to consider reformulating certain passages if it helps the optics (of course, at no compromise with accuracy of the content). I am quite content with my two latest articles No eleventh conditional Ingleton inequality and Selfadhesivity in Gaussian conditional independence structures in this regard.

I have only been able to do this easily and happily since very recently. In this article I want to document the two pits I used to fall into when preparing preprints for the arXiv.

The microtype package needs pdflatex

What made my first couple submissions to the arXiv nightmarish was the use of microtype. This is a great LaTeX\LaTeX package which applies “micro typography” rules beyond what TeX\TeX normally does. This includes spending more effort on shrinking and enlarging spaces between words and letters to make lines look equally full. The other improvement I notice is that it pushes punctuation (periods, commas, hyphens) a little into the margin of the page. It looks so much better than treating those like any other character. It’s the sum of many small things that matters a lot.

The problem is that a document using microtype comes out totally different on the arXiv: different line breaks, different numbers of lines in a paragraph. The accumulation of all these over the document can drastically change its appearance — even the number of pages can change. So what I did in my first couple papers that used microtype was to iteratively introduce tildes and explicit breaks, penalties and even slightly reformulate sentences to get one LaTeX\LaTeX source which renders (up to few line endings) the same on my computer and on the arXiv. This would usually take the entire evening. It’s like the typography version of a polyglot program.

For some time I thought this was due to version incompatibility between the arXiv and me. Quite a cognitive dissonance because I always thought TeX\TeX output is supposed to be incredibly stable.

Last year (on my birthday) I found out that this situation is actually documented in the arXiv FAQ. All that pain caused by my own ignorance! The issue is that microtype requires the pdftex engine, which I use at home anyway through latexmk -pdf. But the arXiv by default uses latex and dvips. (This is why many papers are also offered as .ps for download; you don’t get this option if you insist on the pdflatex engine.) To indicate that your document must be compiled with pdflatex, it suffices to add the line


somewhere in the first five lines of your main LaTeX\LaTeX file. After adding this line, my documents are actually compiled using the same tools that I use at home and, sure enough, the result is exactly the same.

arXiv needs your papersize

Recently, I discovered a new kind of problem. At some point I decided to create my own little document class, tboege-preprint, which inherits from amsart but sets other defaults (11pt, a4paper), loads and configures the packages I commonly use and defines my usual macros. As a result, my preamble just consists of loading this class, the above \pdfoutput=1 line, a handful of per-paper packages and macros and the title of the paper.

Submitting my first paper using this document class, the result on the arXiv is again very different from the document I had carefully prepared on my computer.

The first thing I determined was that this time the problem persists if I disable microtype everywhere — the document looks uglier but the same on both ends. Web searching only turned up an entry in the arXiv help which said that paper sizes used to be a problem when printing papers from the arXiv because they were all forced on the US letter format. This did not seem to relate to my problem but what helped me solve it was the table of US letter vs. A4 sizes. The US letter paper is slightly wider but shorter. This was exactly the way in which my document was screwed.

I declare the a4paper explicitly in my document class, but it appears that the arXiv has its own heuristics for determining it and if it does not determine a4paper, then it will give you letterpaper, no matter what you tell the TeX\TeX engine deep down in your document class. I don’t really understand how this works, but it is what I conclude from my experiments. Once I declare a4paper explicitly in my main LaTeX\LaTeX file (even though it is the default in my document class), arXiv again produces the same document as the LaTeX\LaTeX environment on my computer.

Hence, my preamble now always reads

\documentclass[11pt, a4paper]{tboege-preprint}

Preparing for upload

Finally, this is not a problem per se, but I thought I’d mention it. I tend to keep many notes in the source code of my papers, explicit results of computations, shell or Perl snippets, proof ideas for parts left to the reader, passages I decided to remove from the paper or future directions which are not interesting enough to include. However, I feel that the arXiv is not a place to “secretly publish” these comments. Thus, I keep a Makefile with the following rules inside:

paper.pdf: paper.tex paper.bib
    latexmk -pdf paper.pdf

.PHONY: paper.pdf

arxiv: paper.pdf
    perl -e 'local $$/; $$_ = <>; s/^\s*%[^\n]+\n[\r]?//mg; s/(?<!\\)%[^\n]+$$//mg; s/(\n[\r]?){3,}/\n/g; s/[ ]+$$//mg; print' \
        <paper.tex >paper-arxiv.tex
    cp paper{,-arxiv}.bbl
    zip paper-arxiv.{tex,bbl} *.tikz *.sty *.cls

.PHONY: arxiv

The perl line reads my paper.tex and strips it of all the comments and then combines 3 or more newlines (indicating chunks of removed comments) to just one newline. This requires some discipline on my part to never separate paragraphs by 3 or more newlines (or ending a paragraph in exactly one comment line followed by an empty line) — otherwise two paragraphs are combined. This will show in the output on arXiv, of course, so I am content with these restrictions. What it takes care not to remove are the comments consisting only of a single % right before the end of line. These are usually there for a reason: to prevent spaces from being inserted at a linebreak that is only there to make the source code more readable.

Bundling everything into a zip archive makes it convenient to upload to the arXiv. Now that I am aware of the above two gotchas, though, I hope that I do not have to iterate the “upload and check” loop as often anymore.