bah! Another static site generator

In this first post I want to talk a bit about a program I have written during the course of setting this site up. What I wanted was

a modest static website,
with a blog component,
generated out of a version control system,
no CGI, no application server, no database, no Javascript.

For the markup language, since I like writing email, I wanted something like Markdown, but the markup processor should

handle Unicode,
provide a good syntax highlighter, even for exotic (or hard to highlight) languages like Perl 6,
support for nice mathematical formulas¹,
a little bit of extensibility for custom commands and access to the processor’s parse tree.

Yes, ma, I did consider some established software packages:

Jekyll: “Simple, blog-aware, static sites” sounds perfect. It supports kramdown which in turn supports math formulas, and it has a potent syntax highlighter, Rogue. Sadly, Rogue does not yet support Perl 6 and I don’t know Ruby to add it (or hook into the markdown parser, point 4, for that matter), so tough luck for me. And, just between you and me, I have a hunch that this Jekyll also has an ugly side…
Hugo: Quite a beast and it looked simple enough once I found the places which explain the small subset of things I actually needed. Again, there is the language barrier for hacking on it and Chroma, the syntax highlighter, does not support Perl 6. I would also prefer not to have to rely on clients loading MathJax to get math typeset. As far as I could tell, Hugo, as it is now, will only allow you to inject the MathJax javascripts into your site, and not render the math once and for all on the server. I don’t want to give up Javascript-cleanness only to typeset math. Under different circumstances, we could have been friends, Hugo.
jemdoc: This is indeed lightweight. I (hacked and) used it before to create gaussoids.de which was nice enough. It is “deliberately feature poor” and indeed too poor for my feature list above. Modifying it to do all the things I want, like non-image math support and advanced syntax highlighting, would barely be less work than reinventing the whole thing. Furthermore, jemdoc’s markup parser is based on regex substitution, which I don’t want to be seen with.
various others: of which I looked only at the vanishingly small fraction which I could imagine being able to modify: Their feature set was insufficient and/or the organization of site content inconvenient.

In summary, not only am I too lazy to learn new languages, I also don’t want to learn the intricacies of configuring the existing popular (and hence big) choices, and then I thought for some time that this could be an interesting program to write — or in the words of Tom St Denis:

I am too lazy to figure out someone else’s API. I’d rather invent my own simpler API and use that.

It was (still is) good coding practice.

And this is how bah was born, the program which now generates this website. Its name of course abbreviates “a blog and homepage”. The Duden has an entry for “bah” (that links to its sibling “bäh”) and lists the following meanings:

Interjection expressing aversion, disdain or schadenfreude,
onomatopoetic for the bleating of a sheep.

If you are curious if bah fulfills at least the first meaning of “bah”, do keep on reading.

A closer look at `bah`

Markup language and processor

One of the criteria for my markup languages was Markdown-alikeness, because it is convenient to write, and another was being able to access and modify the parse tree of the document to setup custom filters for the document. I learned that those processors are scarce and double scarce if you’re lazy about learning new languages.

I considered for example AsciiDoc, which is what jemdoc is allegedly based on. The reference implementation in Python doesn’t have anyting resembling a parse tree in/out format, as far as I could see. An alternative implementation Asciidoctor, however, does, but it’s in Ruby and Ruby-only…

My savior was pandoc. It has a very feature-rich variant of Markdown that allows fenced code blocks with classes, images with attributes, definition lists, tables, inline and block mathematics and other cool things². It handles UTF-8 and even turns --- or " into their typographically preferable — so-called “smart” — counterparts, as demonstrated by this sentence. More importantly, it can input and output a parse tree in its internal Haskelly format or JSON, and provides a --filter option to install parse tree filter scripts which pandoc will insert into its processing pipeline at the appropriate point. This means that I can modify the document using external filters in any language I like. Perl happens to have excellent support for that in the Pandoc::Filter module. This is the basis for overcoming the insufficiencies of Jekyll and Hugo.

So I based bah on pandoc and Pandoc Markdown.

Syntax highlighting

For some time I was stunned by the Perl 6 syntax highlighting situation that ruled out otherwise fit systems like the aforementioned Jekyll and Hugo. Indeed, even the Perl 6 Advent Calendar has to resort to posting a gist of an article to github and scraping the syntax highlighted code blocks from there back into the article.

Folk wisdom has it that »only perl can parse Perl« and that holds double for Perl 6. So you might ask: »Who even can highlight Perl 6 at all?« — Well, vim can, as can a bunch of other text editors that people use to write Perl 6³. — »But who ~~in their right mind~~ uses vim as a syntax highlighting engine?« — The answer is Perl, in its infinite TIMTOWTDIty.

That’s right. There is a module called Text::VimColor on CPAN that allows you to call out to vim with some text and a filetype and get a stream of your text with interlaced highlighting instructions back — or you can get straight HTML back which is what I’m using here. And I think that’s super cool. Couple this with the vim-perl6 syntax file and you got a capable highlighter for Perl 6 code snippets, and many, many other languages. Putting this behind a Pandoc::Filter, I can turn every fenced code block in my source Markdown document into a highlighted HTML code block.

The Perl 6 syntax file is not perfect, but usually pretty close. See for yourself:

#|«
Return a Unicode clock character that approximately represents the time
component of the given DateTime. There is one character for every half-hour
of an analogue clock, 24 in total. They start at C<U+1F550> (E<0x1F550>).
 
The mapping from non-half-hours to half-hours is specified via the
C<round> parameter which defaults to C<Closest>.
»
sub unitime (DateTime:D() $dt, Round :$round? = Closest --> Str) is export {
    my $half-hour = do given $round {
        # Minute with second and millisecond as fraction
        my $minute = $dt.minute + $dt.second / 60;
        when Up      { ceiling $minute / 30 }
        when Down    {   floor $minute / 30 }
        when Closest {   round $minute / 30 }
    }
    my $hour = $dt.hour + $half-hour div 2;
    $half-hour mod= 2;
    $hour = ($hour - 1) mod 12 + 1; # 0100 to 1230
    my $handle = $half-hour == 0 ?? ' OCLOCK' !! '-THIRTY';
    uniparse "CLOCK FACE %ENGLISH{$hour}$handle"
}

I can highlight every language my vim installation is capable of, for example Gambas which (nearly?) nobody can handle, not even github. Since every Gambas programmer just uses the Gambas IDE, there is not much motivation to support its syntax elsewhere, even though it isn’t all that difficult. Well, now I have a source of motivation for finishing my vim-gambas syntax file. The result isn’t pretty yet, but I’ll keep working on it as I have time.

'' Sort this bucket. This is Mergesort + Insertionsort. Only the last
'' instance (maximum index) of a particular key survives.
Static Private Sub _Sort(Entries As _Entry[]) As _Entry[]
  Dim aSorted As _Entry[]
  Dim hEnt As _Entry
  Dim iMid As Integer

  If Entries.Count < MergesortLimit Then ' Insertionsort
    aSorted = New _Entry[]
    For Each hEnt In Entries
      _Insert(aSorted, hEnt)
    Next
    Return aSorted
  Endif
  ' Mergesort
  iMid = Entries.Count / 2
  Return _Merge(_Sort(Entries.Copy(0, iMid)), _Sort(Entries.Copy(iMid, Entries.Count - iMid)))
End

Mathematics

Pandoc Markdown recognizes “maths”, either inline between $ signs or as a display block between $$ signs. Using $\LaTeX$ the other half of the week, I appreciate the consistency⁴.

However, the built-in math rendering in pandoc either calls out to external services, relies on Javascript, or embeds the typeset formulas as images — or it produces MathML which gets an honorary mention but isn’t portable. Pandoc::Filters come to the rescue again! What I do is intercept all the math blocks in the document and convert them on my own, using $\KaTeX$ via Node.js.

The $\KaTeX$ project prides itself, among other things, with

Server side rendering: KaTeX produces the same output regardless of browser or environment, so you can pre-render expressions using Node.js and send them as plain HTML.

In my opinion, that pride is completely justified and I can barely contain my amazement. As you can see above, I can even turn the $\KaTeX$ logo into a link, it scales when you zoom in or out of this page, because it is just HTML, and it still looks as nice as if it came straight out of pdflatex. The formulas are statically generated once and for everyone on my server — the only external resources I embed are the required fonts and CSS files from $\KaTeX$ ’s recommended CDN, but no client-side Javascript is involved. If I cared to host these resources myself, this site would be uMatrix-clean.

To flex, let me show you how nicely a result from my Master’s thesis can be reproduced using $\KaTeX$ . For context, define the undirected simple graph $Q(n,k,p,q)$ for $n \ge k \ge p \ge q \ge 0$ as follows: its vertices are the $k$ -faces of the $n$ -cube and two such faces $\gamma, \delta$ are connected by an edge in the graph if and only if there is a $p$ -dimensional face $\alpha$ which intersects $\gamma$ and $\delta$ each in at least $q$ -dimensional faces. Then the following holds:

Theorem. The graph $Q(n,k,p,q)$ is transitive, hence regular. It is complete if and only if $n + q \le p + k$ . The degree of any vertex can be calculated as follows: $\deg Q(n,k,p,q) = -1 + \sum_{m,j \; (\dagger)} \binom{k}{j}2^{k-j}\binom{n-k}{k-j}\binom{n-2k+j}{m}$

where the sum extends over pairs $(m,j) \in [n-k] \times [k]$ which satisfy the feasibility and connectivity conditions

$\tag{$\dagger$} n-2k+j \ge m \quad \wedge \quad p \ge m + 2q - \min\{q, j\}.$

Blog

The basic operation of bah is like you would imagine. It crawls a project directory and either copies files over or, if they’re Markdown files, converts them to HTML using the filters discussed above. Different parts of the site can have different Mustache templates holding the Markdown-converted content.

That holds for the static part of the site anyway⁵. There is also a blog part, on which you are right now. The blog is a bit more dynamic in that posts are scattered in a blog subdirectory outside the static part of the site and are rendered into files whose path depends on the month, year and post title found in the header of the source file. They are also categorized into tags. The /blog URL and its descendants provide lists of posts which fall into their buckets. There is a global RSS feed /blog/feed for the blog, as well as for every tag, e.g. you will find this article in the perl feed.

Tooling

Yes, bah even has “tooling”. The CSS file for the syntax highlighter is generated by a little script vim2css from the peachpuff color theme that comes with vim. I had to tweak it by hand a little, of course, but it’s better than starting from scratch. (You may notice that picking colors that go well together isn’t my forte.)

In the beginning, I mentioned that I wanted the live website to be generated out of a version control system. bah itself is completely agnostic of what the project directory is. The wrapper bah-git can be installed as a post-receive hook into a git repository holding the site. It maintains a checked-out version of the repository and calls bah whenever new commits come in. It also handles locking of the project, moving and chowning the build directory for the webserver and error recovery because bah itself merrily ignores these aspects.

For local testing, I wrote bah-watch which uses inotify on Linux to watch the site’s project directory for changes. On every change, it updates the build directory in a temporary location. The script has an embedded HTTP server which then serves the build directory. It would have been very annoying to write such a long post without this tool.

The end

In summary, this site relies on pandoc, vim, $\KaTeX$ and git, all champions of their respective discipline, glued together by perl, the champion of gluing things together. And, well, the webserver is nginx; there’d be nothing here without it, too.

And with that we’re at the end. I didn’t release bah yet and I’m not sure whether the world needs yet another static site generator. On the other hand, I’m quite proud of it, in that it does all the things I wanted it to do, and in my opinion it does them The Right Way, and thus better than all the alternatives. (If only releasing software wasn’t such a pain…)

If you have any comments or inquiries, please direct them to me via email, post@$this-domain.de, PGP fingerprint can be found in Home.

By which I mean not “best-effort” rendering to unicode and not rendering to static images because not only do they not look nice, they also don’t scale with the rest of the text.↩︎
Like footnotes or citations!↩︎
Update 1 Mar 2019: As I’ve learned meanwhile pygments could do it all along but didn’t list Perl 6 on their language list ↩︎
At least for inline maths.↩︎
“Static” refers to their location here, all content is of course à priori static.↩︎