Rewriting my website in Perl

My website generator used to be a somewhat hacked-together shell script with some parts in AWK. When I was working on my first article about Tide^[a], I finally hit the limits of AWK when I was trying to implement some simple syntax highlighting. Instead of complicating the script, I decided to rewrite it in a more appropriate language.

[a] Making another programming language

So I decided to learn Raku. Unfortunately, it wasn’t packaged for my distro, so I had to make do with Perl instead, which was conveniently already installed, as a dependency of git and cloc.

I had read a book on Perl a really long time ago, but never really used it. So I had to learn by searching for ways of doing stuff. Unfortunately, perlmonks has been offline for a number of years now, so I had to use Stack Overflow as a substitute. Fortunately, the Perl documentation is amazing. It clearly lists some ill-advised features as “deprecated” and lists the replacements.

Modern Perl is fine. A lot of the weirdness I remember reading about 10 years ago has been replaced by nicer features. For example, you used to have to do this on every function definition:

sub example_subroutine {
	my ($arg1, $arg2) = @_;
	...
}

Now you can just:

sub example_subroutine($arg1, $arg2) {
	...
}

The syntax is still completely insane, but it’s really good at what it was designed to do: replace hacky combinations of AWK, shell and sed. Which is what I had before.

I started out by reimplementing the old logic, and using diff to make sure the output was exactly the same as the old script. This actually found some bugs in the old script, some of which I fixed to make the diffs more useful. After two or three rewrites, I was happy enough with the new script to replace the original entirely.

The new script is simpler than the old one. First because it’s implemented in just one language rather than two, second because the crazy state machine that I had to use in AWK was replaced by normal control flow in Perl, since you can just read new lines at any point in the code, unlike AWK. It’s also 10x faster.

Then, finally, I could implement syntax highlighting. Look at this beautiful monstruosity:

$input =~ s`
	(?<comment>//.*$|/\*(.|\n)*?\*/)
	| (?<operator>[(){}\[\],.:;|]
	  | [!=<>+\-*/%]=?
	  | \b(and|or|is|in)\b)
	| \b(?<keyword>break|else|fn|for|if|import
	    | label|let|mut|pub|record|return|switch|while)\b
	| \b(?<literal>\b(false|true|\d+))\b
	| (?<string>"([^\\"]|\\.)*")
	| \b(?<type>bool|int|string|void
	    | ([a-z_]\w+\s*:)? [A-Z]\w*)\b
	| \b(?<function>([a-z_]\w+\s*:)? [a-z]\w*)(?=\s*\()
	| \b(?<variable>([a-z_]\w+\s*:)? [a-z]\w*)(?!\s*\()
`
	defined $+{comment} ? span("comment", $+{comment}) :
	defined $+{operator} ? span("operator", $+{operator}) :
	defined $+{keyword} ? span("keyword", $+{keyword}) :
	defined $+{literal} ? span("literal", $+{literal}) :
	defined $+{string} ? span("string", $+{string}) :
	defined $+{type} ? span("type", $+{type}) :
	defined $+{function} ? span("function", $+{function}) :
	defined $+{variable} ? html_esc($+{variable}) :
	die("missing replacement for $&")
`mnxge;

This is a single logical line.

Well, other than the absolutely insane syntax, this is actually a really elegant way to do token-based syntax highlighting. The regex matches tokens one at a time, and depending on the specific kind of token it is, replaces it with the appropriate span class, including HTML escaping and such.

I actually never had much trouble reading regex. It’s the sort of thing you learn once and it’s fine. But I only knew about POSIX-style Extended RegExes. After looking at the Perl regex documentation, now I understand why people are scared of this stuff. There’s no way anyone can fit this into their head, like I can with POSIX EREs.

In any case, I quite like Perl. It is actually fun to work with, at least for small-to-medium scripts (gmiweb.pl is 517 LOC, which is medium by my count). I would never use it for anything I intended to publish, though.

Here’s the source code for your perusal.