Crimson: (Currently) A Lexer Generator


This is a follow up to another blog post about compiling regular expressions. (Holy heck, it's almost been one whole year!) Basically the plan back then was to create a tool that (1) can generate lexer and parser and (2) I know inside out; to put it bluntly it's basically an attempt to avoid learning other tools like flex&bison and ANTLR4. Now the lexer generator part is complete (and a few simple tests have been done to make sure that it can handle the most simple use cases), I would like to talk about things I've chosen to do.

Where's the code?

Here.

The code is in Nim simply because I use Nim now. I started using Nim a few months ago rewriting an attempt of piece table in Python; despite people mostly use Nim as a joke (I blame the current dev zeitgeist and their utter incompetence on discerning what's good and what's bad), it's genuinely a nice no-bullshit language that gets things done (much like D). If you ever want to not use the newest shiniest "tech stack" and just make the thing you want to make, Nim would be a good choice.

The way it works

Basically:

  • Only supports regular language.
    • A lexer does a different job than a parser, and nearly all programming languages have a lexical syntax that can be expressed in a regular language.
  • It parses the input file into a list of pairs of two: a name for the kind of token, and its corresponding regular expression. For example, one may use the following for a language which has two kinds of tokens, one for identifiers, one for integers:

    ID = /[a-zA-Z_][0-9a-zA-Z_]*/
    INT = /[1-9][0-9]*/
    
  • The regular expressions will be compiled in the same manner as mentioned in the blog post and the compilation result would be combined as if it were one big regular expression all along.
    • Is there any alternative? My answer would be yes; one can choose to convert regular expressions into some representations of directed graph that depicts non-deterministic state automata.
  • The combined program for regular expressions would be written to the output alongside with a "template" that includes the virtual machine to run the program. Currently this would be a Nim module that only depends on a few standard libraries and can be directly imported just like any other Nim module. (example)

What I'm not that happy with

  • How it only generates Nim code at the moment. I'm planning to add more backend (e.g. C and Scheme) in the future.
  • How the whole Unicode situation is handled. Currently everything (both the source input and the input to the generated code) is assumed to be in UTF-8 and is converted to runes as needed with the std/unicode module; since in Nim the elements of the string type is char which is close to C char but not rune (which I assume is UTF-32) conversion between the two is needed and I ended up making it generate quite messy code. Since the generated code is expected to execute in runtime I'm slightly worried that this might damage the overall performance.

Back

Last update: 2024.2.6