Parsing a simple language - Part 1

2022-11-10 by Christian Kjær Larsen

This is the first blog post in the compiler series. We will go through the steps of writing a recursive descent parser using parser combinators. The full source code for this section can be found here.

We will be using the excellent cats-parse parser combinator library.

This blog post will start by parsing simple constants. Then we will add handling of whitespace. Then we will incrementally add more constructs to the language. We will stop just before adding binary operators. The next part will start the implementation of the code generator.

Some knowledge of Scala is assumed in this part.

Crash course in parser combinators

A parser can be though of as a function that given an input string either returns a result and the unparsed part of the string or an error:

type Parser[Res] = String => Either[Error, (Res, String)]

For instance a parser that just reads the first character from the input is

val anyChar: Parser[Char] = { str =>
  if str.isEmpty
  then Error("Expected char")
  else (str.head, str.tail)
}

Combinators that combine these parsers can be created, and that is where the name “Parser combinator” comes from. For more background you can read the great original paper by Graham Hutton.

In cats-parse we have two types at our disposal: A parser of type Parser[T] consumes one or more characters before succeeding with a result of type T. A parser of type Parser0[T] consumes zero or more characters before succeeding with a result of type T.

Constructing parsers

We can construct parsers that consume input by using some of the basic building blocks.

Parser.anyChar: Parser[Char] // Read any character and succeed with the read character
Parser.string(s: String): Parser[Unit] // Read the string `s` and succeed with ()
Parser.char(c: Char): Parser[Unit] // Read the char `c` and succeed with ()
// And more

We can also create a parser that does not consume input, but just immediately succeeds with a value.

Parser.pure[T](a: T): Parser0[T]

We can run a parser p on an input string s with p.parseAll(s). Let’s try out some of the parsers

scala> Parser.string("foo").parseAll("foo")
val res0: Either[cats.parse.Parser.Error, Unit] = Right(())

scala> Parser.pure(1337).parseAll("")
val res1: Either[cats.parse.Parser.Error, Int] = Right(1337)

Combining parsers

To make more complicated parsers we combine the simple parsers with some of the combinators.

p1 *> p2 will first parse p1 and then parse p2 but discard the result from p1. p1 <* p2 will combine parse p1 and then p2 but discard the result from p2. p1 ~ p2 will also sequence the parsers but return both results in a tuple.

To pick from two parsers we have the | combinator. p1 | p2 first tries p1 and if that fails after trying the first character the result of p2 is returned otherwise the result of p1 is returned. If we need to try more than one character from p1 before determining which branch we need, then we use backtracking. In this case we would write p1.backtrack | p2.

Let’s try it out

scala> (Parser.anyChar ~ Parser.anyChar).parseAll("12")
val res0: Either[cats.parse.Parser.Error, (Char, Char)] = Right((1,2))

scala> (Parser.anyChar *> Parser.anyChar).parseAll("12")
val res1: Either[cats.parse.Parser.Error, Char] = Right(2)

If we need to repeat parsers then we can use p.rep0 and p.rep for zero or more or one or more repetitions of p. Both of them will return a list of results. So if p: Parser[T] then p.rep: Parser[List[T]]. We can write p.? for zero or one repetition of p.

Let’s try it out.

scala> Parser.anyChar.rep0.parseAll("1234")
val res0: Either[cats.parse.Parser.Error, List[Char]] = Right(List(1, 2, 3, 4))

scala> Parser.anyChar.?.parseAll("")
val res1: Either[cats.parse.Parser.Error, Option[Char]] = Right(None)

Transforming parsers

With the combinators we cannot really do any interesting things. We are limited to parsers that return characters and unit. If we want to make a parser that parses boolean we can use map to change the result inside the parser.

val trueParser = Parser.string("true").map(_ => true)
val falseParser = Parser.string("false").map(_ => false)
val boolParser: Parser[Boolean] = trueParser | falseParser

Let’s try it out

scala> boolParser.parseAll("true")
val res0: Either[cats.parse.Parser.Error, Boolean] = Right(true)

scala> boolParser.parseAll("false")
val res1: Either[cats.parse.Parser.Error, Boolean] = Right(false)

scala> boolParser.parseAll("foo")
val res2: Either[cats.parse.Parser.Error, Boolean] = Left(Error(...))

There is also p.as(v) as a short hand for p.map(_ => v) so that we can just write Parser.string("true").as(true) which is a bit more readable.

We now have enough building blocks to write a parser for a small part of our language.

Parsing constants

For the first iteration of our language we will only support primitives like integers, characters, booleans and the unit value. We can write the abstract syntax definition using a Scala 3 enum like:

enum Const {
  case Int(n: Long)
  case Ch(c: Char)
  case Bool(b: Boolean)
  case Unit
}

Examples of integers are 123, -42 and so on. Character literals are like 'a', '0' and so on. Booleans are true and false and unit is represented by ().

To parse constants we will implement a parser:

import cats.parse.Parser
val parseConst: Parser[Const] = ???

A parser that parses Unit and Bool could be

val parseConst: Parser[Const] =
  Parser.string("()").as(Const.Unit)
  | Parser.string("true").as(Const.Bool(true))
  | Parser.string("false").as(Const.Bool(true))

The parser library comes with a builtin parser for signed integers, which is what we need. We can simply use it to extend the constant handling to numbers.

import cats.parse.Numbers
val parseConst: Parser[Const] =
  ...
  | Numbers.signedIntString.map(x => Const.Int(x.toInt)))

Let’s try it out.

scala> parseConst.parseAll("true")
val res0: Either[cats.parse.Parser.Error, syntax.Const] = Right(Bool(true))

scala> parseConst.parseAll("-1234")
val res1: Either[cats.parse.Parser.Error, syntax.Const] = Right(Int(-1234))

scala> parseConst.parseAll("()")
val res2: Either[cats.parse.Parser.Error, syntax.Const] = Right(Unit)

scala> parseConst.parseAll("(1234)")
val res3: Either[cats.parse.Parser.Error, syntax.Const] = Left(Error(...))

That seems to work as expected.

Adding if and let

We are interested in real programs, and they should at least include variable bindings and if-expressions. In our language they will look like

let
  x = true
  y = 40
in
  if x then 10 else y

We add these constructs to the abstract syntax. A let expression is a list of bindings of names to expressions that are bound in the body. If-expressions are as you would expect from Scala 3.

type Name = String
type LetBinding = (Name, Exp)
enum Exp {
  case Var(x: Name)
  case C(c: Const)
  case Let(binding: List[LetBinding], body: Exp)
  case If(test: Exp, conseq: Exp, alt: Exp)
}

To parse the recursive structure of the syntax tree we need some special support from cats-parse. The recursive function lazily defers the parsing of subexpressions to avoid infinite loops in the parser. This is due to the eager evaluation in Scala.

def recursive[A](fn: Parser[A] => Parser[A]): Parser[A] = ???

We start by just adding a parser for the if-expression first.

val parseExp: Parser[Exp] = Parser.recursive[Exp] { expr =>
  val parseIf: Parser[Exp] =
    ((Parser.string("if") *> expr)
    ~ (Parser.string("then") *> expr)
    ~ (Parser.string("else") *> expr)).map {
      case ((test, conseq), alt) => Exp.If(test, conseq, alt)
    }

  parseConst | parseIf
}

We will leave parsing let-expressions and variables as an exercise.

Let’s try out the if parser.

scala> parseExp.parseAll("if true then 1 else 2")
val res0: Either[cats.parse.Parser.Error, Exp] = Left(...)

scala> parseExp.parseAll("iftruethen1else2")
val res1: Either[cats.parse.Parser.Error, Exp] = Right(If(...))

Whoops. Something is not right. We have not at all considered whitespace, but the parser will work if we just not include any whitespace. We will fix that in the next section.

Lexing

In our language we will not have significant whitespace. This means that all whitespace is ignored. This is in contrast to languages like Python or Haskell where indentation matters.

This means that the handling of whitespace will be quite simple. We have a function, token, that will change a regular parser into one that parses non-significant whitespace after success. We could have made the parser skip whitespace on both sides, but it’s nice to know that if you want different handling of whitespace in some cases, then whatever you parse next will not start by skipping whitespace.

val whitespace = Parser.charIn(" \r\n\t")

def token[T](p: Parser[T]): Parser[T] = p <* whitespace.rep0

where p.rep0 parses zero or more occurences of p.

Let’s check it out.

scala> token(Parser.string("foo")).parseAll("foo   \n")
val res1: Either[cats.parse.Parser.Error, Unit] = Right(())

scala> token(Parser.string("foo")).parseAll("  foo")
val res2: Either[cats.parse.Parser.Error, Unit] = Left(Error(0, NonEmptyList(OneOfStr(0,List(foo)))))

Which is exactly what we want. If we then wrap, say, Parser.string("if") and the constant parsers in token, then our parser should handle whitespace nicely. We can verify that parseExp then works as expected.

Handling comments

We would also like to have line comments in our language. We will use the # charater to mark the rest of the line as a comment. We can implement this behavior directly in our whitespace parser. This means that if we see a # while looking for whitespace we skip all input until the next newline. This can be achieved through the Parser.until0 combinator. It parses everything until matches. We will use it to skip until Parser.char('\n') matches.

val comment: P[Unit] =
  Parser.char('#') *> Parser.until0(Parser.char('\n') | Parser.string("\r\n")).void

val whitespace: Parser[Unit] = Parser.charIn(" \t\r\n").void | comment

Let’s check it out.

scala> parseExp.parseAll("if true # check condition \n then 1 else 2")
val res0: Either[cats.parse.Parser.Error, Exp] = Right(If(...))

Which is exactly what we want.

Fixing strange behavior

A valid program is letx=10inx which might be fine, but not super nice looking. We can fix this be requiring some whitespace or a symbol after keywords or integer literals.

We define a parser specifically for keywords where we use Parser.peek to look at the input stream without consuming characters. We then use Parser.not to require that parser to succeed if we don’t have one of these keyword characters.

val notAfterKeyword = Parser.charIn('a' to 'z')
  | Parser.charIn('A' to 'Z')
  | Parser.charIn('0' to '9')

def keyword(w: String) =
  token(Parser.string(w) <* Parser.not(Parser.peek(notAfterKeyword)))

So whenever we have a keyword (like if and let) in our parser, then we replace it with this new keyword parser.

Let’s check it out.

scala> parseExp.parseAll("iftrue then 1 else 2")
val res0: Either[Error, Exp] = Left(Error(...))

scala> parseExp.parseAll("if true then 1 else 2")
val res1: Either[Error, Exp] = Right(If(...))

Which is exactly what we want. We leave it as an exercise to fix it for integers. This concludes parsing constants, if and let-expressions. Now we can parse some programs, but not any interesting ones. We can try if the example that we wanted to parse actually works now

scala> parseExp.parseAll("""let x = true
     |                          y = 40
     |                      in
     |                         if x then 10 else y""")
val res4: Either[cats.parse.Parser.Error, Exp] = Right(Let(List((x,C(Bool(true), ...

We will leave it here for this part. We will pick it up next time where we will generate code for the language that we have implemented so far.

As a small appendum we will go through how we can save source positions so that we can refer to them in error messages during the compiler stages.

Bonus points: Saving source locations

The cats-parse includes a Caret type. This type represents a position in the input string. You can think of it as a cursor in a text editor. The caret both tracks the character offset, the line number and the column number from the beginning of the input.

To track these positions we annotate the abstract syntax tree with some more “stuff”.

enum Exp[A] {
  val ann: A
  ...
  case BinOp(..., ann: A)
  ...
}

In the parser case, this extra “stuff” will be a data type tracking the start and end of an expression.

final case class SourceLocation(begin: Caret, end: Caret) {
  def |+|(that: SourceLocation): SourceLocation =
    SourceLocation(begin, that.end)
}

with |+| defined to join source locations together.

This will be essential to giver proper error messages in various stages of the compilation.

We now change the expression parser to return an expression annotated with source locations

type LExp = Exp[SourceLocation]
val parseExp: Parser[LExp] = ???

cats-parse keeps track of the source positions while parsing, and exposes it through a parser

Parser.caret: Parser0[Caret]

New variant of the token parser that will use Parser.caret to get the start and end of the token and return the parsed token along with its position.

def locToken[T](p: Parser[T]): Parser[(T, SourceLocation)] =
  (Parser.caret.with1 ~ p ~ Parser.caret).map {
    case ((begin, res), end) =>
      (res, SourceLocation(begin, end))
  } <* whitespace.rep0

Note that to combine a Parser0 and a Parser we need to use with1.

We implement the keyword parser using the locToken combinator.

def locKeyword(w: String): Parser[SourceLocation] =
  locToken(
    Parser.string(w) <* Parser.not(Parser.peek(notAfterKeyword))).map(_(1)
  )

We can now change the constant parser to also track the source positition of the constants.

val parseConst: P[(Const, SourceLocation)] =
    locKeyword("false").map(loc => (Const.Bool(false), loc))
  | ...
  | locToken(Numbers.signedIntString).map { case (num, loc) => (Const.Int(num.toInt), loc) }

For the let parser the story is a bit more complicated. We use locKeyword to get the positition of the let, and then we can use the position of the body to find the end location of the entire let-expression.

val parseLet =
  (locKeyword("let")
   ~ (token(identifier).backtrack ~ (token(Parser.char('=')) *> expr)).rep
   ~ (keyword("in") *> expr)
  ).map { case ((start, bindings), body) =>
    Exp.Let(bindings.toList, body, start |+| body.ann)
  }

We leave it as an exercise to do it for the if-expressions. We can try it out

scala> parseExp.parseAll("if true then 1 else 2")
val res0: Either[Error, LExp] = Right(
  If(
    C(Bool(true),SourceLocation(Caret(0,3,3),Caret(0,7,7))),
    C(Int(1),SourceLocation(Caret(0,13,13),Caret(0,14,14))),
    C(Int(2),SourceLocation(Caret(0,20,20),Caret(0,21,21))),
    SourceLocation(Caret(0,0,0),Caret(0,21,21)))
  )

So true occurs from 3 to 7. 1 from 13 to 14, and so on. We can also easily check that the whitespace is not included.