Parsing a simple language - Part 1
This is the first blog post in the compiler series. We will go through the steps of writing a recursive descent parser using parser combinators. The full source code for this section can be found here.
We will be using the excellent cats-parse
parser combinator library.
This blog post will start by parsing simple constants. Then we will add handling of whitespace. Then we will incrementally add more constructs to the language. We will stop just before adding binary operators. The next part will start the implementation of the code generator.
Some knowledge of Scala is assumed in this part.
Crash course in parser combinators
A parser can be though of as a function that given an input string either returns a result and the unparsed part of the string or an error:
type Parser[Res] = String => Either[Error, (Res, String)]
For instance a parser that just reads the first character from the input is
val anyChar: Parser[Char] = { str =>
if str.isEmpty
Error("Expected char")
then else (str.head, str.tail)
}
Combinators that combine these parsers can be created, and that is where the name “Parser combinator” comes from. For more background you can read the great original paper by Graham Hutton.
In cats-parse
we have two types at our disposal: A parser of type Parser[T]
consumes one or more characters before succeeding with a result of type T
. A parser of type Parser0[T]
consumes zero or more characters before succeeding with a result of type T
.
Constructing parsers
We can construct parsers that consume input by using some of the basic building blocks.
Parser.anyChar: Parser[Char] // Read any character and succeed with the read character
Parser.string(s: String): Parser[Unit] // Read the string `s` and succeed with ()
Parser.char(c: Char): Parser[Unit] // Read the char `c` and succeed with ()
// And more
We can also create a parser that does not consume input, but just immediately succeeds with a value.
Parser.pure[T](a: T): Parser0[T]
We can run a parser p
on an input string s
with p.parseAll(s)
. Let’s try out some of the parsers
> Parser.string("foo").parseAll("foo")
scalaval res0: Either[cats.parse.Parser.Error, Unit] = Right(())
> Parser.pure(1337).parseAll("")
scalaval res1: Either[cats.parse.Parser.Error, Int] = Right(1337)
Combining parsers
To make more complicated parsers we combine the simple parsers with some of the combinators.
p1 *> p2
will first parse p1
and then parse p2
but discard the result from p1
. p1 <* p2
will combine parse p1
and then p2
but discard the result from p2
. p1 ~ p2
will also sequence the parsers but return both results in a tuple.
To pick from two parsers we have the |
combinator. p1 | p2
first tries p1
and if that fails after trying the first character the result of p2
is returned otherwise the result of p1
is returned. If we need to try more than one character from p1
before determining which branch we need, then we use backtracking. In this case we would write p1.backtrack | p2
.
Let’s try it out
> (Parser.anyChar ~ Parser.anyChar).parseAll("12")
scalaval res0: Either[cats.parse.Parser.Error, (Char, Char)] = Right((1,2))
> (Parser.anyChar *> Parser.anyChar).parseAll("12")
scalaval res1: Either[cats.parse.Parser.Error, Char] = Right(2)
If we need to repeat parsers then we can use p.rep0
and p.rep
for zero or more or one or more repetitions of p
. Both of them will return a list of results. So if p: Parser[T]
then p.rep: Parser[List[T]]
. We can write p.?
for zero or one repetition of p
.
Let’s try it out.
> Parser.anyChar.rep0.parseAll("1234")
scalaval res0: Either[cats.parse.Parser.Error, List[Char]] = Right(List(1, 2, 3, 4))
> Parser.anyChar.?.parseAll("")
scalaval res1: Either[cats.parse.Parser.Error, Option[Char]] = Right(None)
Transforming parsers
With the combinators we cannot really do any interesting things. We are limited to parsers that return characters and unit. If we want to make a parser that parses boolean we can use map
to change the result inside the parser.
val trueParser = Parser.string("true").map(_ => true)
val falseParser = Parser.string("false").map(_ => false)
val boolParser: Parser[Boolean] = trueParser | falseParser
Let’s try it out
> boolParser.parseAll("true")
scalaval res0: Either[cats.parse.Parser.Error, Boolean] = Right(true)
> boolParser.parseAll("false")
scalaval res1: Either[cats.parse.Parser.Error, Boolean] = Right(false)
> boolParser.parseAll("foo")
scalaval res2: Either[cats.parse.Parser.Error, Boolean] = Left(Error(...))
There is also p.as(v)
as a short hand for p.map(_ => v)
so that we can just write Parser.string("true").as(true)
which is a bit more readable.
We now have enough building blocks to write a parser for a small part of our language.
Parsing constants
For the first iteration of our language we will only support primitives like integers, characters, booleans and the unit value. We can write the abstract syntax definition using a Scala 3 enum like:
{
enum Const case Int(n: Long)
case Ch(c: Char)
case Bool(b: Boolean)
case Unit
}
Examples of integers are 123
, -42
and so on. Character literals are like 'a'
, '0'
and so on. Booleans are true
and false
and unit is represented by ()
.
To parse constants we will implement a parser:
import cats.parse.Parser
val parseConst: Parser[Const] = ???
A parser that parses Unit
and Bool
could be
val parseConst: Parser[Const] =
Parser.string("()").as(Const.Unit)
| Parser.string("true").as(Const.Bool(true))
| Parser.string("false").as(Const.Bool(true))
The parser library comes with a builtin parser for signed integers, which is what we need. We can simply use it to extend the constant handling to numbers.
import cats.parse.Numbers
val parseConst: Parser[Const] =
...
| Numbers.signedIntString.map(x => Const.Int(x.toInt)))
Let’s try it out.
> parseConst.parseAll("true")
scalaval res0: Either[cats.parse.Parser.Error, syntax.Const] = Right(Bool(true))
> parseConst.parseAll("-1234")
scalaval res1: Either[cats.parse.Parser.Error, syntax.Const] = Right(Int(-1234))
> parseConst.parseAll("()")
scalaval res2: Either[cats.parse.Parser.Error, syntax.Const] = Right(Unit)
> parseConst.parseAll("(1234)")
scalaval res3: Either[cats.parse.Parser.Error, syntax.Const] = Left(Error(...))
That seems to work as expected.
Adding if
and let
We are interested in real programs, and they should at least include variable bindings and if-expressions. In our language they will look like
let
true
x = 40
y = in
if x then 10 else y
We add these constructs to the abstract syntax. A let expression is a list of bindings of names to expressions that are bound in the body. If-expressions are as you would expect from Scala 3.
type Name = String
type LetBinding = (Name, Exp)
{
enum Exp case Var(x: Name)
case C(c: Const)
case Let(binding: List[LetBinding], body: Exp)
case If(test: Exp, conseq: Exp, alt: Exp)
}
To parse the recursive structure of the syntax tree we need some special support from cats-parse
. The recursive
function lazily defers the parsing of subexpressions to avoid infinite loops in the parser. This is due to the eager evaluation in Scala.
def recursive[A](fn: Parser[A] => Parser[A]): Parser[A] = ???
We start by just adding a parser for the if
-expression first.
val parseExp: Parser[Exp] = Parser.recursive[Exp] { expr =>
val parseIf: Parser[Exp] =
((Parser.string("if") *> expr)
~ (Parser.string("then") *> expr)
~ (Parser.string("else") *> expr)).map {
case ((test, conseq), alt) => Exp.If(test, conseq, alt)
}
| parseIf
parseConst }
We will leave parsing let
-expressions and variables as an exercise.
Let’s try out the if
parser.
> parseExp.parseAll("if true then 1 else 2")
scalaval res0: Either[cats.parse.Parser.Error, Exp] = Left(...)
> parseExp.parseAll("iftruethen1else2")
scalaval res1: Either[cats.parse.Parser.Error, Exp] = Right(If(...))
Whoops. Something is not right. We have not at all considered whitespace, but the parser will work if we just not include any whitespace. We will fix that in the next section.
Lexing
In our language we will not have significant whitespace. This means that all whitespace is ignored. This is in contrast to languages like Python or Haskell where indentation matters.
This means that the handling of whitespace will be quite simple. We have a function, token
, that will change a regular parser into one that parses non-significant whitespace after success. We could have made the parser skip whitespace on both sides, but it’s nice to know that if you want different handling of whitespace in some cases, then whatever you parse next will not start by skipping whitespace.
val whitespace = Parser.charIn(" \r\n\t")
def token[T](p: Parser[T]): Parser[T] = p <* whitespace.rep0
where p.rep0
parses zero or more occurences of p
.
Let’s check it out.
> token(Parser.string("foo")).parseAll("foo \n")
scalaval res1: Either[cats.parse.Parser.Error, Unit] = Right(())
> token(Parser.string("foo")).parseAll(" foo")
scalaval res2: Either[cats.parse.Parser.Error, Unit] = Left(Error(0, NonEmptyList(OneOfStr(0,List(foo)))))
Which is exactly what we want. If we then wrap, say, Parser.string("if")
and the constant parsers in token
, then our parser should handle whitespace nicely. We can verify that parseExp
then works as expected.
Handling comments
We would also like to have line comments in our language. We will use the #
charater to mark the rest of the line as a comment. We can implement this behavior directly in our whitespace
parser. This means that if we see a #
while looking for whitespace we
skip all input until the next newline. This can be achieved through the Parser.until0
combinator. It parses everything
until matches. We will use it to skip until Parser.char('\n')
matches.
val comment: P[Unit] =
Parser.char('#') *> Parser.until0(Parser.char('\n') | Parser.string("\r\n")).void
val whitespace: Parser[Unit] = Parser.charIn(" \t\r\n").void | comment
Let’s check it out.
> parseExp.parseAll("if true # check condition \n then 1 else 2")
scalaval res0: Either[cats.parse.Parser.Error, Exp] = Right(If(...))
Which is exactly what we want.
Fixing strange behavior
A valid program is letx=10inx
which might be fine, but not super nice looking. We can fix this be requiring some whitespace or a symbol after keywords or integer literals.
We define a parser specifically for keywords where we use Parser.peek
to look at the input stream without consuming
characters. We then use Parser.not
to require that parser to succeed if we don’t have one of these keyword characters.
val notAfterKeyword = Parser.charIn('a' to 'z')
| Parser.charIn('A' to 'Z')
| Parser.charIn('0' to '9')
def keyword(w: String) =
token(Parser.string(w) <* Parser.not(Parser.peek(notAfterKeyword)))
So whenever we have a keyword (like if
and let
) in our parser, then we replace it with this new keyword parser.
Let’s check it out.
> parseExp.parseAll("iftrue then 1 else 2")
scalaval res0: Either[Error, Exp] = Left(Error(...))
> parseExp.parseAll("if true then 1 else 2")
scalaval res1: Either[Error, Exp] = Right(If(...))
Which is exactly what we want. We leave it as an exercise to fix it for integers. This concludes parsing constants, if
and let
-expressions. Now we can parse some programs, but not any interesting ones. We can try if the example that we wanted to parse actually works now
> parseExp.parseAll("""let x = true
scala | y = 40
| in
| if x then 10 else y""")
val res4: Either[cats.parse.Parser.Error, Exp] = Right(Let(List((x,C(Bool(true), ...
We will leave it here for this part. We will pick it up next time where we will generate code for the language that we have implemented so far.
As a small appendum we will go through how we can save source positions so that we can refer to them in error messages during the compiler stages.
Bonus points: Saving source locations
The cats-parse
includes a Caret
type. This type represents a position in the input string. You can think of it as a cursor in a text editor. The caret both tracks the character offset, the line number and the column number from the beginning of the input.
To track these positions we annotate the abstract syntax tree with some more “stuff”.
[A] {
enum Expval ann: A
...
case BinOp(..., ann: A)
...
}
In the parser case, this extra “stuff” will be a data type tracking the start and end of an expression.
final case class SourceLocation(begin: Caret, end: Caret) {
def |+|(that: SourceLocation): SourceLocation =
SourceLocation(begin, that.end)
}
with |+|
defined to join source locations together.
This will be essential to giver proper error messages in various stages of the compilation.
We now change the expression parser to return an expression annotated with source locations
type LExp = Exp[SourceLocation]
val parseExp: Parser[LExp] = ???
cats-parse
keeps track of the source positions while parsing, and exposes it through a parser
Parser.caret: Parser0[Caret]
New variant of the token parser that will use Parser.caret
to get the start and end of the token and return the parsed token along with its position.
def locToken[T](p: Parser[T]): Parser[(T, SourceLocation)] =
(Parser.caret.with1 ~ p ~ Parser.caret).map {
case ((begin, res), end) =>
(res, SourceLocation(begin, end))
} <* whitespace.rep0
Note that to combine a Parser0
and a Parser
we need to use with1
.
We implement the keyword
parser using the locToken
combinator.
def locKeyword(w: String): Parser[SourceLocation] =
locToken(
Parser.string(w) <* Parser.not(Parser.peek(notAfterKeyword))).map(_(1)
)
We can now change the constant parser to also track the source positition of the constants.
val parseConst: P[(Const, SourceLocation)] =
locKeyword("false").map(loc => (Const.Bool(false), loc))
| ...
| locToken(Numbers.signedIntString).map { case (num, loc) => (Const.Int(num.toInt), loc) }
For the let
parser the story is a bit more complicated. We use locKeyword
to get the positition of the let
, and then we can use the position of the body to find the end location of the entire let
-expression.
val parseLet =
(locKeyword("let")
~ (token(identifier).backtrack ~ (token(Parser.char('=')) *> expr)).rep
~ (keyword("in") *> expr)
).map { case ((start, bindings), body) =>
.Let(bindings.toList, body, start |+| body.ann)
Exp}
We leave it as an exercise to do it for the if
-expressions. We can try it out
> parseExp.parseAll("if true then 1 else 2")
scalaval res0: Either[Error, LExp] = Right(
If(
C(Bool(true),SourceLocation(Caret(0,3,3),Caret(0,7,7))),
C(Int(1),SourceLocation(Caret(0,13,13),Caret(0,14,14))),
C(Int(2),SourceLocation(Caret(0,20,20),Caret(0,21,21))),
SourceLocation(Caret(0,0,0),Caret(0,21,21)))
)
So true
occurs from 3 to 7. 1
from 13 to 14, and so on. We can also easily check that the whitespace is not included.