In my previous post, I explained an s-expression-like language that handles die rolls mixed with arithmetic. In this continued tutorial we’re going to create the parser in python. In the end we should have something that works like this:

>>> parse("(/ (d 2 (* 14 2)) 3)")
['/', ['d', 2.0, ['*', 14.0, 2.0]], 3.0]

Before we get started, let’s formally define our grammar. There are a handful of useful notation systems, however I’m going derive ANTLR’s parser format and use it, for no real reason.

expression : number | '(' operator expression* ')';
number : ('0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9')+;
operator : '+'|'-'|'*'|'/'|'d';

Each line defines a rule. The | character means “or”. Characters surrounded in single-quotes are literal characters, the + character means one-or-more and the * means zero-or-more. This is what the expression rule looks like visually:

Railroad diagram of the expression rule

To implement this grammar, the first step is to take the expression string and turn it into a list of “tokens”. For ease we’ll just do it all at once so we’ll need something like this:

>>> tokenize("(/ (d 2 (* 14 2)) 3)")
['(', '/', '(', 'd', '2', '(', '*', '14', '2', ')', ')', '3', ')']

A naïve approach might be to take each non-space character as a token, but it soon becomes obvious that it won’t work:

>>> list(c for c in "(+ 19 2)" if not c.isspace())
['(', '+', '1', '9', '2', ')']

The number 19 split into the number 1 and 9, which isn’t correct. There are many ways to do this correctly but I think regular expressions are easiest with python’s re.findall function. Given a regular expression and a string, it will find all non-overlapping matches to the pattern. What is our pattern anyway? Well, excluding whitespace we want consecutive digits or characters in the set ()+-*/d. We can’t just remove whitespace first without introducing ambiguities—(+ 4 62) becomes (+462), which changes the meaning. Therefore we’ll include whitespace as part of our regular expression pattern and then remove it when we’re done. r'(\d+|\s+|[-+*/d()])' will work in python where \d is a digit and \s is a whitespace character.

>>> pattern = r'(\d+|\s+|[-+*/d()])'
>>> [t for t in re.findall(pattern, '(/ (d 2 (* 14 2)) 3)') if not t.isspace()]
['(', '/', '(', 'd', '2', '(', '*', '14', '2', ')', ')', '3', ')']
>>> [t for t in re.findall(pattern, '(/ (d 2 X) 3)') if not t.isspace()]
['(', '/', '(', 'd', '2', ')', '3', ')']

Unfortunately for our purposes, re.findall ignores non-matching patterns as seen in the second example above, so in keeping with our easy approach we can just use re.match to guarantee that our input is entirely made of matching tokens:

>>> pattern = r'^(\d+|\s+|[-+*/d()])*$'
>>> True if re.match(pattern, '(/ (d 2 (* 14 2)) 3)') else False
True
>>> True if re.match(pattern, '(/ (d 2 X) 3)') else False
False

Our finished product will look like this:

def tokenize(expr):
    token_pattern = r'(\d+|\s+|[-+*/d()])'
    if re.match('^%s*$' % token_pattern, expr):
        match = re.findall(token_pattern, expr)
        return [x for x in match if x and not x.isspace()]

The next step will be to change the string into an abstract syntax tree. In the end we should have something that works like this:

>>> expression(['(', '+', '(', 'd', '2', '10', ')', '5', ')'])
['+', ['d', 2.0, 10.0], 5.0]

We want a context where we can pop the tokens off the list, and I think a class is proper here. Also, it lets us tokenize automatically:

 1 class Parser:
 2     def __init__(self, expr):
 3         token_pattern = r'(\d+|\s+|[-+*/d()])'
 4         if re.match('^%s*$' % token_pattern, expr):
 5             match = re.findall(token_pattern, expr)
 6             self.tokens = [x for x in match if x and not x.isspace()]
 7         else:
 8             raise RuntimeError("Failed to tokenize expression '%s'" % expr)
 9 
10     def pop(self):
11         return self.tokens.pop(0) if self.tokens else None
12 
13     def peek(self):
14         return self.tokens[0] if self.tokens else None

You can play around with this and see that pop will remove and return the first token, and that peek will return the first token without removing it:

>>> p = Parser('(+ 19 4)')
>>> p.tokens
['(', '+', '19', '4', ')']
>>> p.pop()
'('
>>> p.tokens
['+', '19', '4', ')']
>>> p.peek()
'+'
>>> p.tokens
['+', '19', '4', ')']

Let’s start actually building the parser. We’re going to write a recursive descent parser, which is extremely straight-forward. The trick is to make a method for each rule and return the ast built by that rule. When presented with a choice in the rule, just use peek with an if statement. When presented with a * rule, use a while loop.

 1 def expression(self):
 2         if self.peek() == '(':
 3             self.pop() # '('
 4             ast = [self.operator()]
 5             while self.peek() != ')':
 6                 ast.append(self.expression())
 7             if self.pop() != ')':
 8                 raise RuntimeError("expected ')'")
 9             return ast
10         elif self.peek().isdigit():
11             return self.number()
12         else:
13             raise RuntimeError("expected number or '('")

Look closely at it and see how it resembles the expression rule in the grammar we defined.

The hard part is over, now we just have number and operator to do and we get this class as a result:

 1 class Parser:
 2     def __init__(self, expr):
 3         token_pattern = r'(\d+|\s+|[-+*/d()])'
 4         if re.match('^%s*$' % token_pattern, expr):
 5             match = re.findall(token_pattern, expr)
 6             self.tokens = [x for x in match if x and not x.isspace()]
 7         else:
 8             raise RuntimeError("Failed to tokenize expression '%s'" % expr)
 9 
10     def pop(self):
11         return self.tokens.pop(0) if self.tokens else None
12 
13     def peek(self):
14         return self.tokens[0] if self.tokens else None
15 
16     def expression(self):
17         if self.peek() == '(':
18             self.pop() # '('
19             ast = [self.operator()]
20             while self.peek() != ')':
21                 ast.append(self.expression())
22             if self.pop() != ')':
23                 raise RuntimeError("expected ')'")
24             return ast
25         elif self.peek().isdigit():
26             return self.number()
27         else:
28             raise RuntimeError("expected number or '('")
29 
30     def number(self):
31         if self.peek().isdigit():
32             return float(self.pop())
33         else:
34             raise RuntimeError("expected number")
35 
36     def operator(self):
37         if self.peek() in '-+/*d':
38             return self.pop()
39         else:
40             raise RuntimeError("expected '-', '+', '/', '*', or 'd'")

 

>>> Parser("(/ (d 2 (* 14 2)) 3)").expression()
['/', ['d', 2.0, ['*', 14.0, 2.0]], 3.0]

In my next blog post we’ll actually evaluate the expression.