Lark Parser Generator

github.com/asyncze/playcode
Date
Author @asyncze

Contents

Lark grammar #

A Lark parser is generated from the .lark grammar file, so first need to translate PlayCode's pseudo-grammar into the syntax used by Lark. It's very similar to EBNF: https://lark-parser.readthedocs.io/en/latest/grammar.html

Below is a subset of PlayCode's grammar for Lark. You can see the full up-to-date grammar here: https://github.com/asyncze/playcode/blob/main/pc.lark

assign_stmt : IDENTIFIER "=" expr

expr : term
     | term "+" term        -> add
     | term "-" term        -> sub

term : factor

factor : SIGNED_NUMBER      -> number
       | IDENTIFIER         -> identifier
       | "(" expr ")"

COMMENT : "--" /[^\n]*/

%import common.CNAME -> IDENTIFIER
%import common.SIGNED_NUMBER
%import common.WS

%ignore WS
%ignore COMMENT

I like the built-in patterns for numbers, strings, whitespace, and so on. The -> followed by add, sub, etc., are aliases to match for addition and subtraction expressions (and not inline asserts as in PlayCode).

Parsing the grammar file and creating a parser looks like this:

parser = Lark(open("pc.lark", "r").read(), start="assign_stmt", parser="lalr")

Simple enough. Then to parse input file to generate an AST.

file = open(sys.argv[1], "r").read()
tree = parser.parse(file)

Here's an example input (in file passed as sys.argv[1]) and the resulting AST:

x = 2
y = 2 + (x - 1)
Tree(Token('RULE', 'program'), [
    Tree(Token('RULE', 'assign_stmt'), [
        Token('IDENTIFIER', 'x'),
        Tree(Token('RULE', 'expr'), [
            Tree(Token('RULE', 'term'), [
                Tree('number', [Token('NUMBER', '2')])
            ])
        ])
    ]),
    Tree(Token('RULE', 'assign_stmt'), [
        Token('IDENTIFIER', 'y'),
        Tree('add', [
            Tree(Token('RULE', 'term'), [
                Tree('number', [Token('NUMBER', '2')])
            ]),
            Tree(Token('RULE', 'term'), [
                Tree(Token('RULE', 'factor'), [
                    Tree('sub', [
                        Tree(Token('RULE', 'term'), [
                            Tree('identifier', [Token('IDENTIFIER', 'x')])
                        ]),
                        Tree(Token('RULE', 'term'), [
                            Tree('number', [Token('NUMBER', '1')])
                        ])
                    ])
                ])
            ])
        ])
    ])
])

This is a much more verbose tree compared to before. Maybe a little too verbose? I could always do another pass over the AST to clean it up (a pre-visitor?).

AST visitor #

Building a visitor for the AST was surprisingly straightforward, basically just matching rule names, aliases, and returning the expected behavior.

def visitor(tree):
    match tree.data:
        case "assign_stmt":
            left, right = tree.children
            SYMBOL_TABLE[left.children[0].value] = visitor(right)
        case "expr":
            return visitor(tree.children[0])
        case "term":
            return visitor(tree.children[0])
        case "factor":
            return visitor(tree.children[0])
        case "add":
            left, right = tree.children
            return int(visitor(left)) + int(visitor(right))
        case "sub":
            left, right = tree.children
            return int(visitor(left)) - int(visitor(right))
        case "number":
            return int(tree.children[0])
        case "identifier":
            return SYMBOL_TABLE[str(tree.children[0])]
visitor(tree)
print(SYMBOL_TABLE)
# {'x': 2, 'y': 3}

Now, this is nice. You can basically write the typical calculator example in only a few lines with the correct grammar.