Lark Parser Generatorgithub.com/asyncze/playcode |
Date | ||
---|---|---|---|
Author | @asyncze |
A Lark parser is generated from the .lark
grammar file, so first need to translate PlayCode's pseudo-grammar into the syntax used by Lark. It's very similar to EBNF: https://lark-parser.readthedocs.io/en/latest/grammar.html
Below is a subset of PlayCode's grammar for Lark. You can see the full up-to-date grammar here: https://github.com/asyncze/playcode/blob/main/pc.lark
assign_stmt : IDENTIFIER "=" expr
expr : term
| term "+" term -> add
| term "-" term -> sub
term : factor
factor : SIGNED_NUMBER -> number
| IDENTIFIER -> identifier
| "(" expr ")"
COMMENT : "--" /[^\n]*/
%import common.CNAME -> IDENTIFIER
%import common.SIGNED_NUMBER
%import common.WS
%ignore WS
%ignore COMMENT
I like the built-in patterns for numbers, strings, whitespace, and so on. The ->
followed by add
, sub
, etc., are aliases to match for addition and subtraction expressions (and not inline asserts as in PlayCode).
Parsing the grammar file and creating a parser looks like this:
parser = Lark(open("pc.lark", "r").read(), start="assign_stmt", parser="lalr")
Simple enough. Then to parse input file to generate an AST.
file = open(sys.argv[1], "r").read()
tree = parser.parse(file)
Here's an example input (in file passed as sys.argv[1]
) and the resulting AST:
x = 2
y = 2 + (x - 1)
Tree(Token('RULE', 'program'), [
Tree(Token('RULE', 'assign_stmt'), [
Token('IDENTIFIER', 'x'),
Tree(Token('RULE', 'expr'), [
Tree(Token('RULE', 'term'), [
Tree('number', [Token('NUMBER', '2')])
])
])
]),
Tree(Token('RULE', 'assign_stmt'), [
Token('IDENTIFIER', 'y'),
Tree('add', [
Tree(Token('RULE', 'term'), [
Tree('number', [Token('NUMBER', '2')])
]),
Tree(Token('RULE', 'term'), [
Tree(Token('RULE', 'factor'), [
Tree('sub', [
Tree(Token('RULE', 'term'), [
Tree('identifier', [Token('IDENTIFIER', 'x')])
]),
Tree(Token('RULE', 'term'), [
Tree('number', [Token('NUMBER', '1')])
])
])
])
])
])
])
])
This is a much more verbose tree compared to before. Maybe a little too verbose? I could always do another pass over the AST to clean it up (a pre-visitor?).
Building a visitor for the AST was surprisingly straightforward, basically just matching rule names, aliases, and returning the expected behavior.
def visitor(tree):
match tree.data:
case "assign_stmt":
left, right = tree.children
SYMBOL_TABLE[left.children[0].value] = visitor(right)
case "expr":
return visitor(tree.children[0])
case "term":
return visitor(tree.children[0])
case "factor":
return visitor(tree.children[0])
case "add":
left, right = tree.children
return int(visitor(left)) + int(visitor(right))
case "sub":
left, right = tree.children
return int(visitor(left)) - int(visitor(right))
case "number":
return int(tree.children[0])
case "identifier":
return SYMBOL_TABLE[str(tree.children[0])]
visitor(tree)
print(SYMBOL_TABLE)
# {'x': 2, 'y': 3}
Now, this is nice. You can basically write the typical calculator example in only a few lines with the correct grammar.