A Handwritten Math Parser in 100 lines of Python
This repository comprises a handwritten parser for easy mathematical expressions
of the form
2*(3+4) written in 100 traces of Python code.
It exists completely for instructional reasons.
Strategies to State it
python3 compute.py '2*(3+4)'
and you are going to cling to aloof receive
14 as the consequence.
Right here is also no longer so frightening, nonetheless that you simply would possibly possibly drag
python graphviz.py '2*(3+4)' > graphviz_input dot -Tpng graphviz_input -o output.png
to derive a visual reprentation of the abstract syntax tree
(this requires having Graphviz installed).
Setting up a hand-written parser for the leisure is completely pointless within the intervening time as there are tools like
ANTLR that destroy your entire heavy lifting for you.
Moreover, this say self-discipline must cling been solved thousands and thousands of instances
by undergrad computer science college students all throughout the sector.
On the different hand, it has no longer been solved by me till this date,
as in my undergrad be taught at TU Vienna we had been skipping the low-diploma work
and built a parser in accordance to yacc/bison.
I certainly enjoyed doing this shrimp facet project
since it takes you support to the roots of computer science
(these items dates support to 1969, in accordance to Wikipedia
and I love loads how you end up with a swish and uncomplicated resolution.
Be unsleeping that I’m on no story an authority in compiler building
and somebody who’s would potentially shudder at among the most things taking place here,
nonetheless to me it used to be a abundant instructional dispute.
The literature relating to this topic is incredibly formal,
which makes it a miniature hard to derive into the topic for an uninitiated person.
In this description, I certainly cling tried to focal level extra on intuitive explanations.
On the different hand, to me it’s slightly obvious that whenever you don’t follow the speculation,
then you definately will soon drag into things that are hard to present sense of
whenever you are going to no longer connect it to what goes on on within the literature.
The topic is to convey algebraic expressions represented as a string
into a form that is also with out danger reused for doing something tantalizing with it,
equivalent to computing the consequence of the expression or visualizing it properly.
The allowed algebraic operations are
+,-,*,/ as effectively as the utilization of (nested) parentheses
( ... ).
The long-established guidelines for operator precedence practice.
There are varied ways how this self-discipline is also tackled,
nonetheless in overall LL(1) parsers cling a recognition for being slightly straightforward to implement.
An LL(1) parser is a top-down parser that retains replacing parts on the parser stack
with the elegant-hand facet of the within the intervening time matching grammar rule.
This choice is in accordance to 2 pieces of recordsdata:
- The pause image on the parser stack, which is prepared to be either a terminal or a non-terminal.
A terminal is a token that appears within the input, equivalent to
whereas a non-terminal is the left-hand facet of a grammar rule, equivalent to
- The most up-to-date terminal from the input stream that’s being processed.
For instance, if the most modern image on the stack is
S and the most modern input terminal is
and there might be a rule within the grammar that enables
S -> a P
S desires to be modified with
P are non-terminals, and for the remainder of this list,
capitalized grammar parts are thought of non-terminals,
and decrease-case grammar parts, equivalent to
a are thought of a terminal.
To continue the example,
a on top of the stack is now matched to the input stream terminal
and removed from the stack.
The project continues till the stack is empty (which implies the parsing used to be dependable)
or an error occurs (which implies that input stream would now not conform to the grammar).
As there are often a entire lot of grammar guidelines to rob from, the data which rule to practice
in which self-discipline desires to be encoded somehow and is in overall saved in a parsing desk.
In our case however the grammar is so straightforward that this would nearly be an overkill and so as a substitute
the parsing desk is represented by some if-statements throughout the code.
Right here is the effect to originate for our grammar:
(1) Exp -> Exp + Exp (2) Exp -> Exp - Exp (3) Exp -> Exp Exp (4) Exp -> Exp / Exp (5) Exp -> ( Exp ) (6) Exp -> num
The grammar is extremely self-explanatory.
It is however ambiguous, since it contains guidelines of the form
This implies that it’s no longer defined yet whether
2+3*4 desires to be interpreted
2+3=5 followed by
5*4=20 or as
3*4=12 followed by
By cleverly re-writing the grammar, the operator precedence is also encoded within the grammar.
(1) Exp -> Exp + Exp2 (2) Exp -> Exp - Exp2 (3) Exp -> Exp2 (4) Exp2 -> Exp2 Exp3 (5) Exp2 -> Exp2 / Exp3 (6) Exp2 -> Exp3 (7) Exp3 -> ( Exp ) (8) Exp3 -> num
For the previous example
2+3*4 the next derivations could perchance be ragged from now on:
Exp (1) Exp + Exp2 (3) Exp2 + Exp2 (6) Exp3 + Exp2 (8) num + Exp2 (4) num + Exp2 Exp3 (6) num + Exp3 Exp3 (8) num + num Exp3 (8) num + num num
Evaluate this to the derivation of
Exp (1) Exp + Exp2 (3) Exp2 + Exp2 (4) Exp2 Exp3 + Exp2 (6) Exp3 Exp3 + Exp2 (8) num Exp3 + Exp2 (8) num num + Exp2 (6) num num + Exp3 (8) num num + num
We see that in each examples the express in which the guidelines for the operators
* are applied is the identical.
It is presumably slightly complex that
+ appears first,
nonetheless whenever you sight on the resulting parse tree that you simply would possibly possibly convince your self that
the consequence of
* flows as an input to
+ and resulting from this reality it desires to be computed first.
Right here, I ragged a left-most derivation of the input stream.
This implies that you simply are going to often strive to substitute the left-most image next
(which corresponds to the image on the highest of the stack),
and no longer something during your parse tree.
Right here is what one
LL(1) certainly stands for, so this is also how our parser will operate.
On the different hand, there might be one extra take.
The grammar we came up with is now non-ambiguous, nonetheless aloof it will no longer be parsed by an LL(1) parser,
because a entire lot of guidelines originate with the identical non-terminal
and the parser would must sight forward higher than one token to resolve out which rule to practice.
Indeed, for the example above or no longer it’s wanted to sight forward higher than one rule
to resolve out the derivation your self.
LL(1) indicates, LL(1)-parsers most tantalizing sight forward one image.
Fortunately, one can produce the grammar LL(1)-parser-generous by rewriting your entire left recursions
within the grammar guidelines as elegant recursions.
(0) S -> Exp $ (1) Exp -> Exp2 Exp' (2) Exp' -> + Exp2 Exp' (3) Exp' -> - Exp2 Exp' (4) Exp' -> ϵ (5) Exp2 -> Exp3 Exp2' (6) Exp2' -> Exp3 Exp2' (7) Exp2' -> / Exp3 Exp2' (8) Exp2' -> ϵ (9) Exp3 -> num (10) Exp3 -> ( Exp )
ϵ capacity that the most modern image of the stack desires to be elegant popped off,
nonetheless no longer be modified by the leisure else.
Moreover, we added one more rule
(0) that makes sure
that the parser understands when the input is carried out.
$ stands for pause of input.
Setting up the parsing desk
Whereas we need to now not going to make dispute of an say parsing desk, we aloof must know its contents
so as that the parser can resolve which rule to practice next.
To simplify the contents of the parsing desk, I will dispute one miniature trick that I came during
whereas enforcing the entire thing and that’s:
If there might be most tantalizing one grammar rule for a say non-terminal,
elegant amplify it with out caring about what is on the input stream.
Right here’s a miniature varied from what you look within the literature,
where that you simply would possibly possibly very effectively be suggested to most tantalizing amplify non-terminals if the most modern terminal permits it.
In our case, this implies that the non-terminals
S, Exp and
Exp2 will likely be expanded or no longer it’s no longer relevant what.
For the varied non-terminals, it’s slightly obvious which rule to practice:
+ -> rule (2) - -> rule (3) -> rule (6) / -> rule (7) num -> rule (9) ( -> rule (10)
Expose that the guidelines can most tantalizing be applied when the most modern image on the stack is becoming to the
left-hand facet of the grammar rule.
For instance, rule
(2) can most tantalizing be applied if within the intervening time
Exp' is on the stack.
Since we also cling some guidelines that is also expanded to
now we must resolve out when that will cling to aloof certainly happen.
For this it’s serious to sight at what terminal appears after a nullable non-terminal.
The nullable non-terminals in our case are
Exp' is followed by
Exp2 is followed by
+, -, ) and
So whenever we come during
$ within the inputstream whereas
Exp' is on top of the stack,
we elegant pop
Exp' off and circulate on.
Obtaining the Abstract Syntax Tree
The abstract syntax tree is also constructed on the disappear during parsing.
The trick here is to most tantalizing consist of these parts that are tantalizing
(in our case
num, +, -, *, / and skip over your entire parts that are
most tantalizing there for grammatical reasons.
One thing that you simply would possibly possibly ranking worthwile to strive is to originate with the concrete syntax tree
that involves your entire parts of the grammar and kick out things that you simply look are pointless.
Retaining things visualized for sure helps with this.
A nice thing about LL(1) parsing is that that you simply would possibly possibly elegant dispute the selection stack for maintaining song
of the most modern non-terminal.
So within the Python implementation, you are going to ranking for the non-terminal
Exp a operate
that would now not great else than first calling
parse_e2() after which calling
parse_ea (which corresponds to
A see on the operate
parse_e3() reveals us how to cope with terminals:
def parse_e3(tokens): if tokens.token_type == TokenType.T_NUM: return tokens.pop(0) match(tokens, TokenType.T_LPAR) e_node = parse_e(tokens) match(tokens, TokenType.T_RPAR) return e_node
Right here, it’s checked whether the most modern token from the input stream is a host.
If it’s, we eat the input token straight away with out striking it on some intermediate stack.
This corresponds to rule
If it’s no longer a host, it will aloof be a
(, so we strive to eat this as a substitute
match() raises an exception if the anticipated and the incoming tokens are varied).