GistTree.Com
Entertainment at it's peak. The news is by your side.

Which Parsing Approach?

0
 

September 15 2020

Each person is conscious of that parsing is an well-known half of designing and implementing
programming languages, but it surely’s the identical of Brussels sprouts: factual for
the weight loss program, but a sort that handiest a put off few revel in. Unfortunately, I’ve come to
realise that our primary distaste for parsing is problematic. Whereas many of us
mediate that we’ve absorbed the advances of the 1960s into our collective
figuring out, I bother that we’ve got regressed, and that we are infrequently making
detestable choices about parsing. If that sounds accusatory, I
don’t indicate it to be: I spent over 20 years assuming
that parsing is easy and that I didn’t bear to perceive it effectively in
bid to stutter it well. Alas, truth has been a merciless trainer, and in this
post I bear to half one of the principal most classes I’ve been compelled to slowly be taught and
acknowledge.

Let’s originate with the fundamentals. A grammar encodes the syntax rules for a given language.
Parsing is the act of taking in an enter (e.g.
a source file) and figuring out if, and how, it corresponds to a grammar. At its
most primary level, parsing staunch says “this enter does/doesn’t correspond to the
grammar”. That’s no longer incessantly precious for programming languages, so we on the entire
stay semantic actions whereas parsing, allowing us to, shall we embrace,
invent a parse tree that represents the enter as a tree. If I bear a
straightforward calculator grammar and the enter 2-3*4 I’d salvage relief
a tree that appears to be like luxuriate in the next:

For the relaxation of this post, I’ll signify bushes as “pretty printed textual mutter material”, where
brackets allow us to succinctly hiss how the tree is structured. For instance
the above tree is such as (2-(3*4)). I’m going to bewitch
that “parsing” ability “take a look at correspondence to the grammar and invent a parse
tree”. I’m also going to simplify other parsing jargon and nomenclature
at any time after I’m able to, to take a search for at and withhold issues a piece understandable and the length
a piece manageable.

Recursive descent

There are a bewildering selection of methods that one can parse an enter, so I’ll
originate with what’s presumably basically the most common: a hand-written parser.
Whereas that would indicate staunch about something, nearly all individuals who writes a
half-decent hand-written parser, whether or no longer they understand it or no longer, is writing a
recursive descent parser . The
belief is comparatively straightforward: one writes a chain of capabilities which gape the
enter string at a given declare and, if they match at that declare,
come the parse. For instance, a first are attempting
at a recursive descent parser in Python that can parse the easy
calculator language above may per chance per chance well presumably look as follows
:

NUMBERS = ["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
OPERATORS = ["-", "*"]

class Number:
  def __init__(self, n): self.n = n
  def __str__(self): return str(self.n)
class Mul:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s*%s)" % (str(self.lhs), str(self.rhs))
class Sub:
  def __init__(self, lhs, rhs):
    self.lhs = lhs
    self.rhs = rhs
  def __str__(self): return "(%s-%s)" % (str(self.lhs), str(self.rhs))

def parse_expr(s, i):
  if s[i] no longer in NUMBERS:
    return
  j = i
  whereas j < len(s) and s[j] in NUMBERS:
    j += 1
  lhs = Number(s[i:j])
  if j == len(s) or s[j] no longer in OPERATORS:
    return (j + 1, lhs)
  op = s[j]
  r = parse_expr(s, j + 1)
  if r is None:
    return
  (i, rhs) = r
  if op == "-":
    return (i, Sub(lhs, rhs))
  else:
    bid op == "*"
    return (i, Mul(lhs, rhs))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]

print(parse("2-3*4"))

The premise is comparatively straightforward: we’ve got a string ‘s’ we’re parsing,
with the variable ‘i’ telling us how a ways we’ve parsed up to now. If
it was in a declare to parse half of the enter beginning at ‘i’ then the
parse_expr purpose returns a pair (i, tree) telling
us how a ways it parsed and giving us relief the tree it created; if it failed it
returns None. When I parse 2-3*4 it prints:

(2-(3*4))

In other words, if we were to take into account that tree, we’d salvage a consequence of -10 –
success! Admittedly, that has come at a attach: the recursive descent parser has
pretty a mode of boiler-plate to be sure that it doesn’t prevail in something silly and
that any syntax errors encountered cause parsing to end. For instance, whereas you occur to
take away the take a look at on strains 40 and 41, then 2abc will parse successfully
returning Number(2), ignoring the truth that
abc couldn’t be parsed! There are methods to cut the boilerplate,
but whereas you occur to put in writing recursive descent parsers for a living, you like to be taught to
stay with it.

Unfortunately, if I are attempting parsing 2*3-4 I salvage a horrid
consequence:

(2*(3-4))

We’ve all been taught from a young age that the grammar of mathematics requires
*’ to “bind more sturdy” than ‘-’. Place more formally,
*’ is presupposed to bear a greater precedence than
-’. Unfortunately, my hand-written recursive descent parser has
given both operators the same precedence. If I was to take into account that tree, I’d
salvage -2 in predicament of the 2 we’d bear expected from the distinctive expression.

Fortunately, there’s a pretty common device to encode operator precedence
which, within the form of the above parser, may per chance per chance well be
written as follows
:

def parse_expr(s, i):
  r = parse_factor(s, i)
  if r is None:
    return
  (i, lhs) = r
  if i < len(s) and s[i] == "-":
    r = parse_expr(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Sub(lhs, rhs))
  return (i, lhs)

def parse_factor(s, i):
  r = parse_term(s, i)
  if r is None:
    return None
  (i, lhs) = r
  if i < len(s) and s[i] == "*":
    r = parse_factor(s, i + 1)
    if r is None:
      return
    (i, rhs) = r
    return (i, Mul(lhs, rhs))
  return (i, lhs)

def parse_term(s, i):
  if s[i] no longer in NUMBERS:
    return
  j = i
  whereas j < len(s) and s[j] in NUMBERS:
    j += 1
  return (j, Number(s[i:j]))

def parse(s):
  r = parse_expr(s, 0)
  if r is None or r[0] <= len(s):
    return "Syntax error"
  return r[1]

If I parse these expressions:

print(parse("2-3*4"))
print(parse("2*3-4"))

I can see that I get the expected output:

(2-(3*4))
((2*3)-4)

Success at last! Well, not quite, because if I parse 2-3-4 I
get another surprising result:

(2-(3-4))

Unfortunately, as this example shows, we’re incorrectly parsing operators as
right associative when they should be left associative. In
other words, when we see a sequence of subtractions, earlier subtractions should
be matched before later subtractions. Fixing this
might seem like it’s easy, but it’s not: the “obvious” way of implementing
left associativity in a recursive descent parser causes an infinite loop.
Fixing that is more involved than I want to get into here: see this
page for an approachable summary of solutions to this problem.

It might be tempting to see these problems as the result of an idiot (me)
writing a parser for a language they don’t sufficiently understand (mathematics). I hope
you can see that there’s also something deeper going on. The underlying problem is that
the grammar I wanted to write is ambiguous: 2-3*4 can be parsed as
equivalent to 2-(3*4) or (2-3)*4. It is often said that
recursive descent parsers are inherently unambiguous. While true, this
makes a virtue out of a vice: recursive descent parsers are unambiguous simply
because they ignore ambiguities. Put another way, whenever a recursive descent parser
encounters a point at run-time where an input can be parsed ambiguously, it arbitrarily
chooses one of the possibilities, and charges onwards as if the other
possibilities had never existed. Significantly, the parser author is not
notified that this happened. Since recursive descent parsers are just normal programs,
it’s unlikely that we’ll ever be able to make a static analysis that can
take such a parser and reliably tell us at compile-time all the points of ambiguity.

It is therefore probably not a coincidence that recursive descent parsers
have no real “theory”. Notably, they do not
have any known relationship to the class of grammars we understand best
Context-Free
Grammars
(CFGs). For example, we do not know, in general, the
language which a recursive descent parser will accept: all we can do is
throw ever more inputs at it and observe if, and how, it parses them, never
knowing if another input will cause a surprising parse.

Over time, I’ve come to view recursive descent as the parsing equivalent of
assembly programming: maximum flexibility, maximum performance, and maximum
danger. Every non-trivial recursive descent parser I’ve seen converted
to another formalism has led to unexpected ambiguities being uncovered.
Sometimes this leads to incorrect
parses (as above), but it just as often leads to seemingly correct input not
being parsed at all . There are good reasons
for using recursive descent parsers (I’ll get to those later), but, in my opinion,
if another formalism can be used, it generally should be.

Generalised parsers

At the opposite end of the spectrum are what have come to be called
generalised parsers. There are various generalised parsing algorithms
(e.g. Earley, GLL,
and GLR) but, from the
perspective of this post, they’re all
equivalent. Each can parse any CFG (so they rest on a solid theory), even
ambiguous ones (so you don’t have to worry about contorting your grammar), and
they guarantee to tell you where all of the ambiguous points in the grammar are
at run-time (so you don’t have to worry about things being unexpectedly
mis-parsed).

These properties appear to make generalised parsing the solution to the
problems noted above with recursive descent parsers. However, this comes at a cost.
Consider the following grammar which, once again, parses the little subset of
mathematics we’re using as an example:

Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;

Given that grammar, many readers will have spotted an obvious point of
ambiguity: 2-3*4 can be parsed as equivalent to
(2-3)*4 or 2-(3*4). Generalised parsers are
interesting because they generate both possibilities at run-time. It is
possible for such parsers to return a “parse forest” (i.e. showing all the
ambiguous possibilities), but that’s not very useful for programming languages:
we expect compilers to settle on a single meaning for the programs
we throw at them. We thus need to disambiguate the ambiguous possibilities so that
we end up with a single parse tree. An easy way of doing this is to assign
a precedence to a rule’s productions so that if, at one point in the
parse, more than one of its productions match, we can pick the one with the
highest precedence. For example, I might rewrite my grammar to look as follows:

Expr: Expr "-" Expr %precedence 10
    | Expr "*" Expr %precedence 20
    | "INT"
    ;

Assuming that “higher” precedences mean “bind tighter”, this will then parse
2-3*4 as equivalent to 2-(3*4).

My experience is that fewer people (including, from bitter experience,
me) spot a second ambiguity in the above grammar: 2-3-4 can be parsed as left
(i.e. (2-3)-4) or right (i.e. 2-(3-4)) associative
(because of rules such as Expr "-" Expr). Unfortunately,
precedences are not sufficient to disambiguate between those two possibilities:
one either needs to rewrite the grammar or use a different disambiguation
operator .

While the good news is that a generalised parser will reliably tell us at
run-time that it encountered an ambiguity, the bad news is that we
generally have to wait until we encounter an input that is parsed ambiguously
to discover that our grammar is ambiguous. There are some
decent heuristics
that will statically find many of the points of
ambiguity, but they are just that — heuristics.

Over time, I’ve come to view generalised parsing as equivalent to
dynamic typing: expressive and safe, but with more errors than necessary
being deferred to run-time. I spent years trying to write arbitrary CFGs but,
for complex grammars, I continually struggled to squeeze out all of the ambiguities . I did not encounter a user
who was happy with, or anything other than startled by, ambiguity errors:
it is rather odd to be told that your input is valid but can’t be parsed.
That said, I think generalised
parsers have a part to play in language composition, where
composing different grammars inherently leads to ambiguity. However, I no
longer believe that generalised parsing is a good fit for “normal” parsing.

Statically unambiguous parsing

There are several parsing approaches which statically rule out ambiguity,
bypassing one of the fundamental problems with generalised parsing. I’ll describe
the two best known: LL and LR. In essence, these approaches describe subsets
of the CFGs which provably contain only unambiguous grammars.
It’s common to describe grammars which adhere to one of these subsets as being
“a valid LL grammar” or similar.

However, as far as we know, it is not possible to define the complete subset
of unambiguous CFGs, so there are unambiguous grammars which do not fit into
these subsets. I thus find it easiest to think of these approaches as being
analogous to a static type system: they are sound (i.e. if a grammar is a valid
LL/LR grammar, it really is unambiguous) but not complete (some unambiguous
grammars aren’t valid LL/LR grammars).

LL parsing

Although less common than in the past, LL parsing still underlies
systems such as javacc. My personal
bias is that LL parsers are largely unappealing, because the lack of left
recursion makes expressing many standard programming language constructs as
awkward as with recursive descent parsers. However, as hinted at by that
commonality, LL grammars have one important feature: they naturally map to
recursive descent parsers (but not necessarily vice versa). One can therefore
ensure that a recursive descent parser is not accidentally steamrollering
over ambiguities by creating an LL grammar and faithfully mapping it to a recursive
descent parser.

To my mind, the combination of LL and recursive descent parser has a small
but important niche: if
you really, truly need the highest possible performance and/or you want the
best possible error messages, this is probably the best route we know of. However, it
comes at a significant cost. For a realistic programming language grammar, it
will typically take many person months of effort to beat an automatically
generated parser. I therefore think this
approach only makes sense for a small number of projects (notably
industrial-strength compilers and IDEs).

LR parsing: prelude

The last of the major parsing approaches I’m going to look at is LR parsing.
In common with many people, I spent years trying to avoid LR parsing because I
had imbibed the common idea that LR parsing is a terrible thing.
Instead I threw myself into other parsing approaches, notably Earley parsers
.

Then, in late 2008, while bored in meetings, I started writing extsmail, a program mostly intended to send
email via ssh. I thought it would be interesting to write
this in the style of a traditional Unix daemon, something I had not attempted
before. For the two configuration files extsmail needs, I therefore decided to use the
traditional Unix daemon parsing tool Yacc. Not only had I not used Yacc before,
I had neither used nor studied LR parsing — I suspected I would have
quite a task on my hand. I was rather surprised when it turned out to be
easy to write a grammar such as externals_parser.y.

However, I assumed that I had been lucky with these grammars, which are
rather simple, and went back to avoiding LR parsing. Having realised that
generalised parsing and ambiguity were causing me problems, I spent quite a while
dabbling with PEG parsing (which is recursive descent in disguise)
before eventually realising that was going to cause me different, but
no less severe, problems relative to generalised parsing.

Later I stumbled across Tim
Wagner’s thesis on incremental parsing

which became the pillar that Lukas Diekmann built upon to create Eco . Wagner’s
work uses LR parsing but I managed to involve myself a bit with Eco without
actually understanding how LR parsing worked. Then, in 2015, when we were experimenting
as a group with Rust, Lukas wrote the beginnings of an LR parser as an
experiment, and I quickly jumped in and made a few modifications. Without
really intending to, I started expanding the code until I realised I had
taken on maintainership of what clearly had the potential to become a full Rust LR parser. At
that point, I realised I actually needed to understand LR parsing.
I found the explanations lurking on the web a bit confusing a first but the
algorithm was simple enough that soon enough I had a full, if basic, Rust LR parser
(which became grmtools).

Why am I telling you this long, probably tedious, personal history? Because
I want to emphasise that I went out of my way to avoid LR parsing, even though
I didn’t really know what I was avoiding or why. Even after I had used
LR parsing, and realised that it wasn’t the bogeyman I had expected, I still spent
several years trying alternatives. Not only is that embarrassing
to admit publicly, it also troubles me: how had I picked up a bias that took me
so long to overcome? I’ve gradually alighted upon a plausible explanation
for our community’s general dislike of LR parsing and, oddly enough, it relates
to undergraduate compiler courses. For reasons that probably made sense in the
1970s and 80s, many compiler courses spend significant, arguably excessive, time on
parsing — generally LR parsing. Students come in expecting to be taught
how to generate machine code in clever ways, but instead have to learn all sorts
of parsing background before they even get to the main LR algorithm. By that
point they are thoroughly sick of parsing generally and LR parsing in
particular. This is a self-inflicted wound by our subject, as we have
accidentally turned people away from a beautiful algorithm .

LR parsing

That dealt with, let’s drill into some of the technical details of LR
parsing. First, LR is strictly more powerful than LL . In other words, every
valid LL grammar is also a valid LR grammar (but not necessarily vice versa).
Second, LR grammars are the
largest practical subset of unambiguous CFGs that we currently know how to
statically define .

Let’s actually try out LR parsing by feeding the following grammar:

%start Expr
%%
Expr: Expr "-" Expr
    | Expr "*" Expr
    | "INT"
    ;

to Yacc. Doing so leads to the following being printed at compile-time:

expr1.y: yacc finds 4 shift/reduce conflicts

At this point, I know that some readers will have broken out in a cold sweat at
the mention of “shift/reduce conflict”. Don’t panic yet! At the moment, let’s
just think of this as the LR parser statically detecting an ambiguity (or four…)
and telling us that we should fix it somehow .

There are various ways of drilling into more details about those
ambiguities. In a shameless plug, I’ll use nimbleparse,
but most Yacc implementations have a way of giving more detailed information.
nimbleparse also needs a valid lexer, so if I feed it the
grammar above as well as this Lex file
:

%%
- "-"
"*"
[0-9]+ "INT"

I get this output:

Shift/Reduce conflicts:
   State 5: Shift("*") / Reduce(Expr: "Expr" "-" "Expr")
   State 5: Shift("-") / Reduce(Expr: "Expr" "-" "Expr")
   State 6: Shift("*") / Reduce(Expr: "Expr" "*" "Expr")
   State 6: Shift("-") / Reduce(Expr: "Expr" "*" "Expr")

Stategraph:
0: [^ -> . Expr, {'$'}]
   Expr -> 1
   'INT' -> 2
1: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [^ -> Expr ., {'$'}]
   '-' -> 3
   '*' -> 4
2: [Expr -> 'INT' ., {'-', '*', '$'}]
3: [Expr -> Expr '-' . Expr, {'-', '*', '$'}]
   'INT' -> 2
   Expr -> 5
4: [Expr -> Expr '*' . Expr, {'-', '*', '$'}]
   Expr -> 6
   'INT' -> 2
5: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '-' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3
6: [Expr -> Expr . '-' Expr, {'-', '*', '$'}]
   [Expr -> Expr . '*' Expr, {'-', '*', '$'}]
   [Expr -> Expr '*' Expr ., {'-', '*', '$'}]
   '*' -> 4
   '-' -> 3

What this reveals us is the stategraph (i.e. a statemachine) our grammar has been
transformed into and the states where the conflicts bear occurred.

It is feasible, with a piece effort, to attain that stategraph and the
conflicts which bear occurred. However, I’m no longer going to enter more component here,
because most readers will potentially bear already bought guessed that it’s very
onerous to salvage sense of conflicts on dapper grammars. I’d liken it, roughly
speaking, to resolving entire-program form inference errors : the errors reported are staunch, but don’t necessarily
correspond to the aspects for your program/grammar that you may perhaps well presumably take into account want
fixing.

Whereas I’m sure that it’s possible to make stronger the device that conflicts are
reported , to my shock, I’ve developed hundreds grammars with out
being scared grand by considerations with conflicts. Indeed, the handiest time
I’ve ever to take a search for at and realize conflicts is when a dapper existing
grammar wants updating to a new external specification, which is no longer common
. In most cases, I’m constructing a new, or tweaking an
existing, tiny grammar. Then, staunch as with languages utilizing form inference, I
get it most effective to set and assemble after nearly every substitute. If this
does title a wrestle, I know what substitute precipitated it, and it then tends to
be pretty apparent what a plausible fix is. No longer handiest prevail in I no longer danger
being concerned about what declare within the stategraph is enthusiastic, I don’t even
danger checking whether or no longer the wrestle(s) are shift/cut, cut/cut,
or accept/cut .

Honestly, I’ve handiest encountered one reasonable counter-instance which is
– stay up for it – mathematical expressions. It is surprisingly
advanced to encode this as an LR grammar, because mathematics’ syntax rules
are advanced, and nearly every naive grammar for them ambiguous. Fortunately, because
it’s one of these common instance, solutions to this abound on the collect. Here’s
the everyday resolution:

%originate Expr
%%
Expr: Expr "-" Timeframe
    | Timeframe
    ;
Timeframe: Timeframe "*" Part
    | Part
    ;
Part: "INT"
    ;

It has no conflicts, that implies that Yacc has statically proven that it is
unambiguous! It handles precedence – 2-3*4 parses as
2-(3*4) – and associativity – 2-3-4 parses as
(2-3)-4 – precisely.

Over time, I’ve come to gape LR parsing as such as static typing:
once in a whereas annoyingly restrictive, but providing ample static guarantees
to be definitely value the annoyance for well-known gadget. It’s well-known to be conscious that LR
isn’t magic: whereas it may per chance in all probability perhaps well presumably end you writing an ambiguous grammar, it won’t end
you writing an unsuitable grammar for the language it is advisable to always parse.
For instance, even supposing LR will prevent you making a rule both left and exact
associative, you still bear to uncover precisely whether or no longer it needs to be left or
exact associative.

Efficiency

Of us every so often bother about parsing efficiency in primary, and LR parsing
efficiency particularly, though practically continuously with out cause
on new computer programs. For instance, if I take Java’s grammar (which is unusually
huge, and therefore unhurried to parse) and the LR parsing gadget I wrote (which has
been handiest somewhat optimised) I’m able to happily parse many tens of
thousands of strains of code per 2nd on my 3 year old computer. Unless you’ve bought billions
of strains of source code, or millions of users, here is surely swiftly
ample.

I believe that parsing efficiency worries date relief to the interval when
parsing tactics were below heavy sort. LR parsing was
invented in 1965
, a time when computer programs were painfully unhurried and resource heart-broken. LR parsing works by generating a
statetable at assemble-time that is then interpreted at wander-time.
These statetables were a ways too huge to be purposeful on the
computer programs of the day, so two solutions were invented to treatment this mutter.

First,
algorithmic subsets of LR (e.g. LALR, SLR) were invented that cut the measurement
of statetables, on the value of reducing the selection of grammars that they’ll accept
(i.e. some LR grammars are no longer official LALR grammars).
In practise, these subsets are irritating to stutter: they cause some reputedly
sensible grammars to be rejected; and figuring out why a grammar has been rejected
can require a
deep figuring out of the algorithm .

2d, since 1977 we’ve known
that you may perhaps well even substantially shrink LR statetables with out
limiting the grammars authorised
. When
mixed with a pair of alternative tactics to squeeze the statetable’s
memory footprint , even basically the most little new
machine can wander an arbitrary LR parser at spectacular speeds.

Error restoration

When I’m programming, I salvage a if truth be told embarrassing selection of syntax errors. It
is obligatory that the parser I’m utilizing precisely reviews where I’ve
made such an error: most parsers, at the side of LR parsers, prevail in a tight ample
job in this regard . It is then fantastic if the parser recovers from
my error, allowing it to proceed parsing.

The very ideally obliging recursive descent parsers prevail in a pretty factual job of error restoration. LL parsing programs also infrequently prevail in
a tolerable job for arbitrary LL grammars.

Unfortunately, it is ideally obliging to articulate that LR parsing programs such
as Yacc prevail in a heart-broken job. Yacc itself makes stutter of error tokens, but
the outcomes are so immoral that I get Yacc parsers with error restoration more
frustrating to stutter than these with out. However, we are able to prevail in
grand better
for arbitrary LR grammars, and with any luck more LR parsers
will give factual error messages at some point.

LR parsing: aesthetics

I’m now going to turn to a fuzzier component: readability. Whether explicitly or
implicitly, of us bear to know the syntax rules
of the language they are utilizing. Some programming language designers bewitch, or
hope, that giving users just a few code examples is such as telling them the
language’s syntax rules. This works for many cases as we are able to largely depend on a shared
cultural figuring out of “what a programming language appears to be like luxuriate in” , but
skilled programmers know the dangers of ignoring dank corners comparable to
operator precedence. At a deeper level, of us that implement a compiler, or
even staunch an staunch syntax highlighter, bear to know precisely what a
language’s syntax rules are. Personally, the readability of a parser is
a well-known component in enabling staunch tooling for, and stutter of, a programming
language.

To my tips, of the assorted grammars and parsers presented in
this post, the very best to read is the version for generalised parsers, because
it most carefully fits the informal mathematical grammar I was taught as a
youngster. However, this readability comes at a attach: since the grammar is
potentially ambiguous I infrequently misjudge which method a given enter will
be parsed after disambiguation.

The toughest to read is, with out a shadow of a doubt, the recursive
descent parser. It’s the longest, basically the most detailed, and the one lacking any
underlying theory to files the reader.

The inability of left recursion in LL parsing makes many grammars awkward to
read. A horrid approach to seeing that is by utilizing the truth that many (though
no longer all) LR grammars may per chance per chance well be converted to LL semi-mechanically (search for e.g.
this
translation of roughly the same LR grammar as frail in this post to an
LL identical
): the resulting LL grammar is no longer incessantly more uncomplicated to read
after the conversion.

LR grammars thus procure a principal gap. They’re infrequently terminate
in readability to an arbitrary CFG; since left associativity is so common,
they’re nearly continuously more uncomplicated to read than LL grammars; and, whereas you occur to’ll allow a
tiny amount of poetic license, they’re infinitely more uncomplicated to read than
a recursive descent parser.

Obviously, I’m clearly a piece biased, so presumably these words from Man Steele may per chance per chance well presumably
be more compelling:

[Be] sure that your language will parse. It appears to be like stupid … to take a seat down down down and
originate designing constructs and no longer bother about how they are going to suit collectively. You
can salvage a language that’s advanced if no longer very no longer going to parse, no longer staunch for a
computer, but for a person. I stutter Yacc continuously as a take a look at of all my language
designs, but I very seldom stutter Yacc within the implementation. I stutter it as a
tester, to be sure that that it’s LR(1) … because if a language is LR(1) it’s more
in all probability that a person can deal with it.

Dynamic Languages Wizards
Collection – Panel on Language Originate

Abstract

Having spent years making an try almost every other possible ability to
parsing, I now firmly take into consideration that LR parsing is the acceptable ability for the massive
majority of capabilities: it has the strongest purposeful safety guarantees,
permits factual grammar readability, and has decent efficiency.
In hiss, I’m hoping future
programming language authors take Man Steele’s advice above, and salvage their
reference grammar LR compliant .

Personally, I’ve build my money where my mouth is: I’ve build a mode of work into grmtools,
a Yacc-compatible LR parsing gadget for Rust. grmtools isn’t but perfect, or
total, neither is it in any method fully optimised — but it surely’s higher than factual
ample for many capabilities and I intend persevering with to deal with it for some
time to come relief. I’m hoping it’s one tiny step against encouraging of us to
rediscover the class and utility of LR parsing.

Acknowledgements: Thanks to Lukas Diekmann and Naveneetha Vasudevan
for feedback on drafts of this post. Thanks to Roberto Ierusalimschy and
Terence Parr for answering my queries. All opinions, and any errors or
infelicities, are very grand because of me!

Note me on Twitter

Footnotes

[1] Parsing Expression Grammars (PEG)s and “parser combinators” in some
purposeful languages are staunch recursive descent parsers in hide.
[2] My favorite instance of here is ideally obliging expressed as a Parsing Expression Grammar (PEG):

r <- a / ab

or as a hand-written recursive descent parser:

def r(s, i):
    if i + 1 < len(s) and s[i] == "a":
        return ...
    elif i + 2 < len(s) and s[i] == "ab":
        return ...

Each of these parsers successfully parse the string ‘a’ but fail
to parse the string ‘ab’. As soon as ‘a’ is
matched, the rule succeeds, which leaves ‘b’ unmatched; neither
parser tries to check ‘ab’ straight.

[3] I feel about that it’s still an beginning query as to how many sure
disambiguation operators there should always still be.
[4] In Converge I ended up dishonest, encoding
some default disambiguation rules into the parser. When I did this I didn’t
if truth be told realize the mutter that I’d encountered nor did I realise that my
“resolution” was no longer curing, but merely delaying, the difficulty. The handiest component
more horrid than encountering an ambiguous parse is finding out that your
enter has been disambiguated-by-default within the detestable method.
[5] To present a rough belief of scale: Rust’s
parser
is ready 10KLoC and javac’s
parser
about 4.5KLoC.
[6] Certain, I wrote more
than
one. I no longer counsel it, because Earley’s customary algorithm has a bug in
it, and descriptions of a/the fix seem both to be unsuitable, or to extinguish
the unimaginable thing about the algorithm.
[7] Michael Van De Vanter first pointed Wagner’s figure out to me. However,
I didn’t cherish it for what it was. I then forgot about it, and stumbled
all over it at “independently” at a later point, sooner than by some ability realising that it
was what Michael had already instantaneous. I later learnt to listen to his advice
more somewhat, and benefited grand from it!
[8] It’s also the root of Tree-sitter, which
may per chance per chance well be the acceptable long-interval of time argument I know of for programming languages
having an LR grammar!
[9] Maybe I was lucky no longer to gape a compilers course myself (my college did
no longer provide one at that point), as it intended I couldn’t develop basically the most
excessive of hypersensitive reaction signs to LR parsing.
[10] From least to most expressive we thus bear: regular expressions, LL, LR,
unambiguous, CFG. In other words, regular expressions are a strict subset of
LL, Los angelesstrict subset of LR, etc. Presumably the most total description of
the hierarchy I know may per chance per chance well be chanced on in p89 of Alexander
Okhotin’s recount
(where arrows indicate “more expressive” and “frequent” ability “CFG”).
Indicate that recursive descent does no longer
fit into this hierarchy in any respect — formally speaking, we know that it
accepts a disjoint dwelling of languages relative to CFGs, but, because PEGs don't bear any
underlying theory that we know of, we are unable to precisely define that dwelling
further.

One other attention-grabbing case is the ALL["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] algorithm
which underlies ANTLR. ALL["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] accepts a strict
superset of LL (at the side of many ambiguous grammars), but is disjoint with LR
since ALL["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] doesn’t make stronger left-recursion. However, ANTLR can take away stutter
left-recursion sooner than invoking ALL["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"], so some grammars that would seem
very no longer going to parse with ALL["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] can if truth be told be parsed by it. Bearing in
tips that we’re talking about infinite sets, and that I don’t mediate we've got a
formal proof of the next assertion, I mediate it'd be ideally obliging to articulate that
the ALL["0", "1", "2", "3", "4", "5", "6", "7", "8", "9"] subset of CFGs is bigger than the LR subset.

[11] There are higher unambiguous subsets comparable to
LR-Ordinary (or “LRR”)
grammars
. However, up to now as I'm able to repeat, these are potentially no longer purposeful.
For instance, it is no longer
decideable as as to if or no longer an arbitrary grammar is LRR or no longer
. Marpa is an LRR-primarily primarily based
gadget that claims that later advances bear made it purposeful, though I
haven’t investigated how that articulate relates to Syzmanski and Williams’ work.
[12] Berkeley Yacc if truth be told implements LALR, but for this case it’s
indistinguishable from LR. I’ll focus on LALR a piece bit later in this post.
[13] Though I’ve presented the conflicts as errors, in Yacc they’re if truth be told
warnings because it has “default wrestle resolution” rules (search for Fragment 5 of the Yacc
handbook
). In other words Yacc is interesting to soak up an ambiguous grammar and
automatically disambiguate it to form an unambiguous grammar. In primary,
I prevail in no longer counsel making stutter of this characteristic.
[14] Though it’s no longer incessantly remarked upon, the old vogue splitting of “parsing” into
separate lexing and parsing phases is an well-known half of the ambiguity sage.
No longer handiest is it easy for the lexer to title for as a key phrase and
woodland as an identifier, but the parser then handiest has to differentiate
between token kinds and no longer token
[15] Imagine a Haskell or RPython program where no longer thought to be one of the principal capabilities bear hiss kinds.
The difficulty when programming in such programs is that errors are infrequently
reported a ways-off from where they were precipitated. In other words,
I'd salvage a
static form error in one purpose, but the form inferencer will detect the
resulting error in one other purpose. Whereas form error messages bear change into grand better
over time, they'll never match human expectations in all cases.
[16] The acceptable wrestle reviews I’ve considered come from LALRPOP.
[17] Off-hand, I'm able to handiest take into consideration a single instance: when Lukas tried
to adapt this Java 7
grammar
to Java 8. Unless that point, grmtools didn’t bear a vogue
of reporting tiny print about conflicts because I hadn’t principal one of these characteristic!

The Java specification frail to pleasure itself on
presenting a straightforward, machine-proven, unambiguous grammar in an appendix. Unfortunately, at
some point, this grammar appears to be like to were dropped from the specification, and I
suspect that the new syntax launched has no longer been checked for possible
ambiguities. We swiftly realised that a Java 8 grammar wasn’t well-known ample
to our work for us to make investments the time in this, so I don’t know if it is
ambiguous or no longer.

[18] For the insatiably irregular, the wrestle kinds indicate roughly:

  • shift/cut: The LR parser can’t be sure that whether or no longer it may per chance in all probability perhaps well presumably still come the
    enter by one token, or whether or no longer a parsing rule will bear done.
  • cut/cut: The LR parser can’t be sure that which of two rules will bear done.
  • accept/cut: The LR parser can’t be sure that if your entire parse has
    done or merely one rule has done.

That final probability is so rare that I’d forgotten it even exists sooner than I
thought to truth-take a look at this footnote!

[19] Roughly speaking, the quickest
colossal computer on this planet at that point
ran about 10,000 instances slower
than a tight desktop chip this present day.
[20] SLR is especially restrictive. However, I’m no longer sure I’ve ever considered SLR frail
in practise (though I understand it was within the past), but LALR remains to be chanced on in
Berkeley Yacc. Even supposing LALR is much less restrictive than SLR, it may per chance in all probability perhaps well presumably still
require exact programming language grammars to be unpleasantly contorted in
locations.
[21] Pager’s description is a piece incomplete; it’s ideally obliging paired with Xin Chen’s
thesis
. From memory, neither mentions that the algorithm is
non-deterministic and can infrequently manufacture unreachable states that would be
garbage calm to set a piece bit more memory.
grmtool’s
implementation of this algorithm
goes into more component on such matters and
also has the bonus of being runnable. However, Pager’s algorithm doesn’t pretty
work effectively whereas you occur to stutter Yacc’s wrestle resolution characteristic. One day I should always still
implement the
IELR algorithm
to treatment this mutter.
[22] For instance, encoding sparse tables (e.g. in Rust with the sparsevec crate), and packing
vectors of tiny integers (e.g. with the packedvec crate). It’s a
long time since I’ve thought about these aspects: from memory,
one can prevail in even better than these tactics, but they’re already
efficient ample that we didn’t if truth be told feel the bear to seem further at that point.
[23] There may per chance be one foremost exception in C-ish syntaxes: lacking curly brackets. The
resulting errors are infrequently reported many strains after the point that
a human would take into account as the reason for the mutter.
[24] rustc provides the acceptable syntax error messages of any compiler / parser I’ve ever
frail.
[25] Most up-to-date years bear reinforced an extended-standing pattern: programmers don’t uncover to
be taught languages with odd syntaxes. For better or worse, C-ish syntax is
in all probability to be the dominant cultural power in programming languages for decades
to come relief.
[26] That doesn’t indicate that the eventual compiler has to bear an LR
parser (though I’d originate with an LR parser and handiest take into account involving to
something else if I had millions of users), but the parser it does bear
needs to be entirely compliant with the reference LR grammar.

Unfortunately, for the foreseeable future, we are going to be caught with
programming languages who bear frail other parsing formalisms: pity the guts-broken
IDE author who has to deal with but one other language with handiest a recursive
descent parser in predicament of an LR grammar!

 

Read More

Leave A Reply

Your email address will not be published.