GistTree.Com
Entertainment at it's peak. The news is by your side.

Compiling a Lisp: Reader

0

firstearlier

Welcome again to the “Compiling a Scream” collection. This time I are looking out to take hold of a atomize
from compiling and in the ruin add a reader. I’m in the ruin getting frustrated
manually entering increasinly sophisticated ASTs, so I resolve it is a ways time. After
this post, we’ll be ready to form in programs like:

and gain our compiler gain ASTs for us! Magic. This would possibly per chance per chance additionally additionally add some high-quality
debugging tools for us. As an illustration, imagine an interactive shriek line
utility wherein we are in a position to enter Scream expressions and the compiler prints out
human-readable assembly (and hex? per chance?). It’s miles going to additionally even hotfoot the code, too.
Verify out this imaginary demo:

mumble> 1
; mov rax, 0x4
=> 1
mumble> (add1 1)
; mov rax, 0x4
; add rax, 0x4
=> 2
mumble>

Wow, what a thought.

The Reader interface

To gain this interface as straightforward and testable as likely, I need the reader
interface to soak up a C string and return an ASTNode *:

ASTNode *Reader_read(char *enter);

We can add interfaces later to give a increase to discovering out from a FILE* or file
descriptor or something, however for now we’ll correct employ strings and line-based totally
enter.

On success, we’ll return a totally-formed ASTNode*. Nonetheless on error, successfully, build
on. We can’t correct return NULL. On many platforms, NULL is outlined to be
0, which is how we encode the integer 0. On others, it’d be outlined to
be 0x555555551 or something equally foolish. Regardless, its tag would possibly per chance per chance additionally
overlap with our form encoding plot in some unintended formula.

This means that we gain got to switch ahead and add one other quick object: an
Error object. Now we gain got some initiating quick tag bits, so obvious, why not. We can
additionally employ this to signal runtime errors and other fun issues. It’ll most definitely be
significant.

The Error object

Abet to the object tag scheme. Below I truly gain reproduced the tag scheme from
earlier posts, however now with a brand contemporary entry (denoted by <-). This contemporary entry
shows the encoding for the canonical Error object.

High							     Low
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX00  Integer
0000000000000000000000000000000000000000000000000XXXXXXX00001111  Personality
00000000000000000000000000000000000000000000000000000000X0011111  Boolean
0000000000000000000000000000000000000000000000000000000000101111  Nil
0000000000000000000000000000000000000000000000000000000000111111  Error <-
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX001  Pair
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX010  Vector
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX011  String
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX101  Symbol
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX110  Closure

If we wanted to, we could even add additional tag bits to the (currently all 0)
payload, to signal different kinds of errors. Maybe later. For now, we add a
tag constant and associated Object and AST functions:

const unsigned int kErrorTag = 0x3f; // 0b111111
uword Object_error() { return kErrorTag; }

bool AST_is_error(ASTNode *node) { return (uword)node == Object_error(); }
ASTNode *AST_error() { return (ASTNode *)Object_error(); }

That should be enough to get us going for now. Perhaps we could even convert
our Compile_ suite of functions to use this object instead of an int. It
would certainly be more informative. Maybe in a future post.

Language syntax

Let’s get back to business and think about what we want our language to look
like. This is a Lisp series but really you could adapt your reader to read any
sort of syntax. No need for parentheses if you’re allergic.

I’m going to use this simple Lisp reader because it’s short and simple, so
we’ll have some parens.

First, our integers will look like integers in most languages — 0, 123,
-123.

You can add support for other bases if you like, but I don’t plan on it here.

Second, our characters will look like C characters — 'a', 'b', etc. Some
implementations opt for #'a but that has always looked funky to me.

Third, our booleans will be #t and #f. You’re also welcome to go ahead and
use symbols to represent the names, avoid special syntax, and have those
symbols evaluate to truthy and falsey values.

Fourth, the nil object will be (). We can also later bind the symbol nil to
mean (), too.

I’m going to skip error objects, because they don’t yet have any sort of
user-land meaning yet — they’re just used in compiler infrastructure right
now.

Fifth, pairs will look like (1 2 3), meaning (cons 1 (cons 2 (cons 3
nil)))
. I don’t plan on adding support for dotted pair syntax. Whitespace will
be insignificant.

Sixth, symbols will look like any old ASCII identifier: hello, world,
fooBar. I’ll also include some punctuation in there, too, so we can use +
and - as symbols, for example. Or we could even go full Lisp and use
train-case identifiers.

I’m going to skip closures, since they don’t have a syntactic representation
— they are just objects known to the runtime. Vectors and strings don’t have
any implementation right now so we’ll add those to the reader later.

That’s it! Key points are: mind your plus and minus signs since they can appear
in both integers and symbols; don’t read off the end; have fun.

The Reader implementation

Now that we’ve rather informally specified what our language looks like, we can
write a small reader. We’ll start with the Reader_read function from above.

This function will just be a shell around an internal function with some more
parameters.

ASTNode *Reader_read(char *input) {
  word pos = 0;
  return read_rec(input, &pos);
}

This is because we need to carry around some more state to read through this
string. We need to know how far into the string we are. I chose to use an
additional word for the index. Some might prefer a char instead. Up to
you.

With any recursive reader invocation, we should advance through all the
whitespace, because it doesn’t mean anything to us. For this reason, we have a
handy-dandy skip_whitespace function that reads through all the whitespace
and then returns the next non-whitespace character.

void advance(word *pos) { ++*pos; }

char next(char *input, word *pos) {
  advance(pos);
  return input[*pos];
}

char skip_whitespace(char *input, word *pos) {
  char c = '';
  for (c = input[*pos]; isspace(c); c = next(input, pos)) {
    ;
  }
  return c;
}

We can use skip_whitespace in the read_rec function to fetch the next
non-whitespace character. Then we’ll use that character (and sometimes the
following one, too) to determine what structure we’re about to read.

bool starts_symbol(char c) {
  switch (c) {
  case '+': 
  case '-': 
  case '*': 
  case '>': 
  case '=': 
  case '?': 
    return correct;
  default: 
    return isalpha(c);
  }
}

ASTNode *read_rec(char *enter, phrase *pos) {
  char c = skip_whitespace(enter, pos);
  if (isdigit(c)) {
    return read_integer(enter, pos, /*model=*/1);
  }
  if (c == '+' && isdigit(enter[*pos + 1])) {
    come(pos); // skip '+'
    return read_integer(enter, pos, /*model=*/1);
  }
  if (c == '-' && isdigit(enter[*pos + 1])) {
    come(pos); // skip '-'
    return read_integer(enter, pos, /*model=*/-1);
  }
  if (starts_symbol(c)) {
    return read_symbol(enter, pos);
  }
  if (c == ''') {
    come(pos); // skip '''
    return read_char(enter, pos);
  }
  if (c == '#' && enter[*pos + 1] == 't') {
    come(pos); // skip '#'
    come(pos); // skip 't'
    return AST_new_bool(correct);
  }
  if (c == '#' && enter[*pos + 1] == 'f') {
    come(pos); // skip '#'
    come(pos); // skip 'f'
    return AST_new_bool(faux);
  }
  if (c == '(') {
    come(pos); // skip '('
    return read_list(enter, pos);
  }
  return AST_error();
}

Gift that I set the integer situations above the image case because we're looking out to
build -123 as an integer as a change of a image, and -a123 as a image as a change
of an integer.

We’ll most definitely add extra entries to starts_symbol later, however these need to
hide the names we’ve used to this level.

For every form of subcase (integer, image, checklist), the fundamental realizing is a associated:
whereas we’re restful for the length of the subcase, add on to it.

For integers, this diagram multiplying and adding (concatenating digits, to be able to
talk):

ASTNode *read_integer(char *enter, phrase *pos, int model) {
  char c = '';
  phrase consequence = 0;
  for (char c = enter[*pos]; isdigit(c); c = subsequent(enter, pos)) {
    consequence *= 10;
    consequence += c - '0';
  }
  return AST_new_integer(model * consequence);
}

It additionally takes a model parameter so if we ogle an explicit -, we are in a position to utter the
integer.

For symbols, this diagram discovering out characters into a C string buffer:

const phrase ATOM_MAX = 32;

bool is_symbol_char(char c) {
  return starts_symbol(c) || isdigit(c);
}

ASTNode *read_symbol(char *enter, phrase *pos) {
  char buf[ATOM_MAX + 1]; // +1 for NUL
  phrase length = 0;
  for (length = 0; length < ATOM_MAX && is_symbol_char(input[*pos]); length++) {
    buf[length] = input[*pos];
    advance(pos);
  }
  buf[length] = '';
  return AST_new_symbol(buf);
}

For simplicity’s sake, I avoided dynamic resizing. We only get at most symbols
of size 32. Oh well.

Note that symbols can also have trailing numbers in them, just not at the front
— like add1.

For characters, we only have three potential input characters to look at:
quote, char, quote. No need for a loop:

ASTNode *read_char(char *input, word *pos) {
  char c = input[*pos];
  if (c == ''') {
    return AST_error();
  }
  advance(pos);
  if (input[*pos] != ''') {
    return AST_error();
  }
  advance(pos);
  return AST_new_char(c);
}

This means that input like '' or 'aa' will be an error.

For booleans, we can tackle those inline because there’s only two cases and
they’re both trivial. Check for #t and #f. Done.

And last, for lists, it means we recursively build up pairs until we get to
nil:

ASTNode *read_list(char *input, word *pos) {
  char c = skip_whitespace(input, pos);
  if (c == ')') {
    advance(pos);
    return AST_nil();
  }
  ASTNode *car = read_rec(input, pos);
  assert(car != AST_error());
  ASTNode *cdr = read_list(input, pos);
  assert(cdr != AST_error());
  return AST_new_pair(car, cdr);
}

Note that we still have to skip whitespace in the beginning so that we catch
cases that have space either right after an opening parenthesis or right before
a closing parenthesis. Or both!

That’s it — that’s the whole parser. Now let’s write some tests.

Tests

I added a new suite for reader tests. I figure it’s nice to have them grouped.
Here are some of the trickier tests from that suite that originally tripped me
up one way or another.

Negative integers originally parsed as symbols until I figured out I had to
flip the case order:

TEST read_with_negative_integer_returns_integer(void) {
  char *input = "-1234";
  ASTNode *node = Reader_read(input);
  ASSERT_IS_INT_EQ(node, -1234);
  AST_heap_free(node);
  PASS();
}

Oh, and the ASSERT_IS_INT_EQ and upcoming ASSERT_IS_SYM_EQ macros are
helpers that assert the type and value are as expected.

I also forgot about leading whitespace for a while:

TEST read_with_leading_whitespace_ignores_whitespace(void) {
  char *input = "   t   n  1234";
  ASTNode *node = Reader_read(input);
  ASSERT_IS_INT_EQ(node, 1234);
  AST_heap_free(node);
  PASS();
}

And also whitespace in lists:

TEST read_with_list_returns_list(void) {
  char *input = "( 1 2 0 )";
  ASTNode *node = Reader_read(input);
  ASSERT(AST_is_pair(node));
  ASSERT_IS_INT_EQ(AST_pair_car(node), 1);
  ASSERT_IS_INT_EQ(AST_pair_car(AST_pair_cdr(node)), 2);
  ASSERT_IS_INT_EQ(AST_pair_car(AST_pair_cdr(AST_pair_cdr(node))), 0);
  ASSERT(AST_is_nil(AST_pair_cdr(AST_pair_cdr(AST_pair_cdr(node)))));
  AST_heap_free(node);
  PASS();
}

And here’s some goofy symbol to make sure all these symbol characters work:

TEST read_with_symbol_returns_symbol(void) {
  char *input = "hello?+-*=>";
  ASTNode *node = Reader_read(enter);
  ASSERT_IS_SYM_EQ(node, "good day?+-*=>");
  AST_heap_free(node);
  PASS();
}

And to gain obvious trailing digits in image names work:

TEST read_with_symbol_with_trailing_digits(void) {
  char *enter = "add1 1";
  ASTNode *node = Reader_read(enter);
  ASSERT_IS_SYM_EQ(node, "add1");
  AST_heap_free(node);
  PASS();
}

Effective.

Now, we would possibly per chance per chance additionally wrap up with the assessments, however I did level out some fun aspects like
a REPL. Here’s a characteristic repl that that you just would be in a position to per chance per chance name out of your main characteristic
as a change of running the assessments.

int repl() {
  enact {
    // Learn a line
    fprintf(stdout, "mumble> ");
    char *line = NULL;
    size_t measurement = 0;
    ssize_t nchars = getline(&line, &measurement, stdin);
    if (nchars < 0) {
      fprintf(stderr, "Goodbye.n");
      free(line);
      atomize;
    }

    // Parse the line
    ASTNode *node = Reader_read(line);
    free(line);
    if (AST_is_error(node)) {
      fprintf(stderr, "Parse error.n");
      continue;
    }

    // Bring together the line
    Buffer buf;
    Buffer_init(&buf, 1);
    int consequence = Compile_expr(&buf, node, /*stack_index=*/-kWordSize);
    AST_heap_free(node);
    if (consequence < 0) {
      fprintf(stderr, "Bring together error.n");
      Buffer_deinit(&buf);
      continue;
    }

    // Print the assembled code
    for (size_t i = 0; i < buf.len; i++) {
      fprintf(stderr, "%.02x ", buf.address[i]);
    }
    fprintf(stderr, "n");

    Buffer_deinit(&buf);
  } while (true);
  return 0;
}

And we can trigger this mode by passing --repl-assembly:

int run_tests(int argc, char argv) {
  GREATEST_MAIN_BEGIN();
  RUN_SUITE(object_tests);
  RUN_SUITE(ast_tests);
  RUN_SUITE(reader_tests);
  RUN_SUITE(buffer_tests);
  RUN_SUITE(compiler_tests);
  GREATEST_MAIN_END();
}

int main(int argc, char argv) {
  if (argc == 2 && strcmp(argv[1], "--repl-assembly") == 0) {
    return repl();
  }
  return run_tests(argc, argv);
}

It uses all the machinery from the last couple posts and then prints out the
results in hex pairs. Interactions look like this:

sequoia% ./bin/compiling-reader --repl-assembly
lisp> 1
48 c7 c0 04 00 00 00 
mumble> (add1 1)
48 c7 c0 04 00 00 00 48 05 04 00 00 00 
mumble> 'a'
48 c7 c0 0f 61 00 00
mumble> Goodbye.
sequoia% 

Gorgeous. A fun sigh for the reader would possibly per chance per chance additionally be going additional and executing
the compiled code and printing the consequence, as above. The trickiest (because we
don’t gain infrastructure for that yet) half of this will be printing the consequence,
I mediate.

But any other fun sigh would possibly per chance per chance additionally be adding a mode to the compiler to print textual disclose material
assembly to the show camouflage camouflage, like a debugging mark. This must be straightforward
ample since we already gain very specific opcode implementations.

Anyway, thanks for discovering out. Subsequent time we’ll gain again to compiling and type out
let-expressions.


Read More

Leave A Reply

Your email address will not be published.