sandbox/doc/part-0x05.rst

Write You a Forth, 0x05
-----------------------

:date: 2018-02-27 08:06
:tags: wyaf, forth

NB: Today's update was pretty large, so I don't show all of the code; this is
what ``git`` is for.

Today I need to start actually doing things with tokens. This requires two
things:

1. Some idea of what a word is, and
2. A dictionary of words

I started taking some notes on this previously, and I think there are a few
kinds of words that are possible:

1. Numbers (e.g. defining a variable)
2. Built-in functions
3. Lambda functions (that is, user-defined functions).

Stage 1 really only needs to incorporate #2, so that's what I'll focus on for
now. However, to prepare for the future, I'm going to define a ``Word`` base
class and inherit from there. This interface is going to need to be
stack-aware, so what I've done is define a ``System`` struct in ``system.h``::

        #ifndef __KF_CORE_H__
        #define __KF_CORE_H__

        #include "defs.h"
        #include "stack.h"

        typedef struct _System {
                Stack<KF_INT>        dstack;
                IO                  *interface;
        } System;


        #endif // __KF_CORE_H__

This will let me later add in support for the return stack and other things
that might be useful. Other ideas: adding something like an ``errno``
equivalent to indicate the last error, and storing a dictionary of words. This
will need some restructuring, though. I've already moved the I/O into the
system as well. This took some finangling in ``kforth.cc``; I'm eliding the
diff here because it's so long, but it's basically a ``sed -i -e
's/interface./sys->interface.``.

The Word interface
^^^^^^^^^^^^^^^^^^

Now I can start defining a Word.h. Maybe this is a case of when you have an
object-oriented language, everything looks like a class, but I decided to
design an abstract class for a Word and implement the first concrete class,
**Builtin**. What I came up with was::

    class Word {
    public:
            virtual ~Word() {};

The *eval* method takes a ``System`` structure and executes some function.
::

            virtual bool  eval(System *) = 0;

The dictionary is a linked list, so next is used to traverse the list.
::

            virtual Word *next(void) = 0;

The ``match`` method is used to determine whether this is the word being
referred to.
::

            virtual bool  match(struct Token *) = 0;

Finally, ``getname`` will fill in a ``char[MAX_TOKEN_SIZE]`` buffer with the
word's name.
::

            virtual void  getname(char *, size_t *) = 0;
    };

With the interface defined, I can implement ``Builtins`` (I've elided the
common interface from the listing below to focus on the implementation)::

        class Builtin : public Word {
        public:
                ~Builtin() {};
                Builtin(const char *name, size_t namelen, Word *head, bool (*fun)(System *));

        private:
                char		 name[MAX_TOKEN_LENGTH];
                size_t		 namelen;
                Word		*prev;
                bool		(*fun)(System *);
        };

The ``bool`` works as a first pass, but I think I'm going to have add some
notion of system conditions later on to denote why execution failed. One thing
that both ``pforth`` and ``gforth`` do that I don't yet do is to clear the
stack when there's an execution failure. At least, they clear the stack with an
unrecognised word. The implementation is pretty trivial::

        #include "defs.h"
        #include "parser.h"
        #include "system.h"
        #include "word.h"

        #include <string.h>


        Builtin::Builtin(const char *name, size_t namelen, Word *head, bool (*target)(System *))
                : prev(head), fun(target)
        {
                memcpy(this->name, name, namelen);
                this->namelen = namelen;
        }

        bool
        Builtin::eval(System *sys)
        {
                return this->fun(sys);
        }

        Word *
        Builtin::next()
        {
                return this->prev;
        }

        bool
        Builtin::match(struct Token *token)
        {
                return match_token(this->name, this->namelen, token->token, token->length);
        }

        void
        Builtin::getname(char *buf, size_t *buflen)
        {
                memcpy(buf, this->name, this->namelen);
                *buflen = namelen;
        }

Right. Now to do something with this.

The system dictionary
^^^^^^^^^^^^^^^^^^^^^

The dictionary's interface is minimal::

        // dict.h
        #ifndef __KF_DICT_H__
        #define __KF_DICT_H__

        #include "defs.h"
        #include "parser.h"
        #include "system.h"
        #include "word.h"

        typedef enum _LOOKUP_ : uint8_t {
                LOOKUP_OK = 0,	     // Lookup executed properly.
                LOOKUP_NOTFOUND = 1, // The token isn't in the dictionary.
                LOOKUP_FAILED = 2    // The word failed to execute.
        } LOOKUP;

        void	init_dict(System *);
        LOOKUP	lookup(struct Token *, System *);

        #endif // __KF_DICT_H__

There's a modicum of differentiation between "everything worked" and "no it
didn't," and by that I mean the lookup can tell you if the word wasn't found
or if there was a problem executing it.

I added a ``struct Word *dict`` field to the ``System`` struct, since we're
operating on these anyways. I guess it's best to start with the lookup function
first so that when I started adding builtins later it'll be easy to just
recompile and use them.
::

        LOOKUP
        lookup(struct Token *token, System *sys)
        {
                Word	*cursor = sys->dict;
                KF_INT	 n;

I seem to recall from *Programming a Problem-Oriented Language* that Chuck
Moore advocated checking whether a token was a number before looking it up
in the dictionary, so that's the approach I'll take::

                if (parse_num(token, &n)) {
                        if (sys->dstack.push(n)) {
                                return LOOKUP_OK;
                        }
                        return LOOKUP_FAILED;
                }

The remainder is pretty much bog-standard linked list traversal::

                while (cursor != nullptr) {
                        if (cursor->match(token)) {
                                if (cursor->eval(sys)) {
                                        return LOOKUP_OK;
                                }
                                return LOOKUP_FAILED;
                        }
                        cursor = cursor->next();
                }

                return LOOKUP_NOTFOUND;
        }

This needs to get hooked up into the interpreter now; this is going to require
some reworking of the ``parser`` function::

        static bool
        parser(const char *buf, const size_t buflen)
        {
                static size_t		offset = 0;
                static struct Token	token;
                static PARSE_RESULT	result = PARSE_FAIL;
                static LOOKUP		lresult = LOOKUP_FAILED;
                static bool		stop = false;

                offset = 0;

                // reset token
                token.token = nullptr;
                token.length = 0;

                while ((result = parse_next(buf, buflen, &offset, &token)) == PARSE_OK) {
                        lresult = lookup(&token, &sys);
                        switch (lresult) {
                        case LOOKUP_OK:
                                continue;
                        case LOOKUP_NOTFOUND:
                                sys.interface->wrln((char *)"word not found", 15);
                                stop = true;
                                break;
                        case LOOKUP_FAILED:
                                sys.interface->wrln((char *)"execution failed", 17);
                                stop = true;
                                break;
                        default:
                                sys.interface->wrln((char *)"*** the world is broken ***", 27);
                                exit(1);
                        }

                        if (stop) {
                                stop = false;
                                break;
                        }
                }

                switch (result) {
                case PARSE_OK:
                        return false;
                case PARSE_EOB:
                        sys.interface->wrbuf(ok, 4);
                        return true;
                case PARSE_LEN:
                        sys.interface->wrln((char *)"parse error: token too long", 27);
                        return false;
                case PARSE_FAIL:
                        sys.interface->wrln((char *)"parser failure", 14);
                        return false;
                default:
                        sys.interface->wrln((char *)"*** the world is broken ***", 27);
                        exit(1);
                }
        }


Now I feel like I'm at the part where I can start adding in functionality. The
easiest first builtin: addition. Almost impossible to screw this up, right?
::

        static bool
        add(System *sys)
        {
                KF_INT	a = 0;
                KF_INT	b = 0;
                if (!sys->dstack.pop(&a)) {
                        return false;
                }

                if (!sys->dstack.pop(&b)) {
                        return false;
                }

                b += a;
                return sys->dstack.push(b);
        }

Now this needs to go into the ``init_dict`` function::

        void
        init_dict(System *sys)
        {
                sys->dict = nullptr;
                sys->dict = new Builtin((const char *)"+", 1, sys->dict, add);
        }

And this needs to get added into the ``main`` function::

        int
        main(void)
        {
                init_dict(&sys);
        #ifdef __linux__
                Console interface;
                sys.interface = &interface;
        #endif
                sys.interface->wrbuf(banner, bannerlen);
                interpreter();
                return 0;
        }

The moment of truth
^^^^^^^^^^^^^^^^^^^

Hold on to your pants, let's see what breaks::

        $ ./kforth
        kforth interpreter
        <>
        ? 2 3 +
        ok.
        <5>

Oh hey, look, it actually works. Time to add a few more definitions for good
measure:

+ the basic arithmetic operators `-`, `*`, `/`
+ the classic `SWAP` and `ROT` words
+ `DEFINITIONS` to look at all the definitions in the language

These are all pretty simple, fortunately. A few things that tripped me up,
though:

+ The *a* and *b* names kind of threw me off because I fill *a* first. This
  means it's the last number on the stack; this didn't matter for addition,
  but in subtraction, it means I had to be careful to do ``b -= a`` rather
  than the other way around.

+ pforth and gfortran both support case insensitive matching, so I had to
  modify the token matcher::

        bool
        match_token(const char *a, const size_t alen,
                const char *b, const size_t blen)
        {
                if (alen != blen) {
                        return false;
                }

                for (size_t i = 0; i < alen; i++) {
                        if (a[i] == b[i]) {
                                continue;
                        }

                        if (!isalpha(a[i]) || !isalpha(b[i])) {
                                return false;
                        }

The XOR by 0x20 is just a neat trick for inverting the case of a letter.
::

                        if ((a[i] ^ 0x20) == b[i]) {
                                continue;
                        }

                        if (a[i] == (b[i] ^ 0x20)) {
                                continue;
                        }

                        return false;
                }
                return true;
        }

+ I forgot to include the case for ``PARSE_OK`` in the result switch statement
  in the ``parser`` function, so I could get a line of code evaluated but then
  it'd die with "the world is broken."

+ When I tried doing some division, I ran into some weird issues::

        $ ./kforth
        kforth interpreter
        <>
        ? 2 5040 /
        ok.
        <<3C><>>

It turns out that in ``write_num``, the case where *n = 0* results in nothing
happening, and therefore the buffer just being written. This is a dirty thing,
but I edge cased this::

        $ git diff io.cc
        diff --git a/io.cc b/io.cc
        index 77e0e2a..a86156b 100644
        --- a/io.cc
        +++ b/io.cc
        @@ -24,6 +24,10 @@ write_num(IO *interface, KF_INT n)
                                n++;
                        }
                }
        +       else if (n == 0) {
        +               interface->wrch('0');
        +               return;
        +       }

                while (n != 0) {
                        char ch = (n % 10) + '0';

May the compiler have mercy on my soul and whatnot.

For you sports fans keeping track at home, here's the classic bugs I've
introduced so far:

1. bounds overrun
2. missing case in a switch statement

But now here I am with the interpreter in good shape. Now I can start
implementing the builtins in earnest!

As before, see the tag `part-0x05 <https://github.com/kisom/kforth/tree/part-0x05>`_.