sandbox/doc/part-0x04.rst

Write You a Forth, 0x04
-----------------------

:date: 2018-02-23 19:20
:tags: wyaf, forth

So, I lied about words being next. When I thought about it some more, what I
really need to do is start adding the stack in and adding support for parsing
numerics. I'll start with the stack, because it's pretty straightforward.

I've added a new definition: ``constexpr uint8_t STACK_SIZE = 128``. This goes
in the ``linux/defs.h``, and the ``#else`` in the top ``defs.h`` will set a
smaller stack size for other targets. I've also defined a type called ``KF_INT``
that, on Linux, is a ``uint32_t``::

        index 4dcc540..e070d27 100644
        --- a/defs.h
        +++ b/defs.h
        @@ -3,6 +3,9 @@
        
        #ifdef __linux__
        #include "linux/defs.h"
        +#else
        +typedef int KF_INT;
        +constexpr uint8_t STACK_SIZE = 16;
        #endif
        
        constexpr size_t       MAX_TOKEN_LENGTH = 16;
        diff --git a/linux/defs.h b/linux/defs.h
        index 57cdaeb..3740f5a 100644
        --- a/linux/defs.h
        +++ b/linux/defs.h
        @@ -4,4 +4,7 @@
        #include <stddef.h>
        #include <stdint.h>
        
        +typedef int32_t KF_INT;
        +constexpr uint8_t      STACK_SIZE = 128;
        +
        #endif
        \ No newline at end of file

It seems useful to be able to adapt the kind of numbers supported; an AVR might do
better with 16-bit integers, for example.

``stack.h``
^^^^^^^^^^^

The stack is going to be templated, because we'll need a ``double`` stack later
for floating point and a return address stack later. This means everything will
go under ``stack.h``. This is a pretty simple implementation that's CS 101 material;
I've opted to have the interface return ``bool``\ s for everything to indicate stack
overflow and underflow and out of bounds::

        #ifndef __KF_STACK_H__
        #define __KF_STACK_H__

        #include "defs.h"

        template <typename T>
        class Stack {
        public:
                bool   push(T val);
                bool   pop(T &val);
                bool   get(size_t, T &);
                size_t size(void) { return this->arrlen; };
        private:
                T arr[STACK_SIZE];
                size_t arrlen;
        };

        // push returns false if there was a stack overflow.
        template <typename T>
        bool
        Stack<T>::push(T val)
        {
                if ((this->arrlen + 1) > STACK_SIZE) {
                        return false;
                }

                this->arr[this->arrlen++] = val;
                return true;
        }

        // pop returns false if there was a stack underflow.
        template <typename T>
        bool
        Stack<T>::pop(T &val)
        {
                if (this->arrlen == 0) {
                        return false;
                }

                val = this->arr[this->arrlen - 1];
                this->arrlen--;
        }

        // get returns false on invalid bounds.
        template <typename T>
        bool
        Stack<T>::get(size_t i, T &val)
        {
                if (i > this->arrlen) {
                        return false;
                }

                val = this->arr[i];
                return true;
        }

        #endif // __KF_STACK_H__

I'll put a ``Stack<KF_INT>`` in ``kforth.cc`` later on. For now, this gives me
an interface for the numeric parser to push a number onto the stack.

``parse_num``
^^^^^^^^^^^^^

It seems like the best place for this is in ``parser.cc`` --- though I might
move into a token processor later. The definition for this goes in ``parser.h``,
and the body is in ``parser.cc``::

        // parse_num tries to parse the token as a signed base 10 number,
        // pushing it onto the stack if needed.
        bool
        parse_num(struct Token *token, Stack<KF_INT> &s)
        {
                KF_INT	n = 0;
                uint8_t i = 0;
                bool    sign = false;

It turns out you can't parse a zero-length token as a number...
::

                if (token->length == 0) {
                        return false;
                }

I'll need to invert the number later if it's negative, but it's worth checking
the first character to see if it's negative.
::

                if (token->token[i] == '-') {
                        i++;
                        sign = true;
                }

Parsing is done by checking whether each character is within the range of the ASCII
numeral values. Later on, I might add in separate functions for processing base 10
and base 16 numbers, and decide which to use based on a prefix (like ``0x``). If the
character is between those values, then the working number is multiplied by 10 and
the digit added.
::

                while (i < token->length) {
                        if (token->token[i] < '0') {
                                return false;
                        }

                        if (token->token[i] > '9') {
                                return false;
                        }

                        n *= 10;
                        n += (uint8_t)(token->token[i] - '0');
                        i++;
                }

If it was a negative number, then the working number has to be inverted::

                if (sign) {
                        n *= -1;
                }

Finally, return the result of pushing the number on the stack. One thing that
might come back to get me later is that this makes it impossible to tell if a
failure to parse the number is due to an invalid number or due to a stack
overflow. This will be a good candidate for revisiting later.
::

                return s.push(n);
        }

``io.cc``
^^^^^^^^^^

Conversely, it'll be useful to write a number to an ``IO`` interface. It
*seems* more useful right now to just provide a number → I/O function, but
that'll be easily adapted to a number → buffer function later. This will add
a real function to ``io.h``, which will require a corresponding ``io.cc``
(which also needs to be added to the ``Makefile``)::

        #include "defs.h"
        #include "io.h"

        #include <string.h>

        void
        write_num(IO &interface, KF_INT n)
        {

Through careful scientific study, I have determined that most number of digits
that a 32-bit integer needs is 10 bytes (sans the sign!). This will absolutely
need to be changed if ``KF_INT`` is ever moved to 64-bit (or larger!) numbers.
There's a TODO in the actual source code that notes this. ::

                char buf[10];
                uint8_t i = 10;
                memset(buf, 0, 10);

Because this is going out to an I/O interface, I don't need to store the sign
in the buffer itself and can just print it and invert the number. Inverting is
important; I ran into a bug earlier where I didn't invert it and my subtractions
below were correspondingly off.
::

                if (n < 0) {
                        interface.wrch('-');
                        n *= -1;
                }

The buffer has to be filled from the end to the beginning to do the inverse of
the parsing method::

                while (n != 0) {
                        char ch = (n % 10) + '0';
                        buf[i--] = ch;
                        n /= 10;
                }

But then it can be just dumped to the interface::

                interface.wrbuf(buf+i, 11-i);
        }

``kforth.cc``
^^^^^^^^^^^^^^

And now I come to the fun part: adding the stack in. After including ``stack.h``,
I've added a stack implementation to the top of the file::

        // dstack is the data stack.
        static Stack<KF_INT>	dstack;

It's kind of useful to be able to print the stack::

        static void
        write_dstack(IO &interface)
        {
                KF_INT	tmp;
                interface.wrch('<');
                for (size_t i = 0; i < dstack.size(); i++) {
                        if (i > 0) {
                                interface.wrch(' ');
                        }

                        dstack.get(i, tmp);
                        write_num(interface, tmp);
                }
                interface.wrch('>');
        }

Surrounding the stack in angle brackets is a cool stylish sort of thing, I
guess. All this is no good if the interpreter isn't actually hooked up to the
number parser::

        // The new while loop in the parser function in kforth.cc:
        while ((result = parse_next(buf, buflen, &offset, &token)) == PARSE_OK) {
                interface.wrbuf((char *)"token: ", 7);
                interface.wrbuf(token.token, token.length);
                interface.wrln((char *)".", 1);

                if (!parse_num(&token, dstack)) {
                        interface.wrln((char *)"failed to parse numeric", 23);
                }

                // Temporary hack until the interpreter is working further.
                if (match_token(token.token, token.length, bye, 3)) {
                        interface.wrln((char *)"Goodbye!", 8);
                        exit(0);
                }
        }

But does it blend?
^^^^^^^^^^^^^^^^^^

Hopefully this works::

        ~/code/kforth (0) $ make
        g++ -std=c++14 -Wall -Werror -g -O0   -c -o linux/io.o linux/io.cc
        g++ -std=c++14 -Wall -Werror -g -O0   -c -o io.o io.cc
        g++ -std=c++14 -Wall -Werror -g -O0   -c -o parser.o parser.cc
        g++ -std=c++14 -Wall -Werror -g -O0   -c -o kforth.o kforth.cc
        g++  -o kforth linux/io.o io.o parser.o kforth.o
        ~/code/kforth (0) $ ./kforth
        kforth interpreter
        <>
        ? 2 -2 30 1000 -1010
        token: 2.
        token: -2.
        token: 30.
        token: 1000.
        token: -1010.
        ok.
        <2 -2 30 1000 -1010>
        ? bye
        token: bye.
        failed to parse numeric
        Goodbye!
        ~/code/kforth (0) $

So there's that. Okay, next time *for real* I'll do a vocabulary thing.

As before, see the tag `part-0x04 <https://github.com/kisom/kforth/tree/part-0x04>`_.

Part B
^^^^^^^

So I was feeling good about the work above until I tried to run this on my
Pixelbook::

	$ ./kforth
	kforth interpreter
	<>
	? 2
	token: 2.
	ok.
	<5>
	
WTF‽ I spent an hour debugging this to realise it was a bounds overflow in
``write_num``. This led me to checking the behaviour of the maximum and
minimum values of ``KF_INT`` which led to me revising ``io.cc``::

	#include "defs.h"
	#include "io.h"
	
	#include <string.h>
	
	static constexpr size_t	nbuflen = 11;
	
	void
	write_num(IO &interface, KF_INT n)
	{
	
		// TODO(kyle): make the size of the buffer depend on the size of
		// KF_INT.
		char buf[nbuflen];
		uint8_t i = nbuflen;
		memset(buf, 0, i);
		bool neg == n < 0;
	
		if (neg) {
			interface.wrch('-');
			n = ~n;
		}
	
		while (n != 0) {
			char ch = (n % 10) + '0';
			if (neg && (i == nbuflen)) ch++;
			
This was the source of the actual bug: ``buf[i]`` where ``i`` == ``nbuflen``
was stomping over the value of ``n``, which is stored on the stack, too.
::

			buf[i-1] = ch;
			i--;
			n /= 10;
		}
	
		uint8_t buflen = nbuflen - i % nbuflen;
		interface.wrbuf(buf+i, buflen);
	}

A couple of things here: first, the magic numbers were driving me crazy. It
didn't fix the problem, but I changed all but one of the uses of them at one
point and forgot one. So, now I'm doing the right thing (or the more right
thing) and using a ``constexpr``. Another thing is changing from ``n *= -1``
to ``n = ~n``. This requires the check for ``neg && (i == nbuflen)`` to add
one to get it right, but it handles the case where *n* = -2147483648::

	(gdb) p -2147483648 * -1
	$1 = 2147483648
	(gdb) p ~(-2147483648)
	$2 = 2147483647
	
Notice that *$1* will overflow a ``uint32_t``, which means it will wrap back
around to -2147483648, which means negating it this way has no effect. *~n + 1*
is a two's complement.

Finally, I made sure to wrap the buffer length so that we never try to write a
longer buffer than the one we have.

I feel dumb for making such a rookie mistake, but I suppose that's what
happens when you stop programming for a living. The updated code is under the
tag `part-0x04-update <https://github.com/kisom/kforth/tree/part-0x04-update>`_.
misc/kforth: Part 0x04 - parsing numerics. 2018-02-24 03:19:29 +00:00			`Write You a Forth, 0x04`
			`-----------------------`

			`:date: 2018-02-23 19:20`
			`:tags: wyaf, forth`

			`So, I lied about words being next. When I thought about it some more, what I`
			`really need to do is start adding the stack in and adding support for parsing`
			`numerics. I'll start with the stack, because it's pretty straightforward.`

			I've added a new definition: ``constexpr uint8_t STACK_SIZE = 128``. This goes
			in the ``linux/defs.h``, and the ``#else`` in the top ``defs.h`` will set a
			smaller stack size for other targets. I've also defined a type called ``KF_INT``
			that, on Linux, is a ``uint32_t``::

			`index 4dcc540..e070d27 100644`
			`--- a/defs.h`
			`+++ b/defs.h`
			`@@ -3,6 +3,9 @@`

			`#ifdef __linux__`
			`#include "linux/defs.h"`
			`+#else`
			`+typedef int KF_INT;`
			`+constexpr uint8_t STACK_SIZE = 16;`
			`#endif`

			`constexpr size_t MAX_TOKEN_LENGTH = 16;`
			`diff --git a/linux/defs.h b/linux/defs.h`
			`index 57cdaeb..3740f5a 100644`
			`--- a/linux/defs.h`
			`+++ b/linux/defs.h`
			`@@ -4,4 +4,7 @@`
			`#include <stddef.h>`
			`#include <stdint.h>`

			`+typedef int32_t KF_INT;`
			`+constexpr uint8_t STACK_SIZE = 128;`
			`+`
			`#endif`
			`\ No newline at end of file`

			`It seems useful to be able to adapt the kind of numbers supported; an AVR might do`
			`better with 16-bit integers, for example.`

			``stack.h``
			`^^^^^^^^^^^`

			The stack is going to be templated, because we'll need a ``double`` stack later
			`for floating point and a return address stack later. This means everything will`
			go under ``stack.h``. This is a pretty simple implementation that's CS 101 material;
			I've opted to have the interface return ``bool``\ s for everything to indicate stack
			`overflow and underflow and out of bounds::`

			`#ifndef __KF_STACK_H__`
			`#define __KF_STACK_H__`

			`#include "defs.h"`

			`template <typename T>`
			`class Stack {`
			`public:`
			`bool push(T val);`
			`bool pop(T &val);`
			`bool get(size_t, T &);`
			`size_t size(void) { return this->arrlen; };`
			`private:`
			`T arr[STACK_SIZE];`
			`size_t arrlen;`
			`};`

			`// push returns false if there was a stack overflow.`
			`template <typename T>`
			`bool`
			`Stack<T>::push(T val)`
			`{`
			`if ((this->arrlen + 1) > STACK_SIZE) {`
			`return false;`
			`}`

			`this->arr[this->arrlen++] = val;`
			`return true;`
			`}`

			`// pop returns false if there was a stack underflow.`
			`template <typename T>`
			`bool`
			`Stack<T>::pop(T &val)`
			`{`
			`if (this->arrlen == 0) {`
			`return false;`
			`}`

			`val = this->arr[this->arrlen - 1];`
			`this->arrlen--;`
			`}`

			`// get returns false on invalid bounds.`
			`template <typename T>`
			`bool`
			`Stack<T>::get(size_t i, T &val)`
			`{`
			`if (i > this->arrlen) {`
			`return false;`
			`}`

			`val = this->arr[i];`
			`return true;`
			`}`

			`#endif // __KF_STACK_H__`

			I'll put a ``Stack<KF_INT>`` in ``kforth.cc`` later on. For now, this gives me
			`an interface for the numeric parser to push a number onto the stack.`

			``parse_num``
			`^^^^^^^^^^^^^`

			It seems like the best place for this is in ``parser.cc`` --- though I might
			move into a token processor later. The definition for this goes in ``parser.h``,
			and the body is in ``parser.cc``::

misc/kforth: Iron out numeric issues. 2018-02-24 05:15:33 +00:00			`// parse_num tries to parse the token as a signed base 10 number,`
misc/kforth: Part 0x04 - parsing numerics. 2018-02-24 03:19:29 +00:00			`// pushing it onto the stack if needed.`
			`bool`
			`parse_num(struct Token *token, Stack<KF_INT> &s)`
			`{`
			`KF_INT n = 0;`
			`uint8_t i = 0;`
			`bool sign = false;`

			`It turns out you can't parse a zero-length token as a number...`
			`::`

			`if (token->length == 0) {`
			`return false;`
			`}`

			`I'll need to invert the number later if it's negative, but it's worth checking`
			`the first character to see if it's negative.`
			`::`

			`if (token->token[i] == '-') {`
			`i++;`
			`sign = true;`
			`}`

			`Parsing is done by checking whether each character is within the range of the ASCII`
			`numeral values. Later on, I might add in separate functions for processing base 10`
			and base 16 numbers, and decide which to use based on a prefix (like ``0x``). If the
			`character is between those values, then the working number is multiplied by 10 and`
			`the digit added.`
			`::`

			`while (i < token->length) {`
			`if (token->token[i] < '0') {`
			`return false;`
			`}`

			`if (token->token[i] > '9') {`
			`return false;`
			`}`

			`n *= 10;`
			`n += (uint8_t)(token->token[i] - '0');`
			`i++;`
			`}`

			`If it was a negative number, then the working number has to be inverted::`

			`if (sign) {`
			`n *= -1;`
			`}`

			`Finally, return the result of pushing the number on the stack. One thing that`
			`might come back to get me later is that this makes it impossible to tell if a`
			`failure to parse the number is due to an invalid number or due to a stack`
			`overflow. This will be a good candidate for revisiting later.`
			`::`

			`return s.push(n);`
			`}`

			``io.cc``
			`^^^^^^^^^^`

			Conversely, it'll be useful to write a number to an ``IO`` interface. It
			`seems more useful right now to just provide a number → I/O function, but`
			`that'll be easily adapted to a number → buffer function later. This will add`
			a real function to ``io.h``, which will require a corresponding ``io.cc``
			(which also needs to be added to the ``Makefile``)::

			`#include "defs.h"`
			`#include "io.h"`

			`#include <string.h>`

			`void`
			`write_num(IO &interface, KF_INT n)`
			`{`

			`Through careful scientific study, I have determined that most number of digits`
			`that a 32-bit integer needs is 10 bytes (sans the sign!). This will absolutely`
			need to be changed if ``KF_INT`` is ever moved to 64-bit (or larger!) numbers.
			`There's a TODO in the actual source code that notes this. ::`

			`char buf[10];`
			`uint8_t i = 10;`
			`memset(buf, 0, 10);`

			`Because this is going out to an I/O interface, I don't need to store the sign`
			`in the buffer itself and can just print it and invert the number. Inverting is`
			`important; I ran into a bug earlier where I didn't invert it and my subtractions`
			`below were correspondingly off.`
			`::`

			`if (n < 0) {`
			`interface.wrch('-');`
			`n *= -1;`
			`}`

			`The buffer has to be filled from the end to the beginning to do the inverse of`
			`the parsing method::`

			`while (n != 0) {`
			`char ch = (n % 10) + '0';`
			`buf[i--] = ch;`
			`n /= 10;`
			`}`

			`But then it can be just dumped to the interface::`

			`interface.wrbuf(buf+i, 11-i);`
			`}`

			``kforth.cc``
			`^^^^^^^^^^^^^^`

			And now I come to the fun part: adding the stack in. After including ``stack.h``,
			`I've added a stack implementation to the top of the file::`

			`// dstack is the data stack.`
			`static Stack<KF_INT> dstack;`

			`It's kind of useful to be able to print the stack::`

			`static void`
			`write_dstack(IO &interface)`
			`{`
			`KF_INT tmp;`
			`interface.wrch('<');`
			`for (size_t i = 0; i < dstack.size(); i++) {`
			`if (i > 0) {`
			`interface.wrch(' ');`
			`}`

			`dstack.get(i, tmp);`
			`write_num(interface, tmp);`
			`}`
			`interface.wrch('>');`
			`}`

			`Surrounding the stack in angle brackets is a cool stylish sort of thing, I`
			`guess. All this is no good if the interpreter isn't actually hooked up to the`
			`number parser::`

			`// The new while loop in the parser function in kforth.cc:`
			`while ((result = parse_next(buf, buflen, &offset, &token)) == PARSE_OK) {`
			`interface.wrbuf((char *)"token: ", 7);`
			`interface.wrbuf(token.token, token.length);`
			`interface.wrln((char *)".", 1);`

			`if (!parse_num(&token, dstack)) {`
			`interface.wrln((char *)"failed to parse numeric", 23);`
			`}`

			`// Temporary hack until the interpreter is working further.`
			`if (match_token(token.token, token.length, bye, 3)) {`
			`interface.wrln((char *)"Goodbye!", 8);`
			`exit(0);`
			`}`
			`}`

			`But does it blend?`
			`^^^^^^^^^^^^^^^^^^`

			`Hopefully this works::`

			`~/code/kforth (0) $ make`
			`g++ -std=c++14 -Wall -Werror -g -O0 -c -o linux/io.o linux/io.cc`
			`g++ -std=c++14 -Wall -Werror -g -O0 -c -o io.o io.cc`
			`g++ -std=c++14 -Wall -Werror -g -O0 -c -o parser.o parser.cc`
			`g++ -std=c++14 -Wall -Werror -g -O0 -c -o kforth.o kforth.cc`
			`g++ -o kforth linux/io.o io.o parser.o kforth.o`
			`~/code/kforth (0) $ ./kforth`
			`kforth interpreter`
			`<>`
			`? 2 -2 30 1000 -1010`
			`token: 2.`
			`token: -2.`
			`token: 30.`
			`token: 1000.`
			`token: -1010.`
			`ok.`
			`<2 -2 30 1000 -1010>`
			`? bye`
			`token: bye.`
			`failed to parse numeric`
			`Goodbye!`
			`~/code/kforth (0) $`

			`So there's that. Okay, next time for real I'll do a vocabulary thing.`

misc/kforth: Iron out numeric issues. 2018-02-24 05:15:33 +00:00			As before, see the tag `part-0x04 <https://github.com/kisom/kforth/tree/part-0x04>`_.

			`Part B`
			`^^^^^^^`

misc/kforth: but what were you feeling 2018-02-24 05:45:02 +00:00			`So I was feeling good about the work above until I tried to run this on my`
misc/kforth: Iron out numeric issues. 2018-02-24 05:15:33 +00:00			`Pixelbook::`

			`$ ./kforth`
			`kforth interpreter`
			`<>`
			`? 2`
			`token: 2.`
			`ok.`
			`<5>`

			`WTF‽ I spent an hour debugging this to realise it was a bounds overflow in`
misc/kforth: spello 2018-02-24 05:46:01 +00:00			``write_num``. This led me to checking the behaviour of the maximum and
misc/kforth: Iron out numeric issues. 2018-02-24 05:15:33 +00:00			minimum values of ``KF_INT`` which led to me revising ``io.cc``::

			`#include "defs.h"`
			`#include "io.h"`

			`#include <string.h>`

			`static constexpr size_t nbuflen = 11;`

			`void`
			`write_num(IO &interface, KF_INT n)`
			`{`

			`// TODO(kyle): make the size of the buffer depend on the size of`
			`// KF_INT.`
			`char buf[nbuflen];`
			`uint8_t i = nbuflen;`
			`memset(buf, 0, i);`
			`bool neg == n < 0;`

			`if (neg) {`
			`interface.wrch('-');`
			`n = ~n;`
			`}`

			`while (n != 0) {`
			`char ch = (n % 10) + '0';`
			`if (neg && (i == nbuflen)) ch++;`

			This was the source of the actual bug: ``buf[i]`` where ``i`` == ``nbuflen``
			was stomping over the value of ``n``, which is stored on the stack, too.
			`::`

			`buf[i-1] = ch;`
			`i--;`
			`n /= 10;`
			`}`

			`uint8_t buflen = nbuflen - i % nbuflen;`
			`interface.wrbuf(buf+i, buflen);`
			`}`

			`A couple of things here: first, the magic numbers were driving me crazy. It`
			`didn't fix the problem, but I changed all but one of the uses of them at one`
			`point and forgot one. So, now I'm doing the right thing (or the more right`
			thing) and using a ``constexpr``. Another thing is changing from ``n *= -1``
			to ``n = ~n``. This requires the check for ``neg && (i == nbuflen)`` to add
			`one to get it right, but it handles the case where n = -2147483648::`

			`(gdb) p -2147483648 * -1`
			`$1 = 2147483648`
			`(gdb) p ~(-2147483648)`
			`$2 = 2147483647`

			Notice that $1 will overflow a ``uint32_t``, which means it will wrap back
			`around to -2147483648, which means negating it this way has no effect. ~n + 1`
			`is a two's complement.`

			`Finally, I made sure to wrap the buffer length so that we never try to write a`
			`longer buffer than the one we have.`

			`I feel dumb for making such a rookie mistake, but I suppose that's what`
			`happens when you stop programming for a living. The updated code is under the`
			tag `part-0x04-update <https://github.com/kisom/kforth/tree/part-0x04-update>`_.