kaashif's blog

Programming, with some mathematics on the side

LeetCode on a Z80 CPU from 1976

2022-02-10

A while ago, I soldered together a Z80 homebrew computer and ran some programs on it, then wrote a blog post about it.

I did end up running some toy programs and some BASIC games like Super Star Trek, but that got me thinking - how hard would it be to solve a real algorithmic problem on it?

It turns out it wasn't hard at all, thanks to the Z80 development kit (z88dk) and the hexload BASIC program that allows running arbitrary binaries from BASIC, without needing to reprogram the ROM. That's very handy since I don't have a ROM programmer!

Comparing compiled Z80 assembly to modern x86 assembly for the same C program is pretty cool - the Z80 is a slightly extended 8080 instruction set, so you might expect the programs to look pretty similar, but x86 is really a whole different beast.

Step 0: Building the computer

Step 0 is building the computer itself - look at that blog post I linked above for some tips, but I really recommend checking out the RC2014 website for the latest advice.

The computer has 8K ROM, 32K RAM, runs at 7.3728MHz and communicates over serial at 115,200 baud. It's not exactly a speed demon, but it's surprising what such a "slow" computer can do - that's still 7 million clock cycles per second.

Step 1: Writing the program

The problem I chose was the eternal classic Two Sum. The problem is as follows:

Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

You can return the answer in any order.

This is a really simple problem with an easy O(n2) solution, and a slightly less easy O(n) solution. The quadratic solution is to loop over all pairs and check them: this requires no extra data structures and indeed no extra libraries, it's just looping over an array.

The linear solution is to construct a hash set of the numbers in the array then loop over the array, checking if the required summand to get to the target result is in the array.

The linear solution isn't all that complicated, but involves a lot of extra stuff (there's no C standard library hash set). Here's my quadratic solution:

#include <stdio.h>
#include <stdlib.h>

void two_sum(int target, int length, int *nums) {
    for (int i = 0; i < length-1; i++) {
        for (int j = i+1; j < length; j++) {
            if (nums[i] + nums[j] == target) {
                printf("%d, %d\n", i, j);
                return;
            }
        }
    }
}

int main() {
    printf("twosum\n");

    int target;
    scanf("%d", &target);

    int length;
    scanf("%d", &length);

    int *nums = calloc(length, sizeof(int));

    for (int i = 0; i < length; i++) {
        scanf("%d", nums+i);
    }

    two_sum(target, length, nums);

    free(nums);
}

The first thing to notice is that I'm just using C standard library functions like printf and calloc - z88dk lets you write "normal" C code almost as if you're writing code for a PC. The difference is that the z88dk standard library is mostly hand-written Z80 assembly, with over 250k lines of code.

Take a look at the z88dk source on GitHub, it's crazy the amount of work that's gone into this. They claim it's the largest repo of Z80 assembler online, and I believe them.

That means we literally can just compile the same program for Z80 and x86 and compare them!

Step 2: Compiling the program

Once you install z88dk, you'll have the zcc compiler. Make sure to run make install to install the compiler to /usr/local, and you can invoke it with zcc. Full instructions are here: https://github.com/z88dk/z88dk/wiki/installation

Assuming the source code above is in twosum.c, you can compile it with this:

$ zcc +rc2014 -subtype=basic -clib=new twosum.c -o twosum -create-app --c-code-in-asm

This uses the "new" C library. I couldn't get the recommended command from the hexload instructions page to work, which used -clib=sdcc_iy, it complained of a missing binary called zsdcpp - presumably a C preprocessor. Even after symlinking a C preprocessor to have that name, I was still getting errors, so I gave up and used the command above, which worked great.

This outputs a few .bin files and an .ihx file - we are interested in the .ihx file.

Step 3: Uploading the program to the computer

I have an RC2014 running Microsoft BASIC from 1978 and a serial connection. There's no way to run an HTTP server on my PC and download the files to the RC2014 or copy the file from a disk - there isn't even any non-volatile storage except the ROM, and certainly no network.

This is where hexload comes in: https://github.com/RC2014Z80/RC2014/tree/master/BASIC-Programs/hexload.

Hexload is a program that allows you to transfer binaries encoded in Intel HEX format (which is just hex encoded binary data with some metadata and other stuff, see the gritty details here) over a serial line to a computer running Microsoft BASIC, then run them.

The key mechanism behind this is that, after we write the binary to memory, BASIC has a command USR(x) that tells the CPU to jump to an address with a user program and start executing. That is, we boot to BASIC, load the program into memory with hexload, then run

PRINT USR(0)

to run our program. This is pretty exciting. Full instructions are on the hexload page linked above, including the slowprint.py program needed to make sure the program isn't printed too fast (which would cause characters to be dropped).

First, I connect the serial cable, power on the RC2014, then connect with screen to my USB FTDI serial cable (sudo since I couldn't be bothered to fix permissions on the device file):

$ sudo screen /dev/ttyUSB0 115200

Then we press the reset button on the clock board or on the backplane to show the boot prompt:

Z80 SBC By Grant Searle

Memory top?

According to the hexload instructions, I put in 35071:

Z80 SBC By Grant Searle

Memory top? 35071
Z80 BASIC Ver 4.7b
Copyright (C) 1978 by Microsoft
1916 Bytes free
Ok

Now we're ready to send a BASIC program over the wire. For that we can use slowprint.py.

First we send hexload over:

$ python3 slowprint.py < hexload.bas | sudo tee /dev/ttyUSB0

I use tee so we can see if there are any discrepancies between what's sent and received. A few times one or two characters were missed and there was an SN Error in BASIC. If you get that, reset the RC2014 and try again, it'll probably work next time.

Once that finishes, you'l be greeted with this:

Loading Data
Start Address: 8900
End Address:   89D7
USR(0) -> HexLoad
HEX LOADER by Filippo Bergamasco & feilipu for z88dk
:

And now we can start sending over our two sum program. Run this:

$ python3 slowprint.py < twosum.ihx | sudo tee /dev/ttyUSB0

The upload may take a while, but after it's done, the program should automatically start executing and we can input our example (the first example from LeetCode):

twosum

9
4
2
7
11
15
0, 1
 0
Ok

That is, the target was 9 and the input array was [2,7,11,15]. We got 0, 1, which is correct - indices 0 and 1 are 2 and 7 which do add up to 9.

Z80 vs x86

That's great, but what does the machine code look like compared to the devil we know - x86? And how big is it?

For comparison, I compiled twosum.c on my PC, optimizing binary size, with glibc and gcc 11:

$ cc twosum.c -g -Os -o twosum.x86
$ objdump -M intel -S twosum.x86

This prints the assembly we get from compiling twosum.c on a PC. The most interesting part to me is the two_sum function, since it should be rather similar - a few for loops, a branch, and a function call. How big could the differences be? I won't paste both binaries here in full, but there are a few interesting points.

Using the stack is horrific

On Z80, only push and pop can read or write directly to the stack. You may wonder why we can't just do:

ld hl, (sp)

to load whatever's at the stack pointer into the hl register, and the answer is that you just can't - the Z80 had to maintain binary compatibility with the Intel 8080 (an 8-bit CPU) and didn't have that much freedom in how it could extend the ld instruction. That's why, when we want to read something from the stack into hl, we end up doing this:

pop hl
push hl

which does a whole lot of reading and writing to the stack just to achieve the effect of reading the top of the stack into hl. Performance-focused Z80 programmers wrote assembler by hand and avoided doing things like this. One might even say that C was too slow since compilers weren't advanced enough to beat the best assembler programmers.

The x86 code didn't even use the stack at all - we have a full 16 8-byte registers available, all general purpose. There are even more non-general purpose registers available like the stack pointer, base pointer, instruction pointer, various SIMD registers, etc.

This is in comparison to the Z80 which only really has 4 16-bit general purpose registers, and not all of them can be used with all instructions.

As an example, say we want to compile this statement in C:

int j = i+1;

(part of the for loop in the two_sum function)

In x86 it's simple:

lea r9,[rax+0x1]

We say we're keeping j in r9 and i in rax, we write rax+1 to r9. Easy.

With the Z80 we're severely constrained in how many registers there are, and we have to keep i and j both on the stack.

pop  hl
push    hl
inc hl
push    hl

So i was at the top of the stack. we set hl to i with the pop/push dance, increment hl, and push hl to the stack, so we're keeping j at the top of the stack.

I can see how all these extra registers make life easier.

Optimising for size hurts speed, a lot

This may seem obvious, but on the Z80 this is true to an extreme. When you have a limited space, like 32K of memory, you have to work hard to fit BASIC and whatever your program is into memory. That means it's often required to call a function where x86 (with no concern about space) can just inline.

One example is the l_gint function, which you can see here: https://github.com/z88dk/z88dk/blob/87003c95c1f3d9be8d4704beff94010159989ec2/libsrc/_DEVELOPMENT/l/sccz80/9-common/l_gint.asm. It's so trivial as to be ridiculous, it's just

l_gint:

   ld a,(hl+)
   ld h,(hl)
   ld l,a

   ret

This just does some loads and increments hl. Why a function? Because this sequence of operations is really common, and it makes for a much smaller binary if these operations are called as a function rather than inlined. This is absolutely required, as the Z80's primitive instruction set already makes it incredibly difficult to get anything done in a reasonable number of instructions.

As an example, take this line of C:

if (nums[i] + nums[j] == target) {

This is just reading from some memory locations, adding, comparing and perhaps jumping. Easy. Yes, on x86:

mov    r8d,DWORD PTR [rdx+rcx*4]
add    r8d,DWORD PTR [rdx+rax*4]
cmp    r8d,edi
jne    1221 <two_sum+0x18>

We read the two values from memory, indexing into the array with the lovely CISC mov and add instructions, and jump past the printf if the sum isn't the target.

On Z80, it's a complete nightmare, and would be even worse without these l_gint-style functions to reduce the size:

    ld   hl,6    ;const
    call    l_gintspsp  ;
    ld  hl,4    ;const
    add hl,sp
    call    l_gint  ;
    add hl,hl
    pop de
    add hl,de
    ld  e,(hl)
    inc hl
    ld  d,(hl)
    push    de
    ld  hl,8    ;const
    call    l_gintspsp  ;
    ld  hl,4    ;const
    add hl,sp
    call    l_gint  ;
    add hl,hl
    pop de
    add hl,de
    call    l_gint  ;
    pop de
    add hl,de
    ex  de,hl
    ld  hl,10   ;const
    add hl,sp
    call    l_gint  ;
    call    l_eq
    jp  nc,i_8

All these function calls are slow, even if they save space. By the way, Z80 assembly is a complete disaster to write by hand. x86 assembly is much nicer in comparison.

It's important to note: the space optimizations work very well and although the code looks terrible for the Z80, it's actually much smaller:

$ du -h twosum.bin
8.0K    twosum.bin
$ du -h twosum.x86
20K twosum.x86

That's 8K for the Z80 code and 20K for the x86 code.

Conclusion

It is incredible that anyone ever wrote software for the Z80 in pure assembler. Hats off to Bill Gates and Paul Allen for actually starting a company based on software written for the Intel 8080, in assembler.

Like holy shit, Altair BASIC fit into 4K of memory and was an actual programming language kids could write programs in!

I could have maybe stomached writing software in assembler once more advanced 16-bit CPUs like the 8086 or 286 came around, but I'm afraid that writing real programs for the Z80 or even worse, 6502, seems really tedious.

I'm in shock that the much more powerful Motorola 68000 didn't become the standard.

Anyway, we've come a long way in 50 years, maybe in another 50 people will wonder how we coped with Python and Java.