This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

A Runtime for high-level languages on the 6502

A Runtime for high-level languages on the 6502
by on (#176655)
I've been working a bit on a compiler that targets the 6502 (yes, I know there are plenty of things that are already good enough, but its mostly just a fun project to work on for me). It supports operations on 1, 2, 3, and 4 byte values, including the usual 6502 operations, plus variable shift operations (including arithmetic shift right), multiply, divide, modulo, logical operations, and others.

David Wheeler's ideas were a big help in figuring out the runtime for a high level language. I decided to build a split stack machine.

There are 4 data stacks -- STACK0, STACK1, STACK2, and STACK3. STACK0 could be on the zero page to save code space and allow some ldy ZP, x instructions, and STACK1, STACK2, STACK3 can be elsewhere. There is only one stack pointer, stored in the X register. This allows stack access quite quickly using indexed absolute addressing. Multi-byte values are stored across multiple stacks. For example, the bytes of a 24-bit integer would be stored at STACK0+x, STACK1+x, and STACK2+x. Manipulating the stack pointer is also very cheap, it only takes a dex or inx.

As an optimization, the current top-of-stack value is stored contiguously in the zero page at TOS. This means that most operations, which conceptually pop 2 values from the stack, then push the result back on the stack, only have to modify TOS in-place and then increment x in order to calculate their values.

Here's an example, the 2-byte addition operation:

Code:
add2:                  ; 30 cycles total
        clc            ; 2 cycles
        lda STACK0, x  ; 4 cycles
        adc TOS        ; 3 cycles
        sta TOS        ; 3 cycles
        lda STACK1, x  ; 4 cycles
        adc TOS+1      ; 3 cycles
        sta TOS+1      ; 3 cycles
        inx            ; 2 cycles
        rts            ; 6 cycles


A generated program would look something like this:

Code:
main: ; calculates $1234 + $1337
        ; put $1234 on the stack
        lda #<$1234
        sta STACK0, x
        lda #>$1234
        sta STACK1, x

        ; put $1337 on the top of the stack
        lda #<$1337
        sta TOS
        lda #>$1337
        sta TOS+1

        jmp add2


There are a few things that I am considering to improve the performance. Currently, the stack usage is very wasteful. If you use 1 byte operands, then STACK1, STACK2, and STACK3 are unused. Because the 1 and 2 byte operations take less code space than the 3 and 4 byte operations, I could create, for example, 1 byte operations the operate on STACK1 instead of STACK0, and 2 byte operations that operate on STACK2 and 3 instead of 0 and 1. Packing multiple values into one stack slot could be powerful, but it would also be very difficult to write a compiler that can take full advantage of it. A good compiler would generate normal 6502 code for 1 byte instructions anyway, rather than using the stack.
Re: A Runtime for high-level languages on the 6502
by on (#176656)
Why would you even need those LDA/STAs in your main code? Just encode them as byte codes to push values on your stack.

I doubt 24-bit and 32-bit calculations are very useful, ever, in the context of Nesdev. 16-bit however is extremely useful so using 2 separate stack for LSB and MSB and avoid the overhead of separating 16-bit and 8-bit operations and just do everything on 16-bit and ignore the MSB when appropritate is actually a good idea, in my opinion.
Re: A Runtime for high-level languages on the 6502
by on (#176660)
It's true that 24 and especially 32 bit instructions aren't often useful, but I've sometimes wanted to use them. For example, storing a position on a large map. Super Mario Bros 3 uses a 24 bit number for each coordinate of a player's position. (Actually, more like 20 bits, because it only goes to 1/16 of a pixel precision).

I don't embed the constants because there is a trade-off on speed and code size. I could use a byte code interpreter, like Sweet16, and it could support all of the operations in one byte easily since a stack machine doesn't need to encode the operands in an instruction. This would be slower. However, they aren't mutually exclusive. I could sometime use an interpreter for code compactness, and direct subroutine calls for speed.
Re: A Runtime for high-level languages on the 6502
by on (#176662)
Bregalad wrote:
I doubt 24-bit and 32-bit calculations are very useful, ever, in the context of Nesdev.

Horizontal coordinate of a character in a long scrolling level: 0-7 for subpixel, 8-15 for pixel, 16-23 for screen.
Re: A Runtime for high-level languages on the 6502
by on (#176665)
Ok, but it's the ONLY purpose of more than 16-bit I can think of, and you probably don't want to do the whole collision detection in 24-bit, especially if it's not hand-optimized assembly, as this code will run every frame.
Re: A Runtime for high-level languages on the 6502
by on (#176671)
Yeah, collision detection with 24-bit coordinates typically disregards subpixels.
Re: A Runtime for high-level languages on the 6502
by on (#176777)
Bregalad wrote:
Why would you even need those LDA/STAs in your main code? Just encode them as byte codes to push values on your stack.

How does that work?
Re: A Runtime for high-level languages on the 6502
by on (#176848)
psycopathicteen wrote:
Bregalad wrote:
Why would you even need those LDA/STAs in your main code? Just encode them as byte codes to push values on your stack.

How does that work?


I imagine it would work like in Sweet16. Sweet16 is a virtual assembly language. Most instructions are a single byte, but the byte code $1X followed by a 2 byte number represents LOAD, with loads the two byte number into the register X. Similarly, this could have a byte code for pushing 1, 2, 3, or 4 byte values onto the stack directly.
Re: A Runtime for high-level languages on the 6502
by on (#176849)
russellsprouts wrote:
Similarly, this could have a byte code for pushing 1, 2, 3, or 4 byte values onto the stack directly.

Like in "The Princess and the pea"?
Re: A Runtime for high-level languages on the 6502
by on (#176867)
How is that different from loading and storing?
Re: A Runtime for high-level languages on the 6502
by on (#176889)
The main difference would be that a 16-bit lda #imm + sta $abs takes 8 cycles and 6 bytes total, while pea $val would only take 5 cycles + 3 bytes total.

Two complications (in my experience with pea): 1) the endian of the bytes pushed might not match what your existing code expects, 2) the values pushed are statically-defined at assemble time (i.e. instruction pushes a predetermined value, not what's in the accumulator); the "workaround" for the latter is to use self-modifying code so that the pea+operand bytes are in RAM and the latter can be changed dynamically.

Whether or not this provides you any cycle-count or space benefits over, stay, a classic lda/sta or lda/pha etc. is, obviously, indeterminable at this time.

P.S. -- Man, I haven't heard of Sweet16 in decades. Blast from the past hitting me hard. :-)
Re: A Runtime for high-level languages on the 6502
by on (#176958)
I thought only the 65816 had PEA.
Re: A Runtime for high-level languages on the 6502
by on (#176962)
Among the 6502 family, only the 65816 has pea as a hardware instruction. Perhaps I failed to make my point clear: an interpreted bytecode implemented on any processor could have an instruction with the same semantics (push 16 bits immediate), and it would be recognizable to those with 65816 experience as being pea.
Re: A Runtime for high-level languages on the 6502
by on (#176966)
I'm still confused on what Bregalad said.
Re: A Runtime for high-level languages on the 6502
by on (#177078)
I had an idea with implementing math with long integers on the 65816.

If you have the C code:

Code:
long a;
long b;
long c;

a = a + b + c;


It could be compiled to:

Code:
lda a_lo
clc
adc b_lo
tax
lda a_hi
adc b_hi
tay
txa
clc
adc c_lo
sta a_lo
tya
adc c_hi
sta a_hi


Basically using X and Y together as a 32-bit accumulator, and switching back and forth.
Re: A Runtime for high-level languages on the 6502
by on (#177091)
psycopathicteen wrote:
I'm still confused on what Bregalad said.

What I meant is that his example code could be changed from:
Code:
        ; put $1234 on the stack
        lda #<$1234
        sta STACK0, x
        lda #>$1234
        sta STACK1, x

        ; put $1337 on the top of the stack
        lda #<$1337
        sta TOS
        lda #>$1337
        sta TOS+1

        jmp add2


To something like
Code:
      jsr Pushconst
      .dw $1234
      jsr Pushconst
      .dw $1234
      jmp add2


Or even better, don't even use the "jsr" and have a bytecode interpreter, and then it would reduce to
Code:
     .db PUSHCONST
     .dw $1234
     .db PUSHCONST
     .dw $1234
     .db ADD2

Where constants like PUSHCONST and ADD2 are single byte instructions that gets intepreated.
Quote:
I imagine it would work like in Sweet16. Sweet16 is a virtual assembly language.

Note quite. Sweet16 is based on a register machine, not a stack machine.
Re: A Runtime for high-level languages on the 6502
by on (#177121)
Yes, Sweet16 is a register machine, but it is similar in that it uses an interpreted byte code.

The reason I didn't go for byte code is because of speed -- an interpreter would have a lot of overhead, especially if the instructions are different lengths because of constant loads. I did experiment with making a fast interpreter using self-modifying code. An interpreter would also have to implement branching and subroutine call instructions, while direct code can just use the 6502 branching instructions.

It is possible to have the best of both worlds, however. An interpreter could be bolted onto a system like I described. It would read the byte code and translate it into calls to the stack machine functions. The compiler could selectively compile functions into either byte code or function calls. Apparently, the PLASMA language for 6502 already does this -- each function can be compiled to byte code, calls to the interpreter, or even directly to assembly, depending on how you want to trade-off speed and size.