This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

NES boot loader specification

NES boot loader specification
by on (#67288)
I've just completed a preliminary version of the NES boot loader specification, along with implementations.
NES boot loader specification
NES boot loader usage
A boot loader is a tiny program which receives a larger program from a PC connected to the NES via RS-232 at 57600 bits per second. The larger program is loaded into zero-page and executed there, where it can then communicate with the PC to determine what to do next. The format and protocol include a checksum, but still allow a very small implementation that does no checking. The smallest I've come up with is 30 bytes. Other implementations are included on the usage page.
Code:
        ; NTSC version
        ldx #0          ; Number of bytes received
byte:   lda #$01
start:  bit $4017       ; Wait for start bit
        beq start
        lsr             ; A = 0
        nop
dbit:   ldy #3          ; Delay between bits
        lsr $4017       ; Read bit. First time reads 1 for start bit.
dly:    dey             ; Delay
        bne dly
        rol a           ; Move bit into shift register
        sta 0,x         ; Delay, and store received byte on final iter
        bcc dbit
        inx
        bne byte
        jmp $0007       ; Execute received code


EDIT: updated for slight format change.

by on (#67289)
So this could work with a Game Genie and any cartridge with battery backed SRAM? So then you can develop and run games which run entirely from the SRAM area, and take advantage of the CHR RAM built into the cartridge. Of course, you'd need to override the vectors.

But I'm a total hardware klutz, don't have soldering irons laying around nor random resistors, and can't build the cable.

by on (#67316)
It would work as long as you could come up with a Game Genie patch that causes execution of SRAM. Run game in debugger to find when it first enables SRAM, then find a JMP/JSR instruction close thereafter and patch its high byte to $60, then have SRAM with page $60 filled with $EA, and the boot loader beginning on the next page. For vector overrides, you can do the same; patch the high bytes to $07, then put JMP instructions where they happen to point in page 7 of RAM. The above would nicely fit in three Game Genie patch slots as well.

The main snag is that you need to find some way to initially get the boot loader into SRAM. If you could have someone put it on an EPROM and replace the ROM with that, then you'd have a really cheap devcart. Of course if you're going to the trouble of replacing the ROM, you might as well put a Flash ROM there instead, as it's virtually the same amount of rewiring and chip cost.

by on (#67318)
Or we could make a tacit agreement to include a bootloader like this in our homebrews so that people can boot by keying a code into the title screen of a repro.

by on (#67321)
Nifty idea. It would also allow backing up/restoring battery-backed SRAM to a PC connected via second controller serial, without having to do any hotswapping. It would be desirable to support this in one's homebrew cartridge release, because it adds value with very little extra implementation cost.

You can even put such communication code in the title screen's main loop, where it merely checks for activity on D0 of the second controller. If found, it enters the boot loader. Then you have the PC send some $FF sync bytes before the program block, to give time for input to be detected and the boot loader to be started. This way you can boot the cartridge and begin sending a program, without having to do anything on the controller (this is how the Munchausen menu works in the recently-posted video).

Right now I'm working on a redone secondary loader that accepts small blocks of code, executes them, and can be re-entered. This allows easy uploading of data to any part of NES RAM, and execution of code to program things into Flash, configure an MMC chip, load CHR RAM, or whatever. I've implemented all this before, but not with this revamped boot loader design that I posted.

by on (#67322)
Would anyone be willing to modify a NES emulator to support connecting the 2nd controller port to some sort of virtual serial port (named pipe on windows or unix socket on unix)? Kind of like how VMWare workstation will let you attach a guest's emulated serial port to some logical "device" on the host that implements simple character IO (named pipes, sockets, file handles and real serial ports).

That way homebrew carts can test this proposed functionality...

Blargg, would you be willing to license your boot loader code very permissively (I'm thinking BSD or equivalent) so that we can place it into our homebrew carts without needing to GPL the entire cart?

I could see adding some attribution to the "credits" screen if desired.

by on (#67324)
byuu has probably done the closest to emulating serial in an emulator. He's made some sort of library for treating it the same as a serial port, so a PC-side program can communicate with the emulator as if it were the real thing.

And yeah, the boot loader code should be licensed modified BSD/MIT/zlib style for sure. No credit needed, just be sure to mention where someone interested in the code might find it by at least mentioning a name or something someone can search for.

I was hoping for more discussion of the boot loader itself, including its design and implementation, to iron out any problems before it gets put in cartridges. Once I implement some things with it I'll have a better idea of any problems.

by on (#67326)
The boot loader looks well thought out. It really reminds me of the compactness and power of the Apple ][ disk boot loader ($c600-$c6ff).

(briefly going off topic...)

I once had a text file of a heavily commented disassembly of the ROM boot loader and first two stages of DOS 3.3. The document explained all of the "tricks" used in the loaders, especially how the stage-1 DOS 3.3 loader will copy code from the disk II ROM (the GCR decoder IIRC).

I can't find it, nor any online copies (to cite), but I found this while searching for it:
http://home.comcast.net/~mjmahon/AppleCrateII.html

A 17-node Apple II parallel computer... wow.

by on (#67360)
OK, one possible breaking change. The fact that the code begins at 7 in zero-page, but in the program block it begins at 4 is bothering me. It makes it just a little bit harder to understand. I'm trying to work it out so that the program block has a header only. It's just that this might add a byte to one of the larger loaders. I know it matters little, but I'm still obsessing over it. If this change works out, the format has one less thing someone could object to. The format I'm aiming for is this: 4-byte signature, 8-bit checksum, 16-bit CRC, 249 bytes of user data.

by on (#67377)
OK, I made the above change. Sorry about breaking the spec already, but this removes some little conceptual snags and simplifies the specification.

The secondary loader that allows remote procedure calls is coming along very nicely.

by on (#67387)
Very nice work on the bootloader, Blargg. And I really like the idea of supporting a developer and getting a game and a programmable cart.

Is it possible to have a second stage boot loader that writes directly to CHR-RAM? If so, something 'official' would be nice. That way you could write little ram games without having to worry about compressing tiles into your code space.

by on (#67388)
Yeah, the secondary one I'm rebuilding will be fabulous. First version I've been using is really fun to program from C. You basically get a clean API for accessing the NES, for example write_chr( addr, ptr, size ) and it writes that from your C program to the NES CHR. Internally it just does a generalized RPC, sending the NES code for a small routine that loads the CHR, along with the CHR data. When that returns, this secondary loader is running, waiting for the next RPC call. Just to be clear, this isn't for writing games or anything (the latency would be too great), just for manipulating the NES hardware/loading things from the host in a very streamlined fashion.

by on (#67534)
Hows the RPC API coming along, blargg? :)

by on (#67537)
Why not do 256 bytes of data, 249 byte blocks is kind of a weird size, I would think that the header and error-checking stuff will discarded immediately after it checks out OK. XMODEM by comparison is 128 bytes of data + 2 bytes of CRC-16, which is super easy to handle - no problems crossing page boundaries.

I've been thinking about this lately, about hooking this up on the expansion port version of my Squeedo board. Given the choice between synchronous SPI and async UART, I'm definitely going to try synchronous. What I'm hoping would work, is on MCU it could do an async bit-bang to be compatible with just the initial loader. After it gets the proper comms code loaded from that, then it should be OK to use any kind of hardware whatsoever, right?

See any potential problems with this idea? Seems OK, as far as I can tell so far.

EDIT - Sorry, nevermind what I said about the block size, I wasn't considering that it's only one block, heheh. Still seems a little odd, but any arbitrary amount is fine in that case.

I guess my biggest concern (with my hardware as I imagine it), is wondering wtf happens if a controller is in port 2 at the same time the MCU (or anything for that matter) is bit-banging the same lines on the expansion port.. On the expansion port though it would be really easy to move to the other bits. I kind of wish the "standard" serial adapter didn't use D0. So I'll have to look for a work-around, probably.

Despite whatever issues I may or may not run into, a standard bootloader like this is a really great idea. I figured XMODEM would just be the standard (as it has been since before I was even born), but XMODEM has it's faults (no filename, or filesize given, no auto-start transfers) so this could handle things a lot better, while still being standard enough.

by on (#67557)
I hacked out some NES->PC code, maybe it's of some use to someone. It's a little big but it sends up to 64k as fast as possible (8N1 @ 57600 baud, no gaps in between bytes) while generating a 16-bit checksum.
Code:

dw count         ; byte count, $0000 will send 64k
dw tcheck        ; temp checksum
dw checksum   ; final checksum
dw ptr             ; start address
db byte           ; holds read byte
db invert         ; set to $FF for direct connection, $00 for MAX232/FTDI


    ldy #0
    sty <tcheck        ; temp checksum
    sty >tcheck
startbit:
    ; pla, branch to here  6
    lda <count    ; 3
    beq skip    ; 2
    nop        ; 2
    dec <count    ; 5
    jmp here    ; 4
skip:            ; 3
    dec <count    ; 5
    dec >count    ; 5
here:            ; ----- 16
    nop        ; 2
    lda invert    ; 3
    sta $4016    ; 4 ---- 31


    lda (ptr),y    ; 5
    sta byte    ; 3
    inc <ptr    ; 5
    beq one        ; 2
    nop        ; 2
    jmp two        ; 4
one:            ; 3
    inc >ptr    ; 5
two:            ; ------ 13
    lda byte    ; 3
    eor invert    ; 3
    sta $4016    ; 4 ---- 31


`    lda byte    ; 3
    clc        ; 2
    adc <tcheck    ; 3
    sta <tcheck    ; 3
    lda >tcheck    ; 3
    adc #0        ; 2
    sta >tcheck    ; 3
    lda byte    ; 3
    eor invert    ; 3
    lsr a        ; 2
    sta $4016    ; 4 ---- 31
   
    ; waste 3 cycles
    pha        ; 3
    ldx #7        ; 2
loop:
    pha    ; 3
    pla    ; 3
    pha    ; 3
    pla    ; 3
    pha    ; 3
    pla    ; 3
    nop    ; 2 -- 20
    lsr a        ; 2
    sta $4016    ; 4 -- 31
    dex        ; 2
    bne loop    ; 3

stopbit:
            ; 2 added cycles from bne
    lsr a        ; 2
    sta byte    ; 3 --- byte should be clear, maybe useful
    pla        ; 3
    lda <count    ; 3
    ora >count    ; 3 --- this is a done flag
    pha        ; 3
    lda invert    ; 3
    eor #1        ; 2   
    sta $4016    ; 4 -- 31

    pla        ; 3
    bne startbit
    lda <tcheck
    sta <checksum
    lda >tcheck
    sta >checksum
    rts

by on (#67563)
I've almost got the RPC library ready for release. It's all fit together so well. The first version won't have live interaction, merely "recording" of routine calls into a file that you then send to the NES to "replay" them. You can read things back from the NES though. The next version will have full interaction, where you can make calls and get data back interactively. The nice thing about the recording approach is that it gives more flexibility in how you get it to the NES, and is simpler to code for.

Memblers wrote:
EDIT - Sorry, nevermind what I said about the block size, I wasn't considering that it's only one block, heheh. Still seems a little odd, but any arbitrary amount is fine in that case.

Yeah, I guess the term "block" implies there's more than one. I guess I should avoid using that word.

See the Design Rationale section of the specification for more about each decision. I've tried to examine every aspect of this and choose a design that's a best fit for them all.

Quote:
I've been thinking about this lately, about hooking this up on the expansion port version of my Squeedo board. Given the choice between synchronous SPI and async UART, I'm definitely going to try synchronous. What I'm hoping would work, is on MCU it could do an async bit-bang to be compatible with just the initial loader. After it gets the proper comms code loaded from that, then it should be OK to use any kind of hardware whatsoever, right?

Yeah. Once your code starts executing at address $0007, it's free to do whatever it wants. Of course if there's not the standard serial connection, the code won't be able to do that much.

I'm thinking a second level of loader is really needed, one that hides how serial transfer is done, etc. This boot loader is mainly meant to reduce the amount of code needed in ROM on a system with the D0-based serial connection. It's not suited for other connections to the PC.

Quote:
I guess my biggest concern (with my hardware as I imagine it), is wondering wtf happens if a controller is in port 2 at the same time the MCU (or anything for that matter) is bit-banging the same lines on the expansion port.. On the expansion port though it would be really easy to move to the other bits. I kind of wish the "standard" serial adapter didn't use D0. So I'll have to look for a work-around, probably.

One reason to use D0 is that every controller cable connects it. It's also easier to use for optimized serial code where it just does LSR $4017 to move the bit into carry. I'll have to think more about this, as you raise a good issue.

Quote:
Despite whatever issues I may or may not run into, a standard bootloader like this is a really great idea. I figured XMODEM would just be the standard

I had realized that XMODEM is meant for unreliable connections. At least here, 57600 has been very reliable, so error checking merely needs to catch that rare error (or bad cabling), rather than automatically retry.

The main goal of this boot loader was being really small yet still (optionally) robust. There could be a second level of loader that sits on this one, or is implemented directly by for example Squeedo, where you don't even have the first-level loader because you've got the space/easy reprogrammability. As long as you can bootstrap to a common environment, then it doesn't matter.

kyuusaku, heh, that's sort of like one of the send routines I have in the upcoming remote procedure call library. There's also an equivalent one that CRC-16s a byte as it's receiving another. Doing those in parallel is definitely the way to go.

EDIT: OK, I figured out how this fits in with Memblers' points about other hardware. I'll be adding this to the specification:

This boot loader is meant to allow control of a NES connected to the host via serial on the second controller port's D0 input. It isn't suitable for other host connection schemes, for example serial connected to a different input on the expansion connector.

Serial over D0 is very easy to wire up, since standard controller cables connect that pin, and the circuitry is just a few resistors and a transistor. The standard boot loader protocol allows it to be put on a cartridge easily, and it doesn't need to be updated, so the user doesn't need the ability to re-burn it. It could for example be included on homebrew cartridges as an extra feature. Anyone can build the serial cable, but getting the boot loader burned on a ROM is the biggest hurdle.

In normal use, a secondary loader will be sent, which then loads the actual program to be executed. The user doesn't directly write programs to send to the boot loader. This secondary loader can be implemented for other host connection cable types, so that the user has the same host tools regardless of the connection. This boot laoder thus is only used for those connected with the simple serial cable described here.

[the further loading might be handled by something like the upcoming RPC library, or perhaps something simpler]

by on (#67566)
I was just talking this over with kevtris, and he came up with what I think is a great solution - have the initial bootloader use both D0 and D1. The only thing AFAIK that uses D1 is Famicom expansion controllers (where it is the 'normal' data-in). So between D0 and D1, the bootloader would then be compatible with controller port serial adapters, NES expansion port serial adapters, and Famicom expansion port serial adapters (Famicom has hard-wired controllers), all at the same time. Can't beat that for flexibility and compatibility.

by on (#67567)
My only concern is junk data on D1. Can we assume that it always reads back as 0 when not connected, and on a NES? If we're going for multiple bits, what about D3 and D4 as well? That way you could have a pass-through serial cable on a NES as well, that connected the controller normally to D0 and serial to D3 or D4. Code-wise doing multiple bits requires doing an LDA, AND, CMP anyway, so supporting arbitrary bits is easy enough (and if you use the fast-and-loose CPX $4017 approach, you automatically support serial on any of the 8 bits that are normally zero).

So what's really at the core of this is a set of input lines to use as a baseline bit-bang serial standard.

I'll have to see how this affects the tiny boot loader implementations. There's going to be some expansion, unfortunately. But oh well, one can always use these smaller ones in specialized situations where the serial interface is known. We just want a more capable one to put on homebrew cartridges etc.

by on (#67569)
Oh also, I believe the Famicom expansion D1 will be on $4016 (I think so, but I'm not sure). So it's a different bit and a different register. Still, not too bad though.

As for noise on the lines, kevtris said some NES games (such as Spelunker) already check this bit. So it must be stable, but I suppose the NES version could be ignoring it afterwards (not sure at the moment, but I doubt that it's been changed - this is probably common on famicom games).

by on (#67570)
blargg wrote:
My only concern is junk data on D1. Can we assume that it always reads back as 0 when not connected, and on a NES?

I seem to remember tracing the input code in Super Mario Bros. (JU), and it assumes so.

by on (#67571)
tepples wrote:
I seem to remember tracing the input code in Super Mario Bros. (JU), and it assumes so.

Several (U) games I checked in the past use both D0 and D1.

by on (#67675)
I've figured out which bits I will use, $4017.D1 for output to the NES, and OUT1 ($4016.D1) for input from the NES, instead of OUT0 (strobe). This should keep every kind of NES controller port device out of the way, and would work just as well on Famicom.

I also noticed on the PIC32, the UART and SPI peripherals share the same I/O pins. That works out pretty well here, it could use either mode just as easily.

by on (#67679)
We're going to have to re-asses this thing from scratch, at least if I'm going to be able to do much. I'm also working on something else now, having spend enough time on this (and the RPC, which I should be releasing in a couple of days). I'm not sure how well I can accommodate all these extensions, and I'm not even sure they make sense for this boot loader. But that's OK, I think re-assessment of how this fits in the grand scheme of things is good. So for now, this only supports $4017.D0 for serial input. I'll try to start another thread on this bigger picture today.

by on (#67752)
Argh, my tester's equipment is non-operational at the moment. Does anyone else even have a PC link cable and want to try out the remote procedure call library before I post what I've got, untested on anything except my setups?

by on (#67774)
I just built a cable tonight and got the tone.bin to execute.
Pretty cool stuff! :D I'd be happy to test out the RPC code.