This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Koei bytecode

Koei bytecode
by on (#195625)
Here's something interesting I've been investigating for the last few days... All of Koei's NES games except for Mahjong Taikai contain a standard bytecode interpreter, and the games consist almost entirely of bytecode except for the interpreter itself, the code that runs in NMI (PPU transfers and music/sound), and some "syscall" routines that perform tasks like bank switching and PPU register setting.

There are two versions of the bytecode interpreter, one used by the MMC1 games (Nobunaga's Ambition, Genghis Khan, Romance of the Three Kingdoms, and Famicom Top Management) and one used by the MMC5 games. The only differences between the versions are that the MMC5 version uses the mapper's multiplication registers, and has an extra routine that seems to be an optimized version of the syscall dispatcher for syscalls that only take one argument for calling syscalls from native 6502 code (the standard dispatcher copies arguments from the bytecode stack to a set of static addresses in zero page; the fast dispatcher expects the caller to have done this itself)

The bytecode is essentially an ideal C machine: it's mostly stack-based with a frame pointer and two 16-bit/32-bit registers, one "left" (destination for arithmetic ops) and the other "right" (source for arithmetic ops). It's integer-only but supports all the standard C operators including data and function pointers, both signed and unsigned division/modulus, signed and unsigned bit shifts, and bitfield extraction (with optional sign extension) and insertion. Almost all operations have 16-bit and signed 32-bit versions. 8-bit data are either zero-extended or sign-extended to 16 bits when read into a register, and the stack is 16-bit aligned.

Each bytecode function begins with a JSR to the interpreter, followed by the stack frame size as a negative 16-bit word ($FC $FF for a routine with two local variables, etc.) followed by the bytecode. The interpreter is reentrant, and the same bytecode instruction is used to call either a bytecode function or an assembly language one. Function arguments are pushed on the stack (with the caller responsible for stack cleanup--there's a "call xxxx and then unwind stack by n" instruction) and return values appear to be returned in the "left" register (the bytecode "return" instruction clobbers "right" but preserves "left").

The bytecode is so much an ideal C machine that I'm quite sure that the games were written in C (strings in the ROMs also have C-style formatting) which was compiled to bytecode rather than native 6502 code to save ROM space. I only wonder whether Koei wrote the compiler and interpreter themselves, or whether it's a product they licensed that may have been used by other developers as well.
Re: Koei bytecode
by on (#195629)
This is awesome! Makes sense for something like a strategy game that doesn't require peak performance. Any plans on publishing more details, such as disassembled code?
Re: Koei bytecode
by on (#195631)
I'd love to see its instruction set.
Re: Koei bytecode
by on (#195639)
I would like also to see the details of the instruction set and the explanation of their meaning.
Re: Koei bytecode
by on (#195645)
I would be interested in digging into the SNES games to see if they use the same byte code system.
Re: Koei bytecode
by on (#195651)
Cool find.
Re: Koei bytecode
by on (#195652)
Oziphantom wrote:
I would be interested in digging into the SNES games to see if they use the same byte code system.


The fact that a couple of their games (ie Gemfire, Uncharted Waters) were released on both the NES and SNES, might support that idea. Porting would be a whole lot easier if both are running the same (or almost-the-same) interpreter.
Re: Koei bytecode
by on (#195655)
gauauu wrote:
Oziphantom wrote:
I would be interested in digging into the SNES games to see if they use the same byte code system.


The fact that a couple of their games (ie Gemfire, Uncharted Waters) were released on both the NES and SNES, might support that idea. Porting would be a whole lot easier if both are running the same (or almost-the-same) interpreter.


Koei's early games (1980s to early 1990s) were released on numerous computer and console platforms including Z80 (PC88), x86 (PC98, IBM PC), 68K (X68K, Amiga, Megadrive) and I think one or two 6809 machines. It makes sense that they would keep as much of the game code as possible in platform-neutral C. Apparently the Megadrive versions of Genghis Khan 2, etc. are much much snappier than the respective SNES versions, which would be explained if the former were compiled to native 68000 and the latter use bytecode.

ETA: I'm still chewing through the interpreter to figure out all the instructions. I've just found two instructions that implement C switch statements, one for non-contiguous cases and the other for contiguous cases.
Re: Koei bytecode
by on (#195681)
Making a KOEI player akin to say SCUMMVM might be nice as well. HD uncharted waters HIT ME
Re: Koei bytecode
by on (#195693)
Oziphantom wrote:
Making a KOEI player akin to say SCUMMVM might be nice as well. HD uncharted waters HIT ME


I think you misunderstand. This bytecode isn't like SCUMM at all. SCUMM is a domain-specific language for third-person point-and-click adventure games. This bytecode isn't domain-specific at all; it's a machine language for a virtual 16/32-bit CPU suitable to run an integer-only subset of C. The only bytecode instructions that are any higher-level than a single 8086 or 68000 instruction are the two switchtable instructions. The "syscalls" aren't part of the bytecode language per se; they get called from bytecode using the regular call instruction. The syscall dispatcher is just to copy arguments and return values between the bytecode stack and fixed zero-page addresses. It's part of the interpreter because it has to use the interpreter's stack pointer to find the arguments.
Re: Koei bytecode
by on (#195695)
Its a VM, be it like the CLR or SCUMM its still a VM.
The important part is,
Does it runs lots of 6502, then drop into the VM for the maths?
Or
It mostly runs in the VM and has a syscall for set CHR-BANK, Call Script, SetSprite,ShowDialog,ScrollScreen etc ?

If its the first probably not so useful for VMing but it probably gets you the simulation.
If the second then you run the VM, trap the syscalls and make your code do what they need and boom you have the game. Basically like SCUMMVM, although the syscalls are probably game specific?
Re: Koei bytecode
by on (#195699)
Oziphantom wrote:
Its a VM, be it like the CLR or SCUMM its still a VM.
The important part is,
Does it runs lots of 6502, then drop into the VM for the maths?
Or
It mostly runs in the VM and has a syscall for set CHR-BANK, Call Script, SetSprite,ShowDialog,ScrollScreen etc ?

If its the first probably not so useful for VMing but it probably gets you the simulation.
If the second then you run the VM, trap the syscalls and make your code do what they need and boom you have the game. Basically like SCUMMVM, although the syscalls are probably game specific?


The "syscalls" are also very low-level. Don't think "display character x's portait" or "print string at coordinates x, y", think "copy x bytes to VRAM upload buffer and wait for the NMI thread to upload it". Even if you trapped the syscalls you'd need to emulate every part of a NES except the 6502 to run the games, and it wouldn't get you anywhere in terms of improving their graphics. With SCUMMVM, because the language is high-level, you can add higher resolution/color depth graphics to a game or modernize its interface just by replacing some resources, but with the Koei games you'd need very deep game-specific hooks. It wouldn't really be any easier than doing the same to a NES game written in straight 6502.

What knowledge of the bytecode would help with is reverse-engineering the simulation algorithms and modifying the games.
Re: Koei bytecode
by on (#195890)
I think I'm ready to write up a technical description of the bytecode architecture:

REGISTERS AND BASIC ARCHITECTURE

The virtual CPU has three 16-bit address/pointer registers and two 16/32-bit data registers, all stored in zero page at the following addresses:

Code:
addr | size | register
-----+------+----------------
$02  |  2   | stack pointer
$04  |  2   | frame pointer
$06  |  2   | instruction pointer
$08  |  4   | left register
$0C  |  4   | right register

In addition, zero page address $00 is used throughout the interpreter as a 16-bit temporary and $10 as a 32-bit temporary.

The stack starts at $05FF (in all the games I've checked) and grows downward, and the stack pointer points to the LSB of the last item pushed (a descending full stack). Some games (e.g. Bandit Kings of Ancient China) check the stack pointer in their vblank handler and if it is below $0200 (i.e. the bytecode stack has started to overwrite the 6502 stack) they display "OVERFLOW" using sprites and lock up.

Bytecode can freely use the stack as temporary storage, but the frame pointer is controlled by the interpreter and only changes on function entry and exit. Local variables and function arguments are addressed relative to the frame pointer.

The left and right registers are so (unofficially) named because all arithmetic and logical instructions operate on them as such, storing the result in left. Division divides left by right and stores the quotient in left, bit shifts shift left by right places, etc. There are no arithmetic flags; instead, there are instructions corresponding to signed and unsigned versions of every comparison operator, all of which store a boolean result (0 or 1) in left, and conditional jumps consist of "jump if left != 0" and "jump if left == 0".

The virtual CPU supports 16-bit and 32-bit operations, with the result always the same size as the operands (e.g. 16-bit multiplication produces a possibly truncated 16-bit result) All 32-bit instructions begin with a $B7 prefix byte. 16-bit operations only use and only update the lower 16 bits of the left and right registers. 8-bit data in memory is always extended to 16 bits when loaded into a register, and the stack also only supports pushing and popping 16-bit or 32-bit sized data.
Re: Koei bytecode
by on (#195891)
FUNCTIONS AND STACK FRAMES

Bytecode is organized into functions, every one of which begins with a machine language JSR to the interpreter, followed by the frame size needed for the function's local variables (as a 16-bit negative number), followed by the actual bytecode. The first thing the interpreter does is pop two return addresses from the 6502 stack. The first one is the return address of the JSR to the interpreter, which is used to find the stack frame size and the bytecode. The second is the return address of the caller of the bytecode function, which is pushed onto the bytecode stack along with the bytecode stack pointer, frame pointer and instruction pointer. The frame pointer is then set to the current stack pointer, and the stack pointer to the new frame pointer minus the requested frame size (actually, since the size is stored as a negative number, it's technically the frame pointer plus the frame size)

The bytecode return instruction does the reverse of the interpreter prologue: it restores the stack pointer, frame pointer and instruction pointer from the stack frame, and jumps to the stored return address + 1.

In this way, bytecode functions and machine language routines can call each other freely, and bytecode function calls can be nested to any depth without growing the 6502 stack, only the bytecode stack. When a bytecode function calls a bytecode function, the return address that gets transferred from the 6502 stack to the bytecode stack is an address inside the interpreter itself.

There are a couple of apparent redundancies: As described above, the interpreter prologue stores the previous instruction pointer on the stack, but the bytecode call instruction also pushes the instruction pointer onto the stack. I have not discovered what the "return address" pushed by the call instruction is used for, if anything. Also, the interpreter prologue copies 8 bytes onto the stack (6502 return address, stack pointer, frame pointer, instruction pointer) but it subtracts 9 from the stack pointer rather than 8, leaving a one-byte "hole" which I haven't yet found the purpose of.

The following is a diagram of a stack frame at the point when the first bytecode instruction in a function is executed. Everything above the line was pushed by the calling bytecode function, while everything below the line was pushed/allocated by the interpreter prologue.

Code:
         ....
         argument2
 fp+13-> argument2
         argument1
 fp+11-> argument1
         caller instrptr  (normally the same as below, unused?)
 fp+9 -> caller instrptr  (normally the same as below, unused?)
 ----------------------------------------------------------------------
 fp+8 -> ????             (reserved by interpreter but unused?)
         old instrptr     (this one is restored on function return)
 fp+6 -> old instrptr     (this one is restored on function return)
         old frameptr
 fp+4 -> old frameptr
         old stackptr
 fp+2 -> old stackptr
         return address-1 (transferred from 6502 stack to bytecode stack)
  fp  -> return address-1 (transferred from 6502 stack to bytecode stack)
         space for local1
 fp-2 -> space for local1
         space for local2
 fp-4 -> space for local2
         ....
         space for localn
  sp  -> space for localn
Re: Koei bytecode
by on (#195893)
ADDRESSING MODES

Most instructions implicitly operate on left and right and store the result in left. The only instructions that have addressing modes for data are "load", "store", "push" and "add". The "store" instruction does not take immediate arguments (obviously), and the "add" instruction only takes immediate arguments (in addition to the implicit register-register mode which every arithmetic instruction has). So there are exactly four instructions for every allowed combination of addressing mode and data size: "load to left", "load to right", "push", and either "store left" or "add to left".

Quick frame-relative
The frame-pointer-relative address of one of 12 local variables or one of 4 function arguments is encoded in the lower 4 bits of the instruction, as follows:
Code:
code |  offset  | meaning*
-----+----------+-----------
 0   | frame-24 | local #12
 1   | frame-22 | local #11
 2   | frame-20 | local #10
 3   | frame-18 | local #9
 4   | frame-16 | local #8
 5   | frame-14 | local #7
 6   | frame-12 | local #6
 7   | frame-10 | local #5
 8   | frame-8  | local #4
 9   | frame-6  | local #3
 A   | frame-4  | local #2
 B   | frame-2  | local #1
 C   | frame+11 | argument #1
 D   | frame+13 | argument #2
 E   | frame+15 | argument #3
 F   | frame+17 | argument #4
*assuming that all local variables and arguments are 16-bit sized

Note that the interpreter doesn't look up these offsets from a lookup table; instead, to improve execution speed (since these are the most commonly used instructions) there are actually 16 versions of each instruction each with a hardcoded offset. This mode only applies to 16-bit data.

Quick immediate
An immediate value from 0 to 15 is encoded in the lower 4 bits of the instruction and zero-extended to 16 bits.

Near frame-relative
A signed offset to the frame pointer from -128 to +127 is stored in the byte following the instruction. This mode only applies to 16-bit data.

Far frame-relative
A signed offset to the frame pointer from -32768 to +32767 is stored in the 2 bytes following the instruction. This mode applies to 8-bit, 16-bit or 32-bit data; 8-bit data is zero-extended to 16 bits.

Immediate
An immediate value is stored in the 1, 2 or 4 bytes following the instruction. A byte-sized immediate value is sign-extended to 16 bits (unlike 8-bit data read from memory, which is always zero-extended). The exception is the stack cleanup instruction, which takes an unsigned 8-bit immediate argument. There are no 32-bit immediate push or add instructions.

Absolute
An absolute address is stored in the 2 bytes following the instruction. This mode applies to 8-bit, 16-bit or 32-bit data; 8-bit data is zero-extended to 16 bits. Jump and call instructions also take an absolute address.

IP relative
IP relative addressing is only used by branch instructions. There are two versions of each branch instruction: one which interprets the byte after the instruction as a positive number from 0 to 255 (branch ahead), and one which interprets it as a negative number from -256 to -1 (branch back).
Re: Koei bytecode
by on (#195910)
INSTRUCTION SET

Code:
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
00-0F | LOADL quick    |   0   | left = <quick>
10-1F | LOADR quick    |   0   | right = <quick>
20-2F | STORE quick    |   0   | <quick> = left
30-3F | PUSH quick     |   0   | push <quick>
40-4F | LOADL qimm     |   0   | left = <qimm>
50-5F | LOADR qimm     |   0   | right = <qimm>
60-6F | PUSH qimm      |   0   | push <qimm>
70-7F | ADD qimm       |   0   | left = left + <qimm>
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 80   | (illegal)      |  ---  | (illegal)
 81   | LOADL near     |   1   | left = <near>
 82   | LOADL far      |   2   | left = <far>
 83   | LOADR near     |   1   | right = <near>
 84   | LOADR far      |   2   | right = <far>
 85   | STORE near     |   1   | <near> = left
 86   | STORE far      |   2   | <far> = left
 87   | PUSH near      |   1   | push <near>
 88   | PUSH far       |   2   | push <far>
 89   | BYTE LOADL imm1|   1   | left = (int8)<imm1>
 8A   | LOADL imm2     |   2   | left = <imm2>
 8B   | BYTE LOADR imm1|   1   | right = (int8)<imm1>
 8C   | LOADR imm2     |   2   | right = <imm2>
 8D   | BYTE PUSH imm1 |   1   | push (int8)<imm1>
 8E   | PUSH imm2      |   2   | push <imm2>
 8F   | BYTE ADD imm1  |   1   | left = left + (int8)<imm1>
 90   | ADD imm2       |   2   | left = left + <imm2>
91-9F | (illegal)      |  ---  | (illegal)
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 A0   | BYTE LOADL far |   2   | left = (uint8)<far>
 A1   | BYTE LOADR far |   2   | right = (uint8)<far>
 A2   | BYTE STORE far |   2   | (uint8)<far> = left
 A3   | BYTE PUSH far  |   2   | push (uint8)<far>
 A4   | LOADL abs      |   2   | left = <abs>
 A5   | BYTE LOADL abs |   2   | left = (uint8)<abs>
 A6   | LOADR abs      |   2   | right = <abs>
 A7   | BYTE LOADR abs |   2   | right = (uint8)<abs>
 A8   | STORE abs      |   2   | <abs> = left
 A9   | BYTE STORE abs |   2   | (uint8)<abs> = left
 AA   | PUSH abs       |   2   | push <abs>
 AB   | BYTE PUSH abs  |   2   | push (uint8)<abs>
 AC   | CALL abs       |   2   | call <abs>
 AD   | COPY imm2      |   2   | copy <imm2> bytes from *left to *right
 AE   | UNSTACK imm1   |   1   | stackptr = stackptr + (uint8)<imm1>
 AF   | UNSTACK imm2   |   2   | stackptr = stackptr + <imm2>
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 B0   | DEREF          |   0   | left = *left
 B1   | POPSTORE       |   0   | pop right; *right = left
 B2   | NOP            |   0   | no operation
 B3   | PUSHL          |   0   | push left
 B4   | POPR           |   0   | pop right
 B5   | MULT           |   0   | left = left * right
 B6   | SDIV           |   0   | left = (int16)left / (int16)right
 B7   | LONG (prefix)  |  var. | (prefix for 32-bit instructions)
 B8   | UDIV           |   0   | left = (uint16)left / (uint16)right
 B9   | SMOD           |   0   | left = (int16)left % (int16)right
 BA   | UMOD           |   0   | left = (uint16)left % (uint16)right
 BB   | ADD            |   0   | left = left + right
 BC   | SUB            |   0   | left = left - right
 BD   | LSHIFT         |   0   | left = left << right
 BE   | URSHIFT        |   0   | left = (uint16)left >> right
 BF   | SRSHIFT        |   0   | left = (int16)left >> right
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 C0   | CMPEQ          |   0   | left = (left == right)
 C1   | CMPNE          |   0   | left = (left != right)
 C2   | SCMPLT         |   0   | left = ((int16)left < (int16)right)
 C3   | SCMPLE         |   0   | left = ((int16)left <= (int16)right)
 C4   | SCMPGT         |   0   | left = ((int16)left > (int16)right)
 C5   | SCMPGE         |   0   | left = ((int16)left >= (int16)right)
 C6   | UCMPLT         |   0   | left = ((uint16)left < (uint16)right)
 C7   | UCMPLE         |   0   | left = ((uint16)left <= (uint16)right)
 C8   | UCMPGT         |   0   | left = ((uint16)left > (uint16)right)
 C9   | UCMPGE         |   0   | left = ((uint16)left >= (uint16)right)
 CA   | NOT            |   0   | left = !left
 CB   | MINUS          |   0   | left = -left
 CC   | COMPL          |   0   | left = ~left
 CD   | SWAP           |   0   | swap left <-> right
 CE   | (illegal)      |  ---  | (illegal)
 CF   | RETURN         |   0   | return from function
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 D0   | INC            |   0   | ++left
 D1   | DEC            |   0   | --left
 D2   | LSHIFT1        |   0   | left = left << 1
 D3   | BYTE DEREF     |   0   | left = *(uint8 *)left
 D4   | BYTE POPSTORE  |   0   | pop right; *(uint8 *)right = left
 D5   | SWITCH offs,num|2+2+tbl| contiguous switch (see below)
 D6   | JUMP abs       |   2   | instrptr = <abs>
 D7   | JUMPT abs      |   2   | if (left != 0) instrptr = <abs>
 D8   | JUMPF abs      |   2   | if (left == 0) instrptr = <abs>
 D9   | SWITCH num     | 2+tbl | noncontiguous switch (see below)
 DA   | AND            |   0   | left = left & right
 DB   | OR             |   0   | left = left | right
 DC   | XOR            |   0   | left = left ^ right
 DD   | CALLPTR        |   0   | call *left
 DE   | LEAL far       |   2   | left = &<far>
 DF   | LEAR far       |   2   | right = &<far>
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 E0   | SLOADBF sz,pos |  1+1  | signed bitfield extract (see below)
 E1   | ULOADBF sz,pos |  1+1  | unsigned bitfield extract (see below)
 E2   | STOREBF sz,pos |  1+1  | bitfield insert (see below)
 E3   | JUMP back      |   1   | instrptr += <back>-256
 E4   | JUMPT back     |   1   | if (left != 0) instrptr += <back>-256
 E5   | JUMPF back     |   1   | if (left == 0) instrptr += <back>-256
 E6   | JUMP ahead     |   1   | instrptr += <ahead>
 E7   | JUMPT ahead    |   1   | if (left != 0) instrptr += <ahead>
 E8   | JUMPF ahead    |   1   | if (left == 0) instrptr += <ahead>
 E9   | CALL abs,imm1  |  2+1  | call <abs>; stackptr += (uint8)<imm1>
 EA   | CALLPTR imm1   |   1   | call *left; stackptr += (uint8)<imm1>
EB-FF | (illegal)      |  ---  | (illegal)

Switch instructions
There are two switch instructions, one ($D5) for contiguous cases and the other ($D9) for noncontiguous cases. They work exactly like a C switch statement with the left register as the variable to test.

For the contiguous switch, the first 16-bit word after the instruction is the twos-complement negative of the smallest case value. The next 16-bit word is the number of cases. Then comes the default jump target, followed by the target for each in-range value of left. Example:
Code:
.byte $D5 ; contiguous switch instruction
.word -4, 5 ; smallest case 4, 5 cases
.word DefaultTarget ; jumps to this address if left was < 4 or > 8
.word Target4 ; jumps to this address if left was 4
.word Target5 ; jumps to this address if left was 5
.word Target6 ; jumps to this address if left was 6
.word Target7 ; jumps to this address if left was 7
.word Target8 ; jumps to this address if left was 8

For the non-contiguous switch, the first 16-bit word after the instruction is the number of cases. Then comes a table of comparison values followed by jump targets. After all the value-target pairs, the default target comes last. Example:
Code:
.byte $D9 ; noncontiguous switch instruction
.word 5 ; 5 cases
.word 1, Target1
.word 4, Target4
.word 9, Target9
.word 16, Target16
.word 25, Target25
.word TargetDefault

Load effective address
Instructions $DE and $DF load a register with the value of the argument + the frame pointer, i.e. with the effective (absolute) address of a local variable. They can be used to pass a pointer to a local variable as a function argument, or in conjunction with the bitfield instructions.

Bitfields
There are three bitfield instructions: one to extract (load from) a signed bitfield, one to extract an unsigned bitfield, and one to insert (store to) a bitfield. For all three bitfield instructions, the first byte after the instruction is the size of the bitfield in bits, and the following byte is the bit position counting from the LSB. The entire bitfield must fit into a 16-bit word (i.e. size + pos < 16)

The two extraction instructions work like the dereference instruction: the address of the bitfield is in left, and the bitfield also gets extracted to left. The insertion instruction takes the value to insert in left, and the address to insert to in right (like the pop-store instruction without the pop).
Re: Koei bytecode
by on (#195912)
This is really neat! Thanks for documenting it.
Re: Koei bytecode
by on (#195913)
32 BIT INSTRUCTION SET
Code:
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 B700 | (illegal)      |  ---  | (illegal)
 B701 | LONG MULT      |   0   | left = left * right
 B702 | LONG SDIV      |   0   | left = (int32)left / (int32)right
 B703 | LONG ADD       |   0   | left = left + right
 B704 | LONG SUB       |   0   | left = left - right
 B705 | LONG MINUS     |   0   | left = -left
 B706 | LONG CMPEQ     |   0   | left = (left == right)
 B707 | LONG CMPNE     |   0   | left = (left != right)
 B708 | LONG SCMPLT    |   0   | left = ((int32)left < (int32)right)
 B709 | LONG SCMPLE    |   0   | left = ((int32)left <= (int32)right)
 B70A | LONG SCMPGT    |   0   | left = ((int32)left > (int32)right)
 B70B | LONG SCMPGE    |   0   | left = ((int32)left >= (int32)right)
 B70C | LONG LOADL far |   2   | left = <far>
 B70D | LONG LOADR far |   2   | right = <far>
 B70E | LONG STORE far |   2   | <far> = left
 B70F | LONG PUSH far  |   2   | push <far>
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 B710 | LONG LOADL abs |   2   | left = <abs>
 B711 | LONG LOADR abs |   2   | right = <abs>
 B712 | LONG STORE abs |   2   | <abs> = left
 B713 | LONG PUSH abs  |   2   | push <abs>
 B714 | LONG PUSHL     |   0   | push left
 B715 | LONG POPR      |   0   | pop right
 B716 | LONG DEREF     |   0   | left = *(int32 *)left
 B717 | LONG POPSTORE  |   0   | pop right; *(int32 *)right = left
 B718 | LONG LOADL imm4|   4   | left = <imm4>
 B719 | LONG LOADR imm4|   4   | right = <imm4>
 B71A | LONG SWAP      |   0   | swap left <-> right
 B71B | LONG INC       |   0   | ++left
 B71C | LONG DEC       |   0   | --left
 B71D | LONG BOOL      |   0   | left = ((int32)left != 0)
 B71E | LONG SMOD      |   0   | left = (int32)left % (int32)right
 B71F | LONG LSHIFT    |   0   | left = left << right
------+----------------+-------+-------------------------------------
code  | nickname       | args  | operation
------+----------------+-------+-------------------------------------
 B720 | LONG SRSHIFT   |   0   | left = (int32)left >> right
 B721 | LONG COMPL     |   0   | left = ~left
 B722 | LONG AND       |   0   | left = left & right
 B723 | LONG OR        |   0   | left = left | right
 B724 | LONG XOR       |   0   | left = left ^ right
 B725 | LONG SEXTEND   |   0   | left = (int32)(int16)left
 B726 | LONG UEXTEND   |   0   | left = (uint32)(uint16)left
 B727 | LONG NOP       |   0   | no operation
 B728 | LONG NOT (?)   |   0   | apparently bugged (see below)
 B729 | LONG UCMPLT    |   0   | left = ((uint32)left < (uint32)right)
 B72A | LONG UCMPLE    |   0   | left = ((uint32)left <= (uint32)right)
 B72B | LONG UCMPGT    |   0   | left = ((uint32)left > (uint32)right)
 B72C | LONG UCMPGE    |   0   | left = ((uint32)left >= (uint32)right)
 B72D | LONG URSHIFT   |   0   | left = (uint32)left >> right
 B72E | LONG UDIV      |   0   | left = (uint32)left / (uint32)right
 B72F | LONG UMOD      |   0   | left = (uint32)left % (uint32)right

Extension and bool conversion
16-bit instructions only use the lower halves of the registers and leave stale data in the upper halves, so it is necessary to sign-extend or zero-extend when doing arithmetic on mixtures of 16-bit and 32-bit data. Conversely, the bool conversion instruction is needed because the conditional jump instructions only check the (non-)zeroness of the lower half of left.

Instruction $B728
Instruction $B728 is probably supposed to be the long version of logical NOT ($CA) but actually has the same effect as instruction $B71D. None of the games I have inspected uses this instruction (the binary sequence $B7 $28 is nowhere to be found in the ROMs) so if it is a bug it seemingly went undetected for as long as Koei used this interpreter.
Re: Koei bytecode
by on (#195920)
This is really cool and I think it definitely at least deserves a page on the NESdev wiki for posterity. Thanks for documenting it.
Re: Koei bytecode
by on (#195927)
Wiki page seconded, it'll be easier to organize and maintain.
Re: Koei bytecode
by on (#195941)
Neat!

It seems more appropriate for Datacrystal or Zophar as to where to document, but yes to archival.
Re: Koei bytecode
by on (#195957)
This is so awesome! Thanks so much for doing all the hard work.

Looking forward to working on a disassembler or decompiler, unless someone else gets around to it first.
Re: Koei bytecode
by on (#196544)
I guess someone should write an open source interpreter too. :X
Re: Koei bytecode
by on (#196547)
^^^ I guess you just volunteered?
Quote:
Bitfields
There are three bitfield instructions: one to extract (load from) a signed bitfield, one to extract an unsigned bitfield, and one to insert (store to) a bitfield. For all three bitfield instructions, the first byte after the instruction is the size of the bitfield in bits, and the following byte is the bit position counting from the LSB. The entire bitfield must fit into a 16-bit word (i.e. size + pos < 16)
I was going to ask if the fields were little- or big-endian, but I suppose unless you're trying to make machines interoperable with the states of each other (hardly necessary) it doesn't matter.

Quote:
CA | NOT | 0 | left = !left
CB | MINUS | 0 | left = -left
CC | COMPL | 0 | left = ~left
with all three existing, I'm curious what NOT does. Is it LEFT[15:0] = (LEFT[15:0] ==0 ? 1 : 0)?

ed: fixed to only write 16 bits to answer.
Re: Koei bytecode
by on (#196548)
Myask wrote:
^^^ I guess you just volunteered?
Quote:
Bitfields
There are three bitfield instructions: one to extract (load from) a signed bitfield, one to extract an unsigned bitfield, and one to insert (store to) a bitfield. For all three bitfield instructions, the first byte after the instruction is the size of the bitfield in bits, and the following byte is the bit position counting from the LSB. The entire bitfield must fit into a 16-bit word (i.e. size + pos < 16)
I was going to ask if the fields were little- or big-endian, but I suppose unless you're trying to make machines interoperable with the states of each other (hardly necessary) it doesn't matter.


Little-endian. That's what I meant by "counting from the LSB". A bitfield with a position of 0 is aligned with the LSB of the word it's in, so no shifting is needed on insertion or extraction.
Re: Koei bytecode
by on (#196786)
Myask wrote:
Quote:
CA | NOT | 0 | left = !left
CB | MINUS | 0 | left = -left
CC | COMPL | 0 | left = ~left
with all three existing, I'm curious what NOT does. Is it LEFT = (LEFT[15:0] ==0 ? 1 : 0)?


Yes.
Re: Koei bytecode
by on (#196903)
I've been thinking about something similar for the 65816, but with the "word-code" being a list of addresses to jump to.
Re: Koei bytecode
by on (#196979)
Why not look at and extract the 16bit KOEI titles to see if they already have the above in 16bit?
Re: Koei bytecode
by on (#197014)
psycopathicteen wrote:
I've been thinking about something similar for the 65816, but with the "word-code" being a list of addresses to jump to.

That of course is what Forth normally does, at least in the indirect-threaded code (ITC) and direct-threaded code (DTC) models. On the 6502.org forum, Bruce Clark presents the idea of using the 16-bit S (stack-pointer) register as the instruction pointer in DTC Forth on the '816, at http://forum.6502.org/viewtopic.php?t=586 . Basically the program itself is on the hardware stack! The earlier idea of a two-instruction NEXT in ITC Forth on the '816, which he mentions at the top of the head post, is discussed at http://forum.6502.org/viewtopic.php?t=584 . Just when you get all comfy with some method, along comes Bruce with some wild idea to turn everything upside down and get you to consider a very different approach that may be a lot better in some situations.
Re: Koei bytecode
by on (#197052)
If you use the stack, you can easily do long addressing.

plx
plb
inc $0000,x
rts
Re: Koei bytecode
by on (#198068)
Oziphantom wrote:
Why not look at and extract the 16bit KOEI titles to see if they already have the above in 16bit?


Apparently their early SNES titles (e.g. Romance of the Three Kingdoms II) do use the same bytecode. Because the virtual machine has no support for an address space larger than 16 bits, they have to constantly copy bytecode and data overlays into RAM, simulating a bankswitched 64K machine.

I had a very brief look at a couple of later SNES Koei games (Uncharted Waters 2 and Genghis Khan 2) and they both look like compiler-generated 65816 code (tons of stack-relative addressing, etc.). So it looks like Koei switched to a different compiler at some point, one that produced native 65816 rather than bytecode. It's a bit interesting that the Famicom version of Genghis Khan 2 (Aoki Ookami to Shiroki Mejika: Genchou Hishi) is interpreted while the SNES version is native 65816.
Re: Koei bytecode
by on (#198330)
I'm now investigating the assembly language initialization/NMI/syscall code, or "BIOS" as Koei apparently called it. In the MMC5 games there appear to be three major versions of the BIOS: one used in Nobunaga's Ambition 2, Bandit Kings, Ishin no Arashi and Rot3K2 (with minor changes per game); one used in Uncharted Waters and L'Empereur; and a final version used in Gemfire, Genghis Khan 2 and Nobunaga's Ambition 3 (aka Lord of Darkness on the SNES and Genesis).

The US versions of Nobunaga's Ambition 2, Bandit Kings and Rot3K2 are actually kind of hybrids between BIOS version 1 and 2: the RAM addresses they use are version 1 (a ton of variables got shuffled in and out of zero page between the versions--I guess someone did some profiling :)) but a number of code changes from version 2 were backported in. The backported changes appear to be mainly sound related--version 1 uses the MMC5 channels exclusively for melodic sound effects, but of course those channels aren't usable on a stock NES, so they had to backport the version 2 sound engine which can share channels between BGM and effects in order to move those sound effects to the built-in APU channels.
Re: Koei bytecode
by on (#201416)
oh, lol... accidentally I'm doing the same re work for a couple of week already, but haven't seen this topic by now...
I was going to write some kind of documentation too, but seems this is not needed anymore. here is a complete info already, better than I can write myself. I even got here some fixes to my custom disasm scripts...

as a side note, I'd like to mention, KOEI's C compiler (strong evidence this is actually a C code is here https://tcrf.net/Aerobiz_Supersonic_(SNES)) has an ability to produce the native 6502 code as well as bytecode at least since 1988. not only the SNES version of uncharted waters 2 has this auto generated native code, but almost every NES title. usually, most of the native code you could see in the NES games by KOEI is autogenerated. except the interpretator itself, bios and some helpers. the rest is for sure autogenerated, maybe only with later manual polishing... looks like they wanted to make some portions of code work faster, or even did some profiling for most frequently used functions. seems they just couldn't compile all the code in native 6502 since it uses a lot more space. the game with maximum autogenerated native code is L'Empereur, almost half of scrips compiled as native code.

if someone interested, I can provide some practical tools to decompile these scrips (ida loader+scripts). doubt I will recreate some kind of KOEI's compilator. but at least I can provide a solution to reassemble the existing code with all links and functions. with using some asm macroses, you could rewrite all bytecode in a nice looking readable and editable sourses. maybe for those who wanted to translate some japaneseonly games (secrets and unused debug already found by me lol, the main reason I started to do all this thing myself)...
Re: Koei bytecode
by on (#201457)
CaH4e3 wrote:
if someone interested, I can provide some practical tools to decompile these scrips (ida loader+scripts). doubt I will recreate some kind of KOEI's compilator. but at least I can provide a solution to reassemble the existing code with all links and functions. with using some asm macroses, you could rewrite all bytecode in a nice looking readable and editable sourses. maybe for those who wanted to translate some japaneseonly games (secrets and unused debug already found by me lol, the main reason I started to do all this thing myself)...


I don't own or use IDA (I've written my own extensible disassembler instead) but I'm interested in seeing your decompiler script anyway.

Did you discover if the apparently redundant bytes in the stack frame are used for anything?
Re: Koei bytecode
by on (#201526)
so, the scripts mostly useless for you without ida. they firmly depends on my other custom disasm scripts. but I may try to make some demo what they do and what I can do with it...

and no, I haven't digeed so much in stack to check if there any oddities ;) i'm sure, any oddities, is just a genuine "bugs" or redundancy of the compiler. the bioscall always copy 22 bytes of stack in arguments buffer in any case, even if there is only one or two args (so maybe because of that they did some fast versions of bios call using only first arg to switch banks or something..)
Re: Koei bytecode
by on (#201563)
CaH4e3 wrote:
so, the scripts mostly useless for you without ida. they firmly depends on my other custom disasm scripts. but I may try to make some demo what they do and what I can do with it...

and no, I haven't digeed so much in stack to check if there any oddities ;) i'm sure, any oddities, is just a genuine "bugs" or redundancy of the compiler. the bioscall always copy 22 bytes of stack in arguments buffer in any case, even if there is only one or two args (so maybe because of that they did some fast versions of bios call using only first arg to switch banks or something..)


Even if I can't run your scripts without IDA, I can port them to my own disassembler :)

Well, I've found one use for the "hole" at frameptr+8. Your tip that there was also compiler-generated native code was just what I needed. Take a look at the routine at 1F/E24B in Ishin no Arashi, 1F/E37A in Rot3K2 or 1F/E2E3 in L'Empereur (the same routine exists in BKoAC and probably most/all of the games, but it's not always in the fixed bank. I guess the games that have it in the fixed bank are the ones with the most compiler-generated native code)

The purpose of this routine seems to be to create a bytecode-compatible stack frame for a native-code function, and to clean up that stack frame when that function exits (it does some 6502 stack manipulation to wedge its cleanup code after the RTS of the "wrapped" function). Anyway, this wrapper routine uses the byte at frameptr+8 to store the number of bytes it has copied from $0080 to the bytecode stack, in order to restore that number of bytes back to $0080 when the wrapped native-code function exits.

It looks like compiler-generated native code uses $0080~ as local variable storage, and each native-code function uses this wrapper to preserve its caller's local variables (if any) and to allocate stack space for the arguments it needs to pass to any functions that it calls.
Re: Koei bytecode
by on (#201613)
AWJ wrote:
Even if I can't run your scripts without IDA, I can port them to my own disassembler :)


enjoy https://pastebin.com/sMW8wLXS
me commented some portions of code, the rest is my own or system library functions...

all nes koei games has a full list of bitecode procedures and can be disassembled automatically using my scripts and ida ;). but the offsets are for my own loader, they has no correlation with real offsets except lower bits...

some opcodes I didn't reversed deeply and just nailed briefly, I took from your description (comments are given). bitfield opcodes never used in the nes bytecode (as I see), as well as the relative branches... so I just didn't care about it by now..

AWJ wrote:
Well, I've found one use for the "hole" at frameptr+8. Your tip that there was also compiler-generated native code was just what I needed. Take a look at the routine at 1F/E24B in Ishin no Arashi, 1F/E37A in Rot3K2 or 1F/E2E3 in L'Empereur (the same routine exists in BKoAC and probably most/all of the games, but it's not always in the fixed bank. I guess the games that have it in the fixed bank are the ones with the most compiler-generated native code)

The purpose of this routine seems to be to create a bytecode-compatible stack frame for a native-code function, and to clean up that stack frame when that function exits (it does some 6502 stack manipulation to wedge its cleanup code after the RTS of the "wrapped" function). Anyway, this wrapper routine uses the byte at frameptr+8 to store the number of bytes it has copied from $0080 to the bytecode stack, in order to restore that number of bytes back to $0080 when the wrapped native-code function exits.

It looks like compiler-generated native code uses $0080~ as local variable storage, and each native-code function uses this wrapper to preserve its caller's local variables (if any) and to allocate stack space for the arguments it needs to pass to any functions that it calls.


this is actually a native version of the bytecode exec procedure. it is used for every native procedure start. the same way as for bytecode procedures. but instead this functions executes the real 6502 asm and then returns back to the caller routine. it does the same thing as for bytecode procedures, prepares the stack and executes the native code instead of the bytecode. it gets the same 16-bit signed value to assign the local variables buffer for procedure, but has an extra parameter byte, which is used to backup a number of vm stack bytes to 80-buffer. it will be stored in your 8-byte stack offset. but, I haven't seen any functions used any value apart from 0 here. l'empereur and uncharted waters never used this byte for sure. it's always 0. so copy to 80 buffer never used there. I doubt it used anywhere else... so this is for sure redundant leftover of some planned but never used feature of the compiler to save some stack parameters..
Re: Koei bytecode
by on (#201663)
Thanks to your script I see what the return address at frameptr+9 is used for, too. It isn't used by the bytecode interpreter but it is used by the native-code function call wrapper (the routine that's approximately equivalent to bytecode opcode $E9). I guess they wanted to make the stack frames identical between bytecode and native-code functions to make debugging easier.

A funny thing about that routine is that the "stack adjust amount" parameter that it takes includes the return address! So even function calls with no arguments need to use a value of 2.

Quote:
this is actually a native version of the bytecode exec procedure. it is used for every native procedure start. the same way as for bytecode procedures. but instead this functions executes the real 6502 asm and then returns back to the caller routine. it does the same thing as for bytecode procedures, prepares the stack and executes the native code instead of the bytecode. it gets the same 16-bit signed value to assign the local variables buffer for procedure, but has an extra parameter byte, which is used to backup a number of vm stack bytes to 80-buffer. it will be stored in your 8-byte stack offset. but, I haven't seen any functions used any value apart from 0 here. l'empereur and uncharted waters never used this byte for sure. it's always 0. so copy to 80 buffer never used there. I doubt it used anywhere else... so this is for sure redundant leftover of some planned but never used feature of the compiler to save some stack parameters..


It's actually the other way around: on function entry it copies n bytes from $80 to the stack, and on function exit it copies them from the stack back to $80.

I thought that feature might be used for local variables that are pointers (and thus have to be in zero page) but I guess the compiler-generated native code just copies pointer variables from the stack to scratch space (e.g. $08-$14) upon use, rather than keeping them in zero page for the entirety of a function?

I've reverse engineered the entire MMC5 BIOS, all three versions (except a couple of version-1-only syscalls that I can't figure out and might actually be incomplete/nonfunctional) Are you interested in a description or have you already done it yourself? I've also reverse engineered the sound program if anyone is interested (the one used in Mahjong Taikai, Famicom Top Management and all the MMC5 games--the first three MMC1 games have a totally different sound program which I've barely looked at)
Re: Koei bytecode
by on (#201816)
AWJ wrote:
It's actually the other way around: on function entry it copies n bytes from $80 to the stack, and on function exit it copies them from the stack back to $80.

I thought that feature might be used for local variables that are pointers (and thus have to be in zero page) but I guess the compiler-generated native code just copies pointer variables from the stack to scratch space (e.g. $08-$14) upon use, rather than keeping them in zero page for the entirety of a function?

yeah, you right. but it seem just more like an attempt to have a separate set of bytecode registers for a native procedures which is overwrites those from procedural stack. if used, they may run two separate native procedures with the same set of registers... maybe for cases when some other inbetween procedure changes it somehow... but anyway, it's never used

AWJ wrote:
I've reverse engineered the entire MMC5 BIOS, all three versions (except a couple of version-1-only syscalls that I can't figure out and might actually be incomplete/nonfunctional) Are you interested in a description or have you already done it yourself? I've also reverse engineered the sound program if anyone is interested (the one used in Mahjong Taikai, Famicom Top Management and all the MMC5 games--the first three MMC1 games have a totally different sound program which I've barely looked at)


I nailed some functions for some games of my personal interest but maybe someone will like to see it too ;)
Re: Koei bytecode
by on (#201848)
After disassembling some native-code routines in Ishin no Arashi, I understand some things I didn't understand before. In the compiler-generated native code, zero page address $06-$07 (which in the interpreter is the instruction pointer) is used as a second frame pointer instead, and contains the value frameptr - fastlocals - 256. The "- 256" is because the 6502 doesn't support indexed addressing with a negative index. Local variables on the stack are addressed using ($06),Y where Y = #$FE for the first (word-sized) local, #$FC for the second local, etc. I was wondering why the stack frame setup did DEC $07 before jumping to the wrapped code and INC $07 after returning from it, but now I see why.

ETA: Bandit Kings of Ancient China does use the "fast local variables at $80" feature; here's an example of a function that uses it (output from my disassembler):

Code:
14/AAE0: 20 99 B1      jsr CreateStackFrame
14/AAE3: 04 26 FF      frame 4,-218
14/AAE6: A0 0B         ldy #$0B
14/AAE8: B1 04         lda (frameptr),y
14/AAEA: 85 80         sta $80
14/AAEC: C8            iny
14/AAED: B1 04         lda (frameptr),y
14/AAEF: 85 81         sta $81
14/AAF1: C8            iny
14/AAF2: B1 04         lda (frameptr),y
14/AAF4: 85 82         sta $82
14/AAF6: C8            iny
14/AAF7: B1 04         lda (frameptr),y
14/AAF9: 85 83         sta $83
14/AAFB: 8A            txa
14/AAFC: 48            pha
14/AAFD: 48            pha
14/AAFE: 48            pha
(snip...)
Re: Koei bytecode
by on (#201899)
AWJ wrote:
ETA: Bandit Kings of Ancient China does use the "fast local variables at $80" feature; here's an example of a function that uses it (output from my disassembler):


I see native code uses it as a vars/regs as well

Code:
BANK14:A61B          loc_20661B:                             ; CODE XREF: BANK14:A64B^Yj
BANK14:A61B                                                  ; BANK14:loc_20669C^Yj
BANK14:A61B 18                       CLC                     ; *a
BANK14:A61C A9 07                    LDA     #7              ; *a
BANK14:A61E 65 80                    ADC     byte_80         ; *a
BANK14:A620 85 80                    STA     byte_80         ; *a
BANK14:A622 8A                       TXA                     ; *a
BANK14:A623 65 81                    ADC     byte_81         ; *a
BANK14:A625 85 81                    STA     byte_81         ; *a
BANK14:A627
BANK14:A627          loc_206627:                             ; CODE XREF: BANK14:A618^Xj
BANK14:A627 A1 80                    LDA     (byte_80,X)     ; *a
BANK14:A629 A0 01                    LDY     #1              ; *a
BANK14:A62B 11 80                    ORA     (byte_80),Y     ; *a
BANK14:A62D F0 03                    BEQ     loc_206632      ; *a
BANK14:A62F 4C 4E A6                 JMP     loc_20664E      ; *a

it seems, most of native code may be decompiled the same way as bytecode, except maybe some optimizations... like they just unroll all bytecode with an native asm representing every single command, maybe with some opts
Re: Koei bytecode
by on (#203424)
Heehee... can you find the bugs in these implementations of the C library functions abs() and strlen()? Hints: stackptr+2 points to the first argument on the stack, and left holds the (16-bit) function return value (meaning left+2 and so forth are effectively scratch space)

Code:
abs:
    ldy #2
    lda (stackptr),y
    sta left
    iny
    lda (stackptr),y
    sta left+1
    bpl done
    eor #$FF
    sta left+1
    lda left
    eor #$FF
    clc
    adc #1
    sta left
done:
    rts


Code:
strlen:
    ldy #3
    lda (stackptr),y
    sta left+3
    dey
    lda (stackptr),y
    sta left+2
    ldy #0
    sty left
    sty left+1
loop:
    lda (left+2),y
    beq done
    iny
    bne loop
    inc left+1
    jmp loop
done:
    sty left
    rts


I'd be surprised if the abs() bug doesn't impact game logic in at least one game, most likely in something non-obvious like AI (not all the games contain all the library functions, but the games that have abs() do call it...)
Re: Koei bytecode
by on (#203443)
Abs does not increment higher nibble of negative value after inverting it. Strlen does not increment hi nibble of str pointer as well as the result. But seems they have not so much strings longer than 256 or they rarely need strlen counting...
Re: Koei bytecode
by on (#212979)
I missed this thread, very interesting stuff!

I recently reverse engineered parts of Super Robot Wars Gaiden Masōkishin, and it turns out that the game is entirely orchestrated by a bytecode interpreter. I couldn’t say if the bytecode is compiled from C, but I would guess not actually.

The instruction set includes all the primitives to facilitate that (the expected load, store, goto, call, return, etc), but a lot of the opcodes do very specific game engine stuff. So the bytecode is maybe more likely compiled from some simple bespoke scripting language. I specifically dug into it to alter the behavior of music playback/queuing, so I’ve not looked at enough bytecode to say anything with certainty. If anyone is interested to look deeper into this I’d be happy to clean up my notes and collaborate on a deeper reverse engineering effort!