This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

ppunroll - optimize PPU writes

ppunroll - optimize PPU writes
by on (#164034)
Hey there,

today I'll share a little command line tool, probably most useful if your project uses CHR RAM.

It does this:
Code:
;data.chr
00 00 00 00 BB BB BC 00

Code:
C:\>ppunroll data.chr
ldx #$00
stx $2007 ;00
stx $2007 ;00
stx $2007 ;00
stx $2007 ;00
ldy #$BB
sty $2007 ;BB
sty $2007 ;BB
iny
sty $2007 ;BC
stx $2007 ;00
;-O7
;size:29
;cycles:38


Personally, I use it for the animation of the main character in Banana Nana, freeing up a substantial amount of space for enemies in the sprite pattern table - while still leaving enough vblank time for animated background tiles.
ROM space does not seem to be much of an issue for homebrews nowadays, so why not!

Another use case: Let's say you have 70 cycles left in vblank...
Code:
C:\>ppunroll -Oc 70 data.chr
ldx #$00
jsr stx4  ;00 00 00 00
ldy #$BB
jsr sty2  ;BB BB
iny
sty $2007 ;BC
stx $2007 ;00
;-O1
;size:17
;cycles:62

The tool automatically chooses a lower optimization level in this case, which stays below 70 cycles, but generates smaller code (see subroutines.asm in the zip archive).

Be warned, the output may often not be optimal, but it should be usable at least. I'll see if I can improve the algorithm one of these days.

Let me know if you find any bugs.

ATTENTION: Check out russelsprouts's post below for his tool, which should generate better output!
Re: ppunroll - optimize PPU writes
by on (#164036)
miau wrote:
Code:
;data.chr
00 00 00 BB BB AA BC 00

Looks like you made a mistake here (doesn't match output).

miau wrote:
Download:
ppunroll-1.0 (C++ source and windows binary)

Might be better to use attachments for posterity.
Re: ppunroll - optimize PPU writes
by on (#164038)
Of course that had to happen! :)
Fixed and attached file instead.
Re: ppunroll - optimize PPU writes
by on (#164042)
Ah, I started working on this idea a few years back (thread), but never released a proper converter. Recently I even worked on a new encoder, which attempted to use other instructions (ASL, LSR, ROR, ROL, INX, DEX, INY, DEY) to save a little space, but the results were less than exciting so I dropped it. I'm still considering using this for animating the player's sprite in my current project, but I'm trying more conventional approaches first.
Re: ppunroll - optimize PPU writes
by on (#164094)
Heh, I didn't even consider bit shifts. Some interesting contributions in that thread. It would be nice to have a tool that generates a perfect sequence of instructions or comes very close at least, but it seems to be quite a challenge indeed. I feel I'm not good enough of a programmer to pull that off in any reasonable amount of time and it probably wouldn't be worth it. Yet, if anyone else feels up to the task, I'd be extremely interested to see what he/she comes up wih.

I needed something to integrate into my build chain and this tool does nicely so far. If there's a need to squeeze off a few bytes for the final build, once all graphics are set in stone, one can still optimize manually.
Re: ppunroll - optimize PPU writes
by on (#164228)
Isn't NES code that does this into (W)RAM basically what one wants to do for optimizing PPU bandwidth?
Re: ppunroll - optimize PPU writes
by on (#164230)
Yes, but WRAM isn't always available, and filling a buffer like this is somewhat slow, since you only write every 5th byte, but if you have some ROM available, you can still benefit from the faster VRAM transfers for commonly used data, such as patterns.
Re: ppunroll - optimize PPU writes
by on (#164269)
I thought about it, and I wrote a program that can find the shortest (in bytes) program that takes the least number of cycles. There is a linear algorithm, but unfortunately it has very high constants, so it is really slow -- probably impractical. I'll see if I can speed it up and release it. EDIT: Turns out the program does work on smallish inputs.

Basically, it performs a graph search to find the shortest path to writing all the bytes, where the length is decided by cycle count, and ties are broken by byte length. A node is a 4-tuple of index, a, x, and y, which represents the current state of the registers and the current index in the buffer. The start node is all registers uninitialized and an index of -1 (nothing written). The goal node is any node with index at the end of the buffer. An edge goes from one state to another if there is an operation to go from the state to the other. Each operation is something like "lda #00; sta PPUDATA" or "stx PPUDATA".

The graph is a DAG -- the index always increases by 1 when going from one node to another. DAGs have a O(V) shortest path algorithm -- visit each node in topographic order. The shortest path to the node is the shortest path from any incoming edge.

So, the algorithm looks like this:



Code:
Initialize a queue with the start node.
While there are nodes in the queue:
  pop a node from the queue.
  if the node's index is the last in the input buffer, check if it is the shortest.
  otherwise,
  Find all of the possible next nodes (nodes that are reachable with one operation. Up to 3 -- it only makes sense to consider
  the shortest node for each register. If the value is already in a register, just store it)
  For each possible next node:
    add the node to the queue if it isn't in it already.
    update the path to the node if it is cheaper to get to the node through the current node.


This algorithm is linear based on the number of nodes. To show that it is linear on the number of input bytes, consider the possible states at index i. One of the registers must be set to the input byte at i. Another can be uninitialized, or one of the other 255 values. The third can be uninitialized, or one of the remaining 254 values. (Note that the registers will never be equal). That gives 3 * 256 * 255 = 195840 possible states. This *is* a constant, so the algorithm is technically linear. In practice, the constant will be lower, so it is practical for < 1000 byte inputs.

I've attached the code. I'm on a Linux machine at the moment, but I'll attach a Windows binary later I've attached a Windows binary. It should only take a second for most inputs.
Re: ppunroll - optimize PPU writes
by on (#164601)
Awesome!!
I didn't get around to trying it, yet, but I've added a note to my original post so people will not miss it.
Re: ppunroll - optimize PPU writes
by on (#164618)
russellsprouts wrote:
I've attached the code.
These are wrong:
Quote:
Code:
    register_state rol() {
      return register_state(i+1, a << 1 | ((a >> 7) & 0x1), x, y, ai, xi, yi, ROL);
    }
    register_state ror() {
      return register_state(i+1, (a >> 1 & 0x7F) | ((a << 7) & 0x80), x, y, ai, xi, yi, ROR);
    }
—ROL and ROR rotate through the carry bit, not just within A.
Re: ppunroll - optimize PPU writes
by on (#164673)
You're right. In that case, I will have to keep track of the current state of the carry flag, along with the possibility that it is uninitialized. I'll post fixed code soon.

EDIT: the fixed code is above.
Re: ppunroll - optimize PPU writes
by on (#164692)
That's closer but still not quite right. :( ROL and ROR should update Carry the same as ASL and LSR.

Actually, now I'm just trying to elicit the behavior of "emitting a long series of ROL or ROR" with a byte sequence like FF 7F BF DF EF F7 FB FD FE FF and it's not cooperating, and I'm insufficiently C++-savvy to know why it seems to be not following the DAG down the path that involves the accumulator
Re: ppunroll - optimize PPU writes
by on (#164718)
You're right again, it slipped my mind when I was making the update. I've fixed the source code above. (But the Windows binary is still the original version.)

Quote:
FF 7F BF DF EF F7 FB FD FE FF

The reason it won't generate rotates in this example is because the rotate strategy takes 60 cycles and 41 bytes.
It always picks the fastest, breaking ties by number of bytes.
Quote:
Best: cycles: 58, bytes: 47 (inflation factor 4.7, 5.8 cycles per byte)
ldy #255
sty $2007 ; 255
ldx #127
stx $2007 ; 127
ldx #191
stx $2007 ; 191
ldx #223
stx $2007 ; 223
ldx #239
stx $2007 ; 239
ldx #247
stx $2007 ; 247
ldx #251
stx $2007 ; 251
ldx #253
stx $2007 ; 253
inx
stx $2007 ; 254
sty $2007 ; 255


However, if you remove the last two bytes so that there is no possibility of using an increment or decrement, you get the result you were looking for:
Quote:
Best: cycles: 48, bytes: 33 (inflation factor 4.125, 6 cycles per byte)
lda #255
sta $2007 ; 255
lsr
sta $2007 ; 127
ror
sta $2007 ; 191
ror
sta $2007 ; 223
ror
sta $2007 ; 239
ror
sta $2007 ; 247
ror
sta $2007 ; 251
ror
sta $2007 ; 253