This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

What's the fastest way to write repeating data to the PPU?

What's the fastest way to write repeating data to the PPU?
by on (#39950)
Suppose you have a single byte value you'd like to write to successive locations in PPU memory. Assuming that PPUADDR ($2006) has already been set, bit 2 of PPUCTRL ($2000) = 0 (i.e. VRAM address increment = 1) and zpBuffer is a zero page variable, the code might look something like this:

Code:
   LDA zpBuffer
   STA PPUDATA
   STA PPUDATA
   ...
   STA PPUDATA
   STA PPUDATA


Now suppose you want to do the same thing, only this time with a repeating word value (2 bytes). Then you might throw the X register into the mix:
Code:
   LDA zpBuffer+0
   LDX zpBuffer+1
   STA PPUDATA
   STX PPUDATA
   ...
   STA PPUDATA
   STX PPUDATA


And if you really want to get crazy, there's even the Y register for those wacky repeating 3-byte sequences:
Code:
   LDA zpBuffer+0
   LDX zpBuffer+1
   LDY zpBuffer+2
   STA PPUDATA
   STX PPUDATA
   STY PPUDATA
   ...
   STA PPUDATA
   STX PPUDATA
   STY PPUDATA


My question is this: What's the most optimal way (fewest cycles) to extend the above pattern beyond 3-byte sequences? The best I can think of is to use a single LDX and LDY statement to hold 2 of the values, then use as many LDA statements (zero page) as necessary for the remaining values.

I originally thought it would be cool to set up the buffer at the bottom of the stack (say, $100-$10F), then repeatedly "pull" successive values into PPUDATA. But alas, it doesn't seem like it's possible to pull values directly into memory.

So, are there any tricks out there I might be overlooking?

by on (#39952)
Change the LDA ZP to LDA #IMM and use self modifying code. I don't think you can get any faster than that. Just prep the code at a point that isn't crucial. If the writes to vram are variable in number, then make a system to keep track of where you put the RTS code, so that you can undo it to recreate a new length for the series of load/store opcodes, etc.

by on (#39953)
You, sir, are a madman. Thanks for that awesome suggestion!

by on (#39955)
Quote:
I originally thought it would be cool to set up the buffer at the bottom of the stack (say, $100-$10F), then repeatedly "pull" successive values into PPUDATA. But alas, it doesn't seem like it's possible to pull values directly into memory.

This is perfectly possible, the pla instruction takes only one byte and only 4 cycles which is interesting (okay it takes the same number of cylces than lda $100,X, but the index is automatically increased).
But yeah the fastest possible is self-modyfiying code. Also you can store the long runs of stas by a loop, which removes the disadvantage of unrolled loops that they eat insane rom space for small speed gain.

by on (#39959)
Battletoads did something like this for the Snake Pit level.

Image

You can see that the status bar is many scanlines lower than it is in other levels, since it is updating a LOT of tiles every frame.
You can also see that the PAL version has the status bar at the top. Updating that many tiles on PAL hardware is no big deal.

FCEU has glitches in the snake pit level on NTSC.

by on (#39960)
Oh wow, thanks for the explanation on that snake level thing, Dwedit. I've been playing alot of Battletoads lately. My TV has pretty bad overscan, so I can never see my score. I can see it on that level though, and it's something that my wife and I thought was weird.