This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Stack-based VRAM update system

Stack-based VRAM update system
by on (#155848)
In another thread I started talking about a way to do VRAM updates that is completely stack-based, and can be used for the entire game. I decided to start this new thread mainly to share what I have designed so far, possibly helping people in search of a VRAM update system, but also to get some input from you guys and maybe make this better.

The first thing I have is this piece of code in the NMI handler, after the OAM DMA, that takes care of running all the updates:

Code:
   ;swap stack pointers
   tsx
   stx RealSP
   ldx FakeSP
   txs

UpdateVRAM:

   ;set the output address and jump to the byte copy code
   pla
   sta $2006
   pla
   sta $2006
   rts

RestoreSP:

   ;restore the stack pointer
   ldx RealSP
   txs

For each VRAM update, an output VRAM address is pulled from the stack and used, and the RTS jumps to some point in the middle of the following unrolled loop, depending on the amount of bytes that need to be copied:

Code:
Copy32Bytes:

   pla
   sta $2007

Copy31Bytes:

   pla
   sta $2007

   (...)

Copy2Bytes:

   pla
   sta $2007

Copy1Byte:

   pla
   sta $2007

CopyNothing:

   jmp UpdateVRAM

This will copy a certain amount of bytes and JMP back to the update loop, so the next update can be processed. I decided that 32 bytes is the ideal length for this unrolled loop because it allows you to update the whole palette and a whole row of name table entries. You'll most likely only need more than that for pattern updates only, but there's nothing wrong in breaking those up into pairs of tiles (32 bytes = 2 tiles).

When does the update loop end then, if the program keeps JMPing back to UpdateVRAM? Well, in order to process the updates as fast as possible, I decided to not explicitly check for the end of the updates, but instead feed the RTS of the last iteration of the update loop with the address to RestoreSP. Yes, this means that the last address written to $2006 is a bogus one (if it makes you feel better, you can write $0000, like some people do anyway), but since it takes 16 cycles to set that address, that's equivalent to 8 non taken branches that would be necessary inside the loop to check for a flag that breaks out of the loop. I'm optimizing for the worst case here, so whenever there are more than 8 updates, it's cheaper to set the address unnecessarily than to check for a flag every update.

What about PPU address increments? Same thing. Whenever the increment mode has to be changed, the RTS is fed with the address of a routine that changes the increment mode, before RTSing again to reach the actual byte copy code.

All decisions so far were made to make the most out of the vblank time, meaning there's a lot of work to do before vblank starts, so the stack is formatted correctly. Personally, I'd rather use indexed addressing to fill the update stack, to avoid having to manipulate the stack pointer mid-frame and to avoid having to write all the data backwards. If you want to build the update stack using PHA, you'll have to do the next steps in the opposite order.

For each update you need to "schedule", you have to first check whether the current PPU address increment mode is the one you need. If it isn't, you have to write the address (minus 1, since it will be called with an RTS) of the routine that changes the increment mode. Then, you have to write the output address for the data, and then the data.

Once all updates have been written, you need a bogus VRAM address, which can be anything really, but you can use $0000 if that makes you more comfortable. Then you need the address of RestoreSP - 1, because that's what will allow the update loop to end. Even if there are no updates at all, you still need these last 4 bytes so the program doesn't crash.

Another interesting advantage of this method is that you can cheat, and use the RTS to jump to other types of byte copying routines besides the chain of PLA + STA. For example, to update attribute table data, I prefer to copy the bytes directly from my shadow attribute tables, instead of wasting time copying them over to the stack. In cases like this, you can simply use the address of the specialized byte copy code (minus 1) instead of the addresses from the regular look-up table. This allows you to have a constant NMI handler for the whole program, but doesn't prevent you from doing more specialized updates if you need to. This is useful if you need to switch banks and copy data directly from ROM, for example. As long as you JMP back to UpdateVRAM in the end, everything is game.

Does anyone have any comments or suggestions? The thing that's bothering me the most is the bogus VRAM address at the end, but like I said, in the worst case, it's cheaper to have that than checking a flag for every update.
Re: Stack-based VRAM update system
by on (#155853)
There's one important thing I forgot to talk about: when trying to buffer an update, there must a way to tell whether there's still enough time left to process it. I'm not sure if going by how full the buffer is is enough, since the same buffer is used for addresses and data, and these take different amounts of time to process (let's not forget the increment mode change, which is an additional routine). The most precise thing to do would probably be to count cycles. Initialize a variable with the total amount of cycles that can be used for VRAM updates and subtract from it whenever a new update is buffered. A look-up table could hold the number of cycles necessary for byte transfers of different lengths.
Re: Stack-based VRAM update system
by on (#155854)
Reminds me of how I add routines to NMI. I wanted it to be easy to hook anything into NMI. I load $0100 and up with the addresses of task I want to complete (minus 1), set the SP and RTS to the first routine. The next RTS will load the next task, and whatever the last task is, its RTS will jump to the routine that restores the stack pointer. I just avoid using the stack for anything in my "NMI" routines. If I really need to I can swap the stack pointer inside the routine.
Re: Stack-based VRAM update system
by on (#155859)
Movax12 wrote:
I wanted it to be easy to hook anything into NMI. I load $0100 and up with the addresses of task I want to complete (minus 1), set the SP and RTS to the first routine. The next RTS will load the next task, and whatever the last task is, its RTS will jump to the routine that restores the stack pointer.

Cool idea, very clever and elegant! And you can still have VRAM addresses and data on the stack that the tasks can use, if you want to.

Doing the math though, it sounds like your way ends up being slightly slower, at least for the common case of simply copying variable amounts of bytes to a dynamic VRAM location. For each task, there's 6 cycles for the RTS, and then inside the task you need 16 cycles to set the VRAM address, then another RTS to jump to the middle of the unrolled loop. That's 28 cycles of preparation before the transfer itself.

In the code I wrote above, there's 16 cycles of setting up the VRAM address, followed by the 6 cycles of the RTS that jumps to the unrolled loop. Then, after all bytes are copied, there's 3 cycles for the JMP, for a total of 25 cycles of overhead for each block of data. There's also what's lost with the dummy $2006 writes in the end, which adds a bit more overhead to the whole process though, so it's not as elegant as your solution.

I was just thinking, and realized that the overhead for updating small amounts of data is quite annoying. Say, when updating a single metatile for a block that's been destroyed: that's 2 rows of 2 tiles followed by 1 attribute byte. These need 3 separate transfers (3 * 25 cycles of overhead = 75 cycles) and 5 bytes of data (5 * 8 cycles = 40 cycles). I thought that 115 cycles sounded like a lot to update just one metatile, but then it hit me: all metatile updates have the same structure, so instead of calling the generic byte copy routine 3 times I can make a routine specifically for updating single metatiles, that doesn't JMP back to the NMI handler until it's done with the 3 transfers. That brings the amount of time it takes to update the metatile down to 97 cycles, which is much more reasonable. So I guess I can have the simple byte copy routine for the more generic stuff, but I have the option to create new specialized routines whenever they seem necessary, so this solution is still very flexible even though the first couple of $2006 writes of each update are done outside the update routine itself. A little messy, but flexible. :lol:

Quote:
I just avoid using the stack for anything in my "NMI" routines. If I really need to I can swap the stack pointer inside the routine.

Yeah, that doesn't bother me at all. The goal is to move as much processing as possible away from the PPU update code, which should eliminate the need for regular stack operations.
Re: Stack-based VRAM update system
by on (#155861)
I realized too late this is probably describing what Movax12 does, but I've already gone and written it.
Quote:
but instead feed the RTS of the last iteration of the update loop with the address to RestoreSP. Yes, this means that the last address written to $2006 is a bogus one

Couldn't you use the same trick for the end of the loop?
i.e:
Code:
Copy1Byte:

   pla
   sta $2007

CopyNothing:

   rts

3 cycles slower per case, but since the extra setup of $2006 at the end is 22 cycles, it'd require more than 7 different update streams before it matters. Then again, I suppose attribute bytes would do that. But, because the routine would no longer need to always start from UpdateVRAM, different addresses could be pushed for different types of streams like:

Code:
UpdateVRAMinc32:
   lda shadow2000
   ora #%00000100
   sta shadow2000
   sta $2000

   pla
   sta $2006
   pla
   sta $2006
   rts

So something that needed to change the increment type would do it, set the address, and RTS to the number of updates. Where yours would presumably have to jmp to UpdateVRAM, RTS to what would presumably change the increment type, then RTS again to the right number of updates. (Unless you had a change increment type above every number of updates duplicated as well.)
Quote:
so instead of calling the generic byte copy routine 3 times I can make a routine specifically for updating single metatiles, that doesn't JMP back to the NMI handler until it's done with the 3 transfers

That's more or less what I do for vertical attributes, since they need a new address so often. I have a series of updates in ROM much like your copy32bytes etc that sets $2006 and then writes $2007.
Code:
   pla
   sta $2006
   
   pla
   sta $2006
   
   pla
   sta $2007
Re: Stack-based VRAM update system
by on (#155864)
Kasumi wrote:
3 cycles slower per case, but since the extra setup of $2006 at the end is 22 cycles, it'd require more than 8 different update streams before it matters. Then again, I suppose attribute bytes would do that.

Yeah. Updating the attributes of a column of metatiles requires 4 separate transfers, and then there's obviously the name table data. Based on this I think that it should be fairly common to have more than 8 updates per frame. And when there aren't, the overhead isn't such a big issue, since less updates means more free time which means the wasted cycles are not as important.

Quote:
But, because the routine would no longer need to always start from UpdateVRAM, different addresses could be pushed for different types of streams like:

There's that... Since I figured I could optimize specific types of updates (by not jumping back to UpdateVRAM all the time), the final update count could easily drop below 8, so this is indeed something to consider.

Quote:
That's more or less what I do for vertical attributes, since they need a new address so often. I have a series of updates in ROM much like your copy32bytes etc that sets $2006 and then writes $2007.

I see. Another trick to speed up vertical attribute updates is that you can update 2 bytes per address if you keep the increment 32 mode. Since 32 bytes is the same as 4 attribute rows, for each address you can write 1 attribute byte and the one that comes 4 rows down, so you only have to set the address 4 times for the whole column.
Re: Stack-based VRAM update system
by on (#155868)
Oh, this is pretty clever and cool. A good idea would be to force the "Copy32Bytes" etc.... routines so that their address can be calculated directly from the # of bytes to copy without using a lookup table. Does it allow to update more data than more regular VRAM update systems (i.e. is it faster ?)
Re: Stack-based VRAM update system
by on (#155871)
Movax12 wrote:
I wanted it to be easy to hook anything into NMI. I load $0100 and up with the addresses of task I want to complete (minus 1), set the SP and RTS to the first routine.


Looking at this after some sleep, not sure if I was clear. During "game logic" I can call a macro that adds a new routine to this list in the stack page. In NMI, you swap the SP and start through the list with RTS. I actually have three lists. One is for vram updates - they are called as soon as possible in NMI. Then I have lower priority routines, and a third for things that don't need to be done in any hurry, like read the controller. And if you need to do something once, the routine can remove itself from the list quite easily (if it was added to the list last).

So, compared to having some static JSRs in NMI, and/or checking flags to decide what needs to be done (and ignoring the overhead in the "game logic" thread) this is quite fast.
Re: Stack-based VRAM update system
by on (#155877)
Bregalad wrote:
A good idea would be to force the "Copy32Bytes" etc.... routines so that their address can be calculated directly from the # of bytes to copy without using a lookup table.

Great idea! Since there are 4 bytes of instructions for copying each byte (PLA, STA $XXXX), the offset could be calculated with bit shifts. The only problem is that the less bytes you have, the greater the address, so in addition to being shifted, the number needs to be complemented. Actually, since RTS takes addresses - 1, we can use one's complement instead of two's complement:

Code:
   ;X = number of bytes to copy
   lda #>Copy32Bytes ;2
   pha ;5
   txa ;7
   asl ;9
   asl ;11
   eor #$ff ;13
   pha ;16 cycles

For this to work, Copy32Bytes must be at $XX80. That way, if I want to copy 32 bytes, the result will be: (32 * 4) XOR 255 = $XX7F, exactly 1 less than the start of the byte copy routine. If I want to copy 1 byte, the math is: (1 * 4) XOR 255 = $XXFB, exactly one less than the address of the last copy, which is at $XXFC-$XXFF. It's 2 cyles slower than using the jump table, but you save 64 or 66 bytes.

Quote:
Does it allow to update more data than more regular VRAM update systems (i.e. is it faster ?)

Well, the byte copy itself is as fast as it gets without having to resort to ZP or dynamic code, it's the overhead of going from one update to the next that slows things down. I tried to minimize that overhead as much as possible, but it does add up when you have lots of small transfers.

Movax12 wrote:
Looking at this after some sleep, not sure if I was clear.

It was pretty clear to me. :D Did I say anything that sounded like I didn't get it?
Re: Stack-based VRAM update system
by on (#155882)
OK, here's another approach, targeting maximum speed - Create 2 unrolled copy loops:

Code:
CopyLoop1:
   .rept 32
   pla
   sta $2007
   .endr
   
   pla
   sta $2006
   pla
   sta $2006
   rts

Code:
CopyLoop2:
   .rept 32
   pla
   sta $2007
   .endr
   
   jmp RestoreSP

The first one is used for all updates except for the last, which uses the second. This means there will be overhead at all (except for setting a new VRAM address, but that's nearly always necessary anyway), since one update jumps directly to the next. It's still customizable, since you jump to routines other than these 2 loops (to change increment modes or fetch data from other sources).

The only drawback is that filling the buffer with index addressing becomes a little awkward, since you never know whether an update is the last one when you're buffering it. You'd have to save the index of the address of the most recent update so you can change it to use the second copy routine once all the logic is done and you know that no more updates will be buffered. Filling the buffer with PLA is more straightforward, since the first update you push is always the last one to be executed, so you can just start using the second copy routine and then use the first for all other updates.

EDIT: It may seem wasteful to have a second copy routine, but when you consider that it eliminates the need for specialized routines in certain cases (like the metatiles I mentioned before actually, since metatile updates have fixed lengths, it's still faster to use a custom routine), this might actually be saving storage space.

I really liked Movax12's idea of one routine calling the next, so I just took it to the next level.

EDIT 2: Actually, if you don't mind the dummy $2006 writes, you can do with just the first routine, and have the RTS break out of the loop, like in my original idea.
Re: Stack-based VRAM update system
by on (#155883)
tokumaru wrote:
Movax12 wrote:
Looking at this after some sleep, not sure if I was clear.

It was pretty clear to me. :D Did I say anything that sounded like I didn't get it?


Not really, but I wasn't sure if I got it after reading it in the morning.
Re: Stack-based VRAM update system
by on (#156173)
Thank you so much for this, tokumaru. I had originally planned on trying to read 2 bytes of the stack for the $2006 updates, reading another byte for how many bytes to copy, putting that in X, and then using a loop to copy X amount of bytes, and I was gonna try to do this with both palettes, 25 tiles in fixed positions, and then try to squeeze a few more tiles in. I was just waiting until I'd run out of VBlank time. This also made me stop being lazy and learn how the RTS trick works.
Re: Stack-based VRAM update system
by on (#163800)
I know that I'm totally resurrecting this, but I thought it was interesting enough that I looked into it.

Performance wise, PLA takes 4 cycles. Because of this, using the stack is no faster than using any other memory. So while this method is cool, it would be better to devote non-stack memory to updating nametables UNLESS you're not using the stack very much, and you're not updating the screen with too much information.
Re: Stack-based VRAM update system
by on (#163801)
I think one of the advantages of the PLA/STA $2007 loop is that you can determine the point to jump in by shifting the negative number of bytes to the left by 2.
Re: Stack-based VRAM update system
by on (#163802)
thenewguy wrote:
So while this method is cool, it would be better to devote non-stack memory to updating nametables UNLESS you're not using the stack very much, and you're not updating the screen with too much information.

Or flip that around and get: Use the stack area for PPU update buffers, UNLESS you need most of it for other things (e.g. some recursive algorithm).
Re: Stack-based VRAM update system
by on (#163811)
To me, the biggest advantage of using the stack is free indexing. This makes it easy to transfer variable amounts of bytes, without having to waste time setting up indices. The fact that destination addresses and byte counts (in the form of RTSable pointers) also come from the stack makes things even faster.

One way things could be faster than this is of you used ZP for the buffer, but without indexing it for reading, or you'd lose the speed boost, so I don't see how that would work. If you used self-modifiable code to read the correct amount of bytes from the correct memory positions (something that would most likely require extra RAM on the cartridge), you'd actually be better off loading immediate values in the self-modifiable code instead, which would result in even faster transfers.

If anyone can come up with a practical way to read arbitrary amounts of bytes from arbitrary ZP locations without using indexing, I'd love to hear about it.
Re: Stack-based VRAM update system
by on (#163853)
tokumaru wrote:
If anyone can come up with a practical way to read arbitrary amounts of bytes from arbitrary ZP locations without using indexing, I'd love to hear about it.

Is an unrolled loop with .repeat not practical? If you need variable lengths, you can use an indirect JMP (or some other PC trick) to start at the byte you want to begin with.
Re: Stack-based VRAM update system
by on (#163857)
That only takes care of the byte count, not the position of the data. Sure I can have an unrolled loop that copies 128 from ZP to $2007, and I can JMP or RTS to start near the end of it and copy only 4 bytes, but it will ALWAYS be the SAME 4 bytes, because there's no indexing whatsoever, the addresses are hardcoded! That's useless for transferring variable amounts of data, because you need to transfer different blocks from the middle of the buffer, not always the last N bytes.
Re: Stack-based VRAM update system
by on (#163862)
Is hardcoding the addresses really a problem? Exactly how many different combinations of PPU string lengths do you need to handle? (This is rhetorical though; if your problem is really that specific, you need a specific solution, obviously.)
Re: Stack-based VRAM update system
by on (#163864)
rainwarrior wrote:
Is hardcoding the addresses really a problem?

Of course! Say that I have an 8-way scrolling engine, using 4 name tables, which I update by drawing 17-metatile rows and 16-metatile columns as necessary. Depending on the position of the camera, the 17 metatiles of a row will be distributed differently across 2 name tables... It might be 1 on the left and 16 on the right, 16 on the left and 1 on the right, or anything in between. The same goes for columns, with all combinations between 1 + 15 and 15 + 1.

So, for each row of metatiles, I need 4 transfers:

- left part, top half;
- left part, bottom half;
- right part, top half;
- right part, bottom half;

And for columns:

- top part, left half;
- top part, right half;
- bottom part, left half;
- bottom part, right half;

All those blocks of data vary in size depending on the position of the camera. I fail to see how I could read an arbitrary number of bytes from an arbitrary section of the buffer, without using indices.

A video update buffer is supposed to be dynamic, you don't know what's going to be put there or in what order... And the same type of data could have different lengths each time (like happens with rows and columns of metatiles, which can be distributed differently across the 4 name tables).

Sure, I could have separate buffers for each task (palette updates, metatiles updates, etc.), each with its own transfer routine reading from hardcoded addresses. In a game that only scrolls in 1 direction, even rows and columns could be handled this way, since the number of bytes to copy would always be the same. But this is not versatile. If you don't need one type of update in a frame, you can't reuse the memory for another type of update, meaning you'd waste more of the precious ZP memory than you'll actually need each frame.

And the 8-way scrolling issue I presented above is practically impossible to solve using hardcoded addresses unless you code a routine for each possible distribution of metatiles, times 2 (row + column) times 2 halves, which would mean an insane waste of ROM.
Re: Stack-based VRAM update system
by on (#163866)
I guess I can see a simpler engine settling for the hardcoded address approach, though. If you only scroll in one direction, 8 pixels at a time, there's nothing too bad about an unrolled loop that copies 30 or 32 tiles from a constant location. Or if you need to update up to 4 patterns in CHR-RAM per frame, you could have an unrolled loop to copy 64 bytes from another fixed location, which could be filled backwards, and different entry points in the unrolled loop would allow copying 4, 3, 2 or 1 pattern(s).

The possible settings are far from arbitrary though, and the use of memory is not very flexible. For games with heavier VRAM usage (an extreme example would be Battletoads), I don't think this would be feasible.

To me that sounds like a paradox really, considering that a simple game is less likely to need a speed boost in VRAM transfers than a complex game with lots of dynamic updates.
Re: Stack-based VRAM update system
by on (#163869)
tepples wrote:
I think one of the advantages of the PLA/STA $2007 loop is that you can determine the point to jump in by shifting the negative number of bytes to the left by 2.

Can you elaborate on this? I'm not sure I understand, but I'd like to.

tokumaru wrote:
To me, the biggest advantage of using the stack is free indexing. This makes it easy to transfer variable amounts of bytes, without having to waste time setting up indices. The fact that destination addresses and byte counts (in the form of RTSable pointers) also come from the stack makes things even faster.

I'm not sure I understand. If you have a buffer in ram somewhere, you should be able to read it using lda $0X00,y at a cost of 4 cycles. Oh, I think I see what you're getting at, that you don't have to deal with y thus eliminating an ldy and a dey? Yeah, okay, I didn't think of that lol. That is pretty cool. I wonder if I have enough stack to use this now lol.

Quote:
If anyone can come up with a practical way to read arbitrary amounts of bytes from arbitrary ZP locations without using indexing, I'd love to hear about it.

Nope, just the annoying way of an unrolled loop.

However, if your copies are large, you could have two seperate lists: one with AddressHi, AddressLo, and Length, the other would be in zero page with data. NMI would read from the first list, but copy data from the second list using an unrolled loop from zero page. Time of 3 rather than 4. Or you could just pull from the stack or use a buffer in ram. I like the easy solutions.
Re: Stack-based VRAM update system
by on (#163870)
tokumaru wrote:
To me that sounds like a paradox really.

I think the paradox comes from thinking about generic problems rather than specific problems. There's like a billion different ways to do multidirectional scrolling, or the myriad of other tasks you're rifling through here. You're trying to solve too many problems with the same code, and you're over-constrained. You can always add constraints until a problem becomes impossible. There's a lot of ways to make the ZP work, if you want it.

Personally, I don't think you really need ZP for this; how much ZP space is worth devoting to the NMI? The answer for most games would likely be smaller than what you can already push to the PPU by other means.
Re: Stack-based VRAM update system
by on (#163871)
thenewguy wrote:
tepples wrote:
I think one of the advantages of the PLA/STA $2007 loop is that you can determine the point to jump in by shifting the negative number of bytes to the left by 2.

Can you elaborate on this? I'm not sure I understand, but I'd like to.

He means that since each instance of PLA/STA $2007 assembles to 4 bytes of ROM, you can use bit shifting to calculate the address to jump to in order to copy a specific number of bytes. Want to copy 5 bytes? Jump to EndOfCopyCode - 5 * 4. Personally, I don't see this as much of an advantage, you could just as easily and probably just as fast use a lookup table of entry points.

Quote:
Oh, I think I see what you're getting at, that you don't have to deal with y thus eliminating an ldy and a dey? Yeah, okay, I didn't think of that lol.

Yeah, that's the idea. Technically, the dey could be eliminated anyway if you used sequential addresses (i..e. LDA Buffer+0, y/LDA Buffer+1, y/LDA Buffer+2, y/etc.) in the unrolled loop, but Y would still need to be updated for each transfer, so that's something we can avoid by using the stack method. ROM usage is lower too, since LDA $XXXX, y/STA $2007 is 6 bytes while PLA/STA $2007 is only 4 bytes.
Re: Stack-based VRAM update system
by on (#163872)
rainwarrior wrote:
I think the paradox comes from thinking about generic problems rather than specific problems. There's like a billion different ways to do multidirectional scrolling, or the myriad of other tasks you're rifling through here. You're trying to solve too many problems with the same code, and you're over-constrained.

Well, this is part of my recent methodology of development, which is actually working out pretty well for me. I've been trying to come up with the most generic solutions possible, and my programs are way less confusing than when I had a lot of specific routines for each little task. The previous iteration of my scrolling routine did in fact have separate buffers for everything, and dedicated transfer routines, and there was practically no flexibility in how the time was used though (e.g. the routine that updated up to 8 arbitrary metatiles would always claim an entire update slot for itself even if there was only 1 metatile to update).

Now, since I have a completely dynamic buffer, I can stuff whatever type of update I want in it, no matter how small or how large, and as long as it fits in the buffer, I know it will be handled during the next vblank.

My current solution is good enough for my needs (my buffer is 212 bytes long, so if I were to use it all in a single update I could upload 208 bytes to VRAM - this is hardly ever the case though, because each transfer has an overhead of 4 bytes: 2 for the target address and 2 for the RTSable entry point), it's not like I'm desperately chasing after a faster method for updating VRAM, I was just curious if someone was able to come up with a fast generic solution using ZP.

Quote:
Personally, I don't think you really need ZP for this; how much ZP space is worth devoting to the NMI? The answer for most games would likely be smaller than what you can already push to the PPU by other means.

That's another paradox right there! :lol:
Re: Stack-based VRAM update system
by on (#163895)
I was playing around with Itadaki Street, and it also uses a stack-based PPU transfer routine (first time I ever saw it, so it's pretty funny to stumble on this thread after that :P).

It's nowhere near as flexible as the one you devised though.