This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

How many PPU updates in NMI?

How many PPU updates in NMI?
by on (#155825)
I've tried to find the answer to this topic, but I didn't know what exact words I have to look for, so I'm just asking the question here:
How many PPU updates can you approximately do during the NMI again on an NTSC console? What was that value again?

Assuming that your NMI does nothing else but saving and restoring the registers, checking a boolean variable to see whether it's allowed to do anything at all, setting that variable to false in the end and doing the scrolling every time in the end, how many tiles may I write per NMI before vblank ends? Was it two columns? Three?
Re: How many PPU updates in NMI?
by on (#155829)
With optimized ASM code, you can get 192 bytes easily, but in C code who knows.
Re: How many PPU updates in NMI?
by on (#155830)
NTSC vblank is 20 scanlines long. At 113.666 cycles per scanline, that's about 2273 cycles of vblank time. Subtract the time necessary for the sprite DMA and you get around 1730 cycles for other updates. The kinds of memory transfers you can do vary a lot, but the 3 fastest methods are the following:

1 - Unrolled stack dump:
Code:
PLA
STA $2007
PLA
STA $2007
PLA
STA $2007
PLA
STA $2007
(...)

To use this method you have to buffer the data you want to change on the stack, and use a jump table to jump to the middle of this unrolled loop at the correct location to copy the desired amount of bytes. It takes 8 cycles to copy each byte.

2 - ZP dump:
Code:
LDA $XX
STA $2007
LDA $XX
STA $2007
LDA $XX
STA $2007
LDA $XX
STA $2007
(...)

You can't use any indexing on this one, otherwise you loose the speed boost of using zero page, so this is only really useful when you're updating fixed-size data chunks (like patterns, which are 16 bytes each). It takes 7 cycles to copy each byte.

3 - Immediate values in generated code:
Code:
LDA #$XX
STA $2007
LDA #$XX
STA $2007
LDA #$XX
STA $2007
LDA #$XX
STA $2007
(...)

For this one you need to use a lot of RAM, because the instruction are generated on the fly. Each byte you're updating expands to 5 bytes worth of instructions. It takes 6 cycles to update each byte.

It's hard to estimate the overhead of setting up the transfers and such, but I'd guess it's around 20%, which leaves you with 1384 cycles for actual data transfers. Now divide this by the number of cycles it takes to update each byte and we get:

Method 1: 1384 / 8 = 173 bytes;

Method 2: 1384 / 7 = 197 bytes;

Method 3: 1384 / 6 = 230 bytes;

If you use C or even the typical ASM loop with DEY/BNE, these numbers will decrease significantly.
Re: How many PPU updates in NMI?
by on (#155831)
I think the estimation for the overhead is a little high. In the best case, you can set registers, then do an indirect jump into an unrolled copy loop. This makes the preparation more complicated, and more is done in non-vblank code to make the vblank code as tight as possible.
Re: How many PPU updates in NMI?
by on (#155834)
The last time I mentioned the PLA variant, tepples pointed out that a 16x partial unroll gets an average of 8.75 cy/byte transfered.
Re: How many PPU updates in NMI?
by on (#155836)
Dwedit wrote:
I think the estimation for the overhead is a little high.

Well, I did say it was hard to estimate... :) I was thinking of things like setting up the VRAM address, increment mode, breaking up larger updates into multiple calls to a smaller unrolled loop, looking up the values for the indirect jumps... I think that could add up to a lot of lost cycles, but we'd have to look at actual code to say for sure.

I've been thinking of a purely stack oriented method, that would get rid of a lot of the overhead. All the data would be in the stack, along with the VRAM addresses, the return addresses for accessing the middle of the unrolled code (eliminating the need to create pointers from a lookup table and do indirect JMPs). The code in the NMI would be only this:

Code:
UpdateVRAM:
   pla
   bmi +Done
   sta $2006
   pla
   sta $2006
   rts
+Done:

A negative byte would be used to indicate that there are no more updates. I couldn't think of anything better than another pla + sta $2000 for setting the increment mode, so I omitted that. Maybe negative values could be used for other things besides ending the loop, like changing the increment mode. This way you wouldn't have to waste time setting the increment for each transfer.

And here's what the unrolled code would be like:

Code:
pla
   sta $2007
   pla
   sta $2007
   pla
   sta $2007
   (...)
   pla
   sta $2007
   jmp UpdateVRAM

Anyway, all the preparation is done during non-vblank time, like you said, even getting the entry points from the look-up table and pushing them to the stack.
Re: How many PPU updates in NMI?
by on (#155837)
Also, you can treat the prerender line as part of vblank time, as long as you do the full 2006/2005/2005/2006 sequence at the end, and finish before cycle 321 of the prerender line.
Re: How many PPU updates in NMI?
by on (#155839)
Dwedit wrote:
Also, you can treat the prerender line as part of vblank time, as long as you do the full 2006/2005/2005/2006 sequence at the end, and finish before cycle 321 of the prerender line.

And as long as you have rendering disabled during that scanline, right? I prefer no not turn rendering off during vblank, so I use the pre-render scanline just to set the scroll normally ($2000/$2005/$2005).
Re: How many PPU updates in NMI?
by on (#155856)
I reduce the overhead of my partially-unrolled loop by using null-terminated buffers:

Code:
   pla
loop:
   sta $2007
   pla
   sta $2007
   pla
   sta $2007
   pla
   sta $2007
   pla
   bne loop
Re: How many PPU updates in NMI?
by on (#155857)
Method 4, unrolled loop...

Code:
lda UpdateBuffer ; (4 cycles)
sta $2007 ; (4 cycles)
lda UpdateBuffer + 1
sta $2007
lda UpdateBuffer + 2
sta $2007
...etc...

1730 / 8 = 216 bytes updated to the PPU.

written in c =

Code:
*((unsigned char*)0x2007) = *((unsigned char*)BUFFER);
*((unsigned char*)0x2007) = *((unsigned char*)BUFFER+1);
*((unsigned char*)0x2007) = *((unsigned char*)BUFFER+2);
//etc
Re: How many PPU updates in NMI?
by on (#155860)
dougeff wrote:
Method 4, unrolled loop...

I didn't mention this because it's as stiff as the ZP method (fixed number of bytes, from fixed addresses), while being as slow as the stack method, which is way more versatile. The only advantage I see with this method is not having to mess with the stack pointer.

You can however turn this into something as flexible as the stack method, while still maintaining the cost of 8 cycles per byte. You can use absolute indexed addressing (which also takes 4 cycles, as long as no page boundaries are crossed), and negative base addresses, like this:

Code:
   lda Buffer-32, x ;entry point for copying 32 bytes
   sta $2007
   lda Buffer-31, x ;entry point for copying 31 bytes
   sta $2007
   (...)
   lda Buffer-2, x ;entry point for copying 2 bytes
   sta $2007
   lda Buffer-1, x ;entry point for copying 1 byte
   sta $2007

And then you just have to manipulate the X register in order to read from anywhere in the 128-byte buffer. If you want to transfer 1 byte from the start of the buffer (position 0), load X with 1 + 0 = 1 and jump to the last LDA + STA pair, which will load the value at Buffer-1 + 1 = 0, the beginning of the buffer.

Want to load 4 bytes from the end of the buffer (positions 124, 125, 126 and 127)? Load X with 4 + 124 = 128 and jump to the middle of the unrolled loop with 4 bytes to go, so the bytes in the following addresses will be read:

Buffer-4 + 128 = 124
Buffer-3 + 128 = 125
Buffer-2 + 128 = 126
Buffer-1 + 128 = 127

It works exactly like the stack method, but instead of manipulating the stack you have to add the offset and the length to form the index.
Re: How many PPU updates in NMI?
by on (#155863)
I just didn't like the idea of using up 200 bytes of the zero page.

Actually, I like your hybrid stack method as well. (that you just posted).

Would that use a JMP (indirect). ?
Re: How many PPU updates in NMI?
by on (#155865)
dougeff wrote:
Would that use a JMP (indirect). ?

Yes, that would be 1 cycle faster than pushing the address to the stack and using RTS, which is also a valid option. Either way you'd normally use a look-up table to get the correct address (or address - 1, in the case of the stack trick) for the amount of bytes you want to transfer.

BTW, I just assumed a 128-byte buffer, but the limit is dictated by the maximum number of bytes the unrolled loop can copy each time. You just can't let the negative base address cross a page boundary. So, an unrolled loop that copies at most 32 bytes can work with a buffer up to 256 - 32 = 224 bytes big, in order to completely avoid page crossing, because the buffer must begin somewhere after $XX1F.
Re: How many PPU updates in NMI?
by on (#155880)
Which would work well with putting the 6502 stack at $0100-$013F and the update buffer at $0140-$01FF.
Re: How many PPU updates in NMI?
by on (#155881)
Yay stack overflow, and the return address gets clobbered by the end of vram data
Re: How many PPU updates in NMI?
by on (#155884)
You obviously have to be careful whenever you use the stack in non-trivial ways. 256 bytes is way to much RAM for the conventional uses of temporary variable storage and remembering return addresses. That memory is begging to be used for other things.
Re: How many PPU updates in NMI?
by on (#155887)
Dwedit wrote:
Yay stack overflow

How much stack does a typical NES game use?
Re: How many PPU updates in NMI?
by on (#155888)
tepples wrote:
How much stack does a typical NES game use?

I don't have any numbers, but I seriously doubt many NES games have deep nested subroutine calls. I can see the stack being used as a buffer though, like for reading data from a switchable program bank for later processing. I doubt anyone needs more than 64 bytes.
Re: How many PPU updates in NMI?
by on (#155889)
My Chu Chu Rocket game used up to 37 bytes of stack space.
Dragon Warrior 1 uses 48 bytes of stack space.
Dragon Warrior 4 uses 116 bytes of stack space. (do we need a thread to list this stuff?)
Re: How many PPU updates in NMI?
by on (#155890)
My game currently seems to max out at about 55 bytes deep on the stack.

I don't think there's much of a "typical"; like it would be easy to adopt a programming style that either prefers shallow or deep stacks. 55 bytes is what I got by not considering it an issue and just doing what I felt like. If there was a problem in my game that I thought some limited recursion would solve, it'd probably go a lot deeper. If I had decided my stack limit was 64 bytes, I just wouldn't consider that kind of solution and I'd do something else in my game.

Pretty easy to observe in other games' stack usage in FCEUX:

In some very currsory tests, SMB seems to use about 21 bytes worth of stack. SMB3 went up to about 25. Batman 26. Alfred Chicken 31.

Metal Slader Glory seems to use only about 9 bytes of stack, and only reserved 32 bytes for it total?

Battletoads makes very strange use of its stack space. It seems to have a different setup for different levels. During transitions it re-initializes the whole stack space to 0. I think each gameplay mode might have a different organization of the stack; sometimes there appear to be a bunch of variables pushed to the top of the stack, and using a block in the middle somewhere for the "real" call stack.

Edit: Dwedit was looking at some other games at the same time. No I don't think we need a thread to list this extremely trivial bit of information about random games. :P