This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Extreme CPU usage after a change (SOLVED)

Extreme CPU usage after a change (SOLVED)
by on (#93989)
I just made a change to two functions that are called exactly once or twice per frame. They basically just load the high/low byte of an address so I can load indirectly from it. All I did was update them to a slightly newer format.

But here's the kicker: Before I made the change, my entire game runs a frame in 8,000-12,000 cycles in one situation.

After the change? 29,765-33,712 in the same situation.

I'm fairly sure the extra time is not within the logic of the subroutines I changed, but in case you don't believe me, here two are:

   ldy #$00;Replace with 16 by 16 tileset number
   ;Loading all the addresses. Shouldn't need to change.
   lda meta16addrs0high,y
   sta <dress9
   lda meta16addrs0low,y
   sta <dress8
   lda meta16addrs1high,y
   sta <dressB
   lda meta16addrs1low,y
   sta <dressA
   lda meta16addrs2high,y
   sta <dressD
   lda meta16addrs2low,y
   sta <dressC
   lda meta16addrs3high,y
   sta <dressF
   lda meta16addrs3low,y
   sta <dressE

   ldx #$00;Replace with 16 by 16 tileset number
   ;Loading all the addresses. Shouldn't need to change.
   lda met16header
   sta <reserved0;Temp RAM
   lda met16header+1
   sta <reserved1;Temp RAM
   lda #$F6
   adc #$0A;Number of elements in the set*2
   bpl l16metaloop

   lda [reserved0],y
   sta <dress8
   lda [reserved0],y
   sta <dress9
   lda [reserved0],y
   sta <dressA
   lda [reserved0],y
   sta <dressB
   lda [reserved0],y
   sta <dressC
   lda [reserved0],y
   sta <dressD
   lda [reserved0],y
   sta <dressE
   lda [reserved0],y
   sta <dressF

The other subroutine I changed was more or less identical, except it loads the fifth byte stream for my 16x16 tiles and stores it.

Basically what I've done is added an extra layer of abstraction, so every bank can have its own variable number of sets. I load from the header which is in a fixed position in all banks, and then I can get the addresses of each set of elements of each tile set, which could be anywhere in the bank since those can be any size.

Note: I'm actually still on NROM, I'm just preparing for the eventual move.

I've looked at some things this could be. The new one uses two temp variables, and X. I looked around, and I can't see anything that expects X or these temp RAM variables to have a certain value. I first thought maybe this was called in the middle of some loop, and this function was making it run A LOT.

But in fact, all three are actually set to expected values almost immediately after these subroutines are called.

The only other thing I can think of is that this change pushed a piece of, like, every SINGLE of my loops passed a page boundary. I certainly don't have any single loops running the THOUSANDS of times this would require.

I realize this isn't much to go on, but I have no idea. This is the only change between the two versions. Has anybody had a similar experience? I'm looking for debugging tips, I guess. Maybe ways to detect when an instruction passes a cycle boundary or any other clever things you guys might think this could be.

This has actually happened to me before. A few large changes = skyrocketed CPU performance, but back then I wasn't even sure EXACTLY what was added. And, as I added more code it just went away.

Now I know exactly what I changed, and want to know what this is about.
Re: Extreme CPU usage after a change that does not even loop
by on (#93990)
Kasumi wrote:
I'm looking for debugging tips, I guess. Maybe ways to detect when an instruction passes a cycle boundary or any other clever things you guys might think this could be.

I can tell you the first thing I'd try is using the Execution Visualizer feature of NESICIDE. Mark the regions of the two functions and it'll give you min/cur/max cycles. It may not be perfect but it'd quickly show you whether the time is wasting in those or not.

EDIT: You don't need to compile the ROM in NESICIDE for the above to work. Just need to know the assembly addresses of the routines.

by on (#93992)
I'm looking at that "l16metaloop" loop...

What's a typical value for X as you run this code?

Edit: nevermind, I saw that it's basically just multiplication.

I suggest you either use a better method for multiplication, or you find some way to avoid it all together.

As for what a better method would be, try something like this:

"A * B = C"
Start by setting C = 0

if carry is set:
    C += B

Repeat 8 times (or less, if you don't expect to use all 8 bits of A)

by on (#93994)
Cpow: The execution of either one isn't variable, and not run more than once in a frame. I never did check the cycle count until your post, but the one posted is always 106. The one that's not is 46. (It does almost EXACTLY the same thing, except it only writes one address instead of 4).

Drag: It's always 0. Look above it. ldx #$00.

Eventually that will load the set number, but I haven't done that yet because there's only one set.

   .dw met16sets_bank0
   .dw meta16_00_00
   .dw meta16_00_01
   .dw meta16_00_02
   .dw meta16_00_03
   .dw meta16_00_04
   .dw meta16_01_00;This would be the next set
   .dw meta16_01_01
   .dw meta16_01_02
   .dw meta16_01_03
   .dw meta16_01_04
   .dw meta16_02_00;This would be the next set
   .dw meta16_02_01
   .dw meta16_02_02
   .dw meta16_02_03
   .dw meta16_02_04

   .incbin "Data/Sets/16Sets/16Set_00_00.bin"
   .incbin "Data/Sets/16Sets/16Set_00_01.bin"
   .incbin "Data/Sets/16Sets/16Set_00_02.bin"
   .incbin "Data/Sets/16Sets/16Set_00_03.bin"
   .incbin "Data/Sets/16Sets/16Set_00_04.bin"

That's what I'm doing right now. It allows each byte stream to be a variable number. All the parts of meta16_00 would be the same length, and all the parts of meta16_01 to be the same length, but a part of meta_16_00 might not be the same length as meta_16_01.

All the banks would have a header pointing to their own data. Set 0 in bank 0, and Set 0 in bank 1 would be different.

I could avoid the multiplication with an additional layer of abstraction, but right now the execution time isn't variable and the function is run exactly once a frame. I doubt there will be more than two or three sets in each bank, and right now my editor only supports a max of eight sets distributed among all banks.

I do have a subroutine that multiplies the way you describe, and it does have huge benefits for larger numbers.

But right now, I get the same CPU usage if I replace the entire loop with ldy #$00, which is what the result ends up being in a case of zero, which is the only case there is.

Another note: Though I DID change the format, both the old format and the new format were in the file at the same time during this "before and after" test. I basically ctrl+z'd all the changes of that on subroutine. 8,000. Control+y'd back to the updated subroutine. 29,000.

At this point, I guess it has to be some kind of strange nesasm thing. I'm gonna log all the instructions in a few frames, and see if I can find what's taking all the time.

Edit: I also checked that the bank with this subroutine didn't fill with these added bytes. I am aware nesasm fails to tell you sometimes when you overflow a bank, which might be the cause of some strange behavior.

by on (#94005)
I still recommend that you explore a different way of multiplying numbers, because it might not matter right now, but that loop is going to take longer and longer to calculate the correct index, the higher the bank number (The method I suggested should take the same amount of time no matter what). If you don't know where your high CPU usage is coming from, it could be from things similar to this, scattered around your code. It could also be from a complete freak accident too. :P

Either way, I don't know the structure of your data, nor do I know what all of your subroutines and functions need to do, so I can't really suggest much, aside from whatever I speculate on. :P

Kasumi wrote:
I guess it has to be some kind of strange nesasm thing.

Quick question; why is this assembler still used, despite all of the bugs and shortcomings it has? "Some strange problem with the assembler" shouldn't ever have to be something a developer needs to consider. o_O I'm not trying to beat you over the head about this or anything, I'm just legitimately curious as to why I see people still mentioning that they use it.

by on (#94008)
The function doesn't loop, but does your program somewhere have a loop that calls the function?

by on (#94016)
That multiply takes the same time no matter what, but I could just as well use a table and be faster than a multiply. It's a non issue for a subroutine that's called once per frame. Believe me the code is optimized where it needs to be. I've got four way scrolling, sonic slope platform co-op without dropping frames. It may drop frames during co-op eventually, but hopefully almost never for single player. I have a lot to add, but right now it hits ~14,000 in the worst cases for single player, four way scrolling, slope action.

Edit: As for why I use nesasm? At this point, damned if I want to switch this far into my project. For my next thing (if there ever is one) I will definitely use something else.

In any case, I think I figured it out. And naturally, it didn't really have much to do with this change. It was some logic I added a long time ago. It wasn't REALLY running that slow, per se, but it was was stopped waiting for the beginning of a frame AFTER I started starting counting cycles for the frame.

I have two variables that are changed by the NMI. One that says all the tiles have been updated by it, and another that says it has occurred.

I have a wait in place for both variables. The one that is changed when it occurs is used to start a frame. Immediately after the main loop detects it's minus, it sets it to plus. (As opposed to setting it to plus immediately before it's checked) I did this to avoid waiting for the NMI in a case where the program ALREADY took more than a frame. (This is also when I start my timer.) I didn't want my game to hit 30 FPS unless it was CONSISTENTLY was dropping frames. My worst cases are usually exactly one frame (like when scrolling to a new screen, or when landing on the ground from a jump), so this would allow it to continue at a good pace since the next frame would almost certainly be under time.

(I stop my timer when I reach the jmp that puts me back to the main NMI check)

However, when I stress tested this (Six players at once), I found out the game crashed. This is because it was possible for an NMI to trigger as it was readying tiles for the NMI to write to the PPU, but before it was done. So the NMI didn't write those tiles (or... it would crash) and continued on. But since an NMI occurred at some point during the frame, the beginning of the loop didn't stop it. And if the program reached the part of the program where tile updates started again, it would corrupt the stream because that part of the game assumed an NMI would have already occurred.

So I put in a second check before tiles are updated that makes sure the NMI has written the buffer and it can start messing with it.

This subroutine must have changed the timing of everything just enough that it always gets stopped at the safetile loop, and not the main loop. This also explains how I've hit this problem before, but it just sort of went away.

I could link both in another way, perhaps. I did it this way since the object logic could "pick up slack" while the next NMI was still to occur, without it getting stuck in the wait loop for safetiles. If I put it earlier, it would delay BOTH.

Anyway, thanks all for listening to me ramble. I really can't wait to show this game off.

by on (#94041)
bpl l16metaloop

"bpl" and "bmi" only work for accumulator instructions.

cpx #$ff
bne l16metaloop

by on (#94044)
You can use bpl/bmi after any instruction which affects flags.

Even TAY or TXA will update the zero and negative flag. INX and INY do as well.

Basically, almost every instruction updates the zero and negative flags.
The exceptions are store instructions, jump/call/return instructions, and instructions that set/clear specific flags.

by on (#94051)
psycopathicteen: Maybe you're thinking of bcc and bcs because dex/inx/dey/iny don't affect the carry?

by on (#94056)
Something that can change CPU usage even if you're barely modifying your code is the branch instructions that crosses a page-boundary and the indexed instructions that crosses a page-boundary.

Maybe your small change just made some instruction of your inner loop just cross a page boundary when it didn't before, and this creates a huge CPU load.

The best way to know this is to look at a trace, for example Nintendulator's, to see where most of your program is spending its time.

by on (#94063)
The problem is solved. See this post. I'll edit the topic title. I had thought about page boundary crosses, but that wasn't it this time. Sort of a similar problem, though.

I found it by comparing a log of the before and after. The only differences were when NMIs occurred, how long they got caught in loops, and of course the actual subroutines I changed. I synced the logs after each NMI/subtroutine call/wait loop and realized they were basically perfectly synced for 20 frames.

Then I noticed each one was always getting caught in the wait loop the other wasn't.

by on (#94065)
I didn't really understand your issue, but it seems you're close to release a good game so I can't wait to try it.