This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

A simple optimization

A simple optimization
by on (#238528)
I'm sure this has probably been covered before, but wanted to share anyway.

I was analyzing my code to see what routines were using the most cycles overall. I was surprised to see that one of my most cycle hungry routines was just a simple loop I was using to clear out my shadow OAM before reloading sprite data into it:

Code:
   LDX #0
   LDA #$FE
   .Loop:
      STA spriteTable, X
      INX
      BNE .Loop

   RTS


I wanted to see if I could reduce the amount of cycles used here, and tried the following instead:

Code:
   LDX #31
   LDA #$FE
   .Loop:
      STA spriteTable, X
      STA spriteTable + 32, X
      STA spriteTable + 64, X
      STA spriteTable + 96, X
      STA spriteTable + 128, X
      STA spriteTable + 160, X
      STA spriteTable + 192, X
      STA spriteTable + 224, X
      DEX
      BPL .Loop

   RTS


I was really surprised to see that this uses roughly half as many cycles, for just a handful of extra bytes of code. I understood in theory that a loop which makes 256 comparisons would take more cycles that a loop that only makes 32, but I had never actually tried it and looked at the difference in speed.

I know this is probably old-hat to many of you, but I'm excited about it. it's enough of a difference to visibly reduce lag in some areas, and I'm very pleased with it.
Re: A simple optimization
by on (#238529)
gravelstudios wrote:
I'm sure this has probably been covered before, but wanted to share anyway.
I was really surprised to see that this uses roughly half as many cycles, for just a handful of extra bytes of code. I understood in theory that a loop which makes 256 comparisons would take more cycles that a loop that only makes 32, but I had never actually tried it and looked at the difference in speed.


Many short loops have roughly 50% overhead. If a loop has 50% overhead, unrolling it even to 2x will cut execution time by 25%. To 4x would offer another 17% (12.5 percentage points from the original time). The benefits going from 4X to e.g. 64X would be fairly slight, but going from 1x to 2x is huge.
Re: A simple optimization
by on (#238530)
Meanwhile, I just helped Snes9x get a tiny performance boost by getting rid of an unrolled loop. If it's modern, unrolling doesn't help. If it's ancient, it does.
Re: A simple optimization
by on (#238534)
I would assume the compiler for just about any modern language could convert a for loop into the most optimized assembly code possible without too much fuss on the part of the programmer. There's really no reason to do anything convoluted.
Re: A simple optimization
by on (#238535)
LDX #28 and doing four DEX at a time will speed it up even more, as only the Y coordinate really needs to be cleared.
Re: A simple optimization
by on (#238537)
gravelstudios wrote:
I would assume the compiler for just about any modern language could convert a for loop into the most optimized assembly code possible without too much fuss on the part of the programmer. There's really no reason to do anything convoluted.

Unrolling loops is a well-known optimisation technique for 65xx CPUs (and sometimes others). It's sometimes more efficient than a rolled loop, and other times the opposite is better. Every situation is different/needs vary.

I'm kind of confused by this statement, though. Where is a compiler involved? You wrote native 6502 code, so you were using an assembler, not a compiler. There is a substantial difference.

Not to mention, you didn't even disclose *what* assembler you were using, re: assumptions that an assembler/tool would know how/when to turn a rolled loop into an unrolled loop (hint: most assemblers will not do this for you, *ever*. If that surprises you, respectfully of course, pause for a moment and think about the ramifications of automatically changing someone's 6502 instructions around on them. You're working with the CPU at the lowest level possible from a programming perspective; this is not an "intermediate" higher-level language like C).
Re: A simple optimization
by on (#238538)
Unrolled loops are a good concept. If the assembly has a .repeat directive, this is often a good place to use it, too.
Code:
LDA #31
LDA #$FE
@loop:
  .repeat 8, I
    STA spriteTable + (32 * I), X
  .endrepeat
  DEX
  BPL @loop
RTS

Another general thought is to avoid overlapping work. In this case any sprite you clear that gets overwritten becomes a redundancy. Instead you might clear only what's left over after filling up the used portion of the sprites.

Though, once it's only taking up the slack in this way, it probably becomes less necessary to optimize it at all. Typically if there's fewer sprites on the screen, there's more CPU time remaining, and when there's more sprites on the screen, less time will be spent in the clearing loop anyway, so the load might be balanced in such a way that this particular task is never becomes a performance issue. (This is not always the case, but I think it is typical.)


Also, not that you need it for this situation, but you can still unroll a loop even if the number of steps is variable: C example (same idea applies in assembly)


Just as a funny analogue, the strange way that out of bounds areas typically look in 3D games tends to come from a similar reason to what I suggested for sprites. 3D games often don't clear the screen before they start drawing to it, because they expect it to be entirely covered up by the time everything else has drawn on top. It's another attempt to avoid redundancy.
Re: A simple optimization
by on (#238545)
Quote:
LDX #28 and doing four DEX at a time will speed it up even more, as only the Y coordinate really needs to be cleared.
I never thought about this. that's a really good idea.
Quote:
Where is a compiler involved? You wrote native 6502 code, so you were using an assembler, not a compiler.
I wasn't referring to my code here. I was referring to Dwedit's comment about Snes9x, which I assume is written in C++ or some other compiled language. I would assume that in C++ there's no need to unroll loops. the compiler will take care of optimizing it for you.
Quote:
Instead you might clear only what's left over after filling up the used portion of the sprites.
I tried doing something like that, but with the way I was already handling transferring sprite data from ROM to RAM and also sprite cycling, it ended up being more convoluted than it was worth. Maybe in a future project.
Re: A simple optimization
by on (#238546)
thanks guys, i just sped my loop up 80% and it makes quite a bit of difference on the boss battles where there are a lot of objects flying around at once.
Re: A simple optimization
by on (#238547)
So here is what I think will be the final version:
Code:
ClearSpriteTable:
   LDX #60            ;note: only overwriting Y coordinate to put sprites off screen.
   LDA #$FE
   .Loop:
      STA spriteTable, X
      STA spriteTable + 64, X
      STA spriteTable + 128, X
      STA spriteTable + 192, X
      DEX
      DEX
      DEX
      DEX
      BPL .Loop

   RTS
I can't believe I never thought about the fact that only the Y coordinate has to be changed to hide sprites. Thanks for pointing that out tepples. I think this version is my preferred balance between size and speed. It's turned a routine that was a cycle-eater into something trivial.
Re: A simple optimization
by on (#238551)
tepples wrote:
LDX #28 and doing four DEX at a time will speed it up even more, as only the Y coordinate really needs to be cleared.

Not only that, but I'd go even further and say only the Y coordinate of unused sprites needs to be cleared. If many sprites are on the screen, mazing them takes time so you'll be very glad that the routine that clears the reast of them will be faster by clearing a smaller rest. This will not compensate for the larger amount of sprites mazed (and the time it took to maze them), but this will still help getting the overall CPU load under control.
Re: A simple optimization
by on (#238552)
gravelstudios wrote:
I wasn't referring to my code here. I was referring to Dwedit's comment about Snes9x, which I assume is written in C++ or some other compiled language. I would assume that in C++ there's no need to unroll loops. the compiler will take care of optimizing it for you.

Ah, I see. Thanks for clarifying.

No, this is a common misconception about software today -- that somehow the compiler (or PL in general) always knows what's best. I think in 2019 compilers do a pretty good job of making intelligent choices based on tons of heuristics, but there are still cases where doing something manually (ex. unrolling a loop) can result in more performant code. Godbolt (for C/C++) can sometimes come in handy for analysing stuff like this on a per-compiler and per-compiler-version basis, if the routine/thing is simple. I've run into this situation with Perl, Python, etc. as well ("edge cases" where doing something in a loop one way is worse than another way, but both still worse than doing it unrolled). Every situation varies.
Re: A simple optimization
by on (#238554)
Bregalad wrote:
Not only that, but I'd go even further and say only the Y coordinate of unused sprites needs to be cleared.

I'm surprised it took this long for someone to mention this... But yeah, don't hide all the sprites beforehand, hide only the ones you haven't used once you're done generating OAM data.

Like Bregalad said, this will also help balance the time you spend on OAM manipulation on busy frames and not so busy ones... If you hide all sprites on a busy frame (lots of sprites), all the time you'll spend hiding them will be for nothing, since you'll end up bringing all sprites back, which will take even more time, increasing the chances of slowdowns.
Re: A simple optimization
by on (#238555)
The way my sprite copying/cycling routine works right now relies on all of the sprites being cleared off the screen first. I tried to patch it up to only move the sprites off-screen that aren't being used, but it would have involved reworking a lot of other code to make it work right, and I decided it wasn't worth the hassle. Now that I know more, maybe I will try to engineer things that way from the start on my next project.
Re: A simple optimization
by on (#238564)
Really? You are able to claim OAM slots for actual sprites, obviously... how is that any different from claiming them for "dummy" sprites which actually just put the Y coordinate off-screen? Whatever method you're using for selecting OAM slots for use, do the exact same thing for putting the remaining Y coordinates off-screen, until all 64 slots have been claimed.

You'd have to give up some of the optimizations suggested in this thread if you can't clear the slots sequentially, but you'd avoid that worst case where all slots are written to twice.
Re: A simple optimization
by on (#238599)
I don't think you can unroll the clear/finish loop if you implement blinking using shuffle arrays. Your sprite data is going to be all over the place. But you can still only clear the rest of the sprites you did not use.

Code:
; Warning. Code is written in place, not tested. But I hope it explains what I intended
MAX_SEQUENCE = 2
spriteSequence1:
.byte 0, 3, 7, 11, ..., 252
spriteSequence1:
.byte 0, 11, 3, 7, ..., 252
spriteSequencesLo:
.byte #<spriteSequence1, #<spriteSequence2
spriteSequencesHi:
.byte #>spriteSequence1, #>spriteSequence2

BeginSprites:
lda #0
sta currentSpriteId
inc spriteSecuenceIndex
ldx spriteSecuenceIndex
cmp #MAX_SEQUENCE
beq @noSequenceReset
    ldx #0
    stx spriteSequenceIndex
@noSequenceReset:
ldx spriteSecuenceIndex
lda spriteSequencesLo,x
sta spriteSequence+0
lda spriteSequencesHi,x
sta spriteSequence+1
rts

AddSprite:
ldy currentSpriteId
; assuming spriteSequence is a pointer to one of the shuffle arrays
; and array contains the offsets in oam, not IDs (eg 0, 3, 7)
lda (spriteSequence),y
tax
lda spriteY
sta OAM,x
; ...
; Fill the rest
inc currentSpriteId
rts

FinishSprites:
ldy currentSpriteId
lda (spriteSequence),y
tax
lda #$FF
sta OAM,x
inc currentSpriteId
bne FinishSprites
rts


Edit: fixed incorrect spriteSequence#
Re: A simple optimization
by on (#238600)
Now thinking about it, you have ROM to spare, you can unroll clear loop completely and only clear remaining sprites even with cycling. Something like that.

Code:
FinishSprites:
   ; load the last available sprite
   ldy currentSpriteId
   ; jump to the appropriate label
   jsr FinishSpritesJump
   ; just to avoid copy-paste
   .repeat 64, I
      .ident(.concat("sprite", .string(I), "clear")):
      lda (spriteSequence),y
      tax
      lda #$FF
      sta OAM,x
      iny
   .endrep
   rts

FinishSpritesJump:
   lda FinishSpritesJumpHi,y
   pha
   lda FinishSpritesJumpLo,y
   pha
   rts

FinishSpritesJumpLo:
   .repeat 64, I
      .byte .lobyte(.ident(.concat("sprite", .string(I), "clear")))
   .endrep
FinishSpritesJumpHi:
   .repeat 64, I
      .byte .hibyte(.ident(.concat("sprite", .string(I), "clear")))
   .endrep
Re: A simple optimization
by on (#238616)
Quote:
Now thinking about it, you have ROM to spare, you can unroll clear loop completely and only clear remaining sprites even with cycling.

I didn't read the code or anything, but generally speaking, unrolling completely loops tends to be a temendous waste of ROM for only a very marginal speed gain. As mentionned before in this thread, something arround 2 or 4 iterations per loop is enough to speed up significantly, and any further unrolling won't do significant difference.

EDIT: OK I now read the code. I think if you have 4 or less sprites sequences, (i.e. the total length is 256 or less bytes), you should not use indirect adressing to access the sequences, as this is a pure waste of time. This by itself will save more cycles than unrolling of any amount.
Re: A simple optimization
by on (#238629)
Bregalad wrote:
EDIT: OK I now read the code. I think if you have 4 or less sprites sequences, (i.e. the total length is 256 or less bytes), you should not use indirect adressing to access the sequences, as this is a pure waste of time. This by itself will save more cycles than unrolling of any amount.


I'm definitely not good programmer on 6502 yet. I'm still learning and will learn forever. What would be the alternative?

This is what I have now

Code:
; header. load pointer once
ldx spriteSecuenceIndex ; ZP  3
lda spriteSequencesLo,x ; Abs 4+
sta spriteSequence+0    ; ZP  3
lda spriteSequencesHi,x ; Abs 4+
sta spriteSequence+1    ; ZP  3
; total 17_
...
; load value from sequence every sprite
lda (spriteSequence),y  ; Ind 5+
; 5 * 64 sprites = 325+

Total 325 + 17 = 342 cycles - regadless of sequence #.
I summed minimum possible cycles.

When I look at it, one alternative is branching at every iteration:

Code:
ldx spriteSecuenceIndex   ; ZP  3
cmp #0                    ; Imm 2
bne @no0                  ; branch taken 3+
   lda spriteSequence0,x
   jmp @end
@no0:
cmp #1                    ; Imm 2
bne @no1                  ; branch taken 3+
   lda spriteSequence1,x
   jmp @end
@no1:
cmp #2                    ; Imm 2
bne @no2                  ; branch taken 3+
   lda spriteSequence2,x
   jmp @end
@no2:
cmp #3                    ; Imm 2
bne @no3                  ; branch taken 3+
   lda spriteSequence3,x ; Abs 4+
@no3:
@end:
; Total 27

So when spriteSecuenceIndex is 3, total cycles would be 27 * 64 sprites = 1728. Looks like indirect is win

Or to use RTS trick and run 4 different implementations.

Code:
; jump once at the beginning
ldx spriteSecuenceIndex   ; ZP  3
jsr JumpToSequenceImplementation ; 6

JumpToSequenceImplementation:
lda SequenceImplementationHi,x ; Abs 4+
pha                            ; 3
lda SequenceImplementationLo,x ; Abs 4+
pha                            ; 3
rts                            ; 6
; Total: 29

; Every iteration
ldx spriteSecuenceIndex   ; ZP  3
lda spriteSequence3,x ; Abs 4+
; Total: 7 * 64 sprites = 448
; Grand total = 448 + 29 = 477

; or if we can preserver X
lda spriteSequence3,x ; Abs 4+
; Total: 4 * 64 sprites = 256
; Grand total = 256 + 29 = 285


So we can save 325 - 285 = 40 cycles. But we waste more ROM.

Is there more optimized solution for that?
Re: A simple optimization
by on (#238635)
The idea of the "RTS trick" (also known as jump table) in this context makes perfect sense, you have 4 highly optimized sprite mazing routines for all 4 orders. However that's not what I had in mind. The idea was just to use normal indexed mode. The LDX $xxxx,Y or LDY $xxxx,Y instructions would be more useful here than a LDA $($xx),Y.
Just pasting your code modified to use what I had in mind.

Code:
; Warning. Code is written in place, not tested. But I hope it explains what I intended
MAX_SEQUENCE = 2
spriteSequence1:
.byte 0, 3, 7, 11, ..., 252
spriteSequence1:
.byte 0, 11, 3, 7, ..., 252
spriteSequencesLo:
.byte #<spriteSequence1, #<spriteSequence2
spriteSequencesHi:
.byte #>spriteSequence1, #>spriteSequence2

BeginSprites:
lda #0
sta currentSpriteId
lda spriteSecuenceIndex
clc
adc #64                 ; Size of a sequence
cmp #MAX_SEQUENCE*64    ;MAX_SEQUENCE can only be 2 or 3 for this to work, if it's 4 the CMP instruction should be removed, if it's more than 4 it won't work at all
beq @noSequenceReset
    lda #0
@noSequenceReset:
sta spriteSequenceIndex
tax
rts

AddSprite:
lda currentSpriteId
clc
adc SpriteSequenceIndex
tay
; assuming spriteSequence is a pointer to one of the shuffle arrays
; and array contains the offsets in oam, not IDs (eg 0, 3, 7)
ldx spriteSequence,y
lda spriteY
sta OAM,x
; ...
; Fill the rest
inc currentSpriteId
rts

FinishSprites:
lda currentSpriteId
clc
adc SpriteSequenceIndex
FinishSprites_loop:
ldx spriteSequence,y
lda #$FF
sta OAM,x
iny
inc currentSpriteId
bne FinishSprites   ; ??? not sure this will work
rts


Now there is no right or wrong, it just depends on what you want to do. A thing I like to do when mazing sprites is to constantly keep the index in either X and Y, and only use the remaining 2 registers for all others operation. Your idea to store and load the "curentspriteId" in a zero-page variable instead of keeping in a register is a source of inneficiency.

Also I like to generate what you call the "sequences" on the fly, that is the sequence is not stored anywhere in ROM, it's just the result of a computation. This has its limits, though.
Re: A simple optimization
by on (#238643)
Bregalad wrote:
The idea of the "RTS trick" (also known as jump table) in this context makes perfect sense, you have 4 highly optimized sprite mazing routines for all 4 orders. However that's not what I had in mind. The idea was just to use normal indexed mode. The LDX $xxxx,Y or LDY $xxxx,Y instructions would be more useful here than a LDA $($xx),Y.
Just pasting your code modified to use what I had in mind.

Code:
lda spriteSecuenceIndex
clc
adc #64                 ; Size of a sequence



Thank you for the explanation. Makes sense to me. Except this part, I'm assuming you wanted to multiply spriteSecuenceIndex * 64 to get an offset, instead of adding [0, 1, 2, 3] + 64. Is that correct?
Re: A simple optimization
by on (#238645)
Basically the initial offset will cycle through 0 and 64 in the case you have 2 sequences, 0, 64 and 128 in the case you have 3 sequences, or 0, 64, 128 and 192 when you have 4 sequences. How this is acheived is not of much interest.

Another way to have this is to interleave the sequences, you add 2, 3 or 4 to the index within the sequences, and have the sequences themselves starting one byte away of another. This could be more efficient or not - depending on how it is coded.
Re: A simple optimization
by on (#238671)
Quote:
Really? You are able to claim OAM slots for actual sprites, obviously... how is that any different from claiming them for "dummy" sprites which actually just put the Y coordinate off-screen? Whatever method you're using for selecting OAM slots for use, do the exact same thing for putting the remaining Y coordinates off-screen, until all 64 slots have been claimed.
First, let me say how much I appreciate everybody's feedback. When I first read this comment, it kind of ticked me off. I mean, I had worked really hard on the code that I had already written, my carefully crafted engine, and didn't like being told that there was still a lot of room for improvement. Then I slept on it and mulled it over a little, and went back and reworked all of the code that needed to be reworked so that this type of approach would work. So ultimately I gained a lot from it. Yes, my original code was doing a lot of redundant work in several areas that I was able to trim out. So I'm grateful, Tokumaru.
Re: A simple optimization
by on (#238675)
Cool, glad I could help! :D