This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Modularity/File Size vs. Efficiency

Modularity/File Size vs. Efficiency
by on (#63756)
I'm just wondering what you would consider more important in making a game: Reducing file size or more efficiency.

by on (#63758)
Really, do whatever you need to fit the size and frame rate targets that you have set. If your game is 132 KiB, and your publisher wants you to get it down to 128 KiB so that it will fit in the smaller ROM chip, you might need to cut out some lookup tables. If you want to have a lot of big levels, space efficiency of the map encoding is important. If a lot of critters in one area are causing the CPU to take longer than 29,000 cycles to make a frame, you'll need to find some tradeoffs to get more speed.

Not everybody's targets are the same. Micronics games, for instance, appear to trade off frame rate for development time.

by on (#63759)
tepples wrote:
your publisher .


Publisher? lolwut?

So basically it really doesn't matter as long as you get it done and it works? I guess I like that way of looking at it since I'm still working on my first game.

by on (#63760)
Most important: it's fun. Depending on what approach you take, you may need to work on reducing data size (if it's a game with large maps), improving efficient (if it involves lots of objects on screen and fast action), or something else entirely (perhaps coming up with a design that can handle the complexity of a turn-based game).

by on (#63761)
Quote:
Publisher? lolwut?

tepples is always faking that we're commercial developers. He probably took that habit from GBAdev.
Not that at some point we're all doing this - tepples just say it loud.

Personally I hate frame rate drops so i'd rather have a "bigger" file (anyways even if you make a VERY large 512k game it's really small by todays standards).

by on (#63764)
Bregalad wrote:
tepples is always faking that we're commercial developers. He probably took that habit from GBAdev.
Not that at some point we're all doing this - tepples just say it loud.

I wouldn't be saying it out loud if it weren't for Sivak and bunnyboy. (No, not that bunnyboy.) Then again, I meant "publisher" in a fairly broad sense. For example, if you are entering a 16K competition, the compo sponsor is your publisher, and you'll need to fit everything into 16384 bytes.

by on (#63774)
For me right now efficiency is the most important. My game does complicated things, and it needs to take a lot of space to do what it does at all. It sometimes needs to take even more to do it fast. The routine that runs the main character is 3.75 kilobytes or something like that.

However I code for space on the side, since I also want more levels. So I take a page from the atari programmer's book, and turn jmps into branches whenever possible. (Which saves a byte. Saved bytes add up though. I've saved enough already for at least a medium sized level)

Sometimes it's easy like this:

Code:
  bne label
  jmp label2


becomes

Code:
  bne label
  beq label2


Sometimes it's trickier like realizing no instructions change the carry flag in between a cmp, sec, sbc etc and a jmp. Then changing the jmp to a branch that brances on the condition that the carry flag with definitely be.

I do some other probably crazy things to save space. But mostly I'm about speed.

If I was doing something else I'd work on saving space, since I think it's kind of fun, actually. So yeah, depends on the game.

by on (#63775)
Kasumi, I swear everytime you respond to one of my threads I learn something new about efficiency. Using a BEQ in place of JMP is so blatantly obvious that I just completely don't see it. Truth be told that just gave me several ideas on ways I could make my code just a little better.

by on (#63776)
blargg wrote:
Most important: it's fun.


Ding ding, blargg wins the prize. (I have no idea what the prize is, but he wins it)

by on (#63777)
I usually optimize for speed, but I can't completely forget about the size. Even though mappers allow you to have a lot of PRG-ROM, the addressing space of the 6502 is still pretty small, and only 32KB of PRG can be mapped at any given time, which means that the way your code and data are organized will haven an impact on the efficiency of the program as well.

I often use branches instead of jumps when a flag (N, C, Z or V) is guaranteed to always have the same value at that point, even if I have to shuffle some instructions around to make sure that value is constant.

Another thing that helps, one byte at a time, is not doing CLC or SEC before additions and subtractions if the value of the carry is known, even if that means compensating the added/subtracted value. For example, if I want to add $20 to a variable but the carry is always set, I add $1F instead.

That's the kind of optimization I do when a block of code is already working without any "tricks" though, because sometimes if you do them while still prototyping the logic you might end up changing the way a flag behaves and screw up the rest of the logic, and that kind of bug can be hard to find.

by on (#63778)
Not setting or clearing the carry flag is very dangerous. I thought I was doing some clever optimizations, and they backfired for that reason.

by on (#63779)
That's why I only do that as the last step before considering a piece of code final, so that no changes could by accident modify the carry.

by on (#63780)
Optimizations like doing BEQ BNE instead of BEQ JMP are pretty silly, unless it's in a macro that's used hundreds of times. Focus your efforts on things that yield hundreds of bytes of savings. Things like this are fragile (change the BEQ to something else and it breaks if you aren't watching). Others, like avoiding a CLC or SEC, are useful only in critical loops where a few cycles saved is significant. Everywhere else they just make the code more brittle. Given that assembly code is some of the hardest to debug, you don't need techniques like this making it even harder. I think most cases of this have ill effects, especially when people here use them in introductory NES code of all places.

tokumaru has it exactly right: If you're going to do things like this, do them last, and only when you can see that they will make a noticeable difference in speed or code size. Otherwise, you're de-optimizing clarity. Always remember that downside.

by on (#63785)
My current savings by using branches instead of jmps is 143ish bytes. Nothing to sneeze at. Space is space. That's a lot of animations or a level to three. (In my formats)

Another thing I neglected to mention is to tag your branches that used to be jmps with a comment. This is so if more added code brings the label the branch points to out of range, you just turn it into a jmp, rather than the first instinct of creating an extended branch which takes up both more cycles AND bytes. (But of course this is avoided entirely if you convert jmps to branches last.)

I do agree it sometimes makes code less readable in the less obvious cases, and you should only do it when you're sure the logic in place is set in stone.

As for avoiding clc and sec, it can indeed save you a lot of cycles in loops. clc/sec before a loop instead of during every iteration is simple enough. My first 8-bit division routine took 214 cycles (Max). After optimization, it takes 133 (Max). (I used to divide a lot, so this was VERY important. I might be able to do it even faster if I bring the tile type back that used division)

Other than during loops, I'll get rid of them if I see it, but I don't actively look for instances where I can avoid them. They're used so seldom the byte (and cycle savings) don't often make it worth it.

I'm not advocating being careless. Be REALLY sure before you make a change like any of these. I have stuff like this comment in my code:

Code:
velmask.highmid.pos:
   bne velmask.highmidstart2
   plp
   jmp velmask.end;bcc would probably work, but I'm scared to try it


I'm pretty sure all paths that lead to that label push a clear carry byte to the stack, but I'm not sure. If I REALLY need that byte later, I can always come back :) Never make a change you're not sure of for one byte. I'm not saying that.

by on (#63786)
Kasumi wrote:
My current savings by using branches instead of jmps is 143ish bytes. Nothing to sneeze at. Space is space. That's a lot of animations or a level to three.

It's nine tiles.

Quote:
Never make a change you're not sure of for one byte.

You're right. This is the NES, not the Atari 2600.

by on (#63787)
tepples wrote:
It's nine tiles.

Not even nine. I guess I need to work harder! :D Perhaps a tiny edit is in order.
Quote:
You're right. This is the NES, not the Atari 2600.

I agree, perhaps I just spend too much time browsing atariage.

by on (#63792)
Well, when it comes for optimisation, there is several tricks. Avoid clearing/setting carry before add/substract is definitely something you'd want to do - it saves one byte every time, but in a project you'll easily save hundreds of bytes that way.

I always keep the clc or sbc instruction before the add/substract, but have it commented. That way if I modify the code so that the state of the carry is not the same - I take it into account and don't introduce stupid bugs that way.

The same applies if you're loading something in a register but that it will already have the value loaded in. For example, you can save some bytes by removing instructions like "lda #$00" if you know A already contains $00. Again, you should keep the instruction, but have it commented.

Other major optimisations are possible by making clever use of N and C flags. For example intead of ANDing a value with #$80, you can check the N flag. Intead of ANDing it with #$01, you can LSR it and check the C flag. Instead of ANDing it with #$40, you can either BIT it anc check the V flag, or ASL it and check the N flag. Also, use "cmp #$80" to transfer the N flag to C.

Another nice trick, is, for example, if you want A to be #$06 if the carry flag is set, and #$00 if clear (just random example), do the following :
Code:
  bcc +
  lda #$06
  .db $cd       ;cmp $xxxx instruction
+ lda #$00
.....

This will avoid a branch in the case of C is set, and save bytes. $cd is easy to remember because you just have to think about a CD-ROM and that's it. It does however affect most flags, and you could accidentally read a mirror of $2007 and introduce graphical glitches that way (tough that's not possible for the lda #$xx opcode ($a9) which can only read $2001 mirrors, it's possible for other opcodes).

by on (#63803)
Kasumi wrote:
As for avoiding clc and sec, it can indeed save you a lot of cycles in loops. clc/sec before a loop instead of during every iteration is simple enough.

This is strategic use of such optimizations, which is usually a net benefit. It reduces clarity of the optimized routine, but frees up cycles so that the rest of your code can be clearer and not have to be as optimized.

My main point was that you can make the most impact with a given amount of your time by spending some/most of it at the end. At that time, you have the big picture and can consider high-level changes that make a big impact, or clarity-reducing low-level optimizations where they provide the most impact (like in a division routine).

by on (#63805)
Yes, the general form of this is: if a bit is already set as needed, don't set it again. So you run through the code and keep track of which bits will have a known state.

Regarding ADC #imm and SBC #imm, another trick is that if the carry is always set or always clear before the instruction, you don't need to change it. If carry is always set before an ADC #imm, just do ADC #imm-1. If it's always clear before the SBC, do SBC #imm+1. This will compensate without having to clear/set the carry.

by on (#63811)
Bregalad wrote:
The same applies if you're loading something in a register but that it will already have the value loaded in. For example, you can save some bytes by removing instructions like "lda #$00" if you know A already contains $00. Again, you should keep the instruction, but have it commented.

Yep. This saves even more bytes than making jmps branches and avoiding sec/clc. I remember when I started I didn't understand the concept of "code takes up space". I used
Code:
  lda #$00
  sta $2007
  lda #$00;Could be commented out of course, but it's still a horrible way to do it
  sta $2007
  lda #$01
  sta $2007

;etc

to write an entire nametable. :oops: Now I always avoid loads when the number I need is already in the register. Saves two bytes and cycles every time, if it's always immediate.
Bregalad wrote:
Intead of ANDing it with #$01, you can LSR it and check the C flag.

I'm surprised I didn't really think of that! That's good stuff! I do a lot of bit checking, since each of my sprites has a lot of data to be stored in a few bytes. That's one I'll definitely see what I can do when my code is absolutely final. Only because I think some of the places I'd want to use this depend on the carry not being messed with for an actual branch. (and not a branch that could be a jmp :))
Bregalad wrote:
Another nice trick, is, for example, if you want A to be #$06 if the carry flag is set, and #$00 if clear (just random example), do the following :
Code:
  bcc +
  lda #$06
  .db $cd       ;cmp $xxxx instruction
+ lda #$00
.....


Edit: nvm, what am I thinking?

I've found my limit in how far I'd go for optimizations. I probably wouldn't do THAT, but it's safer than I first thought it was (hence the edit). It's nice to know about though (I totally wouldn't have thought of it) and there are at least three places I could use it off the top of my head. It's totally awesome if I ever wind up using it!

by on (#63812)
Some other slight optimizations:
Code:
ASL reg/mem ; clear bit 0
LSR reg/mem ; clear bit 7
INC/DEC mem     ; toggle bit 0
ASL reg     ; put bit 7 into carry, and bit 6 into N flag
CMP #$80    ; put bit 7 into carry, without modifying A
CMP #1      ; set carry if A is not zero

by on (#63817)
The CMP #$80 and CMP #$01 trick is definitely awesome.

Another one is that if you have a jsr followed by a rts, replace it by a jmp. It saves one byte each time but it can end up a lot of bytes in a large program.

Another thing I like to do is to avoid branches. For example, if I want A = #$04 if C is clear, and A = $06 if C is set, I could do something like :
Code:
   bcc +
   lda #$06
   bcs ++
+  lda #$04
++ ....

But I find it less elegant than noticing that A = $04 + 2*C and coding it like that :
Code:
    lda #$00
   rol A
   asl A
   ora #$04
   ....

Note that this also show off you can sometimes use ORA instead of ADC - if you know the values that are added together never have the same bits set - and don't have to deal with the carry.

Finally, whenever you use LU tables, be sure to have them pre-formatted as much as possible. I'd avoid doing stuff like that :
Code:
   lda LUTable,X
   asl A
   asl A
   clc
   adc #$03
   sta Wathever,Y

LUTable
   .db Val_1, Val_2, Val_3

But do it like this instead :
Code:
   lda LUTable,X
   sta Wathever,Y

LUTable
   .db Val_1*4+3, Val_2*4+3, Val_3*4+3


It can seem obvious put that way, but trust me you don't always think about it.

Finally, it's usually good practice if you have table of pointers, to split it between high and low tables. Instead of doing this :
Code:
   asl A
   tax
   lda Adr,X
   sta PointerL
   lda Adr+1,X
   sta PointerH
   ldy #$00
   lda (Pointer),Y
   ....

Adr
   .dw Adr1, Adr2, Adr3, ...

You do it like this :
Code:
   tax
   lda AdrL,X
   sta PointerL
   lda AdrH,X
   sta PointerH
   ldy #$00
   lda (Pointer),Y
   ....

AdrL
   .db <Adr1, <Adr2, <Adr3

AdrH
   .db >Adr1, >Adr2, >Adr3

However, I admit I don't always do that, because it only saves 1 byte (the ASL a, possibly a second if the routine is called with the index already in X), and it's very annoying to split long tables manually to save just ONE byte.

However, if the data is small, all high adress are likely to be equal. If the data pointed by AdrN is less than 256 bytes, you can take advantage of this if your assembler support align and do that :
Code:
   tax
   lda AdrL,X
   sta PointerL
   lda #>Adr1  ;Same high adress for all pointers
   sta PointerH
   ldy #$00
   lda (Pointer),Y
   ....

AdrL
   .db <Adr1, <Adr2, <Adr3


Then it's also likely that <Adr1 is equal to $00 (I don't know if there's a way to take advantage of it).
Also, I don't know if there is a way to do something like that but for data larger than 256 bytes. Something evil woud be to do a loop that seeks all entries of the table up to the one requested, and increment high byte when an entry in the low table is smaller than the previous entry. A variant of it would be to have a 8-bit size table (instead of a 16-bit pointer table) and add all values up to the one requested to find the address. I don't know how many bytes this would save, but it'd definitely be slower. That's an interesting concept for NROM or CNROM games though.

by on (#63821)
Bregalad wrote:
Another one is that if you have a jsr followed by a rts, replace it by a jmp. It saves one byte each time but it can end up a lot of bytes in a large program.

Computer science has a name for this: tail call optimization. Not only does it save ROM, but it also saves two bytes of stack, giving you more breathing room if you're using half the stack page for a VRAM transfer buffer.

Quote:
Finally, whenever you use LU tables, be sure to have them pre-formatted as much as possible.

Unless you use the same table in several different ways. For example, one of my projects needs a table of X values of objects for positioning them on the background and a table of (X * 8 + 8) values for positioning sprites. You could put both tables in the source code, but then you would have to remember to change both unless you write a preprocessor that keeps them consistent.

Quote:
it's very annoying to split long tables manually to save just ONE byte.

True, another maintainability tradeoff unless you write a preprocessor.

Quote:
Also, I don't know if there is a way to do something like that but for data larger than 256 bytes. Something evil woud be to do a loop that seeks all entries of the table up to the one requested, and increment high byte when an entry in the low table is smaller than the previous entry. A variant of it would be to have a 8-bit size table (instead of a 16-bit pointer table) and add all values up to the one requested to find the address.

The map decoding in President works just like this. Each map has a start address (2 bytes) and a list of 32 length values (1 byte each). Each length value L[i] is the length in bytes of the object list data for objects whose leftmost X coordinate (in metatile units) is between i*16 and i*16+15. It's scanned once for each frame when the camera crosses a tile boundary, and it's actually not that bad in CPU time.

I'm working on doing something similar for encoding dictionaries in my text compression engine, except counting bits instead of bytes. The words in the dictionary are sorted by decreasing frequency, but only the starting address of one out of every 16 is stored in its entirety. The rest are stored as the number of bits in the Huffman-compressed spelling of that word in characters. This is one of the steps in compressing a novel by a factor of three.

by on (#63824)
Bregalad wrote:
Another thing I like to do is to avoid branches. For example, if I want A = #$04 if C is clear, and A = $06 if C is set, I could do something like :
Code:
   bcc +
   lda #$06
   bcs ++
+  lda #$04
++ ....

But I find it less elegant than noticing that A = $04 + 2*C and coding it like that :
Code:
    lda #$00
    rol A
    asl A
    ora #$04
    ....

Simpler and more general form of "If carry, A=xx, otherwise A=yy":
Code:
    lda xx
    bcs +
    lda yy
+:

Note that this works for xx and yy as immediate values, or values loaded from memory. If the condition is based on N or Z, load the value into A before you set the condition:
Code:
    lda xx
    cpx foo
    bne +
    lda yy
+:

by on (#63849)
Quote:
Computer science has a name for this: tail call optimization. Not only does it save ROM, but it also saves two bytes of stack, giving you more breathing room if you're using half the stack page for a VRAM transfer buffer.

Nice, I didn't know there was a name on it. Speaking about it, I forgot to mention than if a routine is called multiple times, including at least one time at the "tail" of another routine, instead of having a JMP, paste the routine directly after the other one and comment the JMP, effectively saving a total of 4 bytes as opposed to JSR-RTS.

Not to mention the same would apply to all routines which are called only once trough the code - just paste them where they're called. Although I admit I still have some routines which are called only once in my code because because of clarity and maintainability. This technique might not save bytes if there is branches going across the said routine, which could have to be replaced by conditional JMPs.

@Blargg : Yeah this is nice too, but both yours and mine approach is 6 bytes, but mine execute in constant time.

by on (#63864)
Bregalad wrote:
if a routine is called multiple times, including at least one time at the "tail" of another routine, instead of having a JMP, paste the routine directly after the other one and comment the JMP, effectively saving a total of 4 bytes as opposed to JSR-RTS.

I do this too occasionally. But of course, switching from a tail call to fallthrough works only if the routines are in the same translation unit, and if you're using a shuffler to detect buffer overflows, you have to take care to keep them in the same chunk.

Quote:
Not to mention the same would apply to all routines which are called only once trough the code - just paste them where they're called.

In other words, inlining, with the caveats you mentioned.

by on (#65004)
I know smkd is going to disagree with me, but I don't understand why developers waste their already very limited time being obsessive compulsing about memory size to gain a trivial amount of memory they don't even use.

For instance, the draw_status_bar code in Super Mario World. They make an entire loop, just to load 7 registers. All it does is save 5 extra bytes that they eventually didn't used, at the cost of effort and speed. That short of a loop actually takes more effort, than writing it without the loop.

by on (#65005)
I wish one of the vast library of NES emulators would support profiling similar to AMD's CodeAnalyst and Intel's VTune on x86. On the same note it could also keep track of RAM usage, access frequencies and stuff like that (and display the results visually). Would be optimal if it could also handle assembler generated symbols.

I kinda want to write an emulator with emphasis on developers now. Problem however is that for it to be useful the emulation would have to be pretty accurate and thus it would take a long time to develop. Accuracy wouldn't matter that much for profiling though, as long as the CPU was stable...

Oh well, rambling away once again. :D

by on (#65006)
Quote:
I kinda want to write an emulator with emphasis on developers now. Problem however is that for it to be useful the emulation would have to be pretty accurate and thus it would take a long time to develop. Accuracy wouldn't matter that much for profiling though, as long as the CPU was stable...

Nestopia is open source.

by on (#65009)
mic_ wrote:
Quote:
I kinda want to write an emulator with emphasis on developers now. Problem however is that for it to be useful the emulation would have to be pretty accurate and thus it would take a long time to develop. Accuracy wouldn't matter that much for profiling though, as long as the CPU was stable...

Nestopia is open source.

Yeah, and way too complicated for me to want to try to figure it out. :wink:

by on (#65011)
Nestopia is programmed in "Extreme C++" (think Boost). That's one level more complex than plain C++. :)

by on (#65016)
It's not like one would have to read through the entire source. You could modify Cpu::ExecuteOp() in NstCpu.cpp to log every instruction to a file, along with the current CPU state and other information that you'd like to have. Your analysis tool could then be written completely separate from the emulator and use those logs as input.

by on (#65019)
thefox wrote:
I kinda want to write an emulator with emphasis on developers now.

The NESICIDE team wants you.

by on (#65022)
Here's a question related more to pages two and one of this topic.

adc #$00 always leaves the carry clear right? And sbc #$00 always leaves the carry set? I feel like they do, and I use 16 bit coordinates that never move more than #$FF in a frame so I have a lot of them around my code.

by on (#65023)
These instructions will leave the carry set:
Code:
  sec
  lda #$FF
  adc #$00


These instructions will leave the carry clear:
Code:
  clc
  lda #$00
  sbc #$00

by on (#65084)
psycopathicteen wrote:
I know smkd is going to disagree with me, but I don't understand why developers waste their already very limited time being obsessive compulsing about memory size to gain a trivial amount of memory they don't even use.

For instance, the draw_status_bar code in Super Mario World. They make an entire loop, just to load 7 registers. All it does is save 5 extra bytes that they eventually didn't used, at the cost of effort and speed. That short of a loop actually takes more effort, than writing it without the loop.


Is anybody going to respond? Or did I make such a good point, nobody can respond?

by on (#65087)
The way you asked that question is very off-putting. I've boldfaced some of the relevant parts:
Quote:
I know smkd is going to disagree with me, but I don't understand why developers waste their already very limited time being obsessive compulsing about memory size to gain a trivial amount of memory they don't even use.

For instance, the draw_status_bar code in Super Mario World. They make an entire loop, just to load 7 registers. All it does is save 5 extra bytes that they eventually didn't used, at the cost of effort and speed. That short of a loop actually takes more effort, than writing it without the loop.

You basically fill in all the blanks with your own assumptions, then ask why they paint the developers as doing silly things. Examine your assumptions if the conclusion they lead to is absurd.

So, show us the code, then ask questions. "Why is it written like this instead of some other way? All I can imagine is that they were trying to save a few bytes, but they don't even seem to use the bytes they saved, so it doesn't make sense." See how that doesn't bring lots of derogatory remarks into the picture, and sets the stage for actually coming up with a good explanation for the code being the way it is?

I can't help but getting the feeling that you have decided on the conclusion in advance, and simply want to validate it, rather than actually understand the purpose and things that lead to said code being the way it is. I don't know about others, but I don't appreciate reading posts here that just put down developers (this isn't the only place you've done so, either). And I don't care if it's someone else you're putting down; I don't like to be in the presence of one-sided attacks like the above.

by on (#65099)
I just want somebody to respond like "You know, you do make a good point!"

My point is that optimization isn't anywhere as difficult as people make it out to be, and you shouldn't worry about it.

by on (#65106)
psycopathicteen wrote:
My point is that optimization isn't anywhere as difficult as people make it out to be

But if you have a product to ship, and it works fast enough, holding up the ship date for optimization isn't indicated.

by on (#65112)
tepples wrote:
psycopathicteen wrote:
My point is that optimization isn't anywhere as difficult as people make it out to be

But if you have a product to ship, and it works fast enough, holding up the ship date for optimization isn't indicated.


What if you already programmed the game efficiently to begin with so optimization isn't required?

by on (#65115)
psycopathicteen wrote:
What if you already programmed the game efficiently to begin with so optimization isn't required?

I'm the kind that optimizes a lot while planning, so I usually don't have to tweak the code much. A side effect of that is that it takes longer for me to see results compared to someone that codes things the straightforward way to optimize later.

by on (#65118)
tokumaru wrote:
I'm the kind that optimizes a lot while planning, so I usually don't have to tweak the code much. A side effect of that is that it takes longer for me to see results compared to someone that codes things the straightforward way to optimize later.


I aim to do this too, unless it's a very complicated routine that would benefit from a simpler implementation first and then gradually optimising it. I'd want the simplest / straightforward code when debugging it.

psychopathicteen wrote:
What if you already programmed the game efficiently to begin with so optimization isn't required?


-with optimised code involving heavy macro use and loop unrolling, it'd be code size that holds the programmer back if they're trying to fit it on a small ROM.
-producing the most efficient loops and such the first time around isn't a reasonable expectation for complicated code. If the optimal code is not the easiest to debug then a commerical developer wouldn't be expected to do it. Not if they have to produce and debug many pages of code every day. No one is going to write near perfect code in the first attempt either. That, and having limited time to review previously written code would probably account for mediocre code in plenty of SNES games.

The drawstatusbar in SMW is a poor example since it only runs once a frame and the speed difference / size difference is tiny. The same sort of loop is used several times throughout the game and it's mainly once-a-frame stuff. It adds up to over 200 bytes which isn't much, but if many routines are kept compact then the savings will add up fast. Definitely more than '5 bytes', and that 200+ byte figure is just for a the few times the DMA loop thing got used. Original SMW isn't pressured for vblank time either, so using slower but more compact code was not a big deal.

by on (#65120)
There's plenty more than 200 #$FF's at the end of the SMW cartridge, but I guess they just didn't know how much memory they had left until they were done.

A strategy I use, is during v-blank I usually have the accumulator in 8-bit mode, and index in 16-bit mode. When I need to write to a single register I use the accumulator. If I need to write to 2 back to back registers like $2116 & $2117, I use X or Y. That saves memory, development time and speed all at the same time.

by on (#65136)
psycopathicteen wrote:
What if you already programmed the game efficiently to begin with so optimization isn't required?

Then not enough attention will be put toward making the game fun. Sure, optimization can be job 1 in a port, where you've already refined the design of the game itself. But when one is designing and implementing a game from whole cloth, design and balance compete with code efficiency. See the lead of a wiki article about NES limitations.

by on (#65154)
psycopathicteen wrote:
[...] I usually have the accumulator in 8-bit mode, and index in 16-bit mode. When I need to write to a single register I use the accumulator. If I need to write to 2 back to back registers like $2116 & $2117, I use X or Y.

Yeah, same I use for all my SNES code except short sections where a 16-bit A or 8-bit X/Y significantly speeds things up or reduces tedium. I try to avoid switching size because it's so likely to cause mismatches with routines and subtle bugs. I could have every routine set the register size on entry and restore on exit, but that'd be tedious and inefficient. I can't really imagine any other way, given how many 8-bit quantities you have to deal with regularly (partly because the hardware registers so often deal in 8-bit quantities).

by on (#65155)
I tend to use REP/SEP a lot to change the accumulator size. At first I'd often forget some of them, which would result in incorrect object code from the assembler. But now it's such a habit of mine that I rarely get mismatches on the M-flag. I prefer to do 16-bit operations as much as possible when I'm working on 16-bit data.

by on (#65156)
Let your assembler remember for you:
Code:
.macro a16
    .A16 ; tell assembler new size of A
    rep #$20
.endmacro

.macro ai16
    .A16
    .I16
    rep #$30
.endmacro

etc.

by on (#65157)
I didn't mean I forgot which immediate to use. I forgot to add the instruction altogether.

by on (#65176)
I've always thought if it feels natural to change modes, then you probably should.

Likewise, when I find a loop annoying to program, I write it unlooped. If I find unrolling a loop taking too long, I loop it.

by on (#65179)
Here's just an example taken from something I wrote recently. I didn't want to wait for $4018 to become ready so I used the NES-style joypad reading method, and I wanted to get all the buttons:

Code:
    sep      #$20
   ; Read joypad1
   lda      #1
   sta      $4016
   stz      joyData
   stz      joyData+1
   lda      #0
   sta      $4016
   ldx      #12
-:
   lda      $4016
   lsr      a
   rep      #$20
   rol      joyData
   sep      #$20
   dex
   bne      -
   
   rep      #$20
   lda      joyData
   tax
   eor      joyMask
   stx      joyMask
   and      joyMask
   sta      joyData
   sep      #$20
   lda      joyMask+1
   and      #$D
   sta      joyMask+1      ; don't block Select

Using 8-bit operations would've been faster in some places, but I prefer to be in 16-bit mode when I'm operating on 16-bit data.

As far as loops go, I tend to favor them unless I absolutely need the extra speed. If I need it to copy more/less data I can just change the loop counter instead of having to add or remove instructions.

by on (#65184)
Unless you know the routine contributes significantly to overall time, yeah, use loops when they're more convenient, or when unrolling would make the routine large. Sometimes though a .repeat is more useful, because you can do complex expressions involving the index. Since discovering ca65's .repeat, I've been using it more.

by on (#65192)
I tend to not like loops, especially using loops within loops. Lack of index registers, extra cpu calculation, all the txa:clc:adc:tax-ing. Loops within loops are even harder becuase your already using both index regs for one loop, what are you going to use for the other loop?

by on (#65194)
Either save one of the index registers on the stack if you can, or use a DP variable.

by on (#65202)
Quote:
I tend to not like loops, especially using loops within loops. Lack of index registers, extra cpu calculation, all the txa:clc:adc:tax-ing. Loops within loops are even harder becuase your already using both index regs for one loop, what are you going to use for the other loop?

Use a ZP variable for the outer loop and index registers for inner loop.

Another cool thing to do is something like that :
Code:
  lda #$06
_loop
  pha
  ..... ; complex code comes here
   pla
   sec
   sbc #$01
   bne _loop
   .... ;end of loop

it avoid using index registers nor ZP adress, and you can still use the stack as another temporary storage in your loop. The only con is that it's a bit slow and you can't acess your counter inside the loop easily.

by on (#65204)
Quote:
Code:
 ....
   sec
   sbc #$01
   bne _loop
   .... ;end of loop


Use DEA instead of SEC/SBC on 65C02/65C816/HuC6280 ;)

by on (#65207)
It's too bad DEC doesn't support stack-indexed addressing on the 65816 (unless I'm missing something).