This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Opinions on dynamic code generation

Opinions on dynamic code generation
by on (#43801)
So, there is a part of the game I'm working on that needs to multiply a number by another one that doesn't change while a level is played (only between levels). It's a simple 8x8=8 multiplication and with an unrolled loop it was quite fast. But I need to do this so many times during a frame that I figured I had to make the process as fast as I could.

Since I have no space for a look-up table that large, I made a routine that generates the optimal code to multiply any number by the number in question. Since the number I'm multiplying for is constant, the generated uses the accumulator as much as it can before using a RAM variable, and it knows when to add the doubling value to the accumulator without having to check any bits.

The results are very good, the generated code works flawlessly and it's much faster than the old routine, even if you consider the JMPs necessary to call it and return (it's used only in one place, so there is no need for JSR and RTS). I'm just not sure this solution is worth the 30 something bytes the dynamic code occupies, or even if it's worth the complexity it adds to the program. I can't think of any technical reason not to do this (if the RAM gets corrupted somehow, it would be just as catastrophic as, say, if a pointer got screwed up). Do you guys have any considerations on this subject? Reasons for using or not using dynamic code?

by on (#43816)
Could you post a little more about your method? I'm not sure I understand? Also, what's the cycle count for it?

You don't have 256bytes to spare in RAM for a dynamic LUT?

by on (#43818)
tomaitheous wrote:
You don't have 1/8 of your RAM for a dynamic LUT?

If a 32-byte dynamically recompiled code can do the work of a 256-byte table, reclaiming those extra 224 bytes can free up space for all sorts of things, especially on such a low memory environment as the NES.

by on (#43822)
How many different possible multipliers are there? If there are only a few, just have a routine for each constant.

by on (#43823)
1/8th? How do you know he's not using 8k SRAM at $6000?


Quote:
If a 32-byte dynamically recompiled code can do the work of a 256-byte table, reclaiming those extra 224 bytes can free up space for all sorts of things, especially on such a low memory environment as the NES.


Unless you need the speed over than space.

by on (#43831)
It sounds cool to have code that generate some other code in RAM for the NES; but I admit it's hard to see many cases where this is really usefull, but it sure is cool. I have already used techniques where some code is initialised from ROM to RAM and then modified afterwards, and code that generate an unrolled loop but not a routine that generate code on it's own.

For your case you could just use 256 bytes of RAM and generate a table for the multiplication, altough if you can waste less RAM that way then I guess it's good.

by on (#43837)
Generating code to run in RAM is farily common for games which want to do lots of writes during VBLANK. You just modify the second byte of LDA #XX instructions.

by on (#43839)
tomaitheous wrote:
You don't have 256bytes to spare in RAM for a dynamic LUT?


256 bytes of RAM is not something you nonchalantly assign to something on the NES. Though I do find myself doing that given the 5k of SRAM I can work with that's not used for saving games, I would really hesitate to use it for a look up table.

I think the idea of dynamic code generation is cool (especially for Vblank updates like Dwedit said), but consider the amount of complexity it adds to the program in terms of space consumption. Now that you may have to do more checks and stuff, how many bytes does that accumulate to? Is it more than if you wrote individual routines like Blargg suggested? If so, then you should do what Blargg suggested and write individual routines and save yourself the trouble. But I am curious as to what kind of complexity this really adds. Is it not just as simple as a JSR to RAM?

by on (#43842)
It's self-modifying code, other than changing JMP addresses (NMI/IRQ vectors pointing to RAM) I don't use it much. I've used unrolled self-modifying code for VRAM updates, for the speed. It can waste a lot of memory, but that's cheap and easy to come by, especially ROM.

by on (#43843)
tomaitheous wrote:
Could you post a little more about your method? I'm not sure I understand? Also, what's the cycle count for it?

I'll post the code at the end, and the cycle count varies too much depending on the numbers, but I'd think the generated code is at least twice as fast, but very often, more than that.

Quote:
You don't have 256bytes to spare in RAM for a dynamic LUT?

No, I don't.

blargg wrote:
How many different possible multipliers are there? If there are only a few, just have a routine for each constant.

I considered that. The code is used to calculate the index of a particular screen in the level map (y * width + x), so I considered using a different multiplication routine per level, pointed by the header of the level. However, as far as I can tell, I can't make anything faster than the generated codes, so I fail to see any advantage besides saving some RAM (with the disadvantage of using quite a bit of ROM).

Celius wrote:
256 bytes of RAM is not something you nonchalantly assign to something on the NES.

That sums up my opinion on using so much memory. This is a platformer, and uses only the stock 2KB of RAM, so wasting 256 bytes is simply out of the question.

Quote:
I think the idea of dynamic code generation is cool (especially for Vblank updates like Dwedit said)

I disagree. The amount of memory limits how many bytes you can update with this method (if you're not using extra RAM), so I fail so see why you'd want do much speed if the amount of data you can write is limited... you could probably write more data with slower code, because more data would be available. I do all my PPU writing wlike this, and it's pretty fast.

Memblers wrote:
It's self-modifying code, other than changing JMP addresses (NMI/IRQ vectors pointing to RAM) I don't use it much.

Yeah, before this, I didn't either.

Here is the code I used to test the idea (meaning it's not polished at all), if anyone is interested. It runs on the 6502 similator:

Code:
Temp = $00
LevelWidth = $34
MultiplicationRoutine = $100

   .org $8000

   ;this the multiplier
   lda #$34
   sta LevelWidth

   ;start generating code
   ldx #$00
   lda #$01
   cmp LevelWidth
   beq Done

DoShifts:
   lsr LevelWidth
   bcs ShiftsDone
   lda #$0A ;asl
   sta MultiplicationRoutine, x
   inx
   bcc DoShifts
ShiftsDone:

   lda LevelWidth
   beq Done

   ;write the code to shift the accumulator left
   lda #$0A ;asl
   sta MultiplicationRoutine, x
   inx

   ;write the code to store the accumulator
   lda #$85 ;sta
   sta MultiplicationRoutine, x
   inx
   lda #Temp
   sta MultiplicationRoutine, x
   inx

   ;write the code to rotate the accumulator right
   lda #$6A ;ror
   sta MultiplicationRoutine, x
   inx

DoAdditions:
   lsr LevelWidth
   bcc NoAddition
   ;write the code to add the accumulator to the stored value
   lda #$65 ;adc
   sta MultiplicationRoutine, x
   inx
   lda #Temp
   sta MultiplicationRoutine, x
   inx
NoAddition:
   lda LevelWidth
   beq Done
   ;write the code to shift the stored value left
   lda #$06 ;asl
   sta MultiplicationRoutine, x
   inx
   lda #Temp
   sta MultiplicationRoutine, x
   inx
   jmp DoAdditions

Done:
   ;write the code to return
   lda #$4C ;jmp
   sta MultiplicationRoutine, x
   inx
   lda #<MultiplicationDone
   sta MultiplicationRoutine, x
   inx
   lda #>MultiplicationDone
   sta MultiplicationRoutine, x


   ;test the generated code
   lda #$03
   jmp MultiplicationRoutine
MultiplicationDone:
   ;accumulator holds the result

After running the code, open the disassembly window and go to $0100 to see the generated code. I'm not sure I'm creating the absolute optimal code, but it's the best I could do so far.

by on (#43853)
Celius wrote:
256 bytes of RAM is not something you nonchalantly assign to something on the NES.


Umm.. who said anything about nonchalantly?


Quote:
(y * width + x)


So y* width is never going to result in a value greater than 8bit?

by on (#43855)
tomaitheous wrote:
Umm.. who said anything about nonchalantly?


Well I guess it's my bad. I sort of interpreted "You don't have 256bytes to spare in RAM for a dynamic LUT?" as "Don't you have 256bytes to spare in RAM for a dynamic LUT?", which would imply that it should be common to have 256 spare bytes, in which case you could nonchalantly use it.

I second tomaitheous's question: the result is 8 bits? If that's the case, then what are your max X/Y dimensions?

by on (#43857)
tomaitheous wrote:
who said anything about nonchalantly?

I must confess I have no idea what this word means, so I may have misinterpreted what Celius said.

Quote:
So y* width is never going to result in a value greater than 8bit?

Celius wrote:
I second tomaitheous's question: the result is 8 bits? If that's the case, then what are your max X/Y dimensions?

I know, it may seem strange that an index into a level map is only 8 bits, but you may remember that in my level compression scheme, the level map indexes blocks that are 256x256 pixels in size, and with 256 of those the levels can be quite big (larger than anything I've already seen on the NES). Once I reach this large block, I can decompose it into smaller ones to get tile and attribute data to write to the PPU, as well as collision data.

The limit is 128 for each dimension, I just have to watch out and never use more than 256 blocks in a level map. So, if a level is 128 blocks wide (32768 pixels), it can be at most 2 blocks tall (512 pixels).

by on (#43861)
Sorry, I didn't mean for that to sound like "what the heck?". And believe it or not, I know all about your compression scheme. I think it's pretty cool, actually. Mine is similar, but not as extreme (I have screens made up of 8x8 metatiles, so it's 2 steps behind your compression scheme). The only reason I asked that the index was 8 bits was because I was thinking you might be able to get away with precalculating a certain amount. I was thinking you might have something like a 64 by 4 level, in which case just precalculate 4 values at the beginning of the level and stick them in RAM.

I've dealt with a similar size limit before, though I ditched it for a constant size. Would you feel bad forcing your Y to be limited to something like 32 blocks? Because if you didn't, then you could use that space in RAM for a look up table 32 bytes in length rather than the code. Then just do a simple:

ldx YHigh
lda LUT,x
clc
adc XHigh

But I could understand if you felt bad forcing your limit to be 32 screens high, like if you wanted a big vertical level that's only 1 screen wide.

Actually, it's even possible that you could use those 32 values to do a simple calculation to get the desired values for a Y number larger than 32. I mean, if Y = 64, you can just shift the value for Y = 32 over once. But when you start needing to calculate for things like Y = 53, then you might as well just skip the complexity of that and do what you're doing.

EDIT: Oh, and nonchalantly is a really strange sounding word. It sounds like some sort of taco or burrito I think. It basically means "carelessly", or without concern (kind of lazy too). Like "I nonchalantly walked into work 2 hours late" or "I nonchalantly sat down in the business meeting and drank a can of beer, farting and burping the whole time." Does the meaning make more sense now?