This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Lag in the game logic even though it is very short

Lag in the game logic even though it is very short
by on (#155618)
I've encountered the problem that my game starts to lag even tough, as far as I can see it, the game logic is pretty short.

I created a very small sample program to demonstrate my problem. It's in the attachments.

When the game logic hits, this is basically the only thing that is done:
Code:
#define ARRAY_LENGTH(array) (sizeof(array) / sizeof((array)[0]))

extern unsigned char ScrollingPosition;
#pragma zpsym("ScrollingPosition")

static unsigned char PpuUpdate[255];
static unsigned char PpuUpdateIndex;

static unsigned char b;

static const unsigned char StatusBar[] =
{
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};

void PreparePpuUpdate(const unsigned char *data, unsigned char dataLength)
{
   for (b = 0; b < dataLength; ++b)
      PpuUpdate[PpuUpdateIndex++] = data[b];
}

void ProcessNextFrame(void)
{
   ScrollingPosition += 1;

   PpuUpdateIndex = 0;
   PreparePpuUpdate(StatusBar, ARRAY_LENGTH(StatusBar));
   PreparePpuUpdate(StatusBar, ARRAY_LENGTH(StatusBar));
   PreparePpuUpdate(StatusBar, ARRAY_LENGTH(StatusBar));
}


In the sample program, the PpuUpdate array isn't actually used for anything.
But in the actual game, this is the array that gets written into the PPU: I fill two status bars and, if applicable, a vertical column for the background update during scrolling.
And then I write it into the PPU during vblank to put the graphics on screen.

But the above program already takes two frames to finish the ProcessNextFrame function.
If you remove one of the PreparePpuUpdate lines, it runs at full speed.

But the problem is: In the moment, the function doesn't do anything else but filling the array. And this already exceeds the time per frame. So, how am I ever supposed to include actual game logic? Character behavior, enemy behavior etc. How will this be possible if the above code already lags?

Or maybe I'm also doing something wrong in the "Main.s" file.

Does anybody have an idea?
Re: Lag in the game logic even though it is very short
by on (#155620)
cc65 appears to be extremely inefficient when it comes to passing pointers as parameters - it maintains its own internal stack for function parameters and local variables, and it takes nearly half a scanline just to push a pointer. Overall, a single call to PreparePpuUpdate(), which appends 26 bytes to a buffer, is taking over 12,000 cycles to perform.

If you rewrite that function in assembly, you should be able to get it down to 600 cycles or less.
Re: Lag in the game logic even though it is very short
by on (#155622)
Thanks for the information. I was beginning to fear that the combination of the for-loop and incrementing the index variables was the reason it is so slow.

O.k., it sucks that it is like it is. But I might be able to work with it.
For a start, I'll just use the standard C function memcpy, even though I didn't want to use any actual C library functions.
I'll have to see what specific functions I'd have to write myself where a simple memcpy isn't enough anymore.

Does anybody know a workaround where I don't have to use Assembly or C standard functions?

I assume declaring the function as a parametrized macro isn't so good either because this would fill up my ROM space.
Re: Lag in the game logic even though it is very short
by on (#155624)
I'm very interested in figuring out why the above code took 12,000 cycles. I'm at work, so I can't look at it now. My gut tells me that you can write it in C, but in an 'assembly like' way, and it should work as well as assembly.
Since cc65 translates a passed pointer as a push to a stack and a then a pull, and some juggling to index from the pointer (y)...
Re: Lag in the game logic even though it is very short
by on (#155627)
dougeff wrote:
My gut tells me that you can write it in C, but in an 'assembly like' way

What do you mean?

I tried to declare a global pointer variable and to use this inside the function instead of a parameter. Before calling the function, I would have assigned the pointer:
Code:
PreparePpuUpdateData = StatusBar;
PreparePpuUpdateDataSize = ARRAY_LENGTH(StatusBar)
PreparePpuUpdate();

But it wasn't faster either.
Re: Lag in the game logic even though it is very short
by on (#155628)
I'm away from my computer, so I can't write any test programs, to identify how cc65 converts these kind of functions to assembly.

I mean, do it in a way that it doesn't have to push any pointers to a stack. Somehow.
Re: Lag in the game logic even though it is very short
by on (#155633)
On the subject of "memcpy" being faster, the reason "memcpy" will do better than any naive version of it you can write in CC65 is that memcpy is implemented in assembly in the CRT. If you try to write a "memcpy" equivalent in CC65 C code it will always be slower.

Modern compilers often go one better and turn memcpy into an intrinsic, rather than a function, so they can optimize through the memcpy. Sometimes they can detect simple for loops and automatically convert to the intrinsic too.


I made this suggestion before, but it very often helps to avoid using the C stack; which means: avoid local variables, avoid parameters to function call. Here's a simple version:

Code:
// collection of generic "parameter" inputs
unsigned char temp_a;
unsigned char* temp_ptr;

// actual function takes no parameters, uses static parameters instead of stack
void PreparePpuUpdate()
{
   for (b = 0; b < temp_a; ++b)
      PpuUpdate[PpuUpdateIndex++] = temp_ptr[b];
}

// call like this (sort of like an assembly function call)
temp_ptr = data;
temp_a = dataLength;
PpuUpdate();

You could create a macro to roll up that function call into a single line, if you need to cut down on the verbosity.

Might go even further by making the temp variables zeropage. To do this I think you have to put their definition in an assembly file (with a leading _) and extern them here with an additional line:

Code:
; assembly
.zeropage
_temp_a: .res 1
_temp_ptr: .res 2

// C
extern unsigned char temp_a;
extern unsigned char temp_ptr;
#pragma zpsym ("temp_a");
#pragma zpsym ("temp_ptr");

I tend to have a collection of generic static temporaries with names like i, j, k, l defined like this.


And if you want a few more ideas to play with, register variables, and/or fastcall with small number of arguments may or may not help.

Not sure if this variation would be faster, would have to test each case, but this is an example of other kinds of things you can try:

Code:
#pramga regvars (on)
void fastcall PreparePpuUpdate(unsigned char data_len)
{
   register unsigned char i;
   for (i = 0; b < data_len; ++i)
      PpuUpdate[PpuUpdateIndex++] = temp_ptr[i];
}

There's a tipping point where I don't want to mess around with variations of C code guessing to try to find the compiler's "sweet spot" and I'd rather just spend the time writing in assembly, though. Sometimes you can learn things that basically apply everywhere (e.g. principles like "avoid the C stack") but sometimes all you're doing is trying to optimize a single function and it won't help you anywhere else (e.g. use of register variables is highly situation dependent). Up to you how far you want to experiment.


This is also a big pitfall of writing C on the NES. Sooner or later you run into something that just doesn't perform well enough despite looking like very sensible C code. At this point you can fight the compiler by trying variations of the code, bypass it by rewriting the function in assembly, or abandon your project entirely, I guess. Mostly I would recommend compiling and testing very frequently (try to measure performance often; while working on it I suggest writing $2001 with the greyscale bit at the end of the update) so that you have a good idea of what code you just changed that suddenly made it slow.
Re: Lag in the game logic even though it is very short
by on (#155637)
What's the point of coding in C if you have to do it in unintuitive ways, always minding the generated assembly code? Might just as well code directly in assembly, it will probably be easier/faster than going back and forth until you're satisfied with the code generated by the compiler.
Re: Lag in the game logic even though it is very short
by on (#155640)
There's a difference between "unintuitive" and "non-typical".

The stuff like statics for parameters is non-typical for C, but it's hardly unintuitive. There's a learning curve, but once you know the general pitfalls it's fairly intuitive, even if the code looks a little weird compared to typical C code.

Stuff that you really have to test to understand at all is unintuitive, like trying register variables to see if that helps, etc. I don't often find pursing that kind of thing worthwhile when I can just do it in assembly.

You can't really expect every kind of typical code to work well on a limited platform like the NES. DRW has come across several cases where using pointers is dog-slow. CC65's C is very bad for moving arrays of data around like this. In particular, being able to use the absolute address of an array instead of a pointer variable can be a huge speed increase (though still slow compared to assembly). Just like you probably don't want to use multiply or divide in a performance critical section, you also probably don't want to copy an array through a pointer. That's a somewhat intuitive guideline, isn't it?

There's always another way to get the same task done. If one way turns out to be bad, try a different way. In this case using "memcpy" is a pretty easy alternative? (Which DRW already discovered.)
Re: Lag in the game logic even though it is very short
by on (#155643)
DRW wrote:
I'll just use the standard C function memcpy, even though I didn't want to use any actual C library functions.

Writing any code at all in C constantly uses functions from the CRT library. Like how if you write * the compiler actually generates a call to a multiply subroutine, any use of local variables or stack parameteres calls a bunch of stack handling subroutines, etc. The only difference here is memcpy is explicitly called from C instead of only implicitly.

Also remember that nothing from the library gets added to your ROM unless you actually use it, so it's not harmful to have a CRT library that's got unused stuff in it.

It might also be worth your while to write a custom 8-bit memcpy variant in assembly that takes 2 pointer parameters (or even better uses 2 pointer ZP statics) and copies up to 256 bytes. Something simple like:
Code:
// C

extern unsigned char* ptr0;
extern unsigned char* ptr1;
#pragma zpsym ("ptr0")
#pragma zpsym ("ptr1")
extern void fastcall copy8(unsigned char len); // copies len bytes from ptr0 to ptr1 (regions should not overlap)

#define COPY8(param_dst,param_src,param_len) \
   ptr0=param_src; \
   ptr1=param_dst; \
   copy8(param_len);

// example call
COPY8(mydest,mysrc,25);

; assembly

.zeropage
_ptr0: .res 2
_ptr1: .res 2
.export _ptr0
.export _ptr1

.code
.export _copy8
_copy8:
   tay ; fastcall means len parameter is in A
   jmp @next
   @loop:
      dey
      lda (ptr0), Y
      sta (ptr1), Y
   @next:
      cpy #0
      bne @loop
   rts


(Could probably be optimized a bit more, but you get the idea.)
Re: Lag in the game logic even though it is very short
by on (#155655)
rainwarrior wrote:
On the subject of "memcpy" being faster, the reason "memcpy" will do better than any naive version of it you can write in CC65 is that memcpy is implemented in assembly in the CRT. If you try to write a "memcpy" equivalent in CC65 C code it will always be slower.

That's of course clear. But I was surprised that you can call memcpy 50 times (probably even more) in the same frame and the game still runs at full speed. But when you write a C equivalent, the game lags when you even just call it three times.

rainwarrior wrote:
I made this suggestion before, but it very often helps to avoid using the C stack; which means: avoid local variables, avoid parameters to function call.

Yeah, that was the next thing that I thought about as well. (See above.) I tried it out, but it still wasn't faster.

Maybe I'm doing something wrong. If you replace the "ProcessNextFrame.c" in the above zip file with the below code, you can try it out:
Code:
#define ARRAY_LENGTH(array) (sizeof(array) / sizeof((array)[0]))

extern unsigned char ScrollingPosition;
#pragma zpsym("ScrollingPosition")

static unsigned char PpuUpdate[255];
static unsigned char PpuUpdateIndex;

static unsigned char b;

static const unsigned char StatusBar[] =
{
   0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
};

const unsigned char *GenericPointer;
unsigned char GenericSizeValue;

void PreparePpuUpdate_()
{
   for (b = 0; b < GenericSizeValue; ++b)
      PpuUpdate[PpuUpdateIndex++] = GenericPointer[b];
}

#define PreparePpuUpdate(data)\
{\
   GenericPointer = data;\
   GenericSizeValue = ARRAY_LENGTH(data);\
   PreparePpuUpdate_();\
}

void ProcessNextFrame(void)
{
   ScrollingPosition += 1;

   PpuUpdateIndex = 0;
   PreparePpuUpdate(StatusBar);
   PreparePpuUpdate(StatusBar);
   PreparePpuUpdate(StatusBar);
}


Only when I actually work with the constant value itself does it work at full speed again:
Code:
void PreparePpuUpdate()
{
   for (b = 0; b < ARRAY_LENGTH(StatusBar); ++b)
      PpuUpdate[PpuUpdateIndex++] = StatusBar[b];
}

But this is of course not an option since there can be any value that needs to be passed to PpuUpdate.

rainwarrior wrote:
Might go even further by making the temp variables zeropage. To do this I think you have to put their definition in an assembly file (with a leading _) and extern them here with an additional line:

There's an alternate version where you can declare the variables even in C:
Code:
#pragma bssseg(push, "ZEROPAGE")
unsigned char MyZeropageVariable;
#pragma bssseg(push, "BSS")
unsigned char AndNowWeAreInBssAgain;


The __fastcall__ command I'm aware of. Although, if you write your function definition in C, it doesn't really change anything relevant. This is more for functions that you write in Assembly. I'll still use it just to be sure. But in the moment, something else is the bottleneck in my program, so I don't bother with these minimalistic optimizations until later.

I put #pramga regvars (on) into my todo list. I will check it out later when background buildup and sprite movement are implemented and work together in my game.

rainwarrior wrote:
Up to you how far you want to experiment.

I usually write the C code as I would in a PC program plus the general NES stuff (use global variables/static local variables instead of regular local ones etc., avoid function parameters if possible etc.). Only when something actually doesn't work with 60 frames per second anymore do I investigate and try to optimize. Furthermore, I have a bunch of items in my todo list that I need to change. But since I'm still building the basics of the game, this wasn't necessary yet.

rainwarrior wrote:
Sooner or later you run into something that just doesn't perform well enough despite looking like very sensible C code.

I'm still hoping that my game will be simple enough that this won't become an issue. But who knows? Maybe I will have to write an Assembly function here and there later.

rainwarrior wrote:
Mostly I would recommend compiling and testing very frequently

This is of course a given. I write a piece of code and see if it works.

rainwarrior wrote:
(try to measure performance often; while working on it I suggest writing $2001 with the greyscale bit at the end of the update) so that you have a good idea of what code you just changed that suddenly made it slow.

Sincy my program uses autoscrolling, I immediately see when something becomes slow.

tokumaru wrote:
What's the point of coding in C if you have to do it in unintuitive ways, always minding the generated assembly code? Might just as well code directly in assembly, it will probably be easier/faster than going back and forth until you're satisfied with the code generated by the compiler.

I agree with rainwarrior's suggestions. You actually just have to keep some things in mind, like declaring a bunch of global variables for your temporary stuff instead of using a local variable whenever you need it. It's still a ton easier than writing pure Assembly.

rainwarrior wrote:
Writing any code at all in C constantly uses functions from the CRT library. Like how if you write * the compiler actually generates a call to a multiply subroutine, any use of local variables or stack parameteres calls a bunch of stack handling subroutines, etc. The only difference here is memcpy is explicitly called from C instead of only implicitly.

Sure, but that's the thing: I of course use the compiler to tranform C code into Assembly. But the C standard library, this is not just the language C anymore. This is a subjective collection of functions that the creators of C thought might be useful. And these kinds of functions, that's what I want to do myself.
That a = 3* b; gets transformed to functions as well, sure, that's the way the compiler works.
But by using #include <string.h>, I explicitly use external functions in my own code. And I wanted to avoid this, especially for such basic stuff like memory copy. All the called functions in C shall be part of my own project.

I will get back to your copy function in a while. (Once I can actually move around a charcater on screen and that character interacts with the level, I'll take a break from implementing new stuff and try out all the code-related things that I've written down.)
Thanks for it.
Re: Lag in the game logic even though it is very short
by on (#155660)
DRW wrote:
rainwarrior wrote:
I made this suggestion before, but it very often helps to avoid using the C stack; which means: avoid local variables, avoid parameters to function call.

Yeah, that was the next thing that I thought about as well. (See above.) I tried it out, but it still wasn't faster.

Maybe I'm doing something wrong. If you replace the "ProcessNextFrame.c" in the above zip file with the below code, you can try it out...

I took a look, and in this case you're right, it doesn't help very much. It optimizes away a stack access, but it's still really bogged down in the inefficent way the compiler is doing pointer access, so... sorry, yeah, this suggestion doesn't actually make a significant difference here. (I also suspect that "register" wouldn't do any good either. The generated code for this kind of pointer copy loop is just really shitty.)

By the way, I highly recommend you break your cl65 line up into steps so you can take a look at the intermediate assembly. I had to do this to take a look at what it was producing:
Code:
cc65 -o ProcessNextFrame.s -t nes ProcessNextFrame.c
ca65 -o ProcessNextFrame.o -t nes ProcessNextFrame.s
ca65 -o Main.o -t nes Main.s
cl65 -o Test.nes -t nes -C NES.cfg Main.o ProcessNextFrame.o

; instead of

cl65 -o Test.nes -t nes -C NES.cfg Main.s ProcessNextFrame.c


DRW wrote:
There's an alternate version where you can declare the variables even in C:
Code:
#pragma bssseg(push, "ZEROPAGE")
unsigned char MyZeropageVariable;
#pragma bssseg(push, "BSS")
unsigned char AndNowWeAreInBssAgain;

Ah, interesting. Thanks for the tip.

DRW wrote:
rainwarrior wrote:
(try to measure performance often; while working on it I suggest writing $2001 with the greyscale bit at the end of the update) so that you have a good idea of what code you just changed that suddenly made it slow.

Sincy my program uses autoscrolling, I immediately see when something becomes slow.

The suggestions to write $2001 was to visually show where on the screen the CPU had reached when the update stopped. That way you get a rough estimate of how long the frame took; you don't have to wait for slowdown, you can see if the code you just added took 5 scanlines or 100 scanlines, even if you're still running at 60fps. (I actually like to use the emphasis bits + greyscale to color code different timings, though FCEUX doesn't support more than two emphasis states per frame yet; coming in its next version, available in beta builds...)

Sometimes something is really slow, but not quite slow enough to run over the end of the frame. If you're only looking for slowdown cases, you might add something small that becomes "the straw that broke the camel's back" and mistakenly think it was the primary cause of slowdown.
Re: Lag in the game logic even though it is very short
by on (#155664)
rainwarrior wrote:
By the way, I highly recommend you break your cl65 line up into steps so you can take a look at the intermediate assembly.

As often, I'm actually doing this in my real program. But for the sample program, I shortened it to one line to have the whole contents as small as possible. But I'll remember it next time I need to post a complete project.

rainwarrior wrote:
The suggestions to write $2001 was to visually show where on the screen the CPU had reached when the update stopped.

Oh, you mean I shall do a mid-frame update. Yeah, that might help.
Re: Lag in the game logic even though it is very short
by on (#155682)
tokumaru wrote:
What's the point of coding in C if you have to do it in unintuitive ways, always minding the generated assembly code? Might just as well code directly in assembly, it will probably be easier/faster than going back and forth until you're satisfied with the code generated by the compiler.

Ability to share code between the NES and PC versions of a single game, perhaps? Or would it be better to write the game in assembly and then write a static translator from 6502 to C?
Re: Lag in the game logic even though it is very short
by on (#155683)
I would say the biggest advantage is that you can still do most of the stuff pretty easily. I take care not to use fatures that are too slow for the NES. But if I had to code completely in Assembly, then I wouldn't start to program a game in the first place.
Re: Lag in the game logic even though it is very short
by on (#155685)
I think fill-in-the-blank assembly would still be a nice thing to have. Then you concentrate on writing the game logic rather than the engine.
Re: Lag in the game logic even though it is very short
by on (#155704)
Quote:
if I had to code completely in Assembly, then I wouldn't start to program a game in the first place.


I think, once you've written a c function that works well in NES, you can then reuse the function later (changing the variables to repurpose it), and not have to figure it out each time. It will get easier the more you do it, true with assembly too.
Re: Lag in the game logic even though it is very short
by on (#155706)
dougeff wrote:
true with assembly too.

Sure, but since I have been a programmer for 14 years now, I can write down my C code just like that. With Assembly, it takes me a much bigger amount of time to do less stuff.

And then there's the fact that in Assembly, you can make errors much more easily.
For example, on two different occasions, I had to do some step by step analysis because my code didn't work anymore.

The reason: I had written
LDA PPUCTRL_DEFAULT_VALUE
instead of
LDA #PPUCTRL_DEFAULT_VALUE

Something like that just doesn't happen in C.

And if some algorithm doesn't work in C, I can just paste the code into Visual Studio and conveniently debug it.
What's the alternative for Assembly code? Put it into fceux and watch the RAM. Great!
Re: Lag in the game logic even though it is very short
by on (#155711)
DRW wrote:
And if some algorithm doesn't work in C, I can just paste the code into Visual Studio and conveniently debug it.
What's the alternative for Assembly code? Put it into fceux and watch the RAM. Great!

Or you can have your assembler output a symbol table and then use a tool to convert it to the NL format that FCEUX needs. This will let you set breakpoints and see variable names in your code.
Re: Lag in the game logic even though it is very short
by on (#155752)
tepples wrote:
DRW wrote:
And if some algorithm doesn't work in C, I can just paste the code into Visual Studio and conveniently debug it.
What's the alternative for Assembly code? Put it into fceux and watch the RAM. Great!

Or you can have your assembler output a symbol table and then use a tool to convert it to the NL format that FCEUX needs. This will let you set breakpoints and see variable names in your code.

Or you can use NDX's source level debugging features. It's not great, but it works well enough. I can't remember how much I tested it with C, though...
Re: Lag in the game logic even though it is very short
by on (#155779)
O.k., I will have a look at it.

I assume a 6502 debugger like the C++ debugger in Visual Studio does not exist? You know, one where you debug your code inside the open code file. Where you can change code on the fly and where you can watch the value of each variable and constant by simply moving the mouse over it or writing its value into a specific window.
This debugger doesn't need to have anything to do with the NES since it's just about testing general functions.
Re: Lag in the game logic even though it is very short
by on (#155782)
I think NESICIDE is the only attempt at an NES IDE with a debugger so far: http://www.nesicide.com/

I haven't really used it, so I can't comment on how good it might be.

It's possible to write custom debugger add-ons for Visual Studio to interface with other systems besides the built-in one (e.g. a programmer I worked with made debugging plugin to work with embedded Lua scripts in a game we were making). In theory you could modify an emulator to work with a Visual Studio plugin. I don't think anyone has done this yet, though.

Personally I'm fine with just updating FCEUX nl files with all the labels used in my program as part of the build process. If you can see a nearby label in the FCEUX debugger, it's really easy to find that label in your source file.
Re: Lag in the game logic even though it is very short
by on (#155783)
It wouldn't have to be an NES debugger. Just a generic 6502 debugger as convenient as Visual Studio, so that I can see if separate functions that I've written actually do everything correctly.