This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Duff's Device.

Duff's Device.
by on (#97069)
Dunno if it the best code, more just wanted to practice some 6502, but here is a Duff's Device solution for loading PPU data. It expects that the address has been set and A holds the number of values to be copied from the (fake) stack.

Code:
.proc duff_copy ; reg.a has count
   
   tsx
   stx $E; save stack
   ldx $F; load fake stack
   txs
   
   ; split a into high low nybble:
   tax
   and #$0F
   asl
   asl
   sta 1                  ; low half  x 4
   txa
   lsr a
   lsr a
   lsr a
   lsr a                  
   sta 2                  ; high half
   
   lda #<jump_in
   sec
   sbc 1                  ; subtract low nibble to find entry point
                           ; code must be aligned so that <jump_in is greater than 15 * 4 (which is 60)
   sta 3                  ; 3+4 is indirect jump into loop
   lda #>jump_in
   sta 4

   jmp (3)
   
   copy:
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
   jump_in:
      dec 2
      bpl   copy
      
      tsx
      stx   $F
      ldx $E
      txs
      
      rts      
.endproc


I don't like that the setup is so long, maybe room for improvment there.

by on (#97070)
You should pre-calculate whatever jump address and initial values you will use. Then you can load a few registers and jump right in.

by on (#97074)
I guess so.

Code:
.proc duff_copy ; reg.a has count

.segment "RODATA"

jump_in_table:
.word jump_in,jump_in-4,jump_in-8,jump_in-12,jump_in-16,jump_in-20,jump_in-24,jump_in-28,jump_in-32
.word jump_in-36,jump_in-40,jump_in-44,jump_in-48,jump_in-52,jump_in-56,jump_in-60

.segment "CODE"
   ; split a into high low nybble:
   
   tsx
   stx $E; save stack
   ldx $F; load fake stack
   txs
   
   
   tay
   lsr a
   lsr a
   lsr a
   lsr a                  
   sta 2
   tya
   and #$0F
   asl
   tax
   
   
   lda jump_in_table+1,x
   sta 4
   lda jump_in_table,x
   sta 3
   jmp (3)
   
   copy:
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
   jump_in:
      dec 2
      bpl   copy
      
      tsx
      stx   $F
      ldx $E
      txs
      
      rts      
.endproc


Room for improvement, maybe right in the stack data, but here is one minor change.. if it is actually faster I didn't check, but it could be done before this routine.

by on (#97079)
Ah, the good old Battletoad's technique to transfer data faster to VRAM. :D
The solution to use a jump table to change how many bytes are transfered is elegant, too, but here it seems the maximum number of bytes you can transfer is pretty low, you should probably add a loop arround the long pla/sta chain. For example if the chain transfer 16 ($10) bytes and that you want to transfer $25 bytes, you just execute the last 5 iterations of the chain, then you do the entire loop 2 times.

Also it would be an idea to have the chain generated automatically in RAM that way it doesn't waste ROM, and the jump address will be simpler to compute, as you could only compute the low byte, the high byte being constant.

by on (#97082)
Bregalad wrote:
..you should probably add a loop arround the long pla/sta chain. For example if the chain transfer 16 ($10) bytes and that you want to transfer $25 bytes, you just execute the last 5 iterations of the chain, then you do the entire loop 2 times.


It does do that.
I had read that battletoads (I think it was) uses the PLA STA PPU_DATA pair, but I did come up with this code myself, I don't know how similar it is to code that exists. I do want to make it faster, but probably avoid placing code in RAM if possible. I think approach is fast enough for me, at least for now.

It would be nice to just have a branch (always) instruction and modify the branch value, so I might go with RAM code.

by on (#97083)
Quote:
It does do that.

I apologize, I was just being blind ^_^

And yeah jmp is basically your branch always instruction... However in some cases you can optimizes it and use a SP flag which is in a known state to branch always.
For example if you know C is always going to be clear a a point of your code, you can optimize jmp in bcc and it takes only a single byte.

by on (#97086)
I kind of like this:

Code:
.segment "RODATA"
 
duff_copy_ram_code:
   tsx
   stx $E; save stack
   ldx $F; load fake stack
   txs
   
   ldx 2

   clv
   
   duff_branch_value:
   
   bvc jump_in
   
   copy:
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
   jump_in:
      dex
      bpl   copy
      
      tsx
      stx $F
      ldx $E
      txs
      
   
      rts      
duff_copy_ram_code_end:
   
.segment "CODE"


.proc duff_copy ; reg.a has count

   
   tay         ; this can be done before and saved to stack data but will use one more byte
   lsr a
   lsr a
   lsr a
   lsr a                  
   sta 2
   tya
   and #$0F
   asl
   asl
   sta 1
   
   lda duff_branch_value+1
   sec
   sbc 1
   sta $0700 + (duff_branch_value+1 - duff_copy_ram_code)
   
   jsr $0700

   rts
.endproc

.proc setup_duff_copy

   ldx #0
   @loop:
      lda duff_copy_ram_code,x
      sta $0700,x
      inx
      cpx # duff_copy_ram_code_end - duff_copy_ram_code
   bne @loop
   rts
.endproc


Calling code has something like this:

Code:
jsr setup_duff_copy
;start PPU transfer
ldx $F ;fake stack
txs
pla
sta PPU_ADDRESS
pla
sta PPU_ADDRESS
pla
   
tsx
stx $F ;save fake stack

ldx $E      ;real stack
txs
jsr duff_copy


A bit more tweaking, but I think I'm going to go with this, hope someone finds this interesting/useful.

edit: minor change

by on (#97089)
The code is fine, I have a very similar routine in my own code, but you really should stop using memory addresses ($700, $E, $F, etc) directly, and use symbols instead.

by on (#97091)
I like this thread, as it's a very worthy topic, but I just thought I'd clarify: Duff's Device is a C language construct; there is no Duff's Device in assembly. This is just loop unrolling.

The stuff we are discussing has not much to do with Duff's Device except Duff's Device is a clever way to write an unrolled loop in C.

Anyhow, sorry to be pedantic, just it seems like people think Duff invented the unrolled loop or something. Please continue.

by on (#97095)
Is it specific to C? I can be pedantic too, so I don't want to be wrong. I assumed it was just the idea of jumping into the middle of an unrolled loop to make up the remainder of transfers. (Reading Wikipedia..)

Wikipedia does say that it was a technique used in assembly and Duff simply implemented it in C, so I guess you are right, but it seems maybe one of those things that the definition changes with time to apply to other things.

thefox wrote:
..you really should stop using memory addresses ($700, $E, $F, etc) directly, and use symbols instead.


I will for sure, $0700 should be labelled, but I like using 0-$F as temp variables.. do they need a better name? (ex: temp1, temp2, temp3) I know what they are. This stems from using Tepple's nes.ini which has zeropage starting at $10 and using 0-$F as temp.. I liked the idea.

There are many more cycles to be saved.. I'll probably post better code sometime soon.

EDIT: new code:

Code:

RAM_SUB = $0700
   
.segment "CODE"

.proc PPU_Transfer

.segment "ZEROPAGE"
save_stack:   .res   1

.segment "CODE"

   tsx   
   stx save_stack      ; save real stack
   ldx #$FF               ; start pulling data from $0100
   txs
        clv
   bvc   :++               ; branch always
:   sta PPU_ADDRESS
   pla
   sta PPU_ADDRESS
   pla                     ; Size, high nybble
   tax
   pla                     ; value for branch in RAM, precaclulated
   sta RAM_SUB + 1      
   jmp RAM_SUB
return_main:   
:   pla                   
   bpl   :--               ; if plus, valid ppu address, continue
   ldx save_stack      ; or not, exit
   txs
   rts
.endproc

 .segment "RODATA"
 
.proc duff_copy_ram_code
      
   bvc jump_in 
   
   copy:
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
      pla
      sta PPU_DATA
   jump_in:
      dex
      bpl   copy
      
      jmp PPU_Transfer::return_main
duff_copy_ram_code_end:
.endproc

.segment "CODE"

.proc setup_duff_copy

   ldx #0
   @loop:
      lda duff_copy_ram_code,x
      sta $0700,x
      inx
      cpx #duff_copy_ram_code::duff_copy_ram_code_end - duff_copy_ram_code
   bne @loop
   rts
.endproc


Calling code just does: jsr PPU_Transfer

(edit: small code fix..)

edit: Some testing: I can copy 160 bytes in about 1600 cycles, 5 separate transfers in the stack.

This is perfect, I think I'm done. I need ideas on how to calculate the two bytes after the PPU Address though. The code expects the high nibble of the count, and the offest into the loop (Original branch address - low nibble * 4). How to calculate that in game code?

by on (#97103)
Movax12 wrote:
thefox wrote:
..you really should stop using memory addresses ($700, $E, $F, etc) directly, and use symbols instead.

I will for sure, $0700 should be labelled, but I like using 0-$F as temp variables.. do they need a better name? (ex: temp1, temp2, temp3) I know what they are. This stems from using Tepple's nes.ini which has zeropage starting at $10 and using 0-$F as temp.. I liked the idea.

What I tend to do is define procedure-local symbols at the start of each procedure:
Code:
.proc something_or_other
src = 0
srcLeft = 2
dst = 3
  lda #<SRAM_BGCHR
  sta src+0
  lda #>SRAM_BGCHR
  sta src+1
  lda #$10
  sta srcLeft
  ; etc.
  rts
.endproc

by on (#97111)
I see, thanks.

by on (#97119)
tepples wrote:
What I tend to do is define procedure-local symbols at the start of each procedure:

Or you could do the possibly more readable/maintainable:
Code:
.proc something_or_other
.enum
  src     .word
  srcLeft .byte
  dst     .word
.endenum
  lda #<SRAM_BGCHR
  sta src+0
  lda #>SRAM_BGCHR
  sta src+1
  lda #$10
  sta srcLeft
  ; etc.
  rts
.endproc

Be careful to not run in to this quirk though.

If you name the .enum you can also add an .assert to make sure its size is never more than 16 bytes, but then you'll have to use enum_name::foo to refer to the .enum members.