This page is a mirror of Tepples' nesdev forum mirror (URL TBD).
Last updated on Oct-18-2019 Download

Split-screen vertical scrolling and precise delays

Split-screen vertical scrolling and precise delays
by on (#106501)
Split-screen horizontal scrolling is easy: Just poke to $2005 two values, with one being the X scrolling offset, any time during the scanline.
But turns out vertical scrolling is not that easy, especially if you are going to do that without any mapper support.

In a recent project of mine I did that anyway. It is tested on MMC1, VRC6 and MMC3. (Emulation only.)
You can see a screenshot of that here:
Image

Here is the source code. It is in form of two macros.

The first one prepares a PPU write which sets the scrolling offset to a particular value.
The second one commits the write. It must be performed at a specific time window during the h-blank, or you will get artifacts.

Code:
.macro PPU_WriteOffset
        ; A = NTA
        ; X = X-Offset
        ; Y = Y-offset
        asl
        asl       ; For first write to $2006: NTA*4
        sty $01
        tay
        ;sta $2006 ; t: yyy NN YYYYY XXXXX xxx
        ;          ;    *54 32 10--- ----- --- <- These are affected. * = set zero.
        ;          ;        NN                 <- UNIQUE DATA
        txa
        lsr
        lsr
        lsr
        sta $00
        lda $01
        asl
        asl
        and #$E0
        ora $00   ; Second write to $2006:   X/8 + 32*Y/8
        ;
        ; Cycle cost: 2+2+3+2+ 2*4 + 3+3+2+2+2+3 = 32
.endmacro
.macro PPU_WriteOffset_Finish
        sty $2006
        ldy $01
        sty $2005 ; t: yyy NN YYYYY XXXXX xxx
        ;              210 -- 76543 ----- ---  <- These are affected.
        ;         ;    yyy    YY               <- UNIQUE DATA
        ;
        ; At THIS point, we must be in h-blank
        ;
        stx $2005 ; t: yyy NN YYYYY XXXXX xxx
        ;              --- -- ----- 76543 210  <- These are affected.
        ;         ;                       xxx  <- UNIQUE DATA
       
        ;
        sta $2006 ; t: yyy NN YYYYY XXXXX xxx
        ;              --- -- --765 43210 ---  <- These are affected.
        ;
        ;          This last write is the only one that updates v = t!
        ;
        ; Cycle cost: 4*4+3 = 19
.endmacro


Related, here is a macro which sleeps (delays) the execution for the given number of scanlines, plus/minus a configurable number of cycles. All numbers must be compile time constants. The macro depends on a compile-time constant called "PAL" which must be 0 if you are compiling for a NTSC system, 1 if you are compiling for a PAL system.

Code:
.macro ScanlineDelay already_done, do_scanlines, delta
    .if PAL=0
        compensate_dmc_delay (already_done), ((do_scanlines)*341   /3 + (delta))
    .else
        compensate_dmc_delay (already_done), ((do_scanlines)*341*5/16 + (delta))
    .endif
.endmacro


Here is a version that does not need the number of scanlines to be a compile-time constant. (It still requires the PAL/NTSC variable.)

Code:
.pushseg
.segment "FUNC_SCANLINEDELAY"
; A = number of scanlines to delay. Must be >= 1.
; Uses $00,$01,$F8 as temp.
VariableScanlineDelay:
        @nearby_rts_14cyc = $C450 ; lda #1 + rts
        @nearby_rts       = rts12
        ; Some zeropage variable.
        @remain = $F8
        sta @remain             ;3
.if PAL=0
        @n_cases = 3
.else
        @n_cases = 16
.endif
@loop:
        lda #@n_cases           ;2
        cmp @remain             ;3
        Jcc @thatmuch           ;3
        ; n_cases >= remain
        ; remain <= n_cases
        lda @remain             ;-1+3
        tax                     ;2   
        lda #0                  ;2   
        sta @remain             ;3   
        ; X = remain, remain = 0     
        jmp @jump               ;3 -- 20 so far
@thatmuch:
        ; n_cases < remain
        ; remain > n_cases
        tax                     ;2
        lda @remain             ;3
        sec                     ;2
        sbc #@n_cases           ;2
        sta @remain             ;3 -- 20 so far
        ; X = n_cases, remain -= ncases
@jump:
        TableWrapCheck (@lo_ptr_table-1), @n_cases, "@lo_ptr_table causes page wrap"
        TableWrapCheck (@hi_ptr_table-1), @n_cases, "@hi_ptr_table causes page wrap"

        lda @lo_ptr_table-1,x   ;4 assuming no wrap
        sta $00                 ;3
        lda @hi_ptr_table-1,x   ;4 assuming no wrap
        sta $01                 ;3
        jmp ($00)               ;5 -- 39 so far
@continue:
        ;                       ;  -- 42 so far (JMP)
        lda @remain             ;3
        Jne @loop               ;3
        ;                       ;-1
        ; Overhead so far: 6+3-1 = 8 cycles (including JSR).
        ;          Add 6 for RTS = 14 cycles.
        rts

.segment "DATA_SCANLINEDELAY_POINTERS"
        .byte 0 ; We don't need a delay0 pointer.
@lo_ptr_table:
    .repeat @n_cases, I
        .byte <.ident (.sprintf("@delay%d", I+1))
    .endrepeat
@hi_ptr_table:
    .repeat @n_cases, I
        .byte >.ident (.sprintf("@delay%d", I+1))
    .endrepeat

    .repeat @n_cases, I
      .segment .sprintf("DELAY%d", I+1)
      .ident (.sprintf("@delay%d", I+1)):
        ScanlineDelay 48, (I+1), 0
        jmp @continue
    .endrepeat
.popseg


It will produce a number of segments, which you will have to include in your linker script.

These both depend on this macro, which produces code that sleeps the given number of cycles regardless of whether a DMC sample is playing or not. The detection of the DMC sample must be customized for your given program.
http://bisqwit.iki.fi/src/6502-inline_dmc_delay_compensation.inc

Which, in turn, depends on a delay_n macro, which sleeps for an exact number of cycles, which you can find implemented in... Well. There are a number of files.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy.inc This preserves Y, but clobbers A, X, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyc.inc This preserves A, X, Y and C, but clobbers Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepaxyczndv.inc This preserves A, X, Y, C, Z+N and D+V.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keepy-a25.inc This preserves just Y, and requires the presence of a function called delay_a_25_clocks. You can find the implementation below.
-- http://bisqwit.iki.fi/src/6502-inline_delay-keep-ax33-xa30.inc This clobbers all registers besides S and I, and requires the presence of functions called delay_256a_x_33_clocks and delay_256x_a_30_clocks.
The delay code is optimized for size, and is surprisingly good. Most values produce 5-7 bytes of code. The versions which preserve more registers produce larger code.
The delay is customized and optimized for size for up to 5000 cycles, and up from it is done recursively by halving the delay.
None of the macros change contents of memory locations. Many of them do read RAM locations. Some of them do sequences of two successive RAM writes, where the first one changes a value and the second one changes it back.
You can find a version for your own set of preserved registers by editing the filename URL by following the established pattern. In the future, the macro might receive the set of registers as a parameter.

You can find the implementation of the aforementioned three runtime delay functions below. Portions were written by Blargg and dclxvi.
Code:
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Time: 256*A+X+34 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A. Preserves X,Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
:       ; do 256 cycles.        ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
        sbc #1                  ; 2 cycles - Carry was set from cmp
        pha                     ; 3 cycles
         lda #(256-25-10-2-4)   ; +2
         jsr delay_a_25_clocks
        pla                     ; 4 cycles
delay_256a_x_33_clocks:
        cmp #1                  ; +2; 2 cycles overhead
        bcs :-                  ; +2; 4 cycles overhead
        ; 0-255 cycles remain, overhead = 4
        txa                     ; +2; 6; +27 = 33
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
        ;          ;    Cycles        Accumulator     Carry flag
        ;          ; 0  1  2  3  4       (hex)        0 1 2 3 4
        sec        ; 0  0  0  0  0   00 01 02 03 04   1 1 1 1 1
:       sbc #5     ; 2  2  2  2  2   FB FC FD FE FF   0 0 0 0 0
        Jcs :-     ; 4  4  4  4  4   FB FC FD FE FF   0 0 0 0 0
        lsr a      ; 6  6  6  6  6   7D 7E 7E 7F 7F   1 0 1 0 1
        Jcc :+     ; 8  8  8  8  8   7D 7E 7E 7F 7F   1 0 1 0 1
:       sbc #$7E   ;10 11 10 11 10   FF FF 00 00 01   0 0 1 1 1
        Jcc :+     ;12 13 12 13 12   FF FF 00 00 01   0 0 1 1 1
        Jeq :+     ;      14 15 14         00 00 01       1 1 1
        Jne :+     ;            16               01           1
:       rts        ;15 16 17 18 19   (thanks to dclxvi for the algorithm)
;;;;;;;;;;;;;;;;;;;;;;;; 
; Delays X:A clocks+overhead
; Time: 256*X+A+30 clocks (including JSR)
; Written by Joel Yliluoma. Clobbers A,X. Preserves Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
        cpx #0                  ; +2
        Jeq delay_a_25_clocks   ; +3  (25+5 = 30 cycles overhead)
        ; do 256 cycles.        ;  4 cycles so far. Loop is 1+1+ 2+3+ 1+3 = 11 bytes.
        dex                     ;  2 cycles
        pha                     ;  3 cycles
         lda #(256-25-9-2-7)    ; +2
         jsr delay_a_25_clocks
        pla                        ; 4
        jmp delay_256x_a_30_clocks ; 3.
;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Preserved: X, Y
; Time: A+25 clocks (including JSR)  (13+6+6)
;;;;;;;;;;;;;;;;;;;;;;;;
:       sbc #7          ; carry set by CMP
delay_a_25_clocks:
        cmp #7          ;2
        Jcs :-          ;2    do multiples of 7
        ;               ; Cycles          Accumulator            Carry           Zero
        lsr a           ; 0 0 0 0 0 0 0   00 01 02 03 04 05 06   0 0 0 0 0 0 0   ? ? ? ? ? ? ?
        Jcs :+          ; 2 2 2 2 2 2 2   00 00 01 01 02 02 03   0 1 0 1 0 1 0   1 1 0 0 0 0 0
:       Jeq @zero       ; 4 5 4 5 4 5 4   00 00 01 01 02 02 03   0 1 0 1 0 1 0   1 1 0 0 0 0 0
        lsr a           ; : : 6 7 6 7 6   :: :: 01 01 02 02 03   : : 0 1 0 1 0   : : 0 0 0 0 0
        Jeq :+          ; : : 8 9 8 9 8   :: :: 00 00 01 01 01   : : 1 1 0 0 1   : : 1 1 0 0 0
        Jcc :+          ; : : : : A B A   :: :: :: :: 01 01 01   : : : : 0 0 1   : : : : 0 0 0
@zero:  Jne :+          ; 7 8 : : : : C   00 01 :: :: :: :: 01   0 1 : : : : 1   1 1 : : : : 0
:       rts             ; 9 A B C D E F   00 01 00 00 01 01 01   0 1 1 1 0 0 1   1 1 1 1 0 0 0
; ^ (thanks to dclxvi for the algorithm)


These functions require macros called Jeq,Jcc,Jne etc. (with a capital J). These are branch macros that do page-wrapping checking. You can find the implementation of those macros below. They were written by Blargg.

Code:
.macro branch_check opc, dest
    opc dest
    .assert >* = >dest, warning, "branch_check: failed, crosses page"
.endmacro

.macro Jcc dest
        branch_check bcc, dest
.endmacro
.macro Jcs dest
        branch_check bcs, dest
.endmacro 
.macro Jeq dest
        branch_check beq, dest
.endmacro
.macro Jne dest
        branch_check bne, dest
.endmacro
.macro Jmi dest
        branch_check bmi, dest
.endmacro
.macro Jpl dest
        branch_check bpl, dest
.endmacro
.macro Jvc dest
        branch_check bvc, dest
.endmacro 
.macro Jvs dest 
        branch_check bvs, dest
.endmacro


Related, there is also this macro which checks whether a table spans across two pages, and produces a link-time warning if access to the table will produce page crossing.
Code:
.macro TableWrapCheck table, last_index, message
        .assert >(table) = >(table+(last_index)), warning, message
.endmacro
Re: Split-screen vertical scrolling and precise delays
by on (#106503)
Since you only scroll vertically, you should do this (or something equivalent) instead :

Code:
lda #wathever...
sta $2000
ldx #$00
stx $2005
lda VScroll
sta $2005
stx $2005
asl A
asl A
asl A
asl A
asl A
sta $2006


Aside of this, great work for the delay routines, and for this hack in general ! I really find it amazing. The "save anywhere" function does crash the game for me, but this may be due to the fact I use it in conjunction with my own lifebar hack.

EDIT :
Oh and what I said is for the 1st split. The 2nd split only needs two $2006 writes, period. Anything else is superfluous.
Re: Split-screen vertical scrolling and precise delays
by on (#106512)
Bregalad wrote:
Oh and what I said is for the 1st split. The 2nd split only needs two $2006 writes, period. Anything else is superfluous.

From the so called "skinny" documentation it would seem though that just writing twice to $2006 clears the top bit of the smooth Y scrolling position. This is unacceptable.
Also, I am using horizontal scrolling as well. The entire screen is scrolled 8 pixels to the right. This happens because of attribute table granularity issues. Admittedly, this part does not require writes to $2005.

Quote:
The "save anywhere" function does crash the game for me, but this may be due to the fact I use it in conjunction with my own lifebar hack.

If you send me a copy of your hack, I can try and see what causes the crash.
Re: Split-screen vertical scrolling and precise delays
by on (#106513)
Quote:
From the so called "skinny" documentation it would seem though that just writing twice to $2006 clears the top bit of the smooth Y scrolling position. This is unacceptable.

No, it is perfectly acceptable for the second split.

Quote:
Also, I am using horizontal scrolling as well. The entire screen is scrolled 8 pixels to the right

Then just OR the last $2006 write with 1 (while keeping the value in $2005.1 appropriate - not required but better for the sake of consistancy) :

Code:
lda #wathever...
sta $2000
ldx #$08
stx $2005
lda VScroll
sta $2005
stx $2005
asl A
asl A
asl A
asl A
sec
rol A
sta $2006


Oh, and my hack is available at RHDN.
Re: Split-screen vertical scrolling and precise delays
by on (#106514)
Bregalad wrote:
Quote:
From the so called "skinny" documentation it would seem though that just writing twice to $2006 clears the top bit of the smooth Y scrolling position. This is unacceptable.

No, it is perfectly acceptable for the second split.

I am actually doing three splits there. There is one "blank" scanline that you may not notice.
But you're right only assuming that for the second split I am scrolling to a Y scrolling offset that doesn't include a &4. :-) Which happens to hold for now.
However, I am using the same macro for all the splits, and there's no sense making a different macro for different cases, since all of this is cycle-counted anyway (unless for size issues).