NES

The transfer of data from the buffer into the PPU must also be optimized to move as fast as possible, therefore 2 techniques are often employed to streamline the process:
1. RLE tile compression as commands for the loading code to use
2. Unrolled loops for bulk transfers

The drawing routine should also follow a data format that store information that tells the drawing code what to draw, and how to draw it. This format should be as generic and flexible as possible. Most implementations use a system that uses a chain of ‘strings. Each ‘string’ tells the drawing code what to draw, where, and how. Now there are multiple ways to do these strings, but a common way is to use a Compact stripe & plot based RLE. This format has 2 modes depending if the draw code is working with a strip of tiles or plotting individual tiles.

Strips:
byte 1: command & length
byte 2: high address of PPU Address
byte 3: low address of PPU Address
byte 4-x: data

command & length format
DTCCCCCC
D = direction (down / right)
T = type (run (copy the same byte multiple times) / literal (read a new byte of data each time))
C = Count (2-63) [Count of 1 invokes plot command]

Length Value	Meaning
00	End of strings
01	Plot Mode
02 - 3F	Literal to right: Copy n+1 bytes to video memory addresses increasing to right
40 - 7F	Run to right: Copy one byte n-63 times to video memory address increasing to right
80 - BF	Literal down: Copy n - 127 bytes to video memory addresses increasing down
C0 - FF	Run down: copy one byte n - 191 times to video memory addressing increasing down

Plot:
byte 1: PLOT_TILES constant (length of 1)
byte 2: high address of PPUADDR
byte 3: low address of PPUADDR
byte 4: data
- Repeat bytes 2,3, & 4 -
byte x: END_PLOT constant (negative number)

Another part of the code should be unrolled loops. The goal of this is to increase the program’s speed, at the cost of space, by eliminating instructions that control the loop such as “end of loop” tests on each iteration and reducing branch penalties.

6502 Rolled Loop

    ldx    #8     ;2
-   jsr    SomeFunction    ;12
    dex            ;2
    bne    -       ;3 / 2

Bytes: 8
Cycles: 137

6502 unrolled loop
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12
    jsr    SomeFunction    ;12

Bytes: 36
Cycles: 96

This simple techniques in this exaple gives us a ~30% speed boost.

My implementation:
PPULoading.asm