The transfer of data from the buffer into the
PPU must also be optimized to move as fast as possible,
therefore 2 techniques are often employed to streamline the
process:
1. RLE tile compression as commands for the
loading code to use
2. Unrolled loops for bulk transfers
The drawing routine should also follow a data format that store
information that tells the drawing code what to draw, and how to
draw it. This format should be as generic and flexible as
possible. Most implementations use a system that uses a chain of
‘strings. Each ‘string’ tells the drawing code what to draw,
where, and how. Now there are multiple ways to do these strings,
but a common way is to use a Compact stripe & plot based
RLE. This format has 2 modes depending if the draw code is
working with a strip of tiles or plotting individual tiles.
Strips:
byte 1: command & length
byte 2: high address of PPU Address
byte 3: low address of PPU Address
byte 4-x: data
command & length format
DTCCCCCC
D = direction (down / right)
T = type (run (copy the same byte multiple times) / literal
(read a new byte of data each time))
C = Count (2-63) [Count of 1 invokes plot command]
Length Value
|
Meaning
|
00
|
End of strings
|
01
|
Plot Mode
|
02 - 3F
|
Literal to right: Copy n+1 bytes to video
memory addresses increasing to right
|
40 - 7F
|
Run to right: Copy one byte n-63 times to
video memory address increasing to right
|
80 - BF
|
Literal down: Copy n - 127 bytes to video
memory addresses increasing down
|
C0 - FF
|
Run down: copy one byte n - 191 times to
video memory addressing increasing down
|
Plot:
byte 1: PLOT_TILES constant (length of 1)
byte 2: high address of PPUADDR
byte 3: low address of PPUADDR
byte 4: data
- Repeat bytes 2,3, & 4 -
byte x: END_PLOT constant (negative number)
Another part of the code should be unrolled loops. The goal of
this is to increase the program’s speed, at the cost of space,
by eliminating instructions that control the loop such as “end
of loop” tests on each iteration and reducing branch penalties.
6502 Rolled Loop
ldx
#8 ;2
- jsr
SomeFunction ;12
dex
;2
bne
- ;3 / 2
Bytes: 8
Cycles: 137
|
6502 unrolled loop
jsr
SomeFunction ;12
jsr
SomeFunction ;12
jsr
SomeFunction ;12
jsr
SomeFunction ;12
jsr
SomeFunction ;12
jsr
SomeFunction ;12
jsr
SomeFunction ;12
jsr
SomeFunction ;12
Bytes: 36
Cycles: 96
|
This simple techniques in this exaple gives us a ~30% speed
boost.
My implementation:
PPULoading.asm