Getting clever with the Amiga blitter

July 3, 2023

Magicore Anomala is powered largely by the Amiga's blitter, allowing me to quickly clear the screen and draw hundreds of objects every frame at a full 60fps. It runs in parallel with the CPU and excels at copying or manipulating large blocks of data.

But the blitter goes above and beyond the functionality of simply hauling bits around. You can shift, mask, and logically combine up to three independent sources anywhere in shared memory.

Today I'll show you how Magicore uses the copper and blitter to convert and copy a 24-bit RGB color palette into the Amiga's 12-bit color registers, every frame, using zero CPU cycles.

An example of using HSV to RGB conversion to produce a rainbow effect

Quick blitter intro

"Blit" stands for "block transfer"—in other words, transferring a block of memory from a source to a destination.

Jay Miner, the "father of the Amiga", insisted on calling the Amiga's blitter a "bimmer" (which evidently did not catch on). He wanted to distinguish it as a "bitmap image manipulator" because it could shift and combine up to three independent sources in many unique ways.

As you'll soon see, we can use the blitter (bimmer?) to do some clever bit shifting, masking, and ORing on a big chunk of memory—something the CPU would strain to do.

Our RGB data

I have a palette of 32 colors, and each color is stored with 8-bit RGB values. To make this work, I technically store it as GRB (swapped green and red). In memory, each color looks like 00GgRrBb, and we want to convert them to 0RGB (discarding the low 4 bits of each color).

I'm storing the palette as 8-bit colors because I'm doing some color effects like additive blending and converting HSV to RGB. These operations are much easier on byte-aligned values, especially for our 7MHz CPU.

The blit operation

Here is the blit operation to convert our 00GgRrBb to 0RGB:

    ; 00GgRrBb -> 0RGB
    ; A points to RrBb, B points to 00Gg
    ; C has no source, BLTCDAT loaded with $00f0 as a constant
    ; 1. Mask A (R0B0), shift A 4 bits (0R0B)
    ; 2. D = A + BC, i.e. 0R0B | (00Gg & $00f0)

    dc.w    $0001,$0000         ;wait for blitter
    dc.w    BLTCON0,$4df8       ;4: shift A, d: use ABD, f8: D = A + BC
    dc.w    BLTCON1,$0000
    dc.w    BLTCDAT,$00f0       ;C is always $00f0 to mask B
    dc.w    BLTAPTH,0           ;GRB8 color source +2 (RrBb)
    dc.w    BLTAPTL,0
    dc.w    BLTBPTH,0           ;GRB8 color source (00Gg)
    dc.w    BLTBPTL,0
    dc.w    BLTDPTH,0           ;Copperlist for color registers
    dc.w    BLTDPTL,0
    dc.w    BLTAFWM,$f0f0       ;mask out color low bits
    dc.w    BLTALWM,$f0f0
    dc.w    BLTAMOD,2
    dc.w    BLTBMOD,2
    dc.w    BLTDMOD,2
    dc.w    BLTSIZE,32<<6+1     ;blit 32 lines of 1 word each

The above is done using copper (coprocessor) instructions. Here is an equivalent using the CPU:

_scr_blit_colors:
    ; Blit color registers
    move.l      #$4df80000,BLTCON0(a6)
    move.w      #$00f0,BLTCDAT(a6)
    move.l      #CopColor+2,BLTDPTH(a6)
    lea         cfx_WorkingColors,a0
    move.l      a0,BLTBPTH(a6)
    addq        #2,a0
    move.l      a0,BLTAPTH(a6)
    move.l      #$f0f0f0f0,BLTAFWM(a6)
    moveq       #2,d0
    move.w      d0,BLTAMOD(a6)
    move.w      d0,BLTBMOD(a6)
    move.w      d0,BLTDMOD(a6)
    move.w      #32<<6+1,BLTSIZE(a6)
    rts

A more readable breakdown

Let's walk through what happens step by step:

A gets loaded with RrBb (from source memory)
B gets loaded with 00Gg (from source memory)
C gets loaded with 00f0 (as a constant)
A gets masked with f0f0. A is now R0B0
A gets shifted 4 bits. A is now 0R0B
The minterms kick in to combine A, B, and C:
1. B intersects C, combining 00Gg with 00f0 to give us 00G0
2. A unions BC, combining 0R0B with 00G0 to give us 0RGB
The result 0RGB gets written to destination D
A, B, and D move 2 bytes forward but have reached the end of the line, so they move another 2 bytes forward (because of our modulo 2), which brings them to the next color entry

This all happens in 8 cycles. That's the power of the Amiga blitter!

Copper vs. CPU

Above, I gave examples of performing this blit using either the copper or the CPU. The CPU version takes about 64 cycles, which is perfectly reasonable for a 68000.

The copper version takes 0 cycles, because it's the exact same set of instructions every frame—no logic needs to be performed to set up the copper instructions, except for the very first time. Only a small optimization over the CPU version, but it feels cool!

A minor downside

Only one component at a time can access shared memory. If the blitter is running, that means the CPU has to wait its turn to read from shared memory—whether that's reading/writing data, or simply fetching CPU instructions.

By default, the blitter gives every 4th DMA cycle to the CPU. Thankfully, this blit is quite small (only 32 words), so we're not tied up for too long. For larger blits, you can try to structure your program so that your most expensive CPU instructions happen during the blit. A single multiply or divide can take 70 or more CPU cycles, so it's a great time for the CPU to be doing that, rather than repeatedly waiting around a lot to fetch its next instruction.

If your Amiga has "Fast RAM" then this is no issue, because Fast RAM is CPU-only and doesn't have to be shared among all the other chips.

Blitter fun

One of the most fun parts of working on Magicore is leveraging Amiga-specific features to make the game do cool stuff. The blitter is so powerful that it just makes me hope I come across all kinds of unique use cases for it. Maybe it gets your imagination going, too.