Assembly tricks for fast bounds checking

July 9, 2023

I came across a cute little trick while looking to optimize my physics routine in Magicore Anomala.

In the routine, I need to check if each object is inbounds, and deactivate it if it goes out of bounds. Pretty typical situation, and we'd probably see it written as something like this:

int isInbounds(int position, int upperBound) {
    return position >= 0 && position < upperBound;
}

(You'd probably inline this, but we're making it a function for the readability of this example.)

The cute part is that if the position is negative, then the unsigned version of that value is some extremely large number. So, you can get away with doing this in one comparison instead of two:

int isInboundsFast(int position, int upperBound) {
    return (unsigned int)position < upperBound;
}

You just want to make sure that upperBound is not close to the maximum int size, but in most realistic cases, it won't be.

If we throw these into Compiler Explorer, we get to see the output:

isInbounds:
        test    edi, edi
        setns   al
        cmp     edi, esi
        setl    cl
        and     cl, al
        movzx   eax, cl
        ret
isInboundsFast:
        xor     eax, eax
        cmp     edi, esi
        setb    al
        ret

Very cool! Clearly, isInboundsFast is more efficient. The compiler doesn't "optimize" the original any further, because again, our two checks aren't identical—we're taking a shortcut that will probably not have any realistic consequences, but it's not logically guaranteed.

Anyway, x86 Assembly is all but gibberish to me. In this household, we only stan CPUs with a sense of sophistication and class, like the Motorola 68000.

; d0 = position
; d1 = upper bound
is_inbounds:
    cmp.w   d1,d0
    bhs     .destroy
    rts

.destroy:
; Destroy the object

The instruction bhs stands for "branch if higher or same". Unlike bge (branch if greater or equal), bhs ignores signed/negative flags and branches if our pure, unsigned value is higher than the upper bound.

Cramming two 16-bit values into a single 32-bit register

I have to do bounds checking for both X and Y position, and I'm taking a few shortcuts there as well. My positions are 16-bit values, but the 68000 has 32-bit registers, meaning I can hold both of them in a single register. That ends up saving me a few cycles.

Consider this traditional method:

; a0 = pointer to object
; d2 = X upper bound
; d3 = Y upper bound
update_physics:
    ; Load positions into d0 and d1
    move.w  BOB_xpos(a0),d0 ; get X position in d0
    move.w  BOB_ypos(a0),d1 ; get Y position in d1
    ; Add velocity to position
    add.w   BOB_xvel(a0),d0
    add.w   BOB_yvel(a0),d1
    ; Check if new positions are inbounds
    cmp.w   d2,d0
    bhs     .destroy
    cmp.w   d3,d1
    bhs     .destroy

The above takes about 76 cycles. Here is my optimized version:

; a0 = pointer to object
; d2 = X upper bound, shifted left 16 bits to the upper word
; d3 = Y upper bound (not shifted)
update_physics:
    ; Load X/Y position into d0 (X upper, Y lower)
    move.l  BOB_xpos(a0),d0
    ; Add velocity to position
    add.l   BOB_xvel(a0),d0
    ; Check if new positions are inbounds
    cmp.l   d2,d0
    bhs     .destroy
    cmp.w   d3,d0
    bhs     .destroy

This version is only 56 cycles. We're saving ~20 cycles per loop, which adds up to thousands of cycles over the course of 200 objects.

A breakdown of how this works:

update_physics:
    ; Load X/Y position into d0 (X upper, Y lower)
    move.l  BOB_xpos(a0),d0

In memory, X position and Y position are adjacent 16-bit values. By using move.l, we're loading 32 bits of data into d0. So this, makes d0 look like XXXX YYYY.

    ; Add velocity to position
    add.l   BOB_xvel(a0),d0

Similarly, velocity is stored as two adjacent 16-bit values. We're adding velocity XXXX YYYY to our position XXXX YYYY, letting us update both positions in a single instruction.

The downside: If Y is negative, then the addition will cause a carry into the lowest subpixel of X. For instance, if our X/Y position is 0200 0500, and our velocity is 0000 ffff (very slight negative Y velocity), then our resulting position will be 0201 04ff.

As you can see, the Y calculation spilled over into X. So, I only use this for attack patterns where the spill into the X subpixel is inconsequential.

    ; Check if new positions are inbounds
    cmp.l   d2,d0
    bhs     .destroy
    cmp.w   d3,d0
    bhs     .destroy

Here, d2 is our X upper bound and d3 is our Y upper bound. The trick is that since d0 looks like XXXX YYYY, we have d2 shifted to the high word of the register, like XXXX 0000.

That means when we use cmp.l to compare d0 to d2, it's comparing all 32 bits. Importantly, that includes X. It turns out the lower 16 bits (Y) end up not affecting this comparison at all.

Finally, our Y position is in the lower 16 bits, so we can just use a cmp.w to compare the lower word like normal.

That's it for now—thanks for letting me share some of these meticulous little optimizations that make Assembly coding on old hardware so much fun.