# Assembly tricks for fast bounds checking

I came across a cute little trick while looking to optimize my physics routine in Magicore Anomala.

In the routine, I need to check if each object is inbounds, and deactivate it if it goes out of bounds. Pretty typical situation, and we'd probably see it written as something like this:

``````int isInbounds(int position, int upperBound) {
return position >= 0 && position < upperBound;
}``````

(You'd probably inline this, but we're making it a function for the readability of this example.)

The cute part is that if the position is negative, then the unsigned version of that value is some extremely large number. So, you can get away with doing this in one comparison instead of two:

``````int isInboundsFast(int position, int upperBound) {
return (unsigned int)position < upperBound;
}``````

You just want to make sure that `upperBound` is not close to the maximum int size, but in most realistic cases, it won't be.

If we throw these into Compiler Explorer, we get to see the output:

``````isInbounds:
test    edi, edi
setns   al
cmp     edi, esi
setl    cl
and     cl, al
movzx   eax, cl
ret
isInboundsFast:
xor     eax, eax
cmp     edi, esi
setb    al
ret``````

Very cool! Clearly, `isInboundsFast` is more efficient. The compiler doesn't "optimize" the original any further, because again, our two checks aren't identical—we're taking a shortcut that will probably not have any realistic consequences, but it's not logically guaranteed.

Anyway, x86 Assembly is all but gibberish to me. In this household, we only stan CPUs with a sense of sophistication and class, like the Motorola 68000.

``````; d0 = position
; d1 = upper bound
is_inbounds:
cmp.w   d1,d0
bhs     .destroy
rts

.destroy:
; Destroy the object``````

The instruction `bhs` stands for "branch if higher or same". Unlike `bge` (branch if greater or equal), `bhs` ignores signed/negative flags and branches if our pure, unsigned value is higher than the upper bound.

## Cramming two 16-bit values into a single 32-bit register

I have to do bounds checking for both X and Y position, and I'm taking a few shortcuts there as well. My positions are 16-bit values, but the 68000 has 32-bit registers, meaning I can hold both of them in a single register. That ends up saving me a few cycles.

``````; a0 = pointer to object
; d2 = X upper bound
; d3 = Y upper bound
update_physics:
; Load positions into d0 and d1
move.w  BOB_xpos(a0),d0 ; get X position in d0
move.w  BOB_ypos(a0),d1 ; get Y position in d1
; Check if new positions are inbounds
cmp.w   d2,d0
bhs     .destroy
cmp.w   d3,d1
bhs     .destroy``````

The above takes about 76 cycles. Here is my optimized version:

``````; a0 = pointer to object
; d2 = X upper bound, shifted left 16 bits to the upper word
; d3 = Y upper bound (not shifted)
update_physics:
; Load X/Y position into d0 (X upper, Y lower)
move.l  BOB_xpos(a0),d0
; Check if new positions are inbounds
cmp.l   d2,d0
bhs     .destroy
cmp.w   d3,d0
bhs     .destroy``````

This version is only 56 cycles. We're saving ~20 cycles per loop, which adds up to thousands of cycles over the course of 200 objects.

A breakdown of how this works:

``````update_physics:
; Load X/Y position into d0 (X upper, Y lower)
move.l  BOB_xpos(a0),d0``````

In memory, X position and Y position are adjacent 16-bit values. By using `move.l`, we're loading 32 bits of data into `d0`. So this, makes `d0` look like `XXXX YYYY`.

``````    ; Add velocity to position

Similarly, velocity is stored as two adjacent 16-bit values. We're adding velocity `XXXX YYYY` to our position `XXXX YYYY`, letting us update both positions in a single instruction.

The downside: If Y is negative, then the addition will cause a carry into the lowest subpixel of X. For instance, if our X/Y position is `0200 0500`, and our velocity is `0000 ffff` (very slight negative Y velocity), then our resulting position will be `0201 04ff`.

As you can see, the Y calculation spilled over into X. So, I only use this for attack patterns where the spill into the X subpixel is inconsequential.

``````    ; Check if new positions are inbounds
cmp.l   d2,d0
bhs     .destroy
cmp.w   d3,d0
bhs     .destroy``````

Here, `d2` is our X upper bound and `d3` is our Y upper bound. The trick is that since `d0` looks like `XXXX YYYY`, we have `d2` shifted to the high word of the register, like `XXXX 0000`.

That means when we use `cmp.l` to compare `d0` to `d2`, it's comparing all 32 bits. Importantly, that includes X. It turns out the lower 16 bits (Y) end up not affecting this comparison at all.

Finally, our Y position is in the lower 16 bits, so we can just use a `cmp.w` to compare the lower word like normal.

That's it for now—thanks for letting me share some of these meticulous little optimizations that make Assembly coding on old hardware so much fun.