I came across a cute little trick while looking to optimize my physics routine in Magicore Anomala.
In the routine, I need to check if each object is inbounds, and deactivate it if it goes out of bounds. Pretty typical situation, and we'd probably see it written as something like this:
int isInbounds(int position, int upperBound) {
return position >= 0 && position < upperBound;
}
(You'd probably inline this, but we're making it a function for the readability of this example.)
The cute part is that if the position is negative, then the unsigned version of that value is some extremely large number. So, you can get away with doing this in one comparison instead of two:
int isInboundsFast(int position, int upperBound) {
return (unsigned int)position < upperBound;
}
You just want to make sure that upperBound
is not close to the maximum int
size, but in most realistic cases, it won't be.
If we throw these into Compiler Explorer, we get to see the output:
isInbounds:
test edi, edi
setns al
cmp edi, esi
setl cl
and cl, al
movzx eax, cl
ret
isInboundsFast:
xor eax, eax
cmp edi, esi
setb al
ret
Very cool! Clearly, isInboundsFast
is more efficient. The compiler doesn't
"optimize" the original any further, because again, our two checks aren't
identical—we're taking a shortcut that will probably not have any realistic
consequences, but it's not logically guaranteed.
Anyway, x86 Assembly is all but gibberish to me. In this household, we only stan CPUs with a sense of sophistication and class, like the Motorola 68000.
; d0 = position
; d1 = upper bound
is_inbounds:
cmp.w d1,d0
bhs .destroy
rts
.destroy:
; Destroy the object
The instruction bhs
stands for "branch if higher or same". Unlike bge
(branch if greater or equal), bhs
ignores signed/negative flags and branches
if our pure, unsigned value is higher than the upper bound.
Cramming two 16-bit values into a single 32-bit register
I have to do bounds checking for both X and Y position, and I'm taking a few shortcuts there as well. My positions are 16-bit values, but the 68000 has 32-bit registers, meaning I can hold both of them in a single register. That ends up saving me a few cycles.
Consider this traditional method:
; a0 = pointer to object
; d2 = X upper bound
; d3 = Y upper bound
update_physics:
; Load positions into d0 and d1
move.w BOB_xpos(a0),d0 ; get X position in d0
move.w BOB_ypos(a0),d1 ; get Y position in d1
; Add velocity to position
add.w BOB_xvel(a0),d0
add.w BOB_yvel(a0),d1
; Check if new positions are inbounds
cmp.w d2,d0
bhs .destroy
cmp.w d3,d1
bhs .destroy
The above takes about 76 cycles. Here is my optimized version:
; a0 = pointer to object
; d2 = X upper bound, shifted left 16 bits to the upper word
; d3 = Y upper bound (not shifted)
update_physics:
; Load X/Y position into d0 (X upper, Y lower)
move.l BOB_xpos(a0),d0
; Add velocity to position
add.l BOB_xvel(a0),d0
; Check if new positions are inbounds
cmp.l d2,d0
bhs .destroy
cmp.w d3,d0
bhs .destroy
This version is only 56 cycles. We're saving ~20 cycles per loop, which adds up to thousands of cycles over the course of 200 objects.
A breakdown of how this works:
update_physics:
; Load X/Y position into d0 (X upper, Y lower)
move.l BOB_xpos(a0),d0
In memory, X position and Y position are adjacent 16-bit values. By using
move.l
, we're loading 32 bits of data into d0
. So this, makes d0
look like
XXXX YYYY
.
; Add velocity to position
add.l BOB_xvel(a0),d0
Similarly, velocity is stored as two adjacent 16-bit values. We're adding
velocity XXXX YYYY
to our position XXXX YYYY
, letting us update both positions
in a single instruction.
The downside: If Y is negative, then the addition will cause a carry into the
lowest subpixel of X. For instance, if our X/Y position is 0200 0500
, and our
velocity is 0000 ffff
(very slight negative Y velocity), then our resulting
position will be 0201 04ff
.
As you can see, the Y calculation spilled over into X. So, I only use this for attack patterns where the spill into the X subpixel is inconsequential.
; Check if new positions are inbounds
cmp.l d2,d0
bhs .destroy
cmp.w d3,d0
bhs .destroy
Here, d2
is our X upper bound and d3
is our Y upper bound. The trick is that
since d0
looks like XXXX YYYY
, we have d2
shifted to the high word of the
register, like XXXX 0000
.
That means when we use cmp.l
to compare d0
to d2
, it's comparing all 32
bits. Importantly, that includes X. It turns out the lower 16 bits (Y) end up
not affecting this comparison at all.
Finally, our Y position is in the lower 16 bits, so we can just use a cmp.w
to
compare the lower word like normal.
That's it for now—thanks for letting me share some of these meticulous little optimizations that make Assembly coding on old hardware so much fun.