How many ways can you write a loop in x86?

If you come from C-family languages, it is likely you know this structure:

for (int i = 0; i < n; i++) {
    do_something();
}

which would be literally translated by the C compiler into x86 as the following:

.loop_start:
    mov DWORD PTR [rbp-4], 0        ; i = 0
.loop_check:
    mov eax, DWORD PTR [rbp-4]
    cmp eax, DWORD PTR [rbp-20]     ; compare i and n
    jge .loop_end                   ; notice the condition is inverted here
.loop_body:
    call do_something
    add DWORD PTR [rbp-4], 1
    jmp .loop_check
.loop_end:

I said literally because the compiler tried to keep the order consistent with the C code as much as possible. To convert for-loop into a jump, the compiler inverts the loop condition and take it as the jump condition (Even x86 sets alias for this, where jnl - jump if not less is the same as jge - jump if greater or equal). And to loop back, the compiler creates an unconditional jump directing to the jump condition.

But in our mental model, the loop above will do_something() while the condition is still true. We can convert the x86 code into that way:

.loop_start:
    mov DWORD PTR [rbp-4], 0        ; i = 0
    jmp .loop_check
.loop_body:
    call do_something
    add DWORD PTR [rbp-4], 1
.loop_check:
    mov eax, DWORD PTR [rbp-4]
    cmp eax, DWORD PTR [rbp-20]     ; compare i and n
    jl .loop_body                   ; are we happy now?
.loop_end:

But let’s say we have full control of n as a counter, we don’t need extra variable i, and we are fine with iterating in reverse order. We can rewrite the loop condition into (n-- > 0) like the following:

.loop_start:
    jmp .loop_check
.loop_body:
    sub DWORD PTR [rbp-20], 1       ; notice that `sub` happens after
    call do_something
.loop_check:
    cmp DWORD PTR [rbp-20], 0
    jg .loop_body
    sub DWORD PTR [rbp-20], 1       ; ... so we have another one here
.loop_end:

We just successfully converted for (i = 0; i < n; i++) into while (n-- > 0)! If we certain that n is a signed integer, we can rewrite the condition as (--n >= 0):

.loop_start:
    jmp .loop_check
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]          ; `cmp` is `sub` without results
    jge .loop_body
.loop_end:

So far, we have been using two jump instructions. This is because the loop condition also act as the condition to start the loop. If instead we calls do_something first then checks condition later (which is the same as choosing + rather than * in regex), then we can reduce the code into a single jump:

.loop_start:                        ; We removed the jump here
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]
    jge .loop_body
.loop_end:

Surprisingly, more than just jmp and jcc, x86 also has other instructions for loops. The first is jcxz/jecxz/jrcxz, in case you would not like to mess with status flag (see felixcloutier.com - Jcc). Like the name suggest, it will only check if cx/ecx/rcx is zero or not. If we use it as the counter, then we got:

.loop_start:
    mov ecx, DWORD PTR [rbp-20]     ; We need to move `n` into counter
.loop_body:
    call do_something
.loop_check:
    dec ecx
    jecxz .loop_body
.loop_end:

The second one is loop (alone with its variants loope, loopne, see felixcloutier.com - LOOP/LOOPcc). It uses register cx/ecx/rcx as a counter. Technically, it is equivalent to dec+jecxz, which you can observe below:

.loop_start:
    mov ecx, DWORD PTR [rbp-20]
.loop_body:
    call do_something
.loop_check:
    loop .loop_body
.loop_end:

For both kind of instructions above, due to how the instruction is encoded, we can only jump within offsets -128 to +127 (8-bit offset, or short). If we want jump within 32-bit offset, the other jump variants are preferred.

What if we want our target address to be computed at run time? We can use jmp which supports absolute jump. See the example below (exclusive on x86-64):

.loop_start:
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]
    jl .loop_end                    ; We must invert condition here
    lea rax, [rip + .loop_body]     ; RIP-relative addressing
    jmp rax                         ; then do absolute jump here
.loop_end:

But wait, there is also an older trick that you can use to get the instruction pointer:

.loop_start:
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]
    jl .loop_end
    call .tmp
.tmp:
    pop rax                         ; Here, RIP is on top of the stack
    add rax, .loop_body - .tmp
    jmp rax
.loop_end:

Or we can be lazier if you don’t like to use intermediate register:

.loop_start:
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]
    jl .loop_end
    call .tmp
.tmp:
    add [rsp], .loop_body - .tmp    ; Kudos to x86 instruction encoding
    jmp [rsp]
.loop_end:

Since it is on top of the stack, we can replace jmp with ret.

.loop_start:
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]
    jl .loop_end
    call .tmp
.tmp:
    add [rsp], .loop_body - .tmp    ; We are modifying ret address here
    ret
.loop_end:

In fact, we can retroactively apply on lea by using an extra push:

.loop_start:
.loop_body:
    call do_something
.loop_check:
    dec DWORD PTR [rbp-20]
    jl .loop_end
    lea rax, [rip + .loop_body]
    push rax
    ret
.loop_end:

And we didn’t even talk about far jump yet. This is similar to counting number of ways you can craft a sentence: go to random page of a book, and you will see many sentences for the first time. The key is not in the code itself but rather in the grammar! Congratulation for getting this far, and I hope you learn something new about x86.

A big disclaimer though: using call and ret to replace jmp like above can offend the stack and cause security feature like shadow stack stops working (and kills your program too). You can read more at The Old New Thing - Evaluating tail call elimination in the face of return address protection, part 2 by Raymond Chen.