Generating code with LLMs
It's a bit sloppy
Fig 1. I asked for a giraffe dividing things by 6 in a kind of sloppy fashion and that is indeed what I got.
It’s been a while since I compared code I write with code an LLM writes, so I thought I’d try it again and see what the state of the art is. This is with Gemini 2.5pro - if someone wants to try with a different LLM, I’d really like to see the results.
For this test, I decided to implement unsigned division by 6 on 32 and 64-bit RISC-V processors as part of my rvint integer mathematical library for RISC-V. The algorithm I’m using is from the excellent book Hacker’s Delight, and the particular instruction set I’m targeting is processors without hardware multiply/divide instructions, or those with slow multiply/divide. I also want my code to specifically work on the RV32E processors with the reduced register file, so I can run the code my my ch32v003-based olimex retrocomputer.
First, I wrote the routine by hand, so I have something to compare the AI code generation with. I believe this code is optimal for the target architectures, but it may have problems with large input values because of the imprecision in the approximation for 1/6. Note: I also found a comment bug later - the line
srli a2, a0, 3 # a2 = (n >> 2)
should have (n » 3) in the comment.
################################################################################
# routine: div6u
#
# Unsigned fast division by 6 without using M extension.
# This routine is 64-bit on 64-bit CPUs and 32-bit on 32-bit CPUs.
# It uses a fast multiply/shift/add/correct algorithm.
# Suitable for use on RV32E architectures.
#
# input registers:
# a0 = unsigned dividend (32 or 64 bits)
#
# output registers:
# a0 = quotient (unsigned)
################################################################################
div6u:
srli a1, a0, 1 # a1 = (n >> 1)
srli a2, a0, 3 # a2 = (n >> 2)
add a1, a1, a2 # q = (n >> 1) + (n >> 2)
srli a2, a1, 4 # a2 = (q >> 4)
add a1, a1, a2 # q = q + (q >> 4)
srli a2, a1, 8 # a2 = (q >> 8)
add a1, a1, a2 # q = q + (q >> 8)
srli a2, a1, 16 # a2 = (q >> 16)
add a1, a1, a2 # q = q + (q >> 16)
.if CPU_BITS == 64
srli a2, a1, 32 # a2 = (q >> 32)
add a1, a1, a2 # q = q + (q >> 32)
.endif
srli a1, a1, 2 # q = q >> 2
# compute r = n - q*6
slli a2, a1, 2 # a2 = q * 4
slli a3, a1, 1 # a3 = q * 2
add a2, a2, a3 # a2 = q*6
sub a2, a0, a2 # a2 = r = n - q*6
# add correction
sltiu a2, a2, 6 # a2 = 1 if a2 < 6 (unsigned), else a2 = 0
xori a2, a2, 1 # flip the result
add a0, a1, a2 # q = q + (r > 5)
ret
.size div6u, .-div6u
Here’s the result of using Gemini 2.5pro. First, my prompt - note that I specifically told the AI to use the same algorithm as I used to implement my code.
Write an unsigned integer divide by 6 routine using the algorithm from Hacker’s Delight, 2nd edition, “integer division by constants” for 32 and 64 bit RISC-V processors without multiply and divide instructions (RV32I and RV64I instruction sets). In the 32-bit case, the code should also run on the RV32E instruction set. The input number to be divided by 6 will be provided in the a0 register, and the result should be placed in the a0 register. The CPU_BITS equate will be set to 32 or 64 bits depending on the processor architecture, so you can write conditional code wrapped in “.if CPU_BITS == 64” to do conditional assembly for different processors.This is the result:
# RISC-V Unsigned Integer Division by 6
#
# Author: Gemini
# Date: 2024-05-21
#
# Description:
# This routine implements unsigned integer division by the constant 6 for 32-bit
# and 64-bit RISC-V processors using only the base integer instruction set
# (RV32I/RV32E and RV64I). It does not use any multiply or divide instructions.
#
# Algorithm:
# The routine is based on the methods described in the book “Hacker’s Delight,”
# 2nd Edition, by Henry S. Warren, Jr., in Chapter 10, “Integer Division by Constants.”
#
# The division n/6 is calculated as floor(floor(n/2)/3).
#
# 1. The input n is first shifted right by 1 to compute y = floor(n/2).
#
# 2. The division y/3 is then performed using a shift-and-add multiplication
# by a “magic number” that approximates 4/3. The sequence is:
# t = y + (y >> 2)
# t = t + (t >> 4)
# t = t + (t >> 8)
# ...
# q_approx = t >> 2
# This produces an approximate quotient that is correct for some inputs but
# can be off by one for others (specifically, when y is a multiple of 3).
#
# 3. A correction step is applied. The remainder is calculated:
# rem = y - 3 * q_approx
# If the remainder is 3 or greater, the approximate quotient is incremented.
#
# Configuration:
# Set the CPU_BITS equate to 32 or 64 to assemble for the desired architecture.
# - CPU_BITS = 32: Assembles for RV32I and RV32E.
# - CPU_BITS = 64: Assembles for RV64I.
.equ CPU_BITS, 32 # Set to 32 or 64
.globl divu6
.text
# =============================================================================
# divu6: Unsigned integer division by 6
#
# ABI:
# Input: a0 - Unsigned integer dividend (n)
# Output: a0 - Unsigned integer quotient (floor(n/6))
# Clobbers: a1, t0, t1 (scratch registers)
# =============================================================================
divu6:
.if CPU_BITS == 32
# -------------------------------------------------------------------------
# 32-bit Implementation (RV32I / RV32E)
# -------------------------------------------------------------------------
# Step 1: Calculate y = n / 2
srli a1, a0, 1 # a1 = y = n >> 1
mv t1, a1 # Save y in t1 for later (remainder calculation)
# Step 2: Calculate approximate quotient for y / 3
# Based on multiplying by approx 4/3 and shifting right by 2.
# q_approx is calculated in a0. a1 is used as a temporary.
srli a0, t1, 2 # a0 = y >> 2
add a0, a0, t1 # a0 = t = y + (y >> 2)
srli a1, a0, 4 # a1 = t >> 4
add a0, a0, a1 # a0 = t = t + (t >> 4)
srli a1, a0, 8 # a1 = t >> 8
add a0, a0, a1 # a0 = t = t + (t >> 8)
srli a1, a0, 16 # a1 = t >> 16
add a0, a0, a1 # a0 = t = t + (t >> 16)
srli a0, a0, 2 # a0 = q_approx = t >> 2
# Step 3: Correction step
# q_final = q_approx + (y - 3*q_approx >= 3)
slli a1, a0, 1 # a1 = q_approx * 2
add a1, a1, a0 # a1 = q_approx * 3
sub a1, t1, a1 # a1 = rem = y - (q_approx * 3)
li t0, 2 # Load constant 2 for comparison
sltu t0, t0, a1 # t0 = 1 if 2 < rem (i.e., rem >= 3), else 0
add a0, a0, t0 # a0 = q_final = q_approx + correction
ret
.else
# -------------------------------------------------------------------------
# 64-bit Implementation (RV64I)
# -------------------------------------------------------------------------
# Step 1: Calculate y = n / 2
srli a1, a0, 1 # a1 = y = n >> 1
mv t1, a1 # Save y in t1 for later (remainder calculation)
# Step 2: Calculate approximate quotient for y / 3
# Same logic as 32-bit, but with an extra step for the upper 32 bits.
# q_approx is calculated in a0. a1 is used as a temporary.
srli a0, t1, 2 # a0 = y >> 2
add a0, a0, t1 # a0 = t = y + (y >> 2)
srli a1, a0, 4 # a1 = t >> 4
add a0, a0, a1 # a0 = t = t + (t >> 4)
srli a1, a0, 8 # a1 = t >> 8
add a0, a0, a1 # a0 = t = t + (t >> 8)
srli a1, a0, 16 # a1 = t >> 16
add a0, a0, a1 # a0 = t = t + (t >> 16)
srli a1, a0, 32 # a1 = t >> 32
add a0, a0, a1 # a0 = t = t + (t >> 32)
srli a0, a0, 2 # a0 = q_approx = t >> 2
# Step 3: Correction step
# q_final = q_approx + (y - 3*q_approx >= 3)
slli a1, a0, 1 # a1 = q_approx * 2
add a1, a1, a0 # a1 = q_approx * 3
sub a1, t1, a1 # a1 = rem = y - (q_approx * 3)
li t0, 2 # Load constant 2 for comparison
sltu t0, t0, a1 # t0 = 1 if 2 < rem (i.e., rem >= 3), else 0
add a0, a0, t0 # a0 = q_final = q_approx + correction
ret
.endif
The code uses the same number of instructions as my code, but for some reason it provides two complete implementations rather than just doing the conditional assembly around the two instructions that differ for 64-bit support. So the AI code is twice as long and contains a major amount of copypasta’d code. The AI-generated code makes does a divide by 2 followed by a divide by 3, whereas my code directly divides by 6. This has an interesting side effect; the AI generated code makes better use of RISC-V compressed instructions as can be seen in objdump output of the assembled code (div6u is my code, divu6 is AI-generated):
00000108 <div6u>:
108: 00155593 srli a1, a0, 0x1
10c: 00355613 srli a2, a0, 0x3
110: 95b2 c.add a1, a2
112: 0045d613 srli a2, a1, 0x4
116: 95b2 c.add a1, a2
118: 0085d613 srli a2, a1, 0x8
11c: 95b2 c.add a1, a2
11e: 0105d613 srli a2, a1, 0x10
122: 95b2 c.add a1, a2
124: 8189 c.srli a1, 0x2
126: 00259613 slli a2, a1, 0x2
12a: 00159693 slli a3, a1, 0x1
12e: 9636 c.add a2, a3
130: 40c50633 sub a2, a0, a2
134: 00663613 sltiu a2, a2, 0x6
138: 00164613 xori a2, a2, 0x1
13c: 00c58533 add a0, a1, a2
140: 8082 c.jr ra
00000142 <divu6>:
142: 00155593 srli a1, a0, 0x1
146: 832e c.mv t1, a1
148: 00235513 srli a0, t1, 0x2
14c: 951a c.add a0, t1
14e: 00455593 srli a1, a0, 0x4
152: 952e c.add a0, a1
154: 00855593 srli a1, a0, 0x8
158: 952e c.add a0, a1
15a: 01055593 srli a1, a0, 0x10
15e: 952e c.add a0, a1
160: 8109 c.srli a0, 0x2
162: 00151593 slli a1, a0, 0x1
166: 95aa c.add a1, a0
168: 40b305b3 sub a1, t1, a1
16c: 4289 c.li t0, 0x2
16e: 00b2b2b3 sltu t0, t0, a1
172: 9516 c.add a0, t0
174: 8082 c.jr ra
This results in the AI-generated code being 52 bytes long, whereas my code is 58 bytes long in the 32-bit case. This is a result of the divide by two and then divide by three approach the AI took vs. my direct divide by six!
I did also experiment with prompting Gemini without referring to the Hacker’s Delight book or the specific algorithm I wanted and I was unable to get it to generate code that was as good as what it generated above - the code always had conditional branches, loops, or used lots of registers. It appears that if you let the AI choose the algorithm it doesn’t do as good as if you tell it exactly what you want implemented.
Interestingly, on 64-bit machines, both the AI generated code and my code have an identical bug. This does not surprise me due to the properties of dividing by 3 (which is, after all, part of dividing by 6). I know that in my divide by 3 routine, I had to handle the case of numbers close to 0xffffffffffffffff differently because of the error accumulation of the series approximation to 1/3 for large inputs. It makes sense that this same error would occur near values half that in divide by 6 due to the shift right by 1 bit.
1: div6u: 0x0000000000000000 0x0000000000000000 pass
2: div6u: 0x0000000000000005 0x0000000000000000 pass
3: div6u: 0x0000000000000006 0x0000000000000001 pass
4: div6u: 0x000000000000000b 0x0000000000000001 pass
5: div6u: 0x000000000000007b 0x0000000000000014 pass
6: div6u: 0x000000007fffffff 0x0000000015555555 pass
7: div6u: 0x0000000080000000 0x0000000015555555 pass
8: div6u: 0x00000000ffffffff 0x000000002aaaaaaa pass
9: div6u: 0x0000000100000000 0x000000002aaaaaaa pass
10: div6u: 0x7fffffffffffffff 0x1555555555555554 0x1555555555555555 fail
11: div6u: 0x8000000000000000 0x1555555555555555 pass
12: div6u: 0xffffffffffffffff 0x2aaaaaaaaaaaaaaa passI ended up using a similar approach as to the one I used in my div3u routine, where I use a more precise correction (which requires more instructions) in the 64-bit code path. I used AI to come up with the new “magic number” approximation of floor(r * 11 / 64) for the correction step. Once the corrections were applied to the div6u routines, I just went with the version I wrote myself. I also prefer the register allocation in my version as it’s the same as my other divide routines, and sometimes I like to call routines without saving registers on the stack because I know what is clobbered when I’m writing extremely concise code. You can see an example of this in my AES-128 implementation for x86_64.
################################################################################
# routine: div6u
#
# Unsigned fast division by 6 without using M extension.
# This routine is 64-bit on 64-bit CPUs and 32-bit on 32-bit CPUs.
# It uses a fast multiply/shift/add/correct algorithm.
# Suitable for use on RV32E architectures.
#
# input registers:
# a0 = unsigned dividend (32 or 64 bits)
#
# output registers:
# a0 = quotient (unsigned)
################################################################################
div6u:
# Phase 1: Calculate approximate quotient q.
srli a1, a0, 1 # a1 = (n >> 1)
srli a2, a0, 3 # a2 = (n >> 3)
add a1, a1, a2 # q = (n >> 1) + (n >> 3)
srli a2, a1, 4 # a2 = (q >> 4)
add a1, a1, a2 # q = q + (q >> 4)
srli a2, a1, 8 # a2 = (q >> 8)
add a1, a1, a2 # q = q + (q >> 8)
srli a2, a1, 16 # a2 = (q >> 16)
add a1, a1, a2 # q = q + (q >> 16)
.if CPU_BITS == 64
srli a2, a1, 32 # a2 = (q >> 32)
add a1, a1, a2 # q = q + (q >> 32)
.endif
srli a1, a1, 2 # q = q >> 2
# Phase 2: Calculate remainder r = n - 6*q
slli a2, a1, 2 # a2 = q * 4
slli a3, a1, 1 # a3 = q * 2
add a2, a2, a3 # a2 = q * 6
sub a2, a0, a2 # a2 = r = n - q * 6
# Phase 3: Correction
.if CPU_BITS == 32
# For 32-bit, the error is at most 1, so a simple check is sufficient.
sltiu a3, a2, 6 # a3 = 1 if r < 6
xori a3, a3, 1 # a3 = 1 if r >= 6
add a0, a1, a3 # q = q + correction
.endif
.if CPU_BITS == 64
# For 64-bit, we use a fast magic number multiplication to find
# the correction amount, which is floor(r * 11 / 64).
# This single step is sufficient to produce the correct result.
slli a3, a2, 3 # a3 = r * 8
add a3, a3, a2 # a3 = r * 9
slli a4, a2, 1 # a4 = r * 2
add a3, a3, a4 # a3 = r * 11
srli a3, a3, 6 # a3 = floor((r * 11) / 64) -> correction amount
add a0, a1, a3 # a0 = q_approx + correction
.endif
ret
.size div6u, .-div6u
So ultimately I ended up using code I wrote myself as it was more compact in source code form. I used AI assistance for two parts - coming up with the 11/64 approximation to 1/6 as being sufficiently precise to correct two bit errors in the the test quotient for 64-bit numbers, and I used it to generate test cases. I used a prompt which was an improvement on the one I used last time I generated test cases so there were fewer issues with how the test code was laid out.
Update div-tests.s with tests for div6u. Use a similar approach as the test cases for div3u, div5u and div10u. The routines should test both 32 and 64 bit cases. The tests should test large values in particular because this is where the division approximation is most likely to break down. Interleave the div6u tests in between the div5u and div10u tests and variables.The last bit successfully avoided some of the annoyances I had last time. Also, unlike last time, this time around all the generated test cases were correct.
The approach I ultimately took is one I often take - I don’t use the AI to generate my main code, but I do find it a timesaver to generate boilerplate code (the tests) and to investigate small problems, like what’s a good approximation for a formula using a power-of-two divisor that generates one bit or less of error in the approximate quotient correction for 64-bit quantities.



Here's what claude 4.1 gave me with the same prompt:
```
# Unsigned integer divide by 6 for RISC-V RV32I/E and RV64I
# Input: a0 = dividend
# Output: a0 = quotient (dividend / 6)
#
# Algorithm: Multiplication by reciprocal using magic number
# For division by 6, we use the multiplicative inverse approach
# Magic number for 32-bit: 0xAAAAAAAB (with shift)
# Magic number for 64-bit: 0xAAAAAAAAAAAAAAAB (with shift)
.text
.globl udiv6
.align 2
udiv6:
.if CPU_BITS == 64
# 64-bit implementation
# Algorithm: q = (n * 0xAAAAAAAAAAAAAAAB) >> 65
# Since we can't multiply, we decompose the multiplication
# into shifts and adds
# Save original value
mv t0, a0
# Compute n * 0xAAAAAAAAAAAAAAAB
# 0xAAAAAAAAAAAAAAAB in binary has pattern: 1010...1011
# This is (2^64 * 2 + 2^64 - 2^64/3) / 3
# We can compute n * 0xAAA...AAB using:
# result = n + (n << 1) + (n << 3) + (n << 4) + ... pattern
# More efficient method: n/6 = (n * 0xAAAAAAAAAAAAAAAB) >> 65
# Since we need >> 65, we actually compute the high 64 bits of
# a 128-bit multiplication
# Approximation: n/6 ≈ (n >> 1) - (n >> 3) + (n >> 5) - (n >> 7) + ...
# Better approach: n/6 = (n * 43) >> 8 for small n
# For full range: use the fact that n/6 = (n + (n>>2) + (n>>4) + ...) >> 3
# Exact algorithm without multiply:
# q = n >> 1
srli t1, a0, 1 # t1 = n >> 1
# q = q >> 1
srli t2, t1, 1 # t2 = n >> 2
# q = q + (n >> 2)
add t1, t1, t2 # t1 = (n >> 1) + (n >> 2)
# Continue the pattern to build up the reciprocal
srli t2, t2, 2 # t2 = n >> 4
add t1, t1, t2 # t1 += n >> 4
srli t2, t2, 2 # t2 = n >> 6
add t1, t1, t2 # t1 += n >> 6
srli t2, t2, 2 # t2 = n >> 8
add t1, t1, t2 # t1 += n >> 8
srli t2, t2, 2 # t2 = n >> 10
add t1, t1, t2 # t1 += n >> 10
srli t2, t2, 2 # t2 = n >> 12
add t1, t1, t2 # t1 += n >> 12
srli t2, t2, 2 # t2 = n >> 14
add t1, t1, t2 # t1 += n >> 14
srli t2, t2, 2 # t2 = n >> 16
add t1, t1, t2 # t1 += n >> 16
srli t2, t2, 2 # t2 = n >> 18
add t1, t1, t2 # t1 += n >> 18
srli t2, t2, 2 # t2 = n >> 20
add t1, t1, t2 # t1 += n >> 20
# Continue for more precision (64-bit needs more terms)
srli t2, t2, 2 # t2 = n >> 22
add t1, t1, t2
srli t2, t2, 2 # t2 = n >> 24
add t1, t1, t2
srli t2, t2, 2 # t2 = n >> 26
add t1, t1, t2
srli t2, t2, 2 # t2 = n >> 28
add t1, t1, t2
srli t2, t2, 2 # t2 = n >> 30
add t1, t1, t2
srli t2, t2, 2 # t2 = n >> 32
add t1, t1, t2
# More terms for 64-bit precision
srli t2, a0, 34
add t1, t1, t2
srli t2, a0, 36
add t1, t1, t2
srli t2, a0, 38
add t1, t1, t2
srli t2, a0, 40
add t1, t1, t2
srli t2, a0, 42
add t1, t1, t2
srli t2, a0, 44
add t1, t1, t2
srli t2, a0, 46
add t1, t1, t2
srli t2, a0, 48
add t1, t1, t2
srli t2, a0, 50
add t1, t1, t2
srli t2, a0, 52
add t1, t1, t2
srli t2, a0, 54
add t1, t1, t2
srli t2, a0, 56
add t1, t1, t2
srli t2, a0, 58
add t1, t1, t2
srli t2, a0, 60
add t1, t1, t2
srli t2, a0, 62
add t1, t1, t2
# Final adjustment shift
srli a0, t1, 2 # Final shift to get quotient
.else
# 32-bit implementation (also works for RV32E)
# Uses only registers a0-a5, t0-t2 which are available in RV32E
# Algorithm: q = (n * 0xAAAAAAAB) >> 33
# We compute this without multiplication
# The pattern 0xAAAAAAAB = 10101010...10101011
# n/6 can be computed as sum of shifts
# More efficient for 32-bit:
# n/6 ≈ (n >> 1) - (n >> 3) + (n >> 5) - ... (alternating series)
# But we use the positive series for simplicity
# Compute n * (2^32 + 2^32/3) / 2^33
# Which equals (n + n/3) / 2
# Step 1: Compute n/3 approximation using series
# n/3 ≈ (n >> 2) + (n >> 4) + (n >> 6) + (n >> 8) + ...
srli t0, a0, 2 # t0 = n >> 2
srli t1, a0, 4 # t1 = n >> 4
add t0, t0, t1 # t0 = (n >> 2) + (n >> 4)
srli t1, a0, 6 # t1 = n >> 6
add t0, t0, t1 # t0 += n >> 6
srli t1, a0, 8 # t1 = n >> 8
add t0, t0, t1 # t0 += n >> 8
srli t1, a0, 10 # t1 = n >> 10
add t0, t0, t1 # t0 += n >> 10
srli t1, a0, 12 # t1 = n >> 12
add t0, t0, t1 # t0 += n >> 12
srli t1, a0, 14 # t1 = n >> 14
add t0, t0, t1 # t0 += n >> 14
srli t1, a0, 16 # t1 = n >> 16
add t0, t0, t1 # t0 += n >> 16
srli t1, a0, 18 # t1 = n >> 18
add t0, t0, t1 # t0 += n >> 18
srli t1, a0, 20 # t1 = n >> 20
add t0, t0, t1 # t0 += n >> 20
srli t1, a0, 22 # t1 = n >> 22
add t0, t0, t1 # t0 += n >> 22
srli t1, a0, 24 # t1 = n >> 24
add t0, t0, t1 # t0 += n >> 24
srli t1, a0, 26 # t1 = n >> 26
add t0, t0, t1 # t0 += n >> 26
srli t1, a0, 28 # t1 = n >> 28
add t0, t0, t1 # t0 += n >> 28
srli t1, a0, 30 # t1 = n >> 30
add t0, t0, t1 # t0 += n >> 30
# Step 2: Now t0 ≈ n/3, compute (n + n/3) / 2
add t0, t0, a0 # t0 = n + n/3
srli a0, t0, 1 # a0 = (n + n/3) / 2 = n/6
.endif
ret
```