Generating code with LLMs

It's a bit sloppy

Oct 01, 2025

Fig 1. I asked for a giraffe dividing things by 6 in a kind of sloppy fashion and that is indeed what I got.

It’s been a while since I compared code I write with code an LLM writes, so I thought I’d try it again and see what the state of the art is. This is with Gemini 2.5pro - if someone wants to try with a different LLM, I’d really like to see the results.

For this test, I decided to implement unsigned division by 6 on 32 and 64-bit RISC-V processors as part of my rvint integer mathematical library for RISC-V. The algorithm I’m using is from the excellent book Hacker’s Delight, and the particular instruction set I’m targeting is processors without hardware multiply/divide instructions, or those with slow multiply/divide. I also want my code to specifically work on the RV32E processors with the reduced register file, so I can run the code my my ch32v003-based olimex retrocomputer.

First, I wrote the routine by hand, so I have something to compare the AI code generation with. I believe this code is optimal for the target architectures, but it may have problems with large input values because of the imprecision in the approximation for 1/6. Note: I also found a comment bug later - the line

srli a2, a0, 3 # a2 = (n >> 2)

should have (n » 3) in the comment.

################################################################################
# routine: div6u
#
# Unsigned fast division by 6 without using M extension.
# This routine is 64-bit on 64-bit CPUs and 32-bit on 32-bit CPUs.
# It uses a fast multiply/shift/add/correct algorithm.
# Suitable for use on RV32E architectures.
#
# input registers:
# a0 = unsigned dividend (32 or 64 bits)
#
# output registers:
# a0 = quotient (unsigned)
################################################################################
div6u:
	srli	a1, a0, 1	# a1 = (n >> 1)
	srli	a2, a0, 3	# a2 = (n >> 2)
	add	a1, a1, a2	# q = (n >> 1) + (n >> 2)
	srli	a2, a1, 4	# a2 = (q >> 4)
	add	a1, a1, a2	# q = q + (q >> 4)
	srli	a2, a1, 8	# a2 = (q >> 8)
	add	a1, a1, a2	# q = q + (q >> 8)
	srli	a2, a1, 16	# a2 = (q >> 16)
	add	a1, a1, a2	# q = q + (q >> 16)
.if CPU_BITS == 64
	srli	a2, a1, 32	# a2 = (q >> 32)
	add	a1, a1, a2	# q = q + (q >> 32)
.endif
	srli	a1, a1, 2	# q = q >> 2

	# compute r = n - q*6
	slli	a2, a1, 2	# a2 = q * 4
	slli	a3, a1, 1	# a3 = q * 2
	add	a2, a2, a3	# a2 = q*6
	sub	a2, a0, a2	# a2 = r = n - q*6

	# add correction
	sltiu	a2, a2, 6	# a2 = 1 if a2 < 6 (unsigned), else a2 = 0
	xori	a2, a2, 1	# flip the result
	add	a0, a1, a2	# q = q + (r > 5)
	ret
.size	div6u, .-div6u

Here’s the result of using Gemini 2.5pro. First, my prompt - note that I specifically told the AI to use the same algorithm as I used to implement my code.

Write an unsigned integer divide by 6 routine using the algorithm from Hacker’s Delight, 2nd edition, “integer division by constants” for 32 and 64 bit RISC-V processors without multiply and divide instructions (RV32I and RV64I instruction sets). In the 32-bit case, the code should also run on the RV32E instruction set. The input number to be divided by 6 will be provided in the a0 register, and the result should be placed in the a0 register. The CPU_BITS equate will be set to 32 or 64 bits depending on the processor architecture, so you can write conditional code wrapped in “.if CPU_BITS == 64” to do conditional assembly for different processors.

This is the result:

# RISC-V Unsigned Integer Division by 6
#
# Author: Gemini
# Date: 2024-05-21
#
# Description:
# This routine implements unsigned integer division by the constant 6 for 32-bit
# and 64-bit RISC-V processors using only the base integer instruction set
# (RV32I/RV32E and RV64I). It does not use any multiply or divide instructions.
#
# Algorithm:
# The routine is based on the methods described in the book “Hacker’s Delight,”
# 2nd Edition, by Henry S. Warren, Jr., in Chapter 10, “Integer Division by Constants.”
#
# The division n/6 is calculated as floor(floor(n/2)/3).
#
# 1. The input n is first shifted right by 1 to compute y = floor(n/2).
#
# 2. The division y/3 is then performed using a shift-and-add multiplication
#    by a “magic number” that approximates 4/3. The sequence is:
#      t = y + (y >> 2)
#      t = t + (t >> 4)
#      t = t + (t >> 8)
#      ...
#      q_approx = t >> 2
#    This produces an approximate quotient that is correct for some inputs but
#    can be off by one for others (specifically, when y is a multiple of 3).
#
# 3. A correction step is applied. The remainder is calculated:
#      rem = y - 3 * q_approx
#    If the remainder is 3 or greater, the approximate quotient is incremented.
#
# Configuration:
# Set the CPU_BITS equate to 32 or 64 to assemble for the desired architecture.
#  - CPU_BITS = 32: Assembles for RV32I and RV32E.
#  - CPU_BITS = 64: Assembles for RV64I.

.equ CPU_BITS, 32 # Set to 32 or 64

.globl divu6
.text

# =============================================================================
# divu6: Unsigned integer division by 6
#
# ABI:
#   Input:      a0 - Unsigned integer dividend (n)
#   Output:     a0 - Unsigned integer quotient (floor(n/6))
#   Clobbers:   a1, t0, t1 (scratch registers)
# =============================================================================
divu6:

.if CPU_BITS == 32
    # -------------------------------------------------------------------------
    # 32-bit Implementation (RV32I / RV32E)
    # -------------------------------------------------------------------------
    # Step 1: Calculate y = n / 2
    srli a1, a0, 1          # a1 = y = n >> 1
    mv   t1, a1              # Save y in t1 for later (remainder calculation)

    # Step 2: Calculate approximate quotient for y / 3
    # Based on multiplying by approx 4/3 and shifting right by 2.
    # q_approx is calculated in a0. a1 is used as a temporary.
    srli a0, t1, 2          # a0 = y >> 2
    add  a0, a0, t1          # a0 = t = y + (y >> 2)
    srli a1, a0, 4          # a1 = t >> 4
    add  a0, a0, a1          # a0 = t = t + (t >> 4)
    srli a1, a0, 8          # a1 = t >> 8
    add  a0, a0, a1          # a0 = t = t + (t >> 8)
    srli a1, a0, 16         # a1 = t >> 16
    add  a0, a0, a1          # a0 = t = t + (t >> 16)
    srli a0, a0, 2          # a0 = q_approx = t >> 2

    # Step 3: Correction step
    # q_final = q_approx + (y - 3*q_approx >= 3)
    slli a1, a0, 1          # a1 = q_approx * 2
    add  a1, a1, a0          # a1 = q_approx * 3
    sub  a1, t1, a1          # a1 = rem = y - (q_approx * 3)
    li   t0, 2               # Load constant 2 for comparison
    sltu t0, t0, a1           # t0 = 1 if 2 < rem (i.e., rem >= 3), else 0
    add  a0, a0, t0          # a0 = q_final = q_approx + correction
    ret
.else
    # -------------------------------------------------------------------------
    # 64-bit Implementation (RV64I)
    # -------------------------------------------------------------------------
    # Step 1: Calculate y = n / 2
    srli a1, a0, 1          # a1 = y = n >> 1
    mv   t1, a1              # Save y in t1 for later (remainder calculation)

    # Step 2: Calculate approximate quotient for y / 3
    # Same logic as 32-bit, but with an extra step for the upper 32 bits.
    # q_approx is calculated in a0. a1 is used as a temporary.
    srli a0, t1, 2          # a0 = y >> 2
    add  a0, a0, t1          # a0 = t = y + (y >> 2)
    srli a1, a0, 4          # a1 = t >> 4
    add  a0, a0, a1          # a0 = t = t + (t >> 4)
    srli a1, a0, 8          # a1 = t >> 8
    add  a0, a0, a1          # a0 = t = t + (t >> 8)
    srli a1, a0, 16         # a1 = t >> 16
    add  a0, a0, a1          # a0 = t = t + (t >> 16)
    srli a1, a0, 32         # a1 = t >> 32
    add  a0, a0, a1          # a0 = t = t + (t >> 32)
    srli a0, a0, 2          # a0 = q_approx = t >> 2

    # Step 3: Correction step
    # q_final = q_approx + (y - 3*q_approx >= 3)
    slli a1, a0, 1          # a1 = q_approx * 2
    add  a1, a1, a0          # a1 = q_approx * 3
    sub  a1, t1, a1          # a1 = rem = y - (q_approx * 3)
    li   t0, 2               # Load constant 2 for comparison
    sltu t0, t0, a1           # t0 = 1 if 2 < rem (i.e., rem >= 3), else 0
    add  a0, a0, t0          # a0 = q_final = q_approx + correction
    ret
.endif

The code uses the same number of instructions as my code, but for some reason it provides two complete implementations rather than just doing the conditional assembly around the two instructions that differ for 64-bit support. So the AI code is twice as long and contains a major amount of copypasta’d code. The AI-generated code makes does a divide by 2 followed by a divide by 3, whereas my code directly divides by 6. This has an interesting side effect; the AI generated code makes better use of RISC-V compressed instructions as can be seen in objdump output of the assembled code (div6u is my code, divu6 is AI-generated):

00000108 <div6u>:
     108: 00155593     	srli	a1, a0, 0x1
     10c: 00355613     	srli	a2, a0, 0x3
     110: 95b2         	c.add	a1, a2
     112: 0045d613     	srli	a2, a1, 0x4
     116: 95b2         	c.add	a1, a2
     118: 0085d613     	srli	a2, a1, 0x8
     11c: 95b2         	c.add	a1, a2
     11e: 0105d613     	srli	a2, a1, 0x10
     122: 95b2         	c.add	a1, a2
     124: 8189         	c.srli	a1, 0x2
     126: 00259613     	slli	a2, a1, 0x2
     12a: 00159693     	slli	a3, a1, 0x1
     12e: 9636         	c.add	a2, a3
     130: 40c50633     	sub	a2, a0, a2
     134: 00663613     	sltiu	a2, a2, 0x6
     138: 00164613     	xori	a2, a2, 0x1
     13c: 00c58533     	add	a0, a1, a2
     140: 8082         	c.jr	ra

00000142 <divu6>:
     142: 00155593     	srli	a1, a0, 0x1
     146: 832e         	c.mv	t1, a1
     148: 00235513     	srli	a0, t1, 0x2
     14c: 951a         	c.add	a0, t1
     14e: 00455593     	srli	a1, a0, 0x4
     152: 952e         	c.add	a0, a1
     154: 00855593     	srli	a1, a0, 0x8
     158: 952e         	c.add	a0, a1
     15a: 01055593     	srli	a1, a0, 0x10
     15e: 952e         	c.add	a0, a1
     160: 8109         	c.srli	a0, 0x2
     162: 00151593     	slli	a1, a0, 0x1
     166: 95aa         	c.add	a1, a0
     168: 40b305b3     	sub	a1, t1, a1
     16c: 4289         	c.li	t0, 0x2
     16e: 00b2b2b3     	sltu	t0, t0, a1
     172: 9516         	c.add	a0, t0
     174: 8082         	c.jr	ra

This results in the AI-generated code being 52 bytes long, whereas my code is 58 bytes long in the 32-bit case. This is a result of the divide by two and then divide by three approach the AI took vs. my direct divide by six!

I did also experiment with prompting Gemini without referring to the Hacker’s Delight book or the specific algorithm I wanted and I was unable to get it to generate code that was as good as what it generated above - the code always had conditional branches, loops, or used lots of registers. It appears that if you let the AI choose the algorithm it doesn’t do as good as if you tell it exactly what you want implemented.

Interestingly, on 64-bit machines, both the AI generated code and my code have an identical bug. This does not surprise me due to the properties of dividing by 3 (which is, after all, part of dividing by 6). I know that in my divide by 3 routine, I had to handle the case of numbers close to 0xffffffffffffffff differently because of the error accumulation of the series approximation to 1/3 for large inputs. It makes sense that this same error would occur near values half that in divide by 6 due to the shift right by 1 bit.

1: div6u: 0x0000000000000000 0x0000000000000000 pass
2: div6u: 0x0000000000000005 0x0000000000000000 pass
3: div6u: 0x0000000000000006 0x0000000000000001 pass
4: div6u: 0x000000000000000b 0x0000000000000001 pass
5: div6u: 0x000000000000007b 0x0000000000000014 pass
6: div6u: 0x000000007fffffff 0x0000000015555555 pass
7: div6u: 0x0000000080000000 0x0000000015555555 pass
8: div6u: 0x00000000ffffffff 0x000000002aaaaaaa pass
9: div6u: 0x0000000100000000 0x000000002aaaaaaa pass
10: div6u: 0x7fffffffffffffff 0x1555555555555554 0x1555555555555555 fail
11: div6u: 0x8000000000000000 0x1555555555555555 pass
12: div6u: 0xffffffffffffffff 0x2aaaaaaaaaaaaaaa pass

I ended up using a similar approach as to the one I used in my div3u routine, where I use a more precise correction (which requires more instructions) in the 64-bit code path. I used AI to come up with the new “magic number” approximation of floor(r * 11 / 64) for the correction step. Once the corrections were applied to the div6u routines, I just went with the version I wrote myself. I also prefer the register allocation in my version as it’s the same as my other divide routines, and sometimes I like to call routines without saving registers on the stack because I know what is clobbered when I’m writing extremely concise code. You can see an example of this in my AES-128 implementation for x86_64.

################################################################################
# routine: div6u
#
# Unsigned fast division by 6 without using M extension.
# This routine is 64-bit on 64-bit CPUs and 32-bit on 32-bit CPUs.
# It uses a fast multiply/shift/add/correct algorithm.
# Suitable for use on RV32E architectures.
#
# input registers:
# a0 = unsigned dividend (32 or 64 bits)
#
# output registers:
# a0 = quotient (unsigned)
################################################################################
div6u:
	# Phase 1: Calculate approximate quotient q.
	srli	a1, a0, 1	# a1 = (n >> 1)
	srli	a2, a0, 3	# a2 = (n >> 3)
	add	a1, a1, a2	# q = (n >> 1) + (n >> 3)
	srli	a2, a1, 4	# a2 = (q >> 4)
	add	a1, a1, a2	# q = q + (q >> 4)
	srli	a2, a1, 8	# a2 = (q >> 8)
	add	a1, a1, a2	# q = q + (q >> 8)
	srli	a2, a1, 16	# a2 = (q >> 16)
	add	a1, a1, a2	# q = q + (q >> 16)
.if CPU_BITS == 64
	srli	a2, a1, 32	# a2 = (q >> 32)
	add	a1, a1, a2	# q = q + (q >> 32)
.endif
	srli	a1, a1, 2	# q = q >> 2

	# Phase 2: Calculate remainder r = n - 6*q
	slli	a2, a1, 2	# a2 = q * 4
	slli	a3, a1, 1	# a3 = q * 2
	add	a2, a2, a3	# a2 = q * 6
	sub	a2, a0, a2	# a2 = r = n - q * 6

	# Phase 3: Correction
.if CPU_BITS == 32
	# For 32-bit, the error is at most 1, so a simple check is sufficient.
	sltiu	a3, a2, 6	# a3 = 1 if r < 6
	xori	a3, a3, 1	# a3 = 1 if r >= 6
	add	a0, a1, a3	# q = q + correction
.endif
.if CPU_BITS == 64
	# For 64-bit, we use a fast magic number multiplication to find
	# the correction amount, which is floor(r * 11 / 64).
	# This single step is sufficient to produce the correct result.
	slli	a3, a2, 3	# a3 = r * 8
	add	a3, a3, a2	# a3 = r * 9
	slli	a4, a2, 1	# a4 = r * 2
	add	a3, a3, a4	# a3 = r * 11
	srli	a3, a3, 6	# a3 = floor((r * 11) / 64) -> correction amount
	add	a0, a1, a3	# a0 = q_approx + correction
.endif
	ret
.size	div6u, .-div6u

So ultimately I ended up using code I wrote myself as it was more compact in source code form. I used AI assistance for two parts - coming up with the 11/64 approximation to 1/6 as being sufficiently precise to correct two bit errors in the the test quotient for 64-bit numbers, and I used it to generate test cases. I used a prompt which was an improvement on the one I used last time I generated test cases so there were fewer issues with how the test code was laid out.

Update div-tests.s with tests for div6u. Use a similar approach as the test cases for div3u, div5u and div10u. The routines should test both 32 and 64 bit cases. The tests should test large values in particular because this is where the division approximation is most likely to break down. Interleave the div6u tests in between the div5u and div10u tests and variables.

The last bit successfully avoided some of the annoyances I had last time. Also, unlike last time, this time around all the generated test cases were correct.

The approach I ultimately took is one I often take - I don’t use the AI to generate my main code, but I do find it a timesaver to generate boilerplate code (the tests) and to investigate small problems, like what’s a good approximation for a formula using a power-of-two divisor that generates one bit or less of error in the approximate quotient correction for 64-bit quantities.

Discussion about this post

Kamaroyl

Oct 2, 2025

Here's what claude 4.1 gave me with the same prompt:

```

# Unsigned integer divide by 6 for RISC-V RV32I/E and RV64I

# Input: a0 = dividend

# Output: a0 = quotient (dividend / 6)

# Algorithm: Multiplication by reciprocal using magic number

# For division by 6, we use the multiplicative inverse approach

# Magic number for 32-bit: 0xAAAAAAAB (with shift)

# Magic number for 64-bit: 0xAAAAAAAAAAAAAAAB (with shift)

.text

.globl udiv6

.align 2

udiv6:

.if CPU_BITS == 64

# 64-bit implementation

# Algorithm: q = (n * 0xAAAAAAAAAAAAAAAB) >> 65

# Since we can't multiply, we decompose the multiplication

# into shifts and adds

# Save original value

mv t0, a0

# Compute n * 0xAAAAAAAAAAAAAAAB

# 0xAAAAAAAAAAAAAAAB in binary has pattern: 1010...1011

# This is (2^64 * 2 + 2^64 - 2^64/3) / 3

# We can compute n * 0xAAA...AAB using:

# result = n + (n << 1) + (n << 3) + (n << 4) + ... pattern

# More efficient method: n/6 = (n * 0xAAAAAAAAAAAAAAAB) >> 65

# Since we need >> 65, we actually compute the high 64 bits of

# a 128-bit multiplication

# Approximation: n/6 ≈ (n >> 1) - (n >> 3) + (n >> 5) - (n >> 7) + ...

# Better approach: n/6 = (n * 43) >> 8 for small n

# For full range: use the fact that n/6 = (n + (n>>2) + (n>>4) + ...) >> 3

# Exact algorithm without multiply:

# q = n >> 1

srli t1, a0, 1 # t1 = n >> 1

# q = q >> 1

srli t2, t1, 1 # t2 = n >> 2

# q = q + (n >> 2)

add t1, t1, t2 # t1 = (n >> 1) + (n >> 2)

# Continue the pattern to build up the reciprocal

srli t2, t2, 2 # t2 = n >> 4

add t1, t1, t2 # t1 += n >> 4

srli t2, t2, 2 # t2 = n >> 6

add t1, t1, t2 # t1 += n >> 6

srli t2, t2, 2 # t2 = n >> 8

add t1, t1, t2 # t1 += n >> 8

srli t2, t2, 2 # t2 = n >> 10

add t1, t1, t2 # t1 += n >> 10

srli t2, t2, 2 # t2 = n >> 12

add t1, t1, t2 # t1 += n >> 12

srli t2, t2, 2 # t2 = n >> 14

add t1, t1, t2 # t1 += n >> 14

srli t2, t2, 2 # t2 = n >> 16

add t1, t1, t2 # t1 += n >> 16

srli t2, t2, 2 # t2 = n >> 18

add t1, t1, t2 # t1 += n >> 18

srli t2, t2, 2 # t2 = n >> 20

add t1, t1, t2 # t1 += n >> 20

# Continue for more precision (64-bit needs more terms)

srli t2, t2, 2 # t2 = n >> 22

add t1, t1, t2

srli t2, t2, 2 # t2 = n >> 24

add t1, t1, t2

srli t2, t2, 2 # t2 = n >> 26

add t1, t1, t2

srli t2, t2, 2 # t2 = n >> 28

add t1, t1, t2

srli t2, t2, 2 # t2 = n >> 30

add t1, t1, t2

srli t2, t2, 2 # t2 = n >> 32

add t1, t1, t2

# More terms for 64-bit precision

srli t2, a0, 34

add t1, t1, t2

srli t2, a0, 36

add t1, t1, t2

srli t2, a0, 38

add t1, t1, t2

srli t2, a0, 40

add t1, t1, t2

srli t2, a0, 42

add t1, t1, t2

srli t2, a0, 44

add t1, t1, t2

srli t2, a0, 46

add t1, t1, t2

srli t2, a0, 48

add t1, t1, t2

srli t2, a0, 50

add t1, t1, t2

srli t2, a0, 52

add t1, t1, t2

srli t2, a0, 54

add t1, t1, t2

srli t2, a0, 56

add t1, t1, t2

srli t2, a0, 58

add t1, t1, t2

srli t2, a0, 60

add t1, t1, t2

srli t2, a0, 62

add t1, t1, t2

# Final adjustment shift

srli a0, t1, 2 # Final shift to get quotient

.else

# 32-bit implementation (also works for RV32E)

# Uses only registers a0-a5, t0-t2 which are available in RV32E

# Algorithm: q = (n * 0xAAAAAAAB) >> 33

# We compute this without multiplication

# The pattern 0xAAAAAAAB = 10101010...10101011

# n/6 can be computed as sum of shifts

# More efficient for 32-bit:

# n/6 ≈ (n >> 1) - (n >> 3) + (n >> 5) - ... (alternating series)

# But we use the positive series for simplicity

# Compute n * (2^32 + 2^32/3) / 2^33

# Which equals (n + n/3) / 2

# Step 1: Compute n/3 approximation using series

# n/3 ≈ (n >> 2) + (n >> 4) + (n >> 6) + (n >> 8) + ...

srli t0, a0, 2 # t0 = n >> 2

srli t1, a0, 4 # t1 = n >> 4

add t0, t0, t1 # t0 = (n >> 2) + (n >> 4)

srli t1, a0, 6 # t1 = n >> 6

add t0, t0, t1 # t0 += n >> 6

srli t1, a0, 8 # t1 = n >> 8

add t0, t0, t1 # t0 += n >> 8

srli t1, a0, 10 # t1 = n >> 10

add t0, t0, t1 # t0 += n >> 10

srli t1, a0, 12 # t1 = n >> 12

add t0, t0, t1 # t0 += n >> 12

srli t1, a0, 14 # t1 = n >> 14

add t0, t0, t1 # t0 += n >> 14

srli t1, a0, 16 # t1 = n >> 16

add t0, t0, t1 # t0 += n >> 16

srli t1, a0, 18 # t1 = n >> 18

add t0, t0, t1 # t0 += n >> 18

srli t1, a0, 20 # t1 = n >> 20

add t0, t0, t1 # t0 += n >> 20

srli t1, a0, 22 # t1 = n >> 22

add t0, t0, t1 # t0 += n >> 22

srli t1, a0, 24 # t1 = n >> 24

add t0, t0, t1 # t0 += n >> 24

srli t1, a0, 26 # t1 = n >> 26

add t0, t0, t1 # t0 += n >> 26

srli t1, a0, 28 # t1 = n >> 28

add t0, t0, t1 # t0 += n >> 28

srli t1, a0, 30 # t1 = n >> 30

add t0, t0, t1 # t0 += n >> 30

# Step 2: Now t0 ≈ n/3, compute (n + n/3) / 2

add t0, t0, a0 # t0 = n + n/3

srli a0, t0, 1 # a0 = (n + n/3) / 2 = n/6

.endif

ret

```

2 replies by Benard Mesander and others

2 more comments...

No posts

Benard’s Substack

Discussion about this post

Ready for more?