Implementing Division by Multiplying with Magic Numbers
Using the RISC-V Zmmul extension
Recently, I have begun implementing division in my rvint library of integer mathematical algorithms for RISC-V processors. Previously, I used series expansion techniques to implement division using only shifts and add instructions in branchless constant time routines. Now you have the option of also building the library to use multiplication instructions to do division.
Much has been written about the theory behind this technique, for example see this Stack Overflow article on “Why does GCC use multiplication by a strange number in implementing integer division?” In my article, rather than concerning myself with the mathematical “how and why” I will concentrate on how to implement things efficiently on the RISC-V architecture. The best reference I have found for this division technique is section 10.3 of Hacker’s Delight, 2nd edition.
The magic number technique is to multiply by the reciprocal of a number rather than doing the division. Since you only need the high bits of the result (we are doing integer math), this appears as a multiplication by a seemingly “magic” number containing the high bits of the reciprocal.
Why might you want to divide by doing multiplication in this way? Some possible motivations:
You may only have the Zmmul extension in your RISC-V processor’s instruction set; this extension only has the multiplication operations, and not the division and remainder instructions that are part of the full M extension.
The divider in your processor is slower; the multiplier is much faster, and you’d like to do division faster than the native divide instructions can do.
Hacker’s Delight starts with signed division by 3, 5, and 7 as these three algorithms contain sightly different edge cases, and a proof is provided that these three cases work for all integers that are not powers of 2.
Accordingly, let’s look at signed division by 3 first.
.include "config.s"
.include "mul-macs.s"
.if CONSTANT_TABLE
.section .srodata, "a", @progbits
.align 3
M_div3:
.quad 0x5555555555555556 # M = (2**63 + 2) / 3
.endif
.globl div3
.text
.if HAS_ZMMUL == 1
################################################################################
# routine: div3
#
# Signed fast division by 3 for processors with a multiply instruction
# Algorithm: "Magic Number" - Hacker's Delight 2nd ed. sec 10.3,
# Suitable for RV32I_Zmmul, RV64I_Zmmul
# Note: unless your core has the Zkt extension, this may not run in
# constant time, consult your vendor documentation.
#
# input: a0 = signed dividend
# output: a0 = signed quotient
################################################################################
div3:
.if CPU_BITS == 64
.if CONSTANT_TABLE
ld a2, M_div3
.else
# this option is best for constant time (no possibility of cache miss)
li a2, 0x5555555555555556
.endif
mulh a1, a0, a2
.else
li a2, 0x55555556 # M = magic number, (2**32+2)/3
mulhsu a1, a0, a2 # q = floor(M*n/2**32)
.endif
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
add a0, a1, a2 # q = a0 = a1 + a2
ret
.else
# The following routines started with the Hacker's Delight 2nd edition
# Chapter 10 routines, but were extended to handle 64 bits which involved
# refining the series expansions and the correction steps. In some
# cases I managed to save an instruction or two in the correction step.
################################################################################
# routine: div3
#
# Signed fast division by 3.
# Algorithm: Abs(n) -> Unsigned Div -> Restore Sign.
# Suitable for RV32E, RV32I, RV64I
#
# input: a0 = signed dividend
# output: a0 = signed quotient
################################################################################
div3:
# 1. Preamble: Compute Sign Mask (t0) and Abs(n) (a0)
srai t0, a0, CPU_BITS-1 # t0 = -1 if n < 0, else 0
xor a0, a0, t0 # a0 = n ^ sign
sub a0, a0, t0 # a0 = (n ^ sign) - sign = abs(n)
# 2. Series Expansion (Approximates Q_abs = abs(n) / 3)
srli a1, a0, 2
srli a2, a0, 4
add a1, a2, a1 # q = (n >> 2) + (n >> 4)
srli a2, a1, 4
add a1, a2, a1 # q += q >> 4
srli a2, a1, 8
add a1, a2, a1 # q += q >> 8
srli a2, a1, 16
add a1, a2, a1 # q += q >> 16
.if CPU_BITS == 64
srli a2, a1, 32
add a1, a2, a1 # q += q >> 32
.endif
# 3. Calculate Remainder / Error
# R = abs(n) - 3*Q_est
mul3 a2, a1, a2 # a2 = 3 * Q_est
sub a2, a0, a2 # a2 = abs(n) - 3*Q_est (Remainder)
# 4. Branchless Correction
# Correction = (R * 11) >> 5
mul11 a3, a2, a0 # a3 = R * 11
srli a3, a3, 5
add a1, a1, a3 # a1 = Q_abs (corrected)
# 5. Postamble: Restore Sign
xor a0, a1, t0
sub a0, a0, t0
ret
.endif
.size div3, .-div3
While I included the entire routine for completeness, we will only be looking at the .if HAS_ZMMUL block. The .else block contains the series expansion approach described in this article. Which is best depends upon your particular processor and application.
Within the HAS_ZMMUL block, the code is conditionally assembled based upon two other conditionals: CPU_BITS which is either 64 for RV64 processors, or 32 for RV32 processors. Within the 64-bit case, there is another conditional CONSTANT_TABLE, which controls how the magic number constant is materialized.
Let’s examine the 32-bit path first. In Hacker’s Delight, the following algorithm is defined in the abstract RISC-like pseudo assembly language used in examples:
li M,0x5555556
mulhs q, M, n
shri t, n, 31
add q, q, twhere n is the input, M is magic number, t is temp, and q is the result (quotient).
This maps in a straightforward manner to RISC-V RV32 assembly:
li a2, 0x55555556 # M = magic number, (2**32+2)/3
mulhsu a1, a0, a2 # q = floor(M*n/2**32)
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
add a0, a1, a2 # q = a0 = a1 + a2Note the use of mulhsu - this multiplies a signed by an unsigned quantity. As the magic number is positive, we can treat it as “unsigned”. The slti instruction slotted in directly after the multiply will avoid a pipeline stall in out-of-order superscalar cores (admittedly rare in RV32). The slti (set less than immediate) instruction more directly checks for negative values than the shift trick used in Hacker’s Delight.
Disassembling the generated object code, we can see the actual machine instructions generated by the assembler:
00000000 <div3>:
0: 55555637 lui a2, 0x55555
4: 55660613 addi a2, a2, 0x556
8: 02c525b3 mulhsu a1, a0, a2
c: 00052613 slti a2, a0, 0x0
10: 00c58533 add a0, a1, a2
14: 8082 c.jr raNo surprises here - the li pseudo-operation expands into two machine instructions used to materialize the magic number constant - the lui (load upper immediate) loads the upper 20 bits of the a2 register, and the addi (add immediate) instruction adds in the lower 12 bits of the magic number to the a2 register. The c.jr instruction is the “return” at the end of the subroutine.
On advanced RISC-V cores such as BOOM or SiFive U-series. the multiply instruction may run in ~3-4 cycles; all other instructions are single cycle. On embedded cores, the multiply may take ~4-32 cycles. If the processor has the Zkt extension, the multiply will run in a constant number of cycles; otherwise the number of cycles may vary by the numbers being multiplied. For cryptographic applications, one would either prefer an advanced core or require the Zkt extension (or use the series expansion version of the code, which runs in constant time).
So in the advanced core case, this routine would run in ~7-8 cycles, which is significantly faster than the ~22-25 cycles the series expansion approach takes on a 32-bit core. However, on an embedded core, the series expansion approach may be more performant.
Let’s look at the 64-bit path, first with CONSTANT_TABLE = 0. In this case the code simplifies to:
li a2, 0x5555555555555556
mulh a1, a0, a2
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
add a0, a1, a2 # q = a0 = a1 + a2When we disassemble the output of the assembler to get the machine instructions we get:
0000000000000000 <div3>:
0: 05555637 lui a2, 0x5555
4: 55560613 addi a2, a2, 0x555
8: 0632 c.slli a2, 0xc
a: 55560613 addi a2, a2, 0x555
e: 0632 c.slli a2, 0xc
10: 55560613 addi a2, a2, 0x555
14: 0632 c.slli a2, 0xc
16: 55660613 addi a2, a2, 0x556
1a: 02c515b3 mulh a1, a0, a2
1e: 00052613 slti a2, a0, 0x0
22: 00c58533 add a0, a1, a2
26: 8082 c.jr raNote that loading the constant 0x5555555555555556 into register a2 takes 8 machine instructions! Otherwise, things are unsurprising. This code runs in constant time if the mulh runs in constant time. This code will run in ~13-14 cycles on an advanced core or perhaps in ~43 cycles on a simple core. This compares with ~24-27 cycles for the series expansion approach on RV64. So the “magic number” approach is not always the most efficient way to divide.
If CONSTANT_TABLE is set to 1, the constant 0x5555555555555556 is loaded from a table in the .srodata (small read only data) segment rather than being materialized inline. In this case, the generated machine code looks like this:
0: 00000637 lui a2, 0x0
4: 00063603 ld a2, 0x0(a2)
8: 02c515b3 mulh a1, a0, a2
c: 00052613 slti a2, a0, 0x0
10: 00c58533 add a0, a1, a2
14: 8082 c.jr raThe 8 instructions to materialize the 64-bit constant in the previous example are now instead two instructions used to load the constant from RAM. This lowers our cycle count to the ~7-9 cycle range on an advanced core - a nice speedup. If the lookup table needs to be pulled from a cache tier, that could add say 10 cycles or up to 100 cycles to pull from DRAM. So while the code is smaller, depending upon cache behavior, it may be far slower. But the code+data size is smaller.
Doing a signed divide by 5 is very similar to a signed divide by 3, but adds an additional complication - the magic number approximation requires an additional right shift to ensiure the error term to be < 1 bit for the entire range of signed integers. Here’s the pseudo-assembly from Hacker’s Delight:
li M, 0x66666667
mulhs q, M, n
shrsi q, q, 1
shri t, n, 31
add q, q, tHere’s the first part of signed divide by 5 (we concentrate on only the if HAS_ZMMUL case):
include "config.s"
.include "mul-macs.s"
.if CONSTANT_TABLE
.section .srodata, "a", @progbits
.align 3
M_div5:
.quad 0x6666666666666667
.endif
.globl div5
.text
.if HAS_ZMMUL == 1
################################################################################
# routine: div5
#
# Signed fast division by 5 for processors with a multiply instruction
# Algorithm: "Magic Number" - Hacker's Delight 2nd ed. sec 10.3,
# Suitable for RV32I_Zmmul, RV64I_Zmmul
# Note: unless your core has the Zkt extension, this may not run in
# constant time, consult your vendor documentation.
#
# input: a0 = signed dividend
# output: a0 = signed quotient
################################################################################
div5:
.if CPU_BITS == 64
.if CONSTANT_TABLE
ld a2, M_div5
.else
# this option is best for constant time (no possibility of cache miss)
li a2, 0x6666666666666667
.endif
mulh a1, a0, a2
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
srai a1, a1, 1 # shift q right once - do after slti to avoid stall
.else
li a2, 0x66666667 # (2**33+3)/5
mulhsu a1, a0, a2 # q = floor(M*n/2**32)
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
.endif
add a0, a1, a2 # q = a0 = a1 + a2
ret
.else
... series expansion code ...Extracting the 32-bit path from the above gives the following code:
li a2, 0x66666667 # (2**33+3)/5
mulhsu a1, a0, a2 # q = floor(M*n/2**32)
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
add a0, a1, a2 # q = a0 = a1 + a2Note that we can avoid the right shift by 1 in the Hacker’s Delight book by using the RISC-V mulhsu (multiply high signed by unsigned) instruction which allows us one more bit of precision and so we can use one less instruction than in the Hacker’s Delight book with this microarchitectural optimization. The slti instruction immediately after the mulhsu runs in zero additional time on superscalar out-of-order cores as it can run in parallel with the multiply.
The shift cannot be avoided in the 64-bit case, so we have the following:
li a2, 0x6666666666666667
mulh a1, a0, a2
slti a2, a0, 0 # a2 = 1 if a0 < 0 (negative), else 0
srai a1, a1, 1 # shift q right once - do after slti to avoid stall
add a0, a1, a2 # q = a0 = a1 + a2Note that the slti is interchanged with the srai; this is because a high end processor could run the mulh and the slti at the same time. As in the div3 routine, the constant may be materialized inline or read from a table. Again interchanging the < 0 check and the right shift from the order in Hacker’s Delight prevents a pipeline stall on advanced cores and shortens our cycle count by one.
Dividing by 7 introduces a new problem for our algorithm - to obtain sufficient precision we must multiply by a negative magic number and then correct the product with an addition. Here’s the code for the HAS_ZMMUL case:
.include "config.s"
.include "mul-macs.s"
.if CONSTANT_TABLE
.section .srodata, "a", @progbits
.align 3
M_div7:
.quad 0x4924924924924925
.endif
.globl div7
.text
.if HAS_ZMMUL == 1
################################################################################
# routine: div7
#
# Signed fast division by 7 for processors with a multiply instruction
# Algorithm: "Magic Number" - Hacker's Delight 2nd ed. sec 10.3,
# Suitable for RV32I_Zmmul, RV64I_Zmmul
# Note: unless your core has the Zkt extension, this may not run in
# constant time, consult your vendor documentation.
#
# input: a0 = signed dividend
# output: a0 = signed quotient
################################################################################
div7:
.if CPU_BITS == 64
.if CONSTANT_TABLE
ld a2, M_div7
.else
li a2, 0x4924924924924925
.endif
mulh a1, a0, a2
slti a2, a0, 0 # Hides multiplier stall
srai a1, a1, 1 # Exact shift for 64-bit positive magic number
.else
li a2, 0x92492493 # (2**34+5)/7
mulhsu a1, a0, a2 # Signed * Unsigned entirely skips the need for 'add'
slti a2, a0, 0 # Hides multiplier stall
srai a1, a1, 2 # Exact shift for 32-bit mulhsu path
.endif
add a0, a1, a2 # Final sign adjustment correction
ret
.else
...This was very fun to write and differs significantly from the example pseudocode in Hacker’s Delight:
li M, 0x92492493
mulhs q, M, n
add q, q, n
shrsi q, q, 2
shri t, n, 31
add q, q, tThis code has an add after the mulhs in order to fix the overflow problem caused by the “negative” magic number. But look at the 32-bit code extracted from our RISC-V routine:
li a2, 0x92492493 # (2**34+5)/7
mulhsu a1, a0, a2 # Signed * Unsigned entirely skips the need for 'add'
slti a2, a0, 0 # Hides multiplier stall
srai a1, a1, 2 # Exact shift for 32-bit mulhsu path
add a0, a1, a2 # Final sign adjustment correctionBy using the mulhsu (multiply high signed by unsigned) routine we treat the magic number as an unsigned number avoiding the overflow. This means we do not need the add instruction after the multiply - saving one instruction from the “Hacker’s Delight” algorithm. Finally the slti (shri) and the srai (shrsi) are interchanged from the Hacker’s Delight version to avoid a pipeline stall and to allow the slti to run in parallel with the multiplication on cores which support that. These changes improve the RISC-V version by 1-2 cycles over the Hacker’s Delight version. The mulhsu trick works for all negative magic numbers and so isn’t just an optimization for divide by 7 - it works universally.
The 64-bit version of the algorithm also has improvements:
li a2, 0x4924924924924925
mulh a1, a0, a2
slti a2, a0, 0 # Hides multiplier stall
srai a1, a1, 1 # Exact shift for 64-bit positive magic number
add a0, a1, a2 # Final sign adjustment correctionIn this case, the magic number appears as a positive 64-bit number, so again we do not need an addition after the multiply, and again interchanging the two shifts avoids a pipeline stall on out-of-order superscalar cores. So again, we save 1-2 cycles over the Hacker’s Delight version of the code.
Over time, I will be implementing magic number versions of more of the division algorithms in the rvint library. A further option for additional optimization is looking at how clang is materializing 64-bit constants. For example in the divide by 3 case above it used 8 instructions to load the magic number 0x5555555555555556 into a register. Here’s an alternative that uses 5 instructions and 3 fewer clock cycles:
lui a2, 0x55555 # 4 bytes
addi a2, a2, 0x555 # 4 bytes (a2 = 0x55555555)
slli a3, a2, 32 # 4 bytes (a3 = 0x5555555500000000)
add a2, a2, a3 # 2 bytes (c.add a2, a3 -> a2 = 0x5555555555555555)
addi a2, a2, 1 # 2 bytes (c.addi a2, 1 -> a2 = 0x5555555555555556)clang doesn’t do this because it uses an additional register, which is harmless in this case, so it is possible to save a few more bytes and cycles.



How does using mulhsu avoid the shift for the 32-bit div5 case. The magic constant is 0x66666667 which is a positive number so mulh and mulhsu are equivalent.