RVINT - Integer Mathematical Algorithms for RISC-V
What I've been doing in my spare time.
Recently I finished up a round of improvements to my open source library of integer mathematical algorithms for RISC-V processors, especially those processors that are on the less-capable, more microcontroller end of the spectrum. I’ve implemented dozens of algorithms, and the total library size is on the order of 3K.
ubuntu@em-interesting-pascal:~/rvint/src$ llvm-size -t librvint.a
text data bss dec hex filename
1932 0 0 1932 78c div.o (ex librvint.a)
268 0 0 268 10c mul.o (ex librvint.a)
470 0 0 470 1d6 io.o (ex librvint.a)
46 0 0 46 2e sqrt.o (ex librvint.a)
134 0 0 134 86 gcd.o (ex librvint.a)
172 0 0 172 ac bits.o (ex librvint.a)
3022 0 0 3022 bce (TOTALS)
The library now can optionally make use of various extensions to the base instruction set, so while it should work on RV32E, RV32I, and RV64I with no additional extensions, it can now make use of Zba (addressing instructions), Zbb (bit manipulation), Zbs (single bit instructions), Zicond (conditional movement), and M (integer math - although the library mostly targets processors without this extension).
I believe all the algorithms work correctly on RV32E - although the test programs do not, as they are designed to be run on systems that support Linux syscalls. The use of RV32E means the code can run on a CH32V003 processor which is used in my RVPC.
I learned a lot writing this library, my first version only used the base instruction set, and I posted it to reddit r/RISCV and got lots of useful feedback which I’ve tried to incorporate.
Beyond learning more RISC-V assembly, I’ve explored some interesting algorithms that I hadn’t used before - in particular the fast division routines described in the amazing book Hacker’s Delight. These algorithms have some interesting properties - they are based upon a series approximation that in some cases is even faster than the “reciprocal multiplication” trick, and they are completely branchless - they run in a fixed number of cycles, and thus are suitable for cryptographic applications because they do not leak information about the numbers being divided. I’ve used branchless tricks elsewhere in the library because I’ve found I really enjoy figuring out how to solve problems without using any branching.
I also got to use the czero.nez instruction in a couple places. THIS MAKES ME SO HAPPY.
As an example, let’s look at a routine to divide the number in register a0 by 5 - this code runs in a fixed 15 cycles. It follows the same general approach as all the fast division routines I wrote - approximate the quotient with a fixed-point series expansion, compute the remainder, and correct the approximated quotient by looking at the remainder.
div5u:
srli a2, a0, 2
sub a1, a0, a2
srli a2, a1, 4
add a1, a1, a2
srli a2, a1, 8
add a1, a1, a2
srli a2, a1, 16
add a1, a1, a2
.if CPU_BITS == 64
srli a2, a1, 32
add a1, a1, a2
.endif
srli a1, a1, 2
# Calculate r = n - q*5
mul5 a2, a1, a2
sub a2, a2, a0
slti a3, a2, -4
add a0, a1, a3 # a0 = q + correctionThis code also makes use of the mul5 macro which multiplies numbers by 5 without using multiplication instructions:
.macro mul5 dest src scratch0
.if HAS_ZBA
sh2add \dest, \src, \src
.else
slli \scratch0, \src, 2
add \dest, \scratch0, \src
.endif
.endm Let’s take a look at how this works. It’s reasonable to be surprised that the above code divides an integer by five. First, recognize that dividing by five is the same as multiplying by 0.2.
We start by computing 0.75 times our input as that’s easy to do with shifts and adds:
srli a2, a0, 2
sub a1, a0, a2Math:
Next we do a series expansion to get this closer to 0.8:
The code repeatedly shifts right by multiples of 4 (4, 8, 16, 32) and adds the result to the accumulator. Note the series has to be extended by a term to get sufficient accuracy in the 64-bit CPU case.
srli a2, a1, 4 # q >> 4
add a1, a1, a2 # q = q + (q >> 4)
... (repeated for 8, 16, and 32 bits)Once the accumulator holds approximately 0.8n, the algorithm divides by 4 to get 0.2n.
srli a1, a1, 2 # a1 = q >> 2Math:
Because integer math truncates bits, the estimate might be slightly off (specifically, it might be 1 too small as we started with 0.75 and added terms to get to 0.8 - so we under-estimated the correct value by a small amount).
The algorithm calculates the "negative remainder" to check for errors. The “negative remainder” is used instead of the more usual positive remainder as it lets us save one RISC-V instruction (we avoid doing an xor). q below stands for quotient - the result of the division, and r is remainder.
Calculate 5q: It uses the
mul5macro to calculate q x 5.Calculate Difference: It computes
diff = 5q - n.If q is correct, n = 5q + r (where 0 <= r < 5).
Therefore,
diffshould be -r, which is in the range [-4, 0].
Check Threshold:
Code:
slti a3, a2, -4If
diff < -4(e.g., -5, -6...), it means the remainder was 5 or greater, implying the quotient q was too small. If q is too small, register a3 contains 1, otherwise it contains 0. This is used to avoid a conditional branch.
Correction:
Code:
add a0, a1, a3If the check is true (1), it adds 1 to the quotient in order to correct it.
We now have the quotient in a0. Quite nifty if I may say so myself. And fast, 15 cycles on RV32E, and one less if you have the Zba extension!
I like the Zba extension quite a bit. It allows you to do a primitive form of multiply-and-accumulate, as it allows you to shift a register left by 1, 2, or 3 bits, and then add another number to it in one instruction. Kind of reminds me of a more primitive form of the ARM barrel shifter. I actually haven’t found it that useful for forming addresses but I use it to do multiply and adds a lot. I wish there was a shift left by 4 bits form, I found several places where this would have been useful. I used these instructions a lot in my fast multiply macros.
I’m considering trying to shoehorn a reasonable soft floating point math implementation into the CH32V003 next, using some of the routines in rvint as infrastructure.
Rvint currently contains implementations of the following algorithms:
Division
On 32-bit processors (RV32I, RV32E)
32-bit by 32-bit signed and unsigned division with 32-bit result and remainder.
Fast 32-bit unsigned division by 3
Fast 32-bit unsigned division by 5
Fast 32-bit unsigned division by 6
Fast 32-bit unsigned division by 7
Fast 32-bit unsigned division by 9
Fast 32-bit unsigned division by 10
Fast 32-bit unsigned division by 11
Fast 32-bit unsigned division by 12
Fast 32-bit unsigned division by 13
Fast 32-bit unsigned division by 100
Fast 32-bit unsigned division by 1000
Fast 32-bit signed division by 3
Fast 32-bit signed division by 5
Fast 32-bit signed division by 6
Fast 32-bit signed division by 7
Fast 32-bit signed division by 9
Fast 32-bit signed division by 10
Fast 32-bit signed division by 11
Fast 32-bit signed division by 12
Fast 32-bit signed division by 13
Fast 32-bit signed division by 100
Fast 32-bit signed division by 1000
On 64-bit processors (RV64I)
64-bit by 64-bit signed and unsigned division with 64-bit result and remainder.
Fast 64-bit unsigned division by 3
Fast 64-bit unsigned division by 5
Fast 64-bit unsigned division by 6
Fast 64-bit unsigned division by 7
Fast 64-bit unsigned division by 9
Fast 64-bit unsigned division by 10
Fast 64-bit unsigned division by 11
Fast 64-bit unsigned division by 12
Fast 64-bit unsigned division by 13
Fast 64-bit unsigned division by 100
Fast 64-bit unsigned division by 1000
Fast 64-bit signed division by 3
Fast 64-bit signed division by 5
Fast 64-bit signed division by 6
Fast 64-bit signed division by 7
Fast 64-bit signed division by 9
Fast 64-bit signed division by 10
Fast 64-bit signed division by 11
Fast 64-bit signed division by 12
Fast 64-bit signed division by 13
Fast 64-bit signed division by 100
Fast 64-bit signed division by 1000
Math Macros
abs (absolute value) of 32 or 64 bit register.
Multiplication
On 32-bit processors (RV32E, RV32I):
32-bit by 32-bit signed and unsigned multiplication with 32-bit result.
32-bit by 32-bit signed and unsigned multiplication with 64-bit result.
On 64-bit processors (RV64i):
32-bit by 32-bit signed and unsigned multiplication with 64-bit result.
64-bit by 64-bit signed and unsigned multiplication with 64-bit result.
64-bit by 64-bit signed and unsigned multiplication with 128-bit result.
Fast Multiplication Macros
Multiply 32-bit or 64-bit register by 3.
Multiply 32-bit or 64-bit register by 5.
Multiply 32-bit or 64-bit register by 6.
Multiply 32-bit or 64-bit register by 9.
Multiply 32-bit or 64-bit register by 10.
Multiply 32-bit or 64-bit register by 11.
Multiply 32-bit or 64-bit register by 12.
Multiply 32-bit or 64-bit register by 13.
Multiply 32-bit or 64-bit register by 100.
Multiply 32-bit or 64-bit register by 1000.
Base Conversions & I/O Operations
These operations support 32-bit numbers on 32-bit architectures and 64-bit numbers on 64-bit architectures. All routines are RV32E, RV32I, and RV64I compatible. The decimal routines should even work on RV128I.
ASCII binary to binary.
ASCII unsigned decimal to binary.
ASCII signed decimal to two’s complement binary.
ASCII hexadecimal to binary.
binary to ASCII binary.
two’s complement binary to ASCII signed decimal.
unsigned binary to unsigned ASCII decimal.
binary to ASCII hexadecimal.
Square Root
32-bit integer square root on 32-bit processors.
64-bit integer square root on 64-bit processors.
Greatest Common Divisor
32-bit GCD of two unsigned 32-bit numbers on 32-bit processors.
64-bit GCD of two unsigned 64-bit numbers on 64-bit processors.
Least Common Multiple
32-bit LCM of two unsigned 32-bit numbers on 32-bit processors.
64-bit LCM of two unsigned 64-bit numbers on 64-bit processors.
Bit Operations
Count leading zeroes in 32-bit number on 32-bit processors.
Count leading zeroes in 64-bit number on 64-bit processors.
Count trailing zeroes in 32-bit number on 32-bit processors.
Count trailing zeroes in 64-bit number on 64-bit processors.
Here is your RVINT "Integer Math" Giraffe specification:
..
(oo) <-- Head initialized at origin (0,H)
/| |\
/ | | \
/ | | \ NECK HEIGHT CALCULATION:
/ | | \ Total Height (H) = 5400 mm
/ | | \ The neck is approx 1/3 total height.
| | | | Using routine 'div3u' (H / 3):
| | | | Neck Height = 1800 mm
| |__| |
| / \ |
| / \ | SPOT AREA CALCULATION:
| /| [][][]| \ Average spot is a square, side (S) = 45 mm.
|/ | [][][]| \Using routine 'mul32' (S * S):
/ | [][][]| \Spot Area = 2025 sq mm
| |
| |
| | LEG VERIFICATION:
| | | | We need to ensure it has the correct number of legs.
| | | | We pass the input 16 to the routine 'isqrt'.
| | | | Result = 4 legs. Verified.
| | | |
| | | |
_U _U _U _U <-- Feet grounded at Z=0.

