Why use rational numbers?
In a previous article I've written about pseudo-random floating-point numbers generation. Another solution I haven't spoken of consist of using rational numbers. These numbers are defined as a fraction \(\frac{a}{b}\) where \(a\) and \(b\) are integers. From my previous article you should remember that there was efficient ways to generate pseudo-random integers, but difficulties appear when trying to generate floating-point number due to the way they are represented in memory. Rational numbers allow to represent floating-point numbers while using only integers, hence it makes straightforward the generation of uniformly distributed pseudo-random floating-point numbers from the generation of pseudo-random integers. Unfortunately, C does not provide an implementation of rationals, hence their omission in my previous article. I had an incomplete implementation in my previous framework, and this gave me the motivation to come back to it.
Rational numbers are a subset of the real numbers, all real numbers can't be represented using only the rational numbers, but anyway we are limited by the underlying encoding we are dealing with. If we choose to represent the rationals with three integers \(b\), \(n\) and \(d\): \(b+\frac{n}{d}\), \(b\in \mathbb{N}\), \(n\in \mathbb{N}^+\), \(d\in \mathbb{N}^{*+}\), \(n\le d\), and if we choose to encode them using int64_t and uint64_t, we can represent more than \(2^{128}\) values (almost \(2^{192}\) in fact), way more than the \(2^{64}\) values available with a real number encoded using double, and even more than the \(2^{80}\) values available with a real number encoded using long double. Also, we could represent the rational with only two integers, using a fixed value for the denominator, for example the maximum value representable by the encoding type. It simplifies the implementation, at the cost of the number of representable values (for example if \(d\) is fixed to 10, 1/3 can't be represented exactly anymore) and consequently the accuracy, thus I'll use here the three integers approach.
The range of values that can be represented is the same as the one of the type we choose for \(b\) plus one on the positive side to account for the addition of \(\frac{n}{d}\). If it's int64_t it becomes: \([-2^{63}, 2^{63}]\), or [-9223372036854775808, 9223372036854775808]. This is far smaller than the range available when using, for example, double: from -1.7E+308 to 1.7E+308, but rational numbers provide an uniform distribution of values, which is a big advantage in cases like random numbers generation for example. Each value is exactly \(\frac{1}{D}\) away from its two immediate neighbours, where \(D\) is the largest value encoded using uint64_t: \(2^{64}-1\) or 18446744073709551615. In comparison, the largest value not greater or equal than 9223372036854775808.0 that can be encoded using a double is 9223372036854774784.0, a gap of 1024.0! So, double allows for an extremely large range of values, but its precision gets so bad for large values that its usefulness becomes questionable. Rational numbers instead are restricted to a smaller range but guaranty a high precision all along.
All these numbers are extremely large (or small) and difficult to apprehend. To put them in perspective lets compare them to physical values. Our galaxy is approximately 1,000,000,000,000,000,000 kilometer in diameter, that would fit inside the [-9,223,372,036,854,775,808, 9,223,372,036,854,775,808] range. The radius of a hydrogen atom is approximately 5E−14 kilometer, that's larger than 1/18,446,744,073,709,551,615. Put together, choosing kilometer as the unit, the rational numbers with the encoding introduced above is precise and large enough to represent any point in our galaxy with an error less than the size of a hydrogen atom. Wow!
At that point some may wonder why we bother with double anyway. First, rational numbers take more memory: a double takes 8 bytes, same as an int64_t but it takes 3 of them for a rational. Sure you could use smaller integer types than int64_t, but that would decrease dramatically the range and precision. Nowadays hardware have so huge quantity of memory it may seem irrelevant, but if you're dealing with very large amount of data, or tight data transfer speed, it may become a problem. Second, the double precision gets as bad for large values as it gets good for small values. The double value next to 0.0 is 2.3E-308, way smaller than 1/18,446,744,073,709,551,615. However, rationals' precision is already so huge that the need for more in real application is probably rare. Third, and probably the main reason, our computing hardware implement double, not rationals, so if you want to use the latter you have to do it at software level, meaning a huge cost in speed performance. I've looked for hardware implementing rational but couldn't find any. (If you know one, please let me know!)
In defense of floating-point numbers some may also argue that their lack of precision becomes sensible only for large values, whose need in real application is also questionable. This is however incorrect, floating-point precision problems do occur in real-life application. For example, it led to the failure of a Patriot missile during the Gulf War, or problems in videogame as shown in this video by SimonDev about procedural generation of planets. Looking for alternative to the floating-point representation of real numbers is indeed necessary. Rational numbers have also their own limitations and flaws and don't guarantee to avoid the same kind of problems encountered with floating-point encoding. However, it provides an alternative which could prove advantageous in specific cases. In the end, there is no definite good or bad way to do, as always. It only depends on what you are trying to do, need to be decided case by case, so the more tools you have in your toolbox the better you'll be prepared.
Then, what about my own use cases? Will I ever make a simulation of our galaxy at atomic level? Certainly not. A procedurally generated planet? Much more probable even if I have no plan in that direction for the near future. Haven't this already be done by others? Of course it does. Regardless, it just looks too much fun to do it myself!
The implementation.
Lets first define the specs. I want:
There is really nothing difficult from the math point of view, however there are many complications that arise during implementation due to overflows and types conversion. Overflow during integers artihmetic is undefined behaviour, thus the only solution is to detect it prior it's happening and make the calculation in a way that avoid it if possible. Using signed integers everywhere would simplify type conversion problems but the numerator and denominator being always positive would leave half of the representable values unused, I don't want that. Reciprocally, using unsigned integers everywhere means creating rationals only able to represent positive values, which is not acceptable. Using a flag for signedness instead raises probably as many problems as it solves, I don't want that neither.
When mixing different types in the same C statement, implicit conversion occurs as detailed in this link. I'll need only to consider int64_t and uint64_t, for which the relevant rules become:
I'll explicitly cast when necessary anyway as I don't expect everyone, starting with me, to know by heart the full spec. Relying on implicit rules is the way to confusion.
Data structure
The data structure is as simple as:
For the representation of NaN, to manage infinities and overflows, a rational with denominator equal to 0 seems to make sense, then I choose: CapyRatio const capyRatioNaN = {.base = 0, .num = 0, .den = 0};. It also leaves the possibility to use the base and num values to identify several types of NaN.
Rational number reduction.
The reduction of a rational number has two steps: correct the base to bring the numerator less than the denominator, then divide the numerator and denominator by their greatest common divisor. The base may overflow during the correction step, this must be taken care of by checking first by how much it can be increased. To find the greatest common divisor, I'll use the Stein's algorithm (not detailed here).
Conversion to double.
The conversion from rational to double is as simple as:
Conversion from double.
The conversion from double to rational can be obtained using fractions from the Farey sequence and binary search. First we calculate the integral and fractional values of the double using modf. Then we search for the approximation of the fractional part. We start with the two fractions \(a=\frac{0}{1}\) and \(b=\frac{1}{1}\), then we create the median fraction whose numerator is the sum of the numerators of \(a\) and \(b\), and the denominator is the sum of the denominators. If the median is lower than the decimals it replaces \(a\), if it's greater it replaces \(b\). And so on until \(a\) and \(b\) have converged (meaning their converted values to double are equal) or the denominators can't be added anymore without overflowing. A problem with this method is that it can take a very long time to converge. For example, consider the largest double less than 1.0:
The bounds and median of the binary search evolve as follow:
and goes on forever (well, not forever but way too long for practical use). One solution consists of limiting the denominator to an arbitrary maximum value. The larger the maximum value the better precision you get in the conversion but the slower it may take to converge. This solution is unsatisfying because it implies that the result of the conversion won't be as precise as it could be, and choosing that maximum value is just replacing the first problem with another one. A better solution is to jump several steps at once: instead of calculating the median as \(\frac{a_n+b_n}{a_d+b_d}\) we calculate it by solving \(\frac{a_n+kb_n}{a_d+kb_d}=x\) for \(k\in\mathbb{N}\), which gives \(k=\lfloor\frac{xa_d-a_n}{b_n-xb_d}\rfloor\), where \(a_n,a_d,b_n,b_d\) are the numerator and denominator of the bounding fractions, and \(x\) is the decimal part of the converted double. The calculation of \(k\) is performed after each component is converted to double, this means we have to accept the imprecision from double here as there is not much we can do about it. \(b_n-xb_d\) can't be equal to 0: it would mean \(x=\frac{b_n}{b_d}\), i.e. the algorithm has converged and already stopped. The only thing to take care of is when \(k\) is equal to 0. In that case we default to the calculation of the median as \(\frac{a_n+b_n}{a_d+b_d}\), equivalent to forcing \(k\) to 1. Jumping over steps results in performing the conversion in the previous example in one single step: 0 + 0 / 1 <= 0 + 9007199254740991 / 9007199254740992 <= 0 + 1 / 1, and is better illustrated when converting M_PI (the number in parenthesis is the size of the jump in number of steps):
Finally the function to convert from double to rational is as follow:
Edited on 2022/07/19 to correct a bug when (dec < 0.0).
Addition.
Given two rationals \(x\) and \(y\), their sum is equal to \(x_b+y_b+\frac{x_ny_d+y_nx_d}{x_dy_d}\). Nothing can be done for the overflow on \(x_b+y_b\), so I'll just raise an exception in that case, but we need to perform the addition of the fractions first (we'll see why later).
Let's first consider the case where \(x_dy_d\) does not overflow. Given that \(x\) and \(y\) are in reduced form we know that \(x_n\le x_d\) and \(y_n\le y_d\). Added to \(x_dy_d\le UINT64\_MAX\), we are sure that \(x_ny_d\) and \(y_nx_d\) do not overflow neither. Then, we only have to check for the eventual overflow of the sum \(x_ny_d+y_nx_d\). If it does overflow, it means the base must be increased by one, and the numerator can be brought back in range by substracting \(x_dy_d\) from it. If we call \(a\) and \(b\) the two components of the sum \(x_ny_d+y_nx_d\), the safe way to do it is as follow: \(a-(x_dy_d-b)\). From \(x_n\le x_d\) and \(y_n\le y_d\) we know that \(x_dy_d\ge a\) and \(x_dy_d\ge b\), so \(x_dy_d-b\) will stay in range, and because \(x_dy_d\le UINT64\_MAX\lt a+b\) we are sure that \(x_dy_d-b\lt a\), hence \(a-(x_dy_d-b)\) will stay in range.
Now, the case where \(x_dy_d\) does overflow. We can't calculate \(\frac{x_ny_d+y_nx_d}{x_dy_d}\) but instead we can calculate \(\frac{x_n+y_nx_d/y_d}{x_d}\) or \(\frac{x_ny_d/x_d+y_n}{y_d}\). What follows is independent of which one is chosen, so I'll abstract the choice (explained later) by using the notation: \(\frac{a+bc/d}{b}\). The calculation of \(bc/d\) is guaranted here to not overflow: by definition \(b\le UINT64\_MAX\), then \(1\le \frac{UINT64\_MAX}{b}\), then \(c\le \frac{c.UINT64\_MAX}{b}\), and as we also have by definition \(c\le d\) we get \(c\le \frac{d.UINT64\_MAX}{b}\), hence \(\frac{bc}{d}\le UINT64\_MAX\). But, how to calculate it correctly? If we do the multiplication first we may overflow, and if we do the division first we loose accuracy. It takes a short detour, very very short, just a few thousand years ago in Egypt!
The ancient Egytptian multiplication, also known as the Russian peasant multiplication, is a method to multiply two integers, only by dividing and multiplying by two and performing additions. The multiplication of \(a\) by \(b\) looks like this:
Why does this help? It does exactly the same thing as \(a*b\), and doesn't care of the division we are interested in. Well, the point is that with some modification during the calculation, it is possible to account for the division, which actually come in handy to avoid the overflow (as much as possible). The peasant multiplication progressively update the result by increment of updated \(a\). Instead of dividing before the multiplication, too early, or after, too late, we divide during the multiplication, each time \(a\) becomes larger than the divisor, keeping \(a\) as small as possible and minimizing the chance of overflow.
Let's see on some simple examples how it works. To simplify I'll consider the maximum value above which overflow occurs is 10. The multiplication of \(2*3/4=1.5\) is completely safe. It goes as follow:
Other example, \(6*2/4=3\). The multiplication \(6*2=12\) would overflow, and neither 6 or 2 are divisible by 4, hence dividing first would not give the correct result. Thanks to the modified peasant multiplication we can calculate the exact result without overflow from any of the temporary variables.
The code is as follow:
Back to \(\frac{a+bc/d}{b}\), which we now have rewritten as \(\frac{a+(b'+n'/d')}{b}=\frac{a+b'}{b}+\frac{n'}{bd'}\). I'll call \(\frac{n'}{bd'}\) the residual. Let's ignore it one second: the whole addition has become equal to \(x_b+y_b+\frac{a+b'}{b}\). The eventual overflow of \(a+b'\) can be canceled thanks to the division by \(b\) by incrementing the base of the result. That's why we need to calculate the sum of the numerators before the one of the bases: it may also be corrected here.
This leaves only the residual. If \(bd'\) can be calculated, it's just a new ratio that we'll need to add to the result by calling recursively the addition function. We are sure that this recursion ends because at each step the residual gets smaller, given that we always choose \(\frac{a+bc/d}{b}\) such as \(d\le b\). The proof goes as follow: \(d\le b \Rightarrow\frac{bc}{d}\ge c\ge 1\Rightarrow b'\ge 1\Rightarrow \frac{n'}{bd'}\lt \frac{c}{d}\). Final problem, here we know that \(bd\) overflows. Thanks to the reduction at the end of the Peasant multiplication/division \(d'\) may be smaller than \(d\) and avoid the overflow. If not there is also the possibility that \(\frac{n'}{b}\) is reducable, if so this also may save us from the overflow. And if all of that fails, then it means that \(\frac{n'}{bd'}\) can't be represented exactly within the accuracy of 1/UINT64_MAX. We have to approximate it, which can be safely calculated as follow with an error less than 1/UINT64_MAX: \(\frac{n'}{bd'}\simeq \frac{n'/\lfloor b/\lfloor UINT64\_MAX/d'\rfloor\rfloor}{UINT64\_MAX}\)
The code becomes:
Negation.
The negation of a rational is easy: \(-\left(b+\frac{n}{d}\right)=-b-\frac{n}{d}=(-b-1)+\frac{d-n}{d}\). The only two things we have to take care of is the eventual overflow of \(-b-1\), and that INT64_MAX=-INT64_MIN-1. The code is as follow:
Substraction.
For the substraction, I simply reuse the two previous functions: \(x-y=x+(-y)\). The code is as follow:
Comparison.
To compare two rationals we can first compare their base, if they are different the result is trivial. If they are equals we need to compare the numerators, which can be done by multiplying them in such a way they become fractions with same numerator or denumerator, which in turn can easily be compared. Unfortunately, with this multiplication method comes again complication due to overflow. Comparing the difference of the two rationals to 0 instead, as we have already solved any complication in the substraction function (or rather the underlying addition and negation). One more catch here: the difference of the rationals can overflow too, it is the difference of the fractions only that must be checked. The code becomes:
Absolute value.
The absolute value is trivial: if the rational is positive return it, else return its negative.
Multiplication.
The multiplication of \(x\) by \(y\) is equal to \(x_by_b+\frac{x_by_n}{y_d}+\frac{y_bx_n}{x_d}+\frac{x_ny_n}{x_dy_d}\). Calculating the 3 fractions separately instead of grouping them into one simplifies things, so that's what I'll do: calculate each component (\(x_by_b\), \(\frac{x_by_n}{y_d}\), \(\frac{y_bx_n}{x_d}\) and \(\frac{x_ny_n}{x_dy_d}\)) of the multiplication separately to get four rationals (plus a residual and three eventual corrective coefficients, see below), then add them all.
Same as for the addition of ratios, there are lot of overflow chances here, but it gets even worst due to the mix of signed and unsigned values. First, detecting the overflow on the multiplication of the bases \(x_by_b\) is a bit tricky due to the sign, the fractional part and the fact that \(INT64\_MIN\ne -INT64\_MAX\). If \(x_b=0\) or \(y_b=0\), the result is trivial. If \(x_b\gt 0\) and \(y_b\gt 0\), everything's working fine: there is overflow iff \(x_b\gt \lfloor\frac{INT64\_MAX}{y_b}\rfloor\). Other cases are processed as follow.
If \(x_b\gt 0\) and \(y_b\lt 0\), without taking account of the fractional part the multiplication of the bases only may indicates an incorrect overflow. Consider for example \((6+\frac{0}{1})*(-2+\frac{1}{2})\) with an overflow at -10. \(6*-2=-12\) does overflow, but the correct result with the fractional part is \(6*-1.5=-9\), which doesn't overflow. So what I'll do is rewrite \(x_by_b\) as \(x_b(y_b+1)-x_b\). \(x_b\gt 0\) guarantees \(-x_b\) exists, \(y_b\lt 0\) guarantees \(y_b+1\) exists. Then I check for overflow based on \(y_b+1\lt\lfloor\frac{INT64\_MIN}{x_b}\rfloor\). If the inequality is true, the multiplication is guaranteed to overflow, if not we don't know yet but we can move on to calculate it and the eventual overflow will be detected when adding the extra term \(-x_b\) (more on that later). The same goes for \(x_b\lt 0\) and \(y_b\gt 0\): I rewrite it as \(y_b(x_b+1)-y_b\) and use \(x_b+1\lt\lfloor\frac{INT64\_MIN}{y_b}\rfloor\).
If \(x_b\lt 0\) and \(y_b\lt 0\), the previous trick doesn't work any more: there is no term we can safely multiply by -1. One more step will save the day, if \(a\lt 0\), \(-a\) is not guaranteed to exist but \(-(a+1)\) is ! So we can use \(x_by_b=(x_b+1)(y_b+1)-(x_b+1)-(y_b+1)+1\) and check for \(-(x_b+1)\gt\lfloor\frac{INT64\_MAX}{-(y_b+1)}\rfloor\) (no overflow if \(y_b+1=0\)).
The second and third components can be calculated in the same manner, so I'll commonalise the code. We already know how to calculate them: it's the modified Peasant multiplication used in CapyRatioAdd. However they may be negative, while the modified Peasant multiplication handles only positive values. Easy, I'll ignore the sign, calculate, and if needed correct the result as follow: \(-(a+\frac{b}{c})=-(a+1)+\frac{c-b}{c}\).
The fourth component is resolved exactly as \(\frac{x_ny_d+y_nx_d}{x_dy_d}\) in the addition, with one half of the numerator equal to zero, which simplify things a bit.
As you can see we have already solved almost all the problems. There is only one new problem to tackle: how to add/substract the components without intermediate overflow. Given that adding a pair of positive and negative values cannot overflow, we can add the component by pairs of opposite sign values until there is only one value left or all the remaining values have same signedness (in which case we add them up).
The code becomes:
Inverse.
The inverse of \(x\) is equal to \(\frac{x_d}{x_bx_d+x_n}\). Lets first get rid of the sign problem, if \(x_b\lt 0\) I'll calculate the inverse of \(-x\) and then return the negative of the result. Now the only problem is the possible overflow in the calcualtion of \(x_bx_d+x_n\), i.e. \(x_bx_d+x_n\gt UINT64\_MAX\), equivalent to \(\frac{x_bx_d+x_n}{UINT64\_MAX}\gt 1\). This can be easily calculated thanks to the Peasant modified multiplication plus an addition. If there is no overflow, the result is calculated directly, else we need to approximate the results, which can be done with a dichotomic search. There is a trick though in the update of the bounds: we need a condition to compare the median to the searched value. This search value being the inverse of the input, we can multiply the median with the input, if it's greater than 1 the median is too large, else it's too small. The convergence is reached when multiplying the median by the input gives exactly 1, or when it isn't possible anymore to calculate the median without overflowing.
And that's it for the inverse. The code becomes:
Edited on 2023/12/12 to refactor the test (pmd.base + pmd.frac.base == 0) into (pmd.base == 0 && pmd.frac.base == 0).
Division.
For the division I simply reuse the two previous functions: \(\frac{a}{b}=a*(b)^{-1}\).
Unit tests.
Given the complexity of the algorithms introduced here, the probability for a bug is not low ! To avoid it as much as possible I've checked my code carefully with the following unit tests. I've considered the set of 28 test values made of the bases {INT64_MIN, INT64_MIN+1, -1, 0, 1, INT64_MAX-1, INT64_MAX} combined with the fractional parts {0/1, 1/UINT64_MAX, 1/2, (UINT64_MAX-1)/UINT64_MAX}. Then I've checked the addition and multiplication of each pair, as well as the negation and the inversion of each one, against truth tables (calculated manually and using Wolfram Alpha). For the negation I checked that \(x+(-x)=0\), and for the inversion \((x^{-1})^{-1}=x)\) and \(x*(x^{-1})=1\). Reduction and conversion from/to double were also tested on a different, smaller, set of values. Unit tests' result were as expected, within an inaccuracy lower than 8/UINT64_MAX for the inverse of the inverse, 6/UINT64_MAX for \(x*(x^{-1})=1\), and 2/UINT64_MAX for the other tests.
And that's all folks for this time ! There is a lot more about rationals to have fun with, and I'll probably come back to it one day. I'll also probably add a small example of pratical use in the near future, in a separate article which I'll link here.
Edit 2022/02/01: Check how CapyRatio can be used to solve the n-body problem in this article.