Part 1

4. A quantitative comparison of numerical systems

4.1. Definition of decimal precision

The accuracy of the inverse error. If we have a pair of numbers x and y (non-zero and of the same sign), the distance between them in orders of magnitude is

$\mid log_{10}( x / y )\mid$

decimal orders, is the same measure that determines the dynamic range between the smallest and the largest representable positive number x and y. Ideal distribution of ten numbers between 1 and 10 to floating point notation would not be a uniform distribution of numbers in order from 1 to 10, and exponential:

$1, 10^{1/10}, 10^{2/10},..., 10^{9/10}, 10$

. This is the scale of decibels, long used by engineers to Express relationships, for example, 10 decibels is a tenfold ratio. 30db means the ratio


. The ratio of 1db is about a factor of 1.26, if you know the value with a precision of 1db, you have the accuracy to 1 decimal place. If you know value with accuracy of 0.1 db, \that means 2 digits of precision, etc. Formula of decimal precision

$log_{10}(1/\mid log_{10}(x/y)\mid)=-log_{10}(\mid log_{10}(x/y)\mid )$

where x and y are any valid values, calculated using the system of rounding, like the ones used in formats float and posit, either the upper and lower boundaries, if you use a strict system that uses intervals, or valid.

4.2. The definition of sets comparison of float numbers and posit

We can create a scale model of float numbers and posit a length of 8 bits each. The advantage of this approach is that 256 values, this is a fairly small lot, so we can test it fully, and compare all


entries in the tables for the operations of addition, subtraction, multiplication, and division. A real number with a precision of 1/4 have one sign bit, four exponent bits and three bits of the fractional part, and adhere to all the rules of the IEEE 754 standard. The smallest positive number (denormalized) is equal to 1/1024, the greatest positive is 240, the dynamic range is asymmetrical and equal to 5.1 decimal places.14 bit combinations represent NaN.

Comparable 8-bit posit the es=1, has a range of positive numbers from 1/4096 4096, symmetrical dynamic range is 7.2 decimal orders. NaN values there. We can graph the decimal precision positive numbers in both sets, as shown in Fig. 7. Note that the values represented by the numbers posit, are two orders of magnitude greater dynamic range than the float number, and the accuracy of the same or greater for all values except those where the number of float close to overflow or antiperiplanar. The graininess of graphs for both systems is a logarithmic approximation of piecewise linear functions. The numbers float precision is reduced only to the left, on a plot close to antipersonnel, to the right the function ends, because then there are NaN values. The number of posit are more symmetrically decreasing on the edges of the accuracy function.

Fig. 7. Compare the decimal precision of float numbers and posit

4.3. A comparison of the operations of one argument

4.3.1. The inverse of the

For each possible input value x of the function 1/x, the result will exactly match another value in the set, or may be rounded, in this case, we can measure the decimal error, using the formula from section 4.1, for float numbers, the result may lead to overflow or NaN. Cm. Fig. 8.

Fig. 8. Quantitative comparison of float numbers and posit in the calculation of the inverse value

Curves on the right graph show the amount of error when calculating inverse values, the numbers float can give the result NaN. The number of posit superior float in a large number of cases, and this superiority is maintained throughout the range. Computation of the inverse value of denormalized float numbers results in an overflow, which leads to the infinite value of mistakes, and, of course, the reverse argument is NaN gives a NaN value. The number of posit closed with respect to the computation of the inverse value.

4.3.2. Square root

The square root function does not overflow or antiperiplanar. For negative arguments, and for NaN the result is NaN. Remember that we have a “scale model” of float numbers and posit, posit the benefits increase with increasing the accuracy of the data. For 64-bit float and posit, posit error would be about 1/30 errors float, instead of 1/2.

4.3.3. Square

Another common unary operation is


. Overflow and antiderapante business as usual in the construction of the float in the square. Nearly half of the float squaring does not lead to a meaningful result, whereas the value of posit in a square always results in a number of posit (square unsigned unsigned infinity is infinity).

Fig. 9. Quantitative comparison of float numbers and posit in the calculation

$\sqrt x$

Fig. 10. Quantitative comparison of float numbers and posit in the calculation


4.3.4. The base 2 logarithm

We also made a comparison of to cover the functions of the base 2 logarithm, that is, the percentage of cases in which


can be exactly represented, and if it can’t be exactly represented how many decimal places we lose. The number of float we have in this case the only advantage: they can be used to represent








but it is more than kompensiruet large vocabulary integer powers of two numbers posit.

Fig. 11. Quantitative comparison of float numbers and posit in the calculation


Graph similar to those for the square root, approximately half of the cases yields NaN in both cases, but the number of posit have half the loss of decimal precision. If you can calculate


we only need to multiply the result by a scaling factor to obtain




or the logarithm to any other base.

4.3.5. Exhibitor


Similarly, if you can calculate


you can easily by means of a scaling factor to




etc. number of posit have the same exception


a NaN when the argument is



Fig. 12. Quantitative comparison of float numbers and posit in the calculation


Maximum decimal losses to posit numbers may seem large, as


will be rounded back up to maxpos. In this example, only a small number of errors was as great as


decimal orders. Decide what is better: to lose more than a thousand decimal orders of magnitude, or lose an infinite amount of decimal places? If you don’t have to use so Bolshie number, the number of posit still win because of the error at small values much better. In all cases, when you lose a large number of decimal places when using numbers posit, the input argument is far beyond that number the float can even Express. The graphs show how the number of posit more stable in terms of dynamic range, in which the result makes sense, and have the superiority in accuracy within this range.

For the usual unary operations

$1/x, \sqrt x, x^2, log_2(x)$



numbers posit fully and consistently more accurate than the float number with the same number of bits, and produce meaningful output in a broad dynamic range. We now turn our attention to the four basic arithmetic operations with two arguments: addition, subtraction, multiplication, and division.

4.4. Compare operations of two arguments

We can use large-scale numerical model system to study the arithmetic operations of two arguments such as addition, subtraction, multiplication, and division. In order to visualize 65536 results, we do “schedule coverage” 256*256, which shows what proportion of the results are accurate, inaccurate, causes an overflow, antiperiplanar or NaN.

4.4.1. Addition and subtraction


$x − y = x + (− y )$

works great for float and posit, there is no need to check subtraction separately. For the operation of addition, we computed the exact value

$z = x + y$

and compare it with the amount returned in each of the numeric systems. It may happen that the result is inexact, then it needs to be rounded to the nearest finite non-zero numbers, overflow can occur or antiperiplanar, or uncertainty kind


which gives the result NaN. Each of these cases are marked with color, and we can cover the whole table adding a single glance. In the case of rounding, the color changes from black (the exact value) to violet (the exact value for posit and float). Fig. 13 shows that a similar schedule of coatings for float numbers and unum. As with the unary operations, but with much more points, we can draw conclusions about the ability of each numeric system to give meaningful and precise answers:

Fig. 13. A complete schedule of coatings for adding the numbers float and posit

Fig. 14. Quantitative comparison of float numbers and posit for addition

At first glance it becomes obvious that the number of posit is significantly more points on the graph of addition in which the result is accurate. A wide black diagonal band on a chart of coverage for float is much wider than it will be for greater accuracy, because it is the zone of denormalized numbers, in which the numbers float spaced from each other at equal intervals, like the numbers fixed point numbers, such numbers represent a large proportion of the total number only in the case of 8-bit numbers.

4.4.2. Multiplication

We use a similar approach for comparison of how well the number of float and posit multiply. In contrast to addition, multiplication can cause antiderapante of float numbers. “The gradual antiperiplanar” area, which you can see in the center in Fig.15. left. (this refers to the denormalized numbers. approx. transl.) Without this zone, blue zone antiderapante would have the shape of a rhombus. Graph multiplication for numbers less colorful posit that it is better. Only two pixels highlighted as NaN, close to the place where is the zero mark axis (the leftmost pixel in the center vertically, and bottom center horizontally. approx. transl.) There are the results of multiplying the

$\pm\infty\cdot 0=NaN$

. The number of float are more cases in which the work is accurate, but at a terrible price. As shown in Fig.15, nearly 1/4 of all compositions float leads either to overflow or to antiderapante, and this proportion does not decrease with increasing precision float.

Figure 15. Full schedule coverage is for multiplying numbers float and posit

Worst case rounding for numbers posit occurs when

$maxpos \times maxpos$

that again is rounded up to maxpos. For such cases (very rare) error makes the 3.6 orders of magnitude. As pokazyvaetsya in Fig. 16, the number of posit significantly better than float, to minimize the error multiplication.

Fig. 16. Quantitative comparison of float numbers and posit to multiply

Schedule coverage for the operation of division is similar to the graph for multiplication, but the zones are swapped to save space, it is not shown here. Quantitative indicators for the division are almost the same as for multiplication.

4.5. Comparison of float numbers and posit to evaluate expressions

4.5.1. Test “a 32-bit budget precision”

Tests are usually made on the basis of minimum execution time, and often do not give a full idea of how accurate the result is. Another type of test is that we fix the budget error, i.e. the number of bits per variable, and will try to get the maximum decimal precision in the result. Here is a sample expression that we can use for comparing numeric system with a budget of 32 bits per number:

$X=\left(\dfrac{27/10-e}{\pi-(\sqrt 2+\sqrt 3)}\right)^{67/16}=302.8827196\dotsb$

The rule is that we start with the best representations of numbers




possible in each of numerical systems and representation of all specified integer numbers, and see how many decimal digits coincide with the true value of X after executing nine transactions in the expression. We will allocate numbers in orange color.

Despite the fact that 32-bit IEEE float numbers have decimal precision, which ranges from 7.3 to 7.6 decimal orders of magnitude, the accumulation of rounding errors when computing X gives a 302 response.912⋯, having only three winning numbers. This is one of the reasons that users feel the need to use 64-bit float everywhere, as even simple expressions at risk of loss of precision so much that the result may be useless.

32 bit numbers posit have variable decimal precision, which ranges between 8.2 and 8.5 decimal places for numbers with absolute value of about 1. When calculating X, they give us the answer 302.88231⋯ that has twice the significant digits. Also do not forget that 32-bit number posit have a dynamic range up to 144 decimal places, and the 32-bit float have a much smaller dynamic range 83 discharge. Therefore, the additional accuracy of the result is achieved not by narrowing the dynamic range.

4.5.2. Test with four-time accuracy: the challenge Goldberg on the thin triangle

There is the classic problem of “thin triangle” [1]: find the area of a triangle with sides a, b, c, when the two sides b and c only 3 units of the lower digit (Units in the Last Place, ULPs) longer than half the long side (Fig. 17).

Fig. 17. Challenge Goldberg on the thin triangle

The classic formula for the area of promezhutochnoe uses the variable s:


The danger in this formula is that ‘s very close to the value a, and calculating


calicinet a rounding error very much. Try 128-bit (quadruple-precision) IEEE float number for which

$a=7,b=c=7/2+3\times 2^{− 111}$

. (If the unit to take a light-year, then the short side will be longer than half the long side only 1/200 diameter of a proton. But that makes the triangle the height of the doorway at the top.) We also calculated the value of Ausing 128-bit numbers posit (es=7). Below are the results:

$$display$$\begin{matrix} \bits{True value} & 3.14784204874900425235885265494550774498\dots \times 10^{-16} \\ \bits{128-bit IEEE float:} & 3.\color{orange}{63481490842332134725920516158057682788}\dots \times 10^{− 16}\\ \bits{128-bit posit:} & 3.147842048749004252358852654945507744\color{orange}{39} \dots \times 10^{-16} \end{matrix}$$display$$

The number of posit have up to 1.8 decimal digits of precision more than fourfold with float precision in a wide dynamic range:

$2\times 10^{− 270}$


$5 \times 10^{− 269}$

. This is enough to prevent the catastrophic consequences of the increased error in this specific case. It is also interesting to note that the response in the format of posit will be more accurate than the float format, even if we are at the end skonvertiruet in 16-bit posit.

4.5.3. The solution to the quadratic equation

There is a classic technique, designed to avoid rounding errors in the calculation of the roots






using the usual formula

$r_1,r_2=(-b\pm \sqrt {b^2-4ac})/(2a)$

when b is much larger than a and c, which leads to loss of digits to the left, as

$\sqrt {b^2-4ac}$

very close to b. But instead of forcing programmers to remember the mystical techniques might be better to posit did the calculation safe when you use simple formulas from the textbook. Put


and compare the result in the format 32-bit float and posit.

Table 5. The solution to the quadratic equation

Numerically unstable root —


but note that the 32-bit posit gives 6 correct digits instead of 4 for float.

4.6. Comparison of systems of floats and Posit for classical LINPACK test

The main method of evaluation of supercomputers for a long time was the decision

$n\times n$

system of linear equations

$\mathbf Ax=b$

. Namely, the test fills the matrix is A pseudorandom number between 0 and 1, and the vector b in row A. This means that the solution x be a vector consisting of units. The test computes the norm of deductions

$\|\mathbf Ax-b\|$

to verify, although there is no hard limit for the number of digits that must be true in the response. For the test a typical loss of a few digits of precision, and are typically used by 64-bit float (not necessarily IEEE). Initially the test included n=100, but this size was too small for the fastest supercomputers, so n was increased to 300, then to 1000, and finally (with the filing of the first author), the test was scalable, and gives the number of operations per second, based on the fact that the test performs

$\frac {2}{3}n^3+2n^2$

operations of multiplication and addition.

Comparing posit and float, we noted a small flaw test: the answer in General case is not a sequence of units, due to rounding errors of the sums in the rows. This error can be eliminated if we find a kaky entry into A bring in the amount of 1 bit, beyond the limits of possible accuracy, and set this bit to 0. This will give us confidence that the line amount is A representable without rounding, and that the response x is actually a vector consisting of units. For the original version of the problem, with a size of 100×100, 64-bit IEEE float give the answer like this:




None of the 100 numbers is not true; they are close to 1 but never equal 1. The numbers posit, we can do wonderful thing. Using a 32-bit number, and posit the same algorithm, we compute the deduction

$r = \mathbf Ax − b$

using the merge operation — the scalar product. Then decide

$\mathbf Ax'=r$

(using already processed

$\mathbf A$

) and use


to correct:

$x \leftarrow x-x'$

. The result is bespretsendentnoe accurate for LINPACK test answer:

$\{1, 1,...,1\}$

. Can LINPACK rules to prohibit the use of new 32-bit type numbers, the use of which allows to achieve a perfect result with zero error, or continue to insist on using 64-bit float, which does not allow this? This decision will be taken by those who are responsible for this test. Those who need the solution of systems of linear equations to solve real-world problems, rather than comparing the speed of supercomputers, posit offers a staggering advantage.

5. Conclusion

Posit float wins at his own game: it can be used to perform calculations to reduce rounding errors. The number of posit have greater accuracy, wider dynamic range and better coverage. They can be used to obtain better results than float the same bitness, or (which can be an even greater competitive advantage), the same results with reduced width. Because the bandwidth of the system is limited, the use of operands smaller size means more speed and less power consumption.
As they work as float and not as an interval system, they can be considered as a direct replacement for the float, as was demonstrated here. If the algorithm that uses the float, passes the tests and the time and stability “good enough”, then posit it will work even better. Combined operations (operations fused), available at posit, provide a powerful means to prevent accumulation of rounding errors, and in some cases allow you to safely use 32 bit numbers instead posit a 64-bit float in applications that require high performance. This is the General case will increase application performance by 2-4 times, and reduces power consumption, saves energy and reduces the cost of data storage. Hardware support posit will give us the equivalent of one or two steps of Moore’s law without needing to reduce the size of the transistor or to increase the cost. Unlike float, the system gives posit bitwise reproducibility of results on different systems, saving us from the main disadvantage of the IEEE 754 standard. Numbers posit a simpler and more elegant than float, and reduce the amount of equipment to support them. Although the number of float now ubiquitous, the number of posit may soon make them ostrechiny.


1. David Goldberg. What every computer scientist should know about floating-point arithmetic.
ACM Computing Surveys (CSUR), 23(1):5-48, 1991. DOI: doi:10.1145/103162.103163.
2. John L Gustafson. The End of Error: Unum Computing, volume 24. CRC Press, 2015.
3. John L Gustafson. Beyond Floating Point: Next Generation Computer Arithmetic. Stanford Seminar:, 2016. full transcription
available at
4. John L Gustafson. A radical approach to computation with real numbers. Supercomputing
Frontiers and Innovations, 3(2):38-53, 2016. doi:
5. John L Gustafson. The Great Debate @ ARITH23.
, 2016. full transcription available at
6. W Ulrich Kulisch and Willard L Miranker. A new approach to scientific computation, volume 7. Elsevier, 2014.
7. More Sites. IEEE standard for floating-point arithmetic. IEEE Computer Society, 2008.
8. Isaac Yonemoto. Source