# Converting Integers to Floats Using Hyperfocus

## Contents

A few years ago, due to some random chain of events, I ended up implementing a conversion from 128 bit integers to 64 bit floats. This would’ve turned out to be a complete waste of time, except that my final version is faster than the builtin conversion of every compiler I tested. In this blog post, I’ll explain what happened, how floats work, how this conversion works, and how it got a bit out of hand.

## Prologue

It was a beautiful day in August, a few summers ago, when a friend of mine was reviewing a pull request containing something like the following snippet:

```
f64::from_str(&some_u128.to_string()).unwrap()
```

This was used to convert a `u128`

to a `f64`

.
It did so by converting it to a string first, and parsing that as a float next.
Since this method isn’t great in various ways,
she went to her *secret underground lair Discord Server* to ask the

*there if anyone had a better way of doing this.*~~leet hackers~~ bored programmers

In an attempt to appear helpful but actually just trying to impress some of the people there,
I implemented a sort-of mostly correct version of this conversion by calculating the raw bit representation of the `f64`

and using
`f64::from_bits`

.

After rewriting it several times—a process which involved
fixing bugs for certain rarely used values like zero,
wrongly accusing `u128::to_string`

of having bugs,
implementing rounding almost correctly,
wrongly accusing `f64::from_str`

of having bugs,
implementing rounding slightly more correctly,
and trying to write comments but giving up because it’s undecipherable anyway—I finally
ended up with about 7 lines of unreadable bit twiddling code
that seemed to do the trick,
while implementing the correct rounding mode _{almost} perfectly.

Everyone was slightly disgusted and slightly impressed.

Mission accomplished.

..

Eventually we came up with something even shorter:

```
x as f64
```

..

Uh.

Yeah.

So.

Somehow nobody involved even considered the possibility that `_ as f64`

might just work.
Between `u128`

being experimental until somewhat recently and no mainstream processor providing this conversion natively,
we had all simply assumed incorrectly that this was not a thing.

But it was.

And I felt tricked.

By Rustc.

While I was sitting there, struggling for hours *banging virtual rocks together* to light a spark to
make a campfire, Rustc was just sitting there with a lighter in its pocket.
“Hey you never asked,” Rustc laughed, “I figured you were just having fun.”

So, out of *pure spite* and a strong sense of needing to prove that *yes I am having fun damnit!*,
I spent the next several days and nights angrily writing many variants of my code, determined to take revenge on Rustc, somehow.

In this blog post, I’ll try to trace back my steps to implement this integer-to-floating-point conversion until we reach the same seven lines of unreadable code as I did in 2020.

But first, we’ll need to understand how floating point values are represented as bits.

## The bits of an `f64`

The 64 bits of a `f64`

encode a floating point value following the IEEE 754 binary64 standard,
also known as a “double precision floating point value” or `double`

.
Pretty much all processors that support floating point calculations support this format.

This format splits up the 64 bits into three parts: one sign bit, an 11 bit *exponent*, and a 52 bit *mantissa*:

```
f64 bits = 0 10001110110 1100100101000100001000100010001000011111010001010101
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent └─ mantissa
```

The sign bit indicates whether the number is positive (0) or negative (1).
The next 11 bits represent a number, the *exponent*, which gives position of the first `1`

bit of the binary floating point number.
1023 if the first `1`

bit is immediately left of the decimal point (i.e. in the ones place),
1024 if it is in the second digit left of the decimal point (i.e. in the twos place),
and so on, and in the other direction too: 1022 if it is immediately right of the decimal point (in the halves place), etc.
The remaining 52 bits, called the *mantissa*, are the bits that come right after that first `1`

bit of the number.

So, using the letter `m`

to represent the 52 bits of the mantissa,
this is what the decoded binary floating point number looks like
for different values of the exponent:

```
: \
1026 1mmm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1025 1mm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1024 1m.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1023 1.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1022 0.1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1021 0.01mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1020 0.001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
: \
```

A concrete example: the value 9.75 in binary looks like 1001.11,
so it is represented as a `f64`

with an exponent of 1026, and a (binary) mantissa of 001110000⋯0.

```
9.75 in decimal = 1001.11 in binary
│└─┬──┘
│ └─ mantissa: the bits after the first 1 bit
└─ position of the first 1 bit: 3
f64 bits = 0 10000000010 0011100000000000000000000000000000000000000000000000
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent (1023+3) └─ mantissa
```

If we extend the table above all the way to the top, we get to extremely high numbers. The maximum exponent value of 2047 is used to represent the special ’not a number’ and infinity values.

```
2047 Infinity or NaN
2046 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm00000⋯00000.0
2045 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm0000⋯00000.0
2044 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm000⋯00000.0
2043 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm00⋯00000.0
: \ 960 zero bits omitted ─┘
```

Similarly, if we extend the table all the way to the bottom, we get to the extremely small numbers.
The minimum exponent value of zero is also special, and used for the extremely small *subnormal* numbers.
Going from an exponent of one to zero does not move the mantissa another bit to the right,
but instead turns the first `1`

bit into a `0`

bit.
This allows us to represent the number zero: an exponent of zero and a mantissa of zero.
In other words, `f64::from_bits(0)`

equals `0.0`

.

```
: \
4 0.00000⋯0001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
3 0.00000⋯00001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
2 0.00000⋯000001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1 0.00000⋯0000001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
0 0.00000⋯0000000mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm (subnormals)
└─ 1010 zero bits omitted
```

## Converting integers

Back to the original problem: converting an 128-bit unsigned integer (`u128`

) to an `f64`

:

```
pub fn u128_to_f64(x: u128) -> f64 {
let bits: u64 = ???; // TODO
f64::from_bits(bits)
}
```

The input being an integer means we never have to deal with subnormals, except for zero:

```
if x == 0 {
return 0.0;
}
```

Edge-case handled. Now we can continue with the fun part: every other integer.

As the input is unsigned, we already know one bit: the sign bit is always zero.

One bit down, sixty-three left!

```
let sign = 0;
let exponent = ??; // TODO
let mantissa = ??; // TODO
let bits: u64 = (sign << 63) + (exponent << 52) + mantissa;
```

Next up: the exponent. For this we need to know the position of the highest `1`

bit, and add 1023 because that’s how floats work.
Easy, just use `u128::leading_zeros`

to count the zeros before the first `1`

bit, and subtract that from 127 to get the position of the `1`

bit:

```
┌─ n leading zeros
│ ┌─ first 1 bit
┌─┴───┐ │
input = 000⋯000 1 110010010100010000100010001000100001111101000101010010101001011⋯000101011
│ │ │
bit 127 │ bit 0
bit (127 - n)
```

So,

```
let n = x.leading_zeros();
let exponent = (127 - n) + 1023;
```

Boom, 11 more bits done. 52 left: the mantissa.

For the mantissa, we need the 52 bits that followed the highest `1`

bit.
If we left-shift away all the leading zeros and then also left-shift away the highest `1`

bit,
what remains are the bits right after that first `1`

bit.
Then all we need to do is right-shift the (128 − 52 =) 76 unwanted bits away:

```
┌─ n leading zeros
│ ┌─ first 1 bit ┌─ next 52 bits ┌─ remaining bits
┌─┴───┐ │ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────┐
input = 000⋯000 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011
│ │ │
┌───────┘ │ │
│ ┌───────┘ ┌───────┘
│ │ │
<< n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌─┘ ┌─┘
│ │
<< 1 = 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯0000
│ │
└───────────────────────────────┐ └───────────────────────────────┐
│ │
>> 76 = 000000000000000⋯000000000000000 1100100101000100001000100010001000011111010001010100
│ │
f64 bits = 0 10001110110 1100100101000100001000100010001000011111010001010100
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent └─ mantissa
```

So, just three shift operations:

```
let mantissa = x << n << 1 >> 76;
```

So.. That’s it? We’re done?

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let sign = 0;
let n = x.leading_zeros();
let exponent = (127 - n) + 1023;
let mantissa = x << n << 1 >> 76;
let bits = (sign << 63) + ((exponent as u64) << 52) + mantissa as u64;
f64::from_bits(bits)
}
#[test]
fn test() {
assert_eq!(u128_to_f64(1234), 1234.0);
}
```

```
$ cargo test
Compiling stuff v0.1.0
Finished test [unoptimized + debuginfo] target(s) in 0.43s
Running unittests src/lib.rs
running 1 test
test test ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
```

🎉

Yes!

It works for one number! Therefore, our code must work correctly for all numbers. 😌

Okay, let’s try one more:

```
assert_eq!(u128_to_f64(123_456_789_123_456_789), 123_456_789_123_456_789.0);
```

```
$ cargo test
…
test result: ok
```

Works great! It works for two numbers, which means we can now vaguely gesture at mathematical induction, and happily conclude it works for all numbers. 😌

```
assert_eq!(u128_to_f64(123_456_789_123_456_789_123), 123_456_789_123_456_789_123.0);
```

```
$ cargo test
…
test test ... FAILED
failures:
---- test stdout ----
thread 'test' panicked at 'assertion failed: `(left == right)`
left: `1.2345678912345678e20`,
right: `1.234567891234568e20`', src/lib.rs:17:5
```

Oh.

Maybe not.

## Rounding

As it turns out, we shouldn’t just ignore the bits that don’t fit in the `f64`

.
We cannot store all 128 bits in it, since we only have space for 53 bits:
the first `1`

bit encoded by the exponent, and the next 52 bits stored in the mantissa.
Depending on the location of the first `1`

bit,
there are up to (128 − 53 =) 75 bits that get lost.

The IEEE 754 standard tells us that we shouldn’t just ignore those bits,
effectively truncating the number after its 53rd significant bit,
but instead round the number to the closest number a `f64`

can represent.

```
The number from our test:
123_456_789_123_456_789_123 = 00⋯001101011000101001110100111111001101100001101111100110110101010000011
└─────────────┬─────────────────────────────────────┘
└─ 53 significant bits
The (truncated) number our function returned:
123_456_789_123_456_778_240 = 00⋯001101011000101001110100111111001101100001101111100110100000000000000
└──────────┬─┘
remaining bits are gone ─┘
The right answer, by rounding to the nearest representable number:
123_456_789_123_456_794_624 = 00⋯001101011000101001110100111111001101100001101111100111000000000000000
├┘
the number was rounded up ─┘
```

The specification also tells us to break ties by rounding to an even mantissa: the one that ends in a zero bit.

So, in decimal: 9.25 rounds to 9, 9.75 rounds to 10, and 9.5 also rounds to 10, but 8.5 rounds to 8.

Or, in binary: 1001.01 rounds to 1001, 1001.11 rounds to 1010, and 1001.10 also rounds to 1010, but 1000.10 rounds to 1000.

Applying this to properly rounding the mantissa in our conversion, this means that we need to check if the bits we’re throwing away are less, equal, or higher than 100⋯00 in binary. If we left-shift the original number to shift out the 53 significant bits, we are left with exactly the bits that we dropped before. That value will tell us whether to round up or down:

```
┌─ first 1 bit ┌─ next 52 bits ┌─ 75 dropped bits
│ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────────────┐
x << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌──────────────────────────────────────────────────────┘ ┌────────┘
│ │
<< 53 = 101010010110000001111010100100101000100111101010110100000100010101100000000 000⋯000
> 100000000000000000000000000000000000000000000000000000000000000000000000000 000⋯000
│
└─ bigger, so we round up
```

This gives us the following code, if we also include the round-to-even tie-breaking part:

```
let mut mantissa = x << n << 1 >> 76;
let dropped_bits = x << n << 53;
if dropped_bits < 1 << 127 {
// round down
} else if dropped_bits > 1 << 127 {
// round up
mantissa += 1;
} else {
// tie! round to even
if mantissa & 1 == 1 {
mantissa += 1;
}
}
```

Okay great, but now we have another problem: the mantissa might no longer fit in the 52 bits it needs to fit in, if it is rounded up beyond its maximum. If that happens, it means the entire resulting number has become one digit longer, similar to how rounding 999 up to 1000 increases it from three to four digits. So, in that case we’ll need to increase the exponent and clear out the remaining bits:

```
if mantissa == 1 << 53 {
exponent += 1;
mantissa = 0;
}
```

Okay.

Does it work now?

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let sign = 0;
let n = x.leading_zeros();
let mut exponent = (127 - n) + 1023;
let mut mantissa = x << n << 1 >> 76;
let dropped_bits = x << n << 53;
if dropped_bits < 1 << 127 {
// round down.
} else if dropped_bits > 1 << 127 {
// round up
mantissa += 1;
} else {
// tie! round to even
if mantissa & 1 == 1 {
mantissa += 1;
}
}
if mantissa == 1 << 53 {
exponent += 1;
mantissa = 0;
}
let bits = (sign << 63) + ((exponent as u64) << 52) + mantissa as u64;
f64::from_bits(bits)
}
#[test]
fn test() {
assert_eq!(u128_to_f64(1234), 1234.0);
assert_eq!(u128_to_f64(123_456_789_123_456_789), 123_456_789_123_456_789.0);
assert_eq!(u128_to_f64(123_456_789_123_456_788), 123_456_789_123_456_788.0);
assert_eq!(u128_to_f64(123_456_789_123_456_789_123), 123_456_789_123_456_789_123.0);
}
```

```
$ cargo test
…
test test ... ok
```

🎉

Yes!

Testing the other 340282366920938463463374607431768211452 integers is left as an exercise to the reader.

## Cleaning up

Okay this is great and all, but the code could be a lot shorter if we clean things up a bit:

- Remove the useless
`sign`

variable. - Inline the
`bits`

variable into the final expression. - Extract
`x << n`

into its own variable,`y`

, to avoid repetition. - Simplify the if-else-if-else-if structure for rounding. It either increments the mantissa by one, or not. So this can be written as a single condition.

Poof, much better:

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let mut exponent = (127 - n) + 1023;
let mut mantissa = y << 1 >> 76;
let dropped_bits = y << 53;
if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
mantissa += 1;
}
if mantissa == 1 << 53 {
exponent += 1;
mantissa = 0;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
```

## Tricks

Now let’s do some cool tricks to ~~obfuscate~~ optimize things even further.

First, let’s look at the final `if`

, which handles an overflowing mantissa when rounding up.
This `if`

is important, because otherwise the 53rd bit
of the overflowed mantissa would end up getting added to the 53rd bit of the final bit representation of the float,
which is not part of the 52-bit mantissa, but instead is where the least significant bit of the exponent goes.

```
┌─ 53rd bit
│
max. mantissa = 000000000000 1111111111111111111111111111111111111111111111111111
+ 1 = 000000000001 0000000000000000000000000000000000000000000000000000
│ │ │
f64 bits = 0 10001110111 0000000000000000000000000000000000000000000000000000
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent! └─ mantissa (only 52 bits!)
```

So, to prevent the mantissa from overflowing outside of its bounds,
this `if`

handles this case by removing that bit from the mantissa,
and adding a 1 to the exponent instead.
Makes sense.

Uh.

Wait.

Back up.

To prevent the 53rd bit from getting added to the 53rd bit.. we remove it and instead add it to the exponent.. which.. is stored at the 53rd bit.

Well that sounds useless.

Okay cool.

So, we remove that entire `if`

, and get the exact same results as before:

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1023;
let mut mantissa = y << 1 >> 76;
let dropped_bits = y << 53;
if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
mantissa += 1;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
```

## More tricks

Now let’s look at the shift operations.
The `y << 1 >> 76`

operation looks almost like it could just be `y >> 75`

.
Except it can’t, because that `<< 1`

is important to get rid of the most significant 1 bit.
If we didn’t remove that bit, it would end up in the 53rd of the mantissa,
which will end up in the exponent field, resulting in an off-by-one exponent:

```
┌─ first 1 bit ┌─ next 52 bits
│ ┌──────────────────────┴───────────────────────────┐
x = 000⋯000 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011
│ │ │
┌───────┘ │ │
│ ┌───────┘ ┌───────┘
│ │ │
<< n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │ │
│ └─────────────────────────────┐ └─────────────────────────────┐
└─────────────────────────────┐ │ │
│ │ │
>> 75 = 00000000000000⋯00000000000000 1 1100100101000100001000100010001000011111010001010100
│ │ │
f64 bits = 0 10001110111 1100100101000100001000100010001000011111010001010100
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent └─ mantissa
```

But! We can deal with that!
We can introduce a counteracting off-by-one error to correct the exponent back
to what it should be by changing the `1023`

constant to `1022`

. :)

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1022;
let mut mantissa = y >> 75;
let dropped_bits = y << 53;
if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
mantissa += 1;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
```

## Avoiding 128-bit operations

Next up: the rounding `if`

. I don’t like it.
It does several operations on 128-bit integers, which don’t fit in normal cpu registers.
I haven’t looked at the disassembly yet, but it smells expensive.
Unfortunately, we can’t fit it into a 64-bit integer,
as there are up to 75 relevant bits:

```
┌─ first 1 bit ┌─ next 52 bits ┌─ 75 dropped bits
│ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────────────┐
y = x << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌──────────────────────────────────────────────────────┘ ┌────────┘
dropped_bits │ │
= y << 53 = 101010010110000001111010100100101000100111101010110100000100010101100000000 0000⋯000
└───┬─────────────────────────────────────────────────────────────────────┘ └────┬─┘
└─ does not fit in a 64 bit integer always zero ─┘
```

However, we *only* need to know whether `dropped_bits`

is lower, equal, or higher than `1 << 127`

.
If the highest bit is not set, we know it’s lower, and we don’t need to look at the other bits at all.
If the highest bit is set and *all* other bits are zero, it is equal.
If the highest bit is set and any of the other bits is set, it is higher.
So, we don’t care about the individual bits beyond the highest one.
We only need to know whether any of them was set.

```
0xxxxxxxxxxxxxxxx⋯xxxxxxxxxxxxxxx: round down
10000000000000000⋯000000000000000: round to even
1xxxxxxxxxxxxxxxx⋯xxxxxxxxxxxxxxx: round up
(x: don't care)
```

This means that we can binary-or (`|`

) some of the bits together without losing the information we need.
We can use that to squash it all into 64 bits, so it fits in a single register.
Let’s split up the 75 relevant bits into a chunk of 64 bits and a chunk of 11 bits, and
binary-or them together:

```
┌─ the 75 dropped bits that we care about
┌─────────────────────────┴───────────────────────────────────────────────┐
dropped_bits = 101010010110000001111010100100101000100111101010110100000100010101100000000 000⋯000
│ ┌─────┼┘ ┌──────┘
└───┐ │ └───┼┐
│ 01100000000│ last 11 dropped bits
1010100101100000011110101001001010001001111010101101000001000101┘ first 64 dropped bits
---------------------------------------------------------------- binary or
squashed = 1010100101100000011110101001001010001001111010101101001101000101
> 1000000000000000000000000000000000000000000000000000000000000000
│
└─ bigger, so we round up
```

In code, that looks like this:

```
let dropped_bits: u128 = y << 53;
let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (dropped_bits >> 53 & 0b11111111111) as u64;
```

`dropped_bits_squashed`

now has the same most significant bit as `dropped_bits`

,
and the other 127 bits were split into two chunks and binary-or’ed together, to preserve the information we need:
whether any of them was set.

If we inline the last usage of `dropped_bits`

, we get `y << 53 >> 53`

, which we can reduce to just `y`

since we’re
cutting everything beyond the 11nd bit off anyway:

```
let dropped_bits: u128 = y << 53;
let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (y as u64 & 0b1111111111);
```

Now note that it would work fine if we had not picked 11, but 12, or 13, or any other number up to 63 bits from `y`

instead.
It would mean that some of the bits from `y`

would end up both in the left and right chunk, but that’s fine.
We only care about whether any of them was set.

Processors usually like working with powers of two and might even have an instruction for taking the lower 32 bits of a 64 bit number, so let’s take 32 bits instead of 11:

```
let dropped_bits: u128 = y << 53;
let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (y as u64 & 0xFFFFFFFF);
```

Or, visually:

```
┌─ 75 dropped bits
┌─┴─────────────────────────┐
y = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌──────────────────────────────────────────────────────┘ ┌────────┘
│ │
dropped_bits = 101010010110000001111010100100101000100111101010110100000100010101100000000 0000⋯000
: ┌──────┘ : ┌──────┘
└───┐ │ └───┼┐
│ 01010110100000100010101100000000│ last 32 dropped bits
1010100101100000011110101001001010001001111010101101000001000101┘ first 64 dropped bits
---------------------------------------------------------------- binary or
1010100101100000011110101001001011011111111010101111101101000101
> 1000000000000000000000000000000000000000000000000000000000000000
│
└─ bigger, so we round up
```

To make things ~~even more unreadable~~ shorter,
we can inline the other usage of the `dropped_bits`

variable and merge `y << 53 >> 64`

into a single `y >> 11`

,
since the high 64 bits are dropped afterwards by the cast to `u64`

:

```
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
```

Adjusting the condition of the `if`

for our new 64-bit variable, we now have:

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1022;
let mut mantissa = y >> 75;
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
if dropped_bits > 1 << 63 || (dropped_bits == 1 << 63 && mantissa & 1 == 1) {
mantissa += 1;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
```

## Branch-free rounding

It’s great that we’ve gotten rid of some potentially expensive 128-bit operations,
but that `if`

still looks bad with that large condition,
due to the tie-breaking round-to-even stuff.

Is there a way we can get rid of it entirely? Making rounding *branch free*?

The body of the `if`

statement only increments the mantissa by one,
so all we need is an expression that is `0`

when the condition is false,
and `1`

when the condition is true, and unconditionally add that to the mantissa.

If we didn’t have to take rounding ties into account,
we could replace the entire `if`

by:

```
mantissa += dropped_bits >> 63;
```

This works, because `dropped_bits >> 63`

is `1`

when equal or higher than `1 << 63`

,
and `0`

when it is lower.

```
dropped_bits = 1010100101100000011110101001001010001001111010101101000001000000
│
└──────────────────────────────────────────────────────────────┐
│
>> 63 = 0000000000000000000000000000000000000000000000000000000000000001
│
one, so we round up ┘
```

Unfortunately, we need to deal with rounding ties to even.

To see for which cases exactly we need to correct, let’s put all the different cases in a table:

```
│ dropped_bits │ mantissa │ increment │
│ high bit │ other bits │ lowest bit │ mantissa │
├──────────┼────────────┼────────────┼───────────┤
│ 0 │ - │ 0 │ no │
│ 0 │ - │ 1 │ no │
│ 1 │ 0 │ 0 │ no │ !
│ 1 │ 0 │ 1 │ yes │
│ 1 │ >0 │ 0 │ yes │
│ 1 │ >0 │ 1 │ yes │
```

The first column shows the result of `dropped_bits >> 63`

,
and the last column shows whether we actually need to increment the mantissa.
There’s only one case where they don’t match: the third row,
which is the case where the mantissa is even while breaking a tie.

We can attempt to correct by subtracting one from `dropped_bits`

to flip the outcome:

```
Tie:
dropped_bits = 1000000000000000000000000000000000000000000000000000000000000000
- 1 = 0111111111111111111111111111111111111111111111111111111111111111
│
└─ flipped to zero, so we round down
Not a tie:
dropped_bits = 1010100101100000011110101001001010001001111010101101000001000000
- 1 = 1010100101100000011110101001001010001001111010101101000000111111
│
└─ still one, so we round up
```

We need to apply this correction when the high bit of `dropped_bits`

is set,
and the low bit of `mantissa`

is not set,
which we can conveniently express as `(dropped_bits >> 63) & !mantissa`

.

This means we can now replace the entire `if`

statement by:

```
mantissa += (dropped_bits - ((dropped_bits >> 63) & !mantissa)) >> 63;
```

Putting it all together, we now only have one `if`

left:

```
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1022;
let mut mantissa = (y >> 75) as u64;
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
mantissa += (dropped_bits - ((dropped_bits >> 63) & !mantissa)) >> 63;
f64::from_bits(((exponent as u64) << 52) + mantissa)
}
```

## Removing the last branch

Do we need that last remaining `if`

? What happens if we remove it?

`x.leading_zeros()`

would result in `n = 128`

when `x`

is zero, which is fine,
but then `x << n`

on the next line could panic.
Let’s swap that for `x.wrapping_shl(n)`

, such that it never panics,
but instead just results in zero when `x`

is zero and `n`

is 128.
The calculation of the mantissa all just results in a zero, which is exactly correct.
Unfortunately, the calculation of the exponent does not result in a zero, so we’ll have to keep an `if`

there.
Maybe if we squeeze it on one line at the end, no one will notice.
(And maybe the compiler will be able to turn it into a branch-free conditional move,
now that the `if`

just selects between two integers.)

```
pub fn u128_to_f64(x: u128) -> f64 {
let n = x.leading_zeros();
let y = x.wrapping_shl(n);
let mut mantissa = (y >> 75) as u64;
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
mantissa += (dropped_bits - (dropped_bits >> 63 & !mantissa)) >> 63;
let exponent = if x == 0 { 0 } else { 127 - n + 1022 };
f64::from_bits(((exponent as u64) << 52) + mantissa)
}
```

Now all we need to do to make this function ~~even less readable~~ prettier,
is to fold some of the remaining constants, remove unnecessary parentheses,
and to only use single letter variables to make every line start with `let _ =`

:

```
pub fn u128_to_f64(x: u128) -> f64 {
let n = x.leading_zeros();
let y = x.wrapping_shl(n);
let a = (y >> 75) as u64;
let b = (y >> 11 | y & 0xFFFFFFFF) as u64;
let m = a + ((b - (b >> 63 & !a)) >> 63);
let e = if x == 0 { 0 } else { 1149 - n as u64 };
f64::from_bits((e << 52) + m)
}
```

Beautiful!

## Disassembly

If we compile this for x86-64, we’ll see that the resulting machine code is entirely branchless!

Good job, ~~Rustc~~ LLVM!

Look, no labels, no jumps:

```
u128_to_f64:
bsr rax, rdi
mov edx, 127
cmovne rdx, rax
xor rdx, 63
add rdx, 64
bsr rcx, rsi
xor rcx, 63
test rsi, rsi
cmove rcx, rdx
mov rax, rsi
shld rax, rdi, cl
mov rdx, rdi
shl rdx, cl
xor r8d, r8d
test cl, 64
cmovne rax, rdx
cmovne rdx, r8
mov r9, rax
shr r9, 11
shld rax, rdx, 53
mov edx, edx
or rdx, rax
shl rcx, 52
movabs r10, 0x47d0000000000000
sub r10, rcx
or rdi, rsi
cmove r10, r8
shr rax, 63
add rdx, r9
mov ecx, r9d
not ecx
and eax, ecx
sub rdx, rax
shr rdx, 63
or rdx, r10
movq xmm0, rdx
ret
```

I don’t know about you, but if I was an x86 processor, I’d think this code is really pretty.

## Results

As it turns out, we could’ve saved a lot of effort by just doing:

```
pub fn u128_to_f64(x: u128) -> f64 {
x as f64
}
```

But that’s obviously no fun.

Unfortunately, we cannot measure fun. What we can measure, though, is speed.

Quickly throwing together a benchmark using `criterion`

to test the builtin `as f64`

conversion gives the following results on my computer:

An average of 7.7 nanoseconds per conversion. Not bad.

..

Okay. Moment of truth…

Let’s try the same benchmark on our own implementation…

..

Wow. 4.8 nanoseconds per iteration. That’s almost twice as fast!

😮

## Epilogue

Having beaten Rustc’s builtin `as f64`

operator,
I had proven to Rustc that, no, I totally didn’t just waste my time by forgetting about its builtin conversions.
I could finally pretend like this was my goal all along.

The resulting dopamine rush left me craving for more, which resulted in me spending the next several days and nights
implementing conversions from every single integer type to the floating point types, and back.
I worked on benchmarks and tests, and started tracking the differences in the
generated machine code for several different architectures, just to try to find
the version of the Rust code that compiled down to the best and most performant code on all platforms.
In the end I also made some architecture-specific versions,
such as an implementation
that makes use of native `f64`

addition and subtraction instructions
to perform u128-to-f64 conversion another two times faster (2.3 nanoseconds),
by using some tricks that could fill their own blog post.

Eventually, tired but satisfied, I pushed it all to GitHub and slowly woke up from my hyperfocus fueled week-long fever dream, wondering when I had last slept or eaten. I ate, slept, and then never cared about integer-float conversion ever again.

Update (2022-05-31):My conversion functions have been merged into`rust/compiler-builtins`

for all conversions between all integer and floating point types. The optimized versions for 128 bit integers have also been merged in the .NET runtime as part of the the new`Int128`

and`UInt128`

support.