Converting Integers to Floats Using Hyperfocus

A few years ago, due to some random chain of events, I ended up implementing a conversion from 128 bit integers to 64 bit floats. This would’ve turned out to be a complete waste of time, except that my final version is faster than the builtin conversion of every compiler I tested. In this blog post, I’ll explain what happened, how floats work, how this conversion works, and how it got a bit out of hand.

Prologue

It was a beautiful day in August, a few summers ago, when a friend of mine was reviewing a pull request containing something like the following snippet:

f64::from_str(&some_u128.to_string()).unwrap()

This was used to convert a u128 to a f64. It did so by converting it to a string first, and parsing that as a float next. Since this method isn’t great in various ways, she went to her secret underground ~~lair~~ Discord Server to ask the ~~leet hackers~~ bored programmers there if anyone had a better way of doing this.

In an attempt to appear helpful but actually just trying to impress some of the people there, I implemented a sort-of mostly correct version of this conversion by calculating the raw bit representation of the f64 and using f64::from_bits.

After rewriting it several times—a process which involved fixing bugs for certain rarely used values like zero, wrongly accusing u128::to_string of having bugs, implementing rounding almost correctly, wrongly accusing f64::from_str of having bugs, implementing rounding slightly more correctly, and trying to write comments but giving up because it’s undecipherable anyway—I finally ended up with about 7 lines of unreadable bit twiddling code that seemed to do the trick, while implementing the correct rounding mode _^almost perfectly.

Everyone was slightly disgusted and slightly impressed.

Mission accomplished.

Eventually we came up with something even shorter:

x as f64

Uh.

Yeah.

So.

Somehow nobody involved even considered the possibility that _ as f64 might just work. Between u128 being experimental until somewhat recently and no mainstream processor providing this conversion natively, we had all simply assumed incorrectly that this was not a thing.

But it was.

And I felt tricked.

By Rustc.

While I was sitting there, struggling for hours banging virtual rocks together to light a spark to make a campfire, Rustc was just sitting there with a lighter in its pocket. “Hey you never asked,” Rustc laughed, “I figured you were just having fun.”

So, out of pure spite and a strong sense of needing to prove that yes I am having fun damnit!, I spent the next several days and nights angrily writing many variants of my code, determined to take revenge on Rustc, somehow.

In this blog post, I’ll try to trace back my steps to implement this integer-to-floating-point conversion until we reach the same seven lines of unreadable code as I did in 2020.

But first, we’ll need to understand how floating point values are represented as bits.

The bits of an `f64`

The 64 bits of a f64 encode a floating point value following the IEEE 754 binary64 standard, also known as a “double precision floating point value” or double. Pretty much all processors that support floating point calculations support this format.

This format splits up the 64 bits into three parts: one sign bit, an 11 bit exponent, and a 52 bit mantissa:


f64 bits = 0 10001110110 1100100101000100001000100010001000011111010001010101
           │ └─┬───────┘ └────────────┬─────────────────────────────────────┘
     sign ─┘   └─ exponent            └─ mantissa

The sign bit indicates whether the number is positive (0) or negative (1). The next 11 bits represent a number, the exponent, which gives position of the first 1 bit of the binary floating point number. 1023 if the first 1 bit is immediately left of the decimal point (i.e. in the ones place), 1024 if it is in the second digit left of the decimal point (i.e. in the twos place), and so on, and in the other direction too: 1022 if it is immediately right of the decimal point (in the halves place), etc. The remaining 52 bits, called the mantissa, are the bits that come right after that first 1 bit of the number.

So, using the letter m to represent the 52 bits of the mantissa, this is what the decoded binary floating point number looks like for different values of the exponent:


   :   \
1026    1mmm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1025     1mm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1024      1m.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1023       1.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1022       0.1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1021       0.01mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1020       0.001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
   :            \

A concrete example: the value 9.75 in binary looks like 1001.11, so it is represented as a f64 with an exponent of 1026, and a (binary) mantissa of 001110000⋯0.


9.75 in decimal = 1001.11 in binary
                  │└─┬──┘
                  │  └─ mantissa: the bits after the first 1 bit
                  └─ position of the first 1 bit: 3

f64 bits = 0 10000000010 0011100000000000000000000000000000000000000000000000
           │ └─┬───────┘ └────────────┬─────────────────────────────────────┘
     sign ─┘   └─ exponent (1023+3)   └─ mantissa

If we extend the table above all the way to the top, we get to extremely high numbers. The maximum exponent value of 2047 is used to represent the special ’not a number’ and infinity values.


2047    Infinity or NaN
2046    1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm00000⋯00000.0
2045     1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm0000⋯00000.0
2044      1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm000⋯00000.0
2043       1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm00⋯00000.0
   :        \                              960 zero bits omitted ─┘

Similarly, if we extend the table all the way to the bottom, we get to the extremely small numbers. The minimum exponent value of zero is also special, and used for the extremely small subnormal numbers. Going from an exponent of one to zero does not move the mantissa another bit to the right, but instead turns the first 1 bit into a 0 bit. This allows us to represent the number zero: an exponent of zero and a mantissa of zero. In other words, f64::from_bits(0) equals 0.0.


   :                 \
   4       0.00000⋯0001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
   3       0.00000⋯00001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
   2       0.00000⋯000001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
   1       0.00000⋯0000001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
   0       0.00000⋯0000000mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm (subnormals)
                  └─ 1010 zero bits omitted

Converting integers

Back to the original problem: converting an 128-bit unsigned integer (u128) to an f64:

pub fn u128_to_f64(x: u128) -> f64 {
    let bits: u64 = ???; // TODO
    f64::from_bits(bits)
}

The input being an integer means we never have to deal with subnormals, except for zero:

    if x == 0 {
        return 0.0;
    }

Edge-case handled. Now we can continue with the fun part: every other integer.

As the input is unsigned, we already know one bit: the sign bit is always zero.

One bit down, sixty-three left!

    let sign = 0;
    let exponent = ??; // TODO
    let mantissa = ??; // TODO
    let bits: u64 = (sign << 63) + (exponent << 52) + mantissa;

Next up: the exponent. For this we need to know the position of the highest 1 bit, and add 1023 because that’s how floats work. Easy, just use u128::leading_zeros to count the zeros before the first 1 bit, and subtract that from 127 to get the position of the 1 bit:


          ┌─ n leading zeros
          │     ┌─ first 1 bit
        ┌─┴───┐ │
input = 000⋯000 1 110010010100010000100010001000100001111101000101010010101001011⋯000101011
        │       │                                                                         │
     bit 127    │                                                                       bit 0
            bit (127 - n)

So,

    let n = x.leading_zeros();
    let exponent = (127 - n) + 1023;

Boom, 11 more bits done. 52 left: the mantissa.

For the mantissa, we need the 52 bits that followed the highest 1 bit. If we left-shift away all the leading zeros and then also left-shift away the highest 1 bit, what remains are the bits right after that first 1 bit. Then all we need to do is right-shift the (128 − 52 =) 76 unwanted bits away:


          ┌─ n leading zeros
          │     ┌─ first 1 bit           ┌─ next 52 bits                 ┌─ remaining bits
        ┌─┴───┐ │ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────┐
input = 000⋯000 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011
                │ │                                                  │
        ┌───────┘ │                                                  │
        │ ┌───────┘                                          ┌───────┘
        │ │                                                  │
 << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
          │                                                  │
        ┌─┘                                                ┌─┘
        │                                                  │
 << 1 = 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011  000⋯0000
        │                                                  │
        └───────────────────────────────┐                  └───────────────────────────────┐
                                        │                                                  │
>> 76 = 000000000000000⋯000000000000000 1100100101000100001000100010001000011111010001010100
                                        │                                                  │
               f64 bits = 0 10001110110 1100100101000100001000100010001000011111010001010100
                          │ └─┬───────┘ └────────────┬─────────────────────────────────────┘
                    sign ─┘   └─ exponent            └─ mantissa

So, just three shift operations:

    let mantissa = x << n << 1 >> 76;

So.. That’s it? We’re done?

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let sign = 0;
    let n = x.leading_zeros();
    let exponent = (127 - n) + 1023;
    let mantissa = x << n << 1 >> 76;
    let bits = (sign << 63) + ((exponent as u64) << 52) + mantissa as u64;
    f64::from_bits(bits)
}

#[test]
fn test() {
    assert_eq!(u128_to_f64(1234), 1234.0);
}

$ cargo test
   Compiling stuff v0.1.0
    Finished test [unoptimized + debuginfo] target(s) in 0.43s
     Running unittests src/lib.rs

running 1 test
test test ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

🎉

Yes!

It works for one number! Therefore, our code must work correctly for all numbers. 😌

Okay, let’s try one more:

    assert_eq!(u128_to_f64(123_456_789_123_456_789), 123_456_789_123_456_789.0);

$ cargo test
…
test result: ok

Works great! It works for two numbers, which means we can now vaguely gesture at mathematical induction, and happily conclude it works for all numbers. 😌

    assert_eq!(u128_to_f64(123_456_789_123_456_789_123), 123_456_789_123_456_789_123.0);

$ cargo test
…
test test ... FAILED

failures:

---- test stdout ----
thread 'test' panicked at 'assertion failed: `(left == right)`
  left: `1.2345678912345678e20`,
 right: `1.234567891234568e20`', src/lib.rs:17:5

Oh.

Maybe not.

Rounding

As it turns out, we shouldn’t just ignore the bits that don’t fit in the f64. We cannot store all 128 bits in it, since we only have space for 53 bits: the first 1 bit encoded by the exponent, and the next 52 bits stored in the mantissa. Depending on the location of the first 1 bit, there are up to (128 − 53 =) 75 bits that get lost.

The IEEE 754 standard tells us that we shouldn’t just ignore those bits, effectively truncating the number after its 53rd significant bit, but instead round the number to the closest number a f64 can represent.


The number from our test:
123_456_789_123_456_789_123 = 00⋯001101011000101001110100111111001101100001101111100110110101010000011
                                   └─────────────┬─────────────────────────────────────┘
                                                 └─ 53 significant bits

The (truncated) number our function returned:
123_456_789_123_456_778_240 = 00⋯001101011000101001110100111111001101100001101111100110100000000000000
                                                                                        └──────────┬─┘
                                                                          remaining bits are gone ─┘

The right answer, by rounding to the nearest representable number:
123_456_789_123_456_794_624 = 00⋯001101011000101001110100111111001101100001101111100111000000000000000
                                                                                      ├┘
                                                           the number was rounded up ─┘

The specification also tells us to break ties by rounding to an even mantissa: the one that ends in a zero bit.

So, in decimal: 9.25 rounds to 9, 9.75 rounds to 10, and 9.5 also rounds to 10, but 8.5 rounds to 8.

Or, in binary: 1001.01 rounds to 1001, 1001.11 rounds to 1010, and 1001.10 also rounds to 1010, but 1000.10 rounds to 1000.

Applying this to properly rounding the mantissa in our conversion, this means that we need to check if the bits we’re throwing away are less, equal, or higher than 100⋯00 in binary. If we left-shift the original number to shift out the 53 significant bits, we are left with exactly the bits that we dropped before. That value will tell us whether to round up or down:


         ┌─ first 1 bit           ┌─ next 52 bits                 ┌─ 75 dropped bits
         │ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────────────┐
x << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
                                                                │                           │
         ┌──────────────────────────────────────────────────────┘                  ┌────────┘
         │                                                                         │
 << 53 = 101010010110000001111010100100101000100111101010110100000100010101100000000 000⋯000
       > 100000000000000000000000000000000000000000000000000000000000000000000000000 000⋯000
       │
       └─ bigger, so we round up

This gives us the following code, if we also include the round-to-even tie-breaking part:

    let mut mantissa = x << n << 1 >> 76;
    let dropped_bits = x << n << 53;
    if dropped_bits < 1 << 127 {
        // round down
    } else if dropped_bits > 1 << 127 {
        // round up
        mantissa += 1;
    } else {
        // tie! round to even
        if mantissa & 1 == 1 {
            mantissa += 1;
        }
    }

Okay great, but now we have another problem: the mantissa might no longer fit in the 52 bits it needs to fit in, if it is rounded up beyond its maximum. If that happens, it means the entire resulting number has become one digit longer, similar to how rounding 999 up to 1000 increases it from three to four digits. So, in that case we’ll need to increase the exponent and clear out the remaining bits:

    if mantissa == 1 << 53 {
        exponent += 1;
        mantissa = 0;
    }

Okay.

Does it work now?

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let sign = 0;
    let n = x.leading_zeros();
    let mut exponent = (127 - n) + 1023;
    let mut mantissa = x << n << 1 >> 76;
    let dropped_bits = x << n << 53;
    if dropped_bits < 1 << 127 {
        // round down.
    } else if dropped_bits > 1 << 127 {
        // round up
        mantissa += 1;
    } else {
        // tie! round to even
        if mantissa & 1 == 1 {
            mantissa += 1;
        }
    }
    if mantissa == 1 << 53 {
        exponent += 1;
        mantissa = 0;
    }
    let bits = (sign << 63) + ((exponent as u64) << 52) + mantissa as u64;
    f64::from_bits(bits)
}

#[test]
fn test() {
    assert_eq!(u128_to_f64(1234), 1234.0);
    assert_eq!(u128_to_f64(123_456_789_123_456_789), 123_456_789_123_456_789.0);
    assert_eq!(u128_to_f64(123_456_789_123_456_788), 123_456_789_123_456_788.0);
    assert_eq!(u128_to_f64(123_456_789_123_456_789_123), 123_456_789_123_456_789_123.0);
}

$ cargo test
…
test test ... ok

🎉

Yes!

Testing the other 340282366920938463463374607431768211452 integers is left as an exercise to the reader.

Cleaning up

Okay this is great and all, but the code could be a lot shorter if we clean things up a bit:

Remove the useless sign variable.
Inline the bits variable into the final expression.
Extract x << n into its own variable, y, to avoid repetition.
Simplify the if-else-if-else-if structure for rounding. It either increments the mantissa by one, or not. So this can be written as a single condition.

Poof, much better:

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let n = x.leading_zeros();
    let y = x << n;
    let mut exponent = (127 - n) + 1023;
    let mut mantissa = y << 1 >> 76;
    let dropped_bits = y << 53;
    if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
        mantissa += 1;
    }
    if mantissa == 1 << 53 {
        exponent += 1;
        mantissa = 0;
    }
    f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}

Tricks

Now let’s do some cool tricks to ~~obfuscate~~ optimize things even further.

First, let’s look at the final if, which handles an overflowing mantissa when rounding up. This if is important, because otherwise the 53rd bit of the overflowed mantissa would end up getting added to the 53rd bit of the final bit representation of the float, which is not part of the 52-bit mantissa, but instead is where the least significant bit of the exponent goes.


                           ┌─ 53rd bit
                           │
max. mantissa = 000000000000 1111111111111111111111111111111111111111111111111111
          + 1 = 000000000001 0000000000000000000000000000000000000000000000000000
                           │ │                                                  │
    f64 bits = 0 10001110111 0000000000000000000000000000000000000000000000000000
               │ └─┬───────┘ └────────────┬─────────────────────────────────────┘
         sign ─┘   └─ exponent!           └─ mantissa (only 52 bits!)

So, to prevent the mantissa from overflowing outside of its bounds, this if handles this case by removing that bit from the mantissa, and adding a 1 to the exponent instead. Makes sense.

Uh.

Wait.

Back up.

To prevent the 53rd bit from getting added to the 53rd bit.. we remove it and instead add it to the exponent.. which.. is stored at the 53rd bit.

Well that sounds useless.

Okay cool.

So, we remove that entire if, and get the exact same results as before:

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let n = x.leading_zeros();
    let y = x << n;
    let exponent = (127 - n) + 1023;
    let mut mantissa = y << 1 >> 76;
    let dropped_bits = y << 53;
    if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
        mantissa += 1;
    }
    f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}

More tricks

Now let’s look at the shift operations. The y << 1 >> 76 operation looks almost like it could just be y >> 75. Except it can’t, because that << 1 is important to get rid of the most significant 1 bit. If we didn’t remove that bit, it would end up in the 53rd of the mantissa, which will end up in the exponent field, resulting in an off-by-one exponent:


                ┌─ first 1 bit           ┌─ next 52 bits
                │ ┌──────────────────────┴───────────────────────────┐
    x = 000⋯000 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011
                │ │                                                  │
        ┌───────┘ │                                                  │
        │ ┌───────┘                                          ┌───────┘
        │ │                                                  │
 << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
        │ │                                                  │
        │ └─────────────────────────────┐                    └─────────────────────────────┐
        └─────────────────────────────┐ │                                                  │
                                      │ │                                                  │
>> 75 = 00000000000000⋯00000000000000 1 1100100101000100001000100010001000011111010001010100
                                      │ │                                                  │
               f64 bits = 0 10001110111 1100100101000100001000100010001000011111010001010100
                          │ └─┬───────┘ └────────────┬─────────────────────────────────────┘
                    sign ─┘   └─ exponent            └─ mantissa

But! We can deal with that! We can introduce a counteracting off-by-one error to correct the exponent back to what it should be by changing the 1023 constant to 1022. :)

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let n = x.leading_zeros();
    let y = x << n;
    let exponent = (127 - n) + 1022;
    let mut mantissa = y >> 75;
    let dropped_bits = y << 53;
    if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
        mantissa += 1;
    }
    f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}

Avoiding 128-bit operations

Next up: the rounding if. I don’t like it. It does several operations on 128-bit integers, which don’t fit in normal cpu registers. I haven’t looked at the disassembly yet, but it smells expensive. Unfortunately, we can’t fit it into a 64-bit integer, as there are up to 75 relevant bits:


               ┌─ first 1 bit           ┌─ next 52 bits                 ┌─ 75 dropped bits
               │ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────────────┐
  y = x << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
                                                                      │                           │
               ┌──────────────────────────────────────────────────────┘                  ┌────────┘
dropped_bits   │                                                                         │
   = y << 53 = 101010010110000001111010100100101000100111101010110100000100010101100000000 0000⋯000
               └───┬─────────────────────────────────────────────────────────────────────┘ └────┬─┘
                   └─ does not fit in a 64 bit integer                             always zero ─┘

However, we only need to know whether dropped_bits is lower, equal, or higher than 1 << 127. If the highest bit is not set, we know it’s lower, and we don’t need to look at the other bits at all. If the highest bit is set and all other bits are zero, it is equal. If the highest bit is set and any of the other bits is set, it is higher. So, we don’t care about the individual bits beyond the highest one. We only need to know whether any of them was set.


0xxxxxxxxxxxxxxxx⋯xxxxxxxxxxxxxxx: round down
10000000000000000⋯000000000000000: round to even
1xxxxxxxxxxxxxxxx⋯xxxxxxxxxxxxxxx: round up

(x: don't care)

This means that we can binary-or (|) some of the bits together without losing the information we need. We can use that to squash it all into 64 bits, so it fits in a single register. Let’s split up the 75 relevant bits into a chunk of 64 bits and a chunk of 11 bits, and binary-or them together:


                                         ┌─ the 75 dropped bits that we care about
               ┌─────────────────────────┴───────────────────────────────────────────────┐
dropped_bits = 101010010110000001111010100100101000100111101010110100000100010101100000000 000⋯000
               │                                                        ┌─────┼┘  ┌──────┘
               └───┐                                                    │     └───┼┐
                   │                                                    01100000000│  last 11 dropped bits
                   1010100101100000011110101001001010001001111010101101000001000101┘  first 64 dropped bits
                   ---------------------------------------------------------------- binary or
        squashed = 1010100101100000011110101001001010001001111010101101001101000101
                 > 1000000000000000000000000000000000000000000000000000000000000000
                 │
                 └─ bigger, so we round up

In code, that looks like this:

    let dropped_bits: u128 = y << 53;
    let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (dropped_bits >> 53 & 0b11111111111) as u64;

dropped_bits_squashed now has the same most significant bit as dropped_bits, and the other 127 bits were split into two chunks and binary-or’ed together, to preserve the information we need: whether any of them was set.

If we inline the last usage of dropped_bits, we get y << 53 >> 53, which we can reduce to just y since we’re cutting everything beyond the 11nd bit off anyway:

    let dropped_bits: u128 = y << 53;
    let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (y as u64 & 0b1111111111);

Now note that it would work fine if we had not picked 11, but 12, or 13, or any other number up to 63 bits from y instead. It would mean that some of the bits from y would end up both in the left and right chunk, but that’s fine. We only care about whether any of them was set.

Processors usually like working with powers of two and might even have an instruction for taking the lower 32 bits of a 64 bit number, so let’s take 32 bits instead of 11:

    let dropped_bits: u128 = y << 53;
    let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (y as u64 & 0xFFFFFFFF);

Or, visually:


                                                                        ┌─ 75 dropped bits
                                                                      ┌─┴─────────────────────────┐
           y = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
                                                                      │                           │
               ┌──────────────────────────────────────────────────────┘                  ┌────────┘
               │                                                                         │
dropped_bits = 101010010110000001111010100100101000100111101010110100000100010101100000000 0000⋯000
               :                                   ┌──────┘                   :   ┌──────┘
               └───┐                               │                          └───┼┐
                   │                               01010110100000100010101100000000│ last 32 dropped bits
                   1010100101100000011110101001001010001001111010101101000001000101┘ first 64 dropped bits
                   ---------------------------------------------------------------- binary or
                   1010100101100000011110101001001011011111111010101111101101000101
                 > 1000000000000000000000000000000000000000000000000000000000000000
                 │
                 └─ bigger, so we round up

To make things ~~even more unreadable~~ shorter, we can inline the other usage of the dropped_bits variable and merge y << 53 >> 64 into a single y >> 11, since the high 64 bits are dropped afterwards by the cast to u64:

    let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;

Adjusting the condition of the if for our new 64-bit variable, we now have:

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let n = x.leading_zeros();
    let y = x << n;
    let exponent = (127 - n) + 1022;
    let mut mantissa = y >> 75;
    let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
    if dropped_bits > 1 << 63 || (dropped_bits == 1 << 63 && mantissa & 1 == 1) {
        mantissa += 1;
    }
    f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}

Branch-free rounding

It’s great that we’ve gotten rid of some potentially expensive 128-bit operations, but that if still looks bad with that large condition, due to the tie-breaking round-to-even stuff.

Is there a way we can get rid of it entirely? Making rounding branch free?

The body of the if statement only increments the mantissa by one, so all we need is an expression that is 0 when the condition is false, and 1 when the condition is true, and unconditionally add that to the mantissa.

If we didn’t have to take rounding ties into account, we could replace the entire if by:

    mantissa += dropped_bits >> 63;

This works, because dropped_bits >> 63 is 1 when equal or higher than 1 << 63, and 0 when it is lower.


dropped_bits = 1010100101100000011110101001001010001001111010101101000001000000
               │
               └──────────────────────────────────────────────────────────────┐
                                                                              │
       >> 63 = 0000000000000000000000000000000000000000000000000000000000000001
                                                                              │
                                                          one, so we round up ┘

Unfortunately, we need to deal with rounding ties to even.

To see for which cases exactly we need to correct, let’s put all the different cases in a table:


│     dropped_bits      │  mantissa  │ increment │
│ high bit │ other bits │ lowest bit │ mantissa  │
├──────────┼────────────┼────────────┼───────────┤
│    0     │     -      │     0      │     no    │
│    0     │     -      │     1      │     no    │
│    1     │     0      │     0      │     no    │ !
│    1     │     0      │     1      │    yes    │
│    1     │    >0      │     0      │    yes    │
│    1     │    >0      │     1      │    yes    │

The first column shows the result of dropped_bits >> 63, and the last column shows whether we actually need to increment the mantissa. There’s only one case where they don’t match: the third row, which is the case where the mantissa is even while breaking a tie.

We can attempt to correct by subtracting one from dropped_bits to flip the outcome:


Tie:

dropped_bits = 1000000000000000000000000000000000000000000000000000000000000000
         - 1 = 0111111111111111111111111111111111111111111111111111111111111111
               │
               └─ flipped to zero, so we round down

Not a tie:

dropped_bits = 1010100101100000011110101001001010001001111010101101000001000000
         - 1 = 1010100101100000011110101001001010001001111010101101000000111111
               │
               └─ still one, so we round up

We need to apply this correction when the high bit of dropped_bits is set, and the low bit of mantissa is not set, which we can conveniently express as (dropped_bits >> 63) & !mantissa.

This means we can now replace the entire if statement by:

    mantissa += (dropped_bits - ((dropped_bits >> 63) & !mantissa)) >> 63;

Putting it all together, we now only have one if left:

pub fn u128_to_f64(x: u128) -> f64 {
    if x == 0 {
        return 0.0;
    }
    let n = x.leading_zeros();
    let y = x << n;
    let exponent = (127 - n) + 1022;
    let mut mantissa = (y >> 75) as u64;
    let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
    mantissa += (dropped_bits - ((dropped_bits >> 63) & !mantissa)) >> 63;
    f64::from_bits(((exponent as u64) << 52) + mantissa)
}

Removing the last branch

Do we need that last remaining if? What happens if we remove it?

x.leading_zeros() would result in n = 128 when x is zero, which is fine, but then x << n on the next line could panic. Let’s swap that for x.wrapping_shl(n), such that it never panics, but instead just results in zero when x is zero and n is 128. The calculation of the mantissa all just results in a zero, which is exactly correct. Unfortunately, the calculation of the exponent does not result in a zero, so we’ll have to keep an if there. Maybe if we squeeze it on one line at the end, no one will notice. (And maybe the compiler will be able to turn it into a branch-free conditional move, now that the if just selects between two integers.)

pub fn u128_to_f64(x: u128) -> f64 {
    let n = x.leading_zeros();
    let y = x.wrapping_shl(n);
    let mut mantissa = (y >> 75) as u64;
    let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
    mantissa += (dropped_bits - (dropped_bits >> 63 & !mantissa)) >> 63;
    let exponent = if x == 0 { 0 } else { 127 - n + 1022 };
    f64::from_bits(((exponent as u64) << 52) + mantissa)
}

Now all we need to do to make this function ~~even less readable~~ prettier, is to fold some of the remaining constants, remove unnecessary parentheses, and to only use single letter variables to make every line start with let _ =:

pub fn u128_to_f64(x: u128) -> f64 {
    let n = x.leading_zeros();
    let y = x.wrapping_shl(n);
    let a = (y >> 75) as u64;
    let b = (y >> 11 | y & 0xFFFFFFFF) as u64;
    let m = a + ((b - (b >> 63 & !a)) >> 63);
    let e = if x == 0 { 0 } else { 1149 - n as u64 };
    f64::from_bits((e << 52) + m)
}

Beautiful!

Disassembly

If we compile this for x86-64, we’ll see that the resulting machine code is entirely branchless!

Good job, ~~Rustc~~ LLVM!

Look, no labels, no jumps:

u128_to_f64:
  bsr     rax, rdi
  mov     edx, 127
  cmovne  rdx, rax
  xor     rdx, 63
  add     rdx, 64
  bsr     rcx, rsi
  xor     rcx, 63
  test    rsi, rsi
  cmove   rcx, rdx
  mov     rax, rsi
  shld    rax, rdi, cl
  mov     rdx, rdi
  shl     rdx, cl
  xor     r8d, r8d
  test    cl, 64
  cmovne  rax, rdx
  cmovne  rdx, r8
  mov     r9, rax
  shr     r9, 11
  shld    rax, rdx, 53
  mov     edx, edx
  or      rdx, rax
  shl     rcx, 52
  movabs  r10, 0x47d0000000000000
  sub     r10, rcx
  or      rdi, rsi
  cmove   r10, r8
  shr     rax, 63
  add     rdx, r9
  mov     ecx, r9d
  not     ecx
  and     eax, ecx
  sub     rdx, rax
  shr     rdx, 63
  or      rdx, r10
  movq    xmm0, rdx
  ret

I don’t know about you, but if I was an x86 processor, I’d think this code is really pretty.

Results

As it turns out, we could’ve saved a lot of effort by just doing:

pub fn u128_to_f64(x: u128) -> f64 {
    x as f64
}

But that’s obviously no fun.

Unfortunately, we cannot measure fun. What we can measure, though, is speed.

Quickly throwing together a benchmark using criterion to test the builtin as f64 conversion gives the following results on my computer:

An average of 7.7 nanoseconds per conversion. Not bad.

Okay. Moment of truth…

Let’s try the same benchmark on our own implementation…

Wow. 4.8 nanoseconds per iteration. That’s almost twice as fast!

😮

Epilogue

Having beaten Rustc’s builtin as f64 operator, I had proven to Rustc that, no, I totally didn’t just waste my time by forgetting about its builtin conversions. I could finally pretend like this was my goal all along.

The resulting dopamine rush left me craving for more, which resulted in me spending the next several days and nights implementing conversions from every single integer type to the floating point types, and back. I worked on benchmarks and tests, and started tracking the differences in the generated machine code for several different architectures, just to try to find the version of the Rust code that compiled down to the best and most performant code on all platforms. In the end I also made some architecture-specific versions, such as an implementation that makes use of native f64 addition and subtraction instructions to perform u128-to-f64 conversion another two times faster (2.3 nanoseconds), by using some tricks that could fill their own blog post.

Eventually, tired but satisfied, I pushed it all to GitHub and slowly woke up from my hyperfocus fueled week-long fever dream, wondering when I had last slept or eaten. I ate, slept, and then never cared about integer-float conversion ever again.

Update (2022-05-31): My conversion functions have been merged into rust/compiler-builtins for all conversions between all integer and floating point types. The optimized versions for 128 bit integers have also been merged in the .NET runtime as part of the the new Int128 and UInt128 support.

Contents