Converting Integers to Floats Using Hyperfocus
Contents
A few years ago, due to some random chain of events, I ended up implementing a conversion from 128 bit integers to 64 bit floats. This would’ve turned out to be a complete waste of time, except that my final version is faster than the builtin conversion of every compiler I tested. In this blog post, I’ll explain what happened, how floats work, how this conversion works, and how it got a bit out of hand.
Prologue
It was a beautiful day in August, a few summers ago, when a friend of mine was reviewing a pull request containing something like the following snippet:
f64::from_str(&some_u128.to_string()).unwrap()
This was used to convert a u128
to a f64
.
It did so by converting it to a string first, and parsing that as a float next.
Since this method isn’t great in various ways,
she went to her secret underground lair Discord Server to ask the leet hackers bored programmers there if anyone had a better way of doing this.
In an attempt to appear helpful but actually just trying to impress some of the people there,
I implemented a sort-of mostly correct version of this conversion by calculating the raw bit representation of the f64
and using
f64::from_bits
.
After rewriting it several times—a process which involved
fixing bugs for certain rarely used values like zero,
wrongly accusing u128::to_string
of having bugs,
implementing rounding almost correctly,
wrongly accusing f64::from_str
of having bugs,
implementing rounding slightly more correctly,
and trying to write comments but giving up because it’s undecipherable anyway—I finally
ended up with about 7 lines of unreadable bit twiddling code
that seemed to do the trick,
while implementing the correct rounding mode almost perfectly.
Everyone was slightly disgusted and slightly impressed.
Mission accomplished.
..
Eventually we came up with something even shorter:
x as f64
..
Uh.
Yeah.
So.
Somehow nobody involved even considered the possibility that _ as f64
might just work.
Between u128
being experimental until somewhat recently and no mainstream processor providing this conversion natively,
we had all simply assumed incorrectly that this was not a thing.
But it was.
And I felt tricked.
By Rustc.
While I was sitting there, struggling for hours banging virtual rocks together to light a spark to make a campfire, Rustc was just sitting there with a lighter in its pocket. “Hey you never asked,” Rustc laughed, “I figured you were just having fun.”
So, out of pure spite and a strong sense of needing to prove that yes I am having fun damnit!, I spent the next several days and nights angrily writing many variants of my code, determined to take revenge on Rustc, somehow.
In this blog post, I’ll try to trace back my steps to implement this integer-to-floating-point conversion until we reach the same seven lines of unreadable code as I did in 2020.
But first, we’ll need to understand how floating point values are represented as bits.
The bits of an f64
The 64 bits of a f64
encode a floating point value following the IEEE 754 binary64 standard,
also known as a “double precision floating point value” or double
.
Pretty much all processors that support floating point calculations support this format.
This format splits up the 64 bits into three parts: one sign bit, an 11 bit exponent, and a 52 bit mantissa:
f64 bits = 0 10001110110 1100100101000100001000100010001000011111010001010101
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent └─ mantissa
The sign bit indicates whether the number is positive (0) or negative (1).
The next 11 bits represent a number, the exponent, which gives position of the first 1
bit of the binary floating point number.
1023 if the first 1
bit is immediately left of the decimal point (i.e. in the ones place),
1024 if it is in the second digit left of the decimal point (i.e. in the twos place),
and so on, and in the other direction too: 1022 if it is immediately right of the decimal point (in the halves place), etc.
The remaining 52 bits, called the mantissa, are the bits that come right after that first 1
bit of the number.
So, using the letter m
to represent the 52 bits of the mantissa,
this is what the decoded binary floating point number looks like
for different values of the exponent:
: \
1026 1mmm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1025 1mm.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1024 1m.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1023 1.mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1022 0.1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1021 0.01mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1020 0.001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
: \
A concrete example: the value 9.75 in binary looks like 1001.11,
so it is represented as a f64
with an exponent of 1026, and a (binary) mantissa of 001110000⋯0.
9.75 in decimal = 1001.11 in binary
│└─┬──┘
│ └─ mantissa: the bits after the first 1 bit
└─ position of the first 1 bit: 3
f64 bits = 0 10000000010 0011100000000000000000000000000000000000000000000000
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent (1023+3) └─ mantissa
If we extend the table above all the way to the top, we get to extremely high numbers. The maximum exponent value of 2047 is used to represent the special ’not a number’ and infinity values.
2047 Infinity or NaN
2046 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm00000⋯00000.0
2045 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm0000⋯00000.0
2044 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm000⋯00000.0
2043 1mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm00⋯00000.0
: \ 960 zero bits omitted ─┘
Similarly, if we extend the table all the way to the bottom, we get to the extremely small numbers.
The minimum exponent value of zero is also special, and used for the extremely small subnormal numbers.
Going from an exponent of one to zero does not move the mantissa another bit to the right,
but instead turns the first 1
bit into a 0
bit.
This allows us to represent the number zero: an exponent of zero and a mantissa of zero.
In other words, f64::from_bits(0)
equals 0.0
.
: \
4 0.00000⋯0001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
3 0.00000⋯00001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
2 0.00000⋯000001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
1 0.00000⋯0000001mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
0 0.00000⋯0000000mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm (subnormals)
└─ 1010 zero bits omitted
Converting integers
Back to the original problem: converting an 128-bit unsigned integer (u128
) to an f64
:
pub fn u128_to_f64(x: u128) -> f64 {
let bits: u64 = ???; // TODO
f64::from_bits(bits)
}
The input being an integer means we never have to deal with subnormals, except for zero:
if x == 0 {
return 0.0;
}
Edge-case handled. Now we can continue with the fun part: every other integer.
As the input is unsigned, we already know one bit: the sign bit is always zero.
One bit down, sixty-three left!
let sign = 0;
let exponent = ??; // TODO
let mantissa = ??; // TODO
let bits: u64 = (sign << 63) + (exponent << 52) + mantissa;
Next up: the exponent. For this we need to know the position of the highest 1
bit, and add 1023 because that’s how floats work.
Easy, just use u128::leading_zeros
to count the zeros before the first 1
bit, and subtract that from 127 to get the position of the 1
bit:
┌─ n leading zeros
│ ┌─ first 1 bit
┌─┴───┐ │
input = 000⋯000 1 110010010100010000100010001000100001111101000101010010101001011⋯000101011
│ │ │
bit 127 │ bit 0
bit (127 - n)
So,
let n = x.leading_zeros();
let exponent = (127 - n) + 1023;
Boom, 11 more bits done. 52 left: the mantissa.
For the mantissa, we need the 52 bits that followed the highest 1
bit.
If we left-shift away all the leading zeros and then also left-shift away the highest 1
bit,
what remains are the bits right after that first 1
bit.
Then all we need to do is right-shift the (128 − 52 =) 76 unwanted bits away:
┌─ n leading zeros
│ ┌─ first 1 bit ┌─ next 52 bits ┌─ remaining bits
┌─┴───┐ │ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────┐
input = 000⋯000 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011
│ │ │
┌───────┘ │ │
│ ┌───────┘ ┌───────┘
│ │ │
<< n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌─┘ ┌─┘
│ │
<< 1 = 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯0000
│ │
└───────────────────────────────┐ └───────────────────────────────┐
│ │
>> 76 = 000000000000000⋯000000000000000 1100100101000100001000100010001000011111010001010100
│ │
f64 bits = 0 10001110110 1100100101000100001000100010001000011111010001010100
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent └─ mantissa
So, just three shift operations:
let mantissa = x << n << 1 >> 76;
So.. That’s it? We’re done?
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let sign = 0;
let n = x.leading_zeros();
let exponent = (127 - n) + 1023;
let mantissa = x << n << 1 >> 76;
let bits = (sign << 63) + ((exponent as u64) << 52) + mantissa as u64;
f64::from_bits(bits)
}
#[test]
fn test() {
assert_eq!(u128_to_f64(1234), 1234.0);
}
$ cargo test
Compiling stuff v0.1.0
Finished test [unoptimized + debuginfo] target(s) in 0.43s
Running unittests src/lib.rs
running 1 test
test test ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s
🎉
Yes!
It works for one number! Therefore, our code must work correctly for all numbers. 😌
Okay, let’s try one more:
assert_eq!(u128_to_f64(123_456_789_123_456_789), 123_456_789_123_456_789.0);
$ cargo test
…
test result: ok
Works great! It works for two numbers, which means we can now vaguely gesture at mathematical induction, and happily conclude it works for all numbers. 😌
assert_eq!(u128_to_f64(123_456_789_123_456_789_123), 123_456_789_123_456_789_123.0);
$ cargo test
…
test test ... FAILED
failures:
---- test stdout ----
thread 'test' panicked at 'assertion failed: `(left == right)`
left: `1.2345678912345678e20`,
right: `1.234567891234568e20`', src/lib.rs:17:5
Oh.
Maybe not.
Rounding
As it turns out, we shouldn’t just ignore the bits that don’t fit in the f64
.
We cannot store all 128 bits in it, since we only have space for 53 bits:
the first 1
bit encoded by the exponent, and the next 52 bits stored in the mantissa.
Depending on the location of the first 1
bit,
there are up to (128 − 53 =) 75 bits that get lost.
The IEEE 754 standard tells us that we shouldn’t just ignore those bits,
effectively truncating the number after its 53rd significant bit,
but instead round the number to the closest number a f64
can represent.
The number from our test:
123_456_789_123_456_789_123 = 00⋯001101011000101001110100111111001101100001101111100110110101010000011
└─────────────┬─────────────────────────────────────┘
└─ 53 significant bits
The (truncated) number our function returned:
123_456_789_123_456_778_240 = 00⋯001101011000101001110100111111001101100001101111100110100000000000000
└──────────┬─┘
remaining bits are gone ─┘
The right answer, by rounding to the nearest representable number:
123_456_789_123_456_794_624 = 00⋯001101011000101001110100111111001101100001101111100111000000000000000
├┘
the number was rounded up ─┘
The specification also tells us to break ties by rounding to an even mantissa: the one that ends in a zero bit.
So, in decimal: 9.25 rounds to 9, 9.75 rounds to 10, and 9.5 also rounds to 10, but 8.5 rounds to 8.
Or, in binary: 1001.01 rounds to 1001, 1001.11 rounds to 1010, and 1001.10 also rounds to 1010, but 1000.10 rounds to 1000.
Applying this to properly rounding the mantissa in our conversion, this means that we need to check if the bits we’re throwing away are less, equal, or higher than 100⋯00 in binary. If we left-shift the original number to shift out the 53 significant bits, we are left with exactly the bits that we dropped before. That value will tell us whether to round up or down:
┌─ first 1 bit ┌─ next 52 bits ┌─ 75 dropped bits
│ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────────────┐
x << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌──────────────────────────────────────────────────────┘ ┌────────┘
│ │
<< 53 = 101010010110000001111010100100101000100111101010110100000100010101100000000 000⋯000
> 100000000000000000000000000000000000000000000000000000000000000000000000000 000⋯000
│
└─ bigger, so we round up
This gives us the following code, if we also include the round-to-even tie-breaking part:
let mut mantissa = x << n << 1 >> 76;
let dropped_bits = x << n << 53;
if dropped_bits < 1 << 127 {
// round down
} else if dropped_bits > 1 << 127 {
// round up
mantissa += 1;
} else {
// tie! round to even
if mantissa & 1 == 1 {
mantissa += 1;
}
}
Okay great, but now we have another problem: the mantissa might no longer fit in the 52 bits it needs to fit in, if it is rounded up beyond its maximum. If that happens, it means the entire resulting number has become one digit longer, similar to how rounding 999 up to 1000 increases it from three to four digits. So, in that case we’ll need to increase the exponent and clear out the remaining bits:
if mantissa == 1 << 53 {
exponent += 1;
mantissa = 0;
}
Okay.
Does it work now?
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let sign = 0;
let n = x.leading_zeros();
let mut exponent = (127 - n) + 1023;
let mut mantissa = x << n << 1 >> 76;
let dropped_bits = x << n << 53;
if dropped_bits < 1 << 127 {
// round down.
} else if dropped_bits > 1 << 127 {
// round up
mantissa += 1;
} else {
// tie! round to even
if mantissa & 1 == 1 {
mantissa += 1;
}
}
if mantissa == 1 << 53 {
exponent += 1;
mantissa = 0;
}
let bits = (sign << 63) + ((exponent as u64) << 52) + mantissa as u64;
f64::from_bits(bits)
}
#[test]
fn test() {
assert_eq!(u128_to_f64(1234), 1234.0);
assert_eq!(u128_to_f64(123_456_789_123_456_789), 123_456_789_123_456_789.0);
assert_eq!(u128_to_f64(123_456_789_123_456_788), 123_456_789_123_456_788.0);
assert_eq!(u128_to_f64(123_456_789_123_456_789_123), 123_456_789_123_456_789_123.0);
}
$ cargo test
…
test test ... ok
🎉
Yes!
Testing the other 340282366920938463463374607431768211452 integers is left as an exercise to the reader.
Cleaning up
Okay this is great and all, but the code could be a lot shorter if we clean things up a bit:
- Remove the useless
sign
variable. - Inline the
bits
variable into the final expression. - Extract
x << n
into its own variable,y
, to avoid repetition. - Simplify the if-else-if-else-if structure for rounding. It either increments the mantissa by one, or not. So this can be written as a single condition.
Poof, much better:
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let mut exponent = (127 - n) + 1023;
let mut mantissa = y << 1 >> 76;
let dropped_bits = y << 53;
if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
mantissa += 1;
}
if mantissa == 1 << 53 {
exponent += 1;
mantissa = 0;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
Tricks
Now let’s do some cool tricks to obfuscate optimize things even further.
First, let’s look at the final if
, which handles an overflowing mantissa when rounding up.
This if
is important, because otherwise the 53rd bit
of the overflowed mantissa would end up getting added to the 53rd bit of the final bit representation of the float,
which is not part of the 52-bit mantissa, but instead is where the least significant bit of the exponent goes.
┌─ 53rd bit
│
max. mantissa = 000000000000 1111111111111111111111111111111111111111111111111111
+ 1 = 000000000001 0000000000000000000000000000000000000000000000000000
│ │ │
f64 bits = 0 10001110111 0000000000000000000000000000000000000000000000000000
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent! └─ mantissa (only 52 bits!)
So, to prevent the mantissa from overflowing outside of its bounds,
this if
handles this case by removing that bit from the mantissa,
and adding a 1 to the exponent instead.
Makes sense.
Uh.
Wait.
Back up.
To prevent the 53rd bit from getting added to the 53rd bit.. we remove it and instead add it to the exponent.. which.. is stored at the 53rd bit.
Well that sounds useless.
Okay cool.
So, we remove that entire if
, and get the exact same results as before:
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1023;
let mut mantissa = y << 1 >> 76;
let dropped_bits = y << 53;
if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
mantissa += 1;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
More tricks
Now let’s look at the shift operations.
The y << 1 >> 76
operation looks almost like it could just be y >> 75
.
Except it can’t, because that << 1
is important to get rid of the most significant 1 bit.
If we didn’t remove that bit, it would end up in the 53rd of the mantissa,
which will end up in the exponent field, resulting in an off-by-one exponent:
┌─ first 1 bit ┌─ next 52 bits
│ ┌──────────────────────┴───────────────────────────┐
x = 000⋯000 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011
│ │ │
┌───────┘ │ │
│ ┌───────┘ ┌───────┘
│ │ │
<< n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │ │
│ └─────────────────────────────┐ └─────────────────────────────┐
└─────────────────────────────┐ │ │
│ │ │
>> 75 = 00000000000000⋯00000000000000 1 1100100101000100001000100010001000011111010001010100
│ │ │
f64 bits = 0 10001110111 1100100101000100001000100010001000011111010001010100
│ └─┬───────┘ └────────────┬─────────────────────────────────────┘
sign ─┘ └─ exponent └─ mantissa
But! We can deal with that!
We can introduce a counteracting off-by-one error to correct the exponent back
to what it should be by changing the 1023
constant to 1022
. :)
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1022;
let mut mantissa = y >> 75;
let dropped_bits = y << 53;
if dropped_bits > 1 << 127 || (dropped_bits == 1 << 127 && mantissa & 1 == 1) {
mantissa += 1;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
Avoiding 128-bit operations
Next up: the rounding if
. I don’t like it.
It does several operations on 128-bit integers, which don’t fit in normal cpu registers.
I haven’t looked at the disassembly yet, but it smells expensive.
Unfortunately, we can’t fit it into a 64-bit integer,
as there are up to 75 relevant bits:
┌─ first 1 bit ┌─ next 52 bits ┌─ 75 dropped bits
│ ┌──────────────────────┴───────────────────────────┐ ┌─┴─────────────────────────┐
y = x << n = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌──────────────────────────────────────────────────────┘ ┌────────┘
dropped_bits │ │
= y << 53 = 101010010110000001111010100100101000100111101010110100000100010101100000000 0000⋯000
└───┬─────────────────────────────────────────────────────────────────────┘ └────┬─┘
└─ does not fit in a 64 bit integer always zero ─┘
However, we only need to know whether dropped_bits
is lower, equal, or higher than 1 << 127
.
If the highest bit is not set, we know it’s lower, and we don’t need to look at the other bits at all.
If the highest bit is set and all other bits are zero, it is equal.
If the highest bit is set and any of the other bits is set, it is higher.
So, we don’t care about the individual bits beyond the highest one.
We only need to know whether any of them was set.
0xxxxxxxxxxxxxxxx⋯xxxxxxxxxxxxxxx: round down
10000000000000000⋯000000000000000: round to even
1xxxxxxxxxxxxxxxx⋯xxxxxxxxxxxxxxx: round up
(x: don't care)
This means that we can binary-or (|
) some of the bits together without losing the information we need.
We can use that to squash it all into 64 bits, so it fits in a single register.
Let’s split up the 75 relevant bits into a chunk of 64 bits and a chunk of 11 bits, and
binary-or them together:
┌─ the 75 dropped bits that we care about
┌─────────────────────────┴───────────────────────────────────────────────┐
dropped_bits = 101010010110000001111010100100101000100111101010110100000100010101100000000 000⋯000
│ ┌─────┼┘ ┌──────┘
└───┐ │ └───┼┐
│ 01100000000│ last 11 dropped bits
1010100101100000011110101001001010001001111010101101000001000101┘ first 64 dropped bits
---------------------------------------------------------------- binary or
squashed = 1010100101100000011110101001001010001001111010101101001101000101
> 1000000000000000000000000000000000000000000000000000000000000000
│
└─ bigger, so we round up
In code, that looks like this:
let dropped_bits: u128 = y << 53;
let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (dropped_bits >> 53 & 0b11111111111) as u64;
dropped_bits_squashed
now has the same most significant bit as dropped_bits
,
and the other 127 bits were split into two chunks and binary-or’ed together, to preserve the information we need:
whether any of them was set.
If we inline the last usage of dropped_bits
, we get y << 53 >> 53
, which we can reduce to just y
since we’re
cutting everything beyond the 11nd bit off anyway:
let dropped_bits: u128 = y << 53;
let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (y as u64 & 0b1111111111);
Now note that it would work fine if we had not picked 11, but 12, or 13, or any other number up to 63 bits from y
instead.
It would mean that some of the bits from y
would end up both in the left and right chunk, but that’s fine.
We only care about whether any of them was set.
Processors usually like working with powers of two and might even have an instruction for taking the lower 32 bits of a 64 bit number, so let’s take 32 bits instead of 11:
let dropped_bits: u128 = y << 53;
let dropped_bits_squashed: u64 = (dropped_bits >> 64) as u64 | (y as u64 & 0xFFFFFFFF);
Or, visually:
┌─ 75 dropped bits
┌─┴─────────────────────────┐
y = 1 1100100101000100001000100010001000011111010001010100 10101001011⋯000101011 000⋯000
│ │
┌──────────────────────────────────────────────────────┘ ┌────────┘
│ │
dropped_bits = 101010010110000001111010100100101000100111101010110100000100010101100000000 0000⋯000
: ┌──────┘ : ┌──────┘
└───┐ │ └───┼┐
│ 01010110100000100010101100000000│ last 32 dropped bits
1010100101100000011110101001001010001001111010101101000001000101┘ first 64 dropped bits
---------------------------------------------------------------- binary or
1010100101100000011110101001001011011111111010101111101101000101
> 1000000000000000000000000000000000000000000000000000000000000000
│
└─ bigger, so we round up
To make things even more unreadable shorter,
we can inline the other usage of the dropped_bits
variable and merge y << 53 >> 64
into a single y >> 11
,
since the high 64 bits are dropped afterwards by the cast to u64
:
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
Adjusting the condition of the if
for our new 64-bit variable, we now have:
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1022;
let mut mantissa = y >> 75;
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
if dropped_bits > 1 << 63 || (dropped_bits == 1 << 63 && mantissa & 1 == 1) {
mantissa += 1;
}
f64::from_bits(((exponent as u64) << 52) + mantissa as u64)
}
Branch-free rounding
It’s great that we’ve gotten rid of some potentially expensive 128-bit operations,
but that if
still looks bad with that large condition,
due to the tie-breaking round-to-even stuff.
Is there a way we can get rid of it entirely? Making rounding branch free?
The body of the if
statement only increments the mantissa by one,
so all we need is an expression that is 0
when the condition is false,
and 1
when the condition is true, and unconditionally add that to the mantissa.
If we didn’t have to take rounding ties into account,
we could replace the entire if
by:
mantissa += dropped_bits >> 63;
This works, because dropped_bits >> 63
is 1
when equal or higher than 1 << 63
,
and 0
when it is lower.
dropped_bits = 1010100101100000011110101001001010001001111010101101000001000000
│
└──────────────────────────────────────────────────────────────┐
│
>> 63 = 0000000000000000000000000000000000000000000000000000000000000001
│
one, so we round up ┘
Unfortunately, we need to deal with rounding ties to even.
To see for which cases exactly we need to correct, let’s put all the different cases in a table:
│ dropped_bits │ mantissa │ increment │
│ high bit │ other bits │ lowest bit │ mantissa │
├──────────┼────────────┼────────────┼───────────┤
│ 0 │ - │ 0 │ no │
│ 0 │ - │ 1 │ no │
│ 1 │ 0 │ 0 │ no │ !
│ 1 │ 0 │ 1 │ yes │
│ 1 │ >0 │ 0 │ yes │
│ 1 │ >0 │ 1 │ yes │
The first column shows the result of dropped_bits >> 63
,
and the last column shows whether we actually need to increment the mantissa.
There’s only one case where they don’t match: the third row,
which is the case where the mantissa is even while breaking a tie.
We can attempt to correct by subtracting one from dropped_bits
to flip the outcome:
Tie:
dropped_bits = 1000000000000000000000000000000000000000000000000000000000000000
- 1 = 0111111111111111111111111111111111111111111111111111111111111111
│
└─ flipped to zero, so we round down
Not a tie:
dropped_bits = 1010100101100000011110101001001010001001111010101101000001000000
- 1 = 1010100101100000011110101001001010001001111010101101000000111111
│
└─ still one, so we round up
We need to apply this correction when the high bit of dropped_bits
is set,
and the low bit of mantissa
is not set,
which we can conveniently express as (dropped_bits >> 63) & !mantissa
.
This means we can now replace the entire if
statement by:
mantissa += (dropped_bits - ((dropped_bits >> 63) & !mantissa)) >> 63;
Putting it all together, we now only have one if
left:
pub fn u128_to_f64(x: u128) -> f64 {
if x == 0 {
return 0.0;
}
let n = x.leading_zeros();
let y = x << n;
let exponent = (127 - n) + 1022;
let mut mantissa = (y >> 75) as u64;
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
mantissa += (dropped_bits - ((dropped_bits >> 63) & !mantissa)) >> 63;
f64::from_bits(((exponent as u64) << 52) + mantissa)
}
Removing the last branch
Do we need that last remaining if
? What happens if we remove it?
x.leading_zeros()
would result in n = 128
when x
is zero, which is fine,
but then x << n
on the next line could panic.
Let’s swap that for x.wrapping_shl(n)
, such that it never panics,
but instead just results in zero when x
is zero and n
is 128.
The calculation of the mantissa all just results in a zero, which is exactly correct.
Unfortunately, the calculation of the exponent does not result in a zero, so we’ll have to keep an if
there.
Maybe if we squeeze it on one line at the end, no one will notice.
(And maybe the compiler will be able to turn it into a branch-free conditional move,
now that the if
just selects between two integers.)
pub fn u128_to_f64(x: u128) -> f64 {
let n = x.leading_zeros();
let y = x.wrapping_shl(n);
let mut mantissa = (y >> 75) as u64;
let dropped_bits = (y >> 11 | y & 0xFFFFFFFF) as u64;
mantissa += (dropped_bits - (dropped_bits >> 63 & !mantissa)) >> 63;
let exponent = if x == 0 { 0 } else { 127 - n + 1022 };
f64::from_bits(((exponent as u64) << 52) + mantissa)
}
Now all we need to do to make this function even less readable prettier,
is to fold some of the remaining constants, remove unnecessary parentheses,
and to only use single letter variables to make every line start with let _ =
:
pub fn u128_to_f64(x: u128) -> f64 {
let n = x.leading_zeros();
let y = x.wrapping_shl(n);
let a = (y >> 75) as u64;
let b = (y >> 11 | y & 0xFFFFFFFF) as u64;
let m = a + ((b - (b >> 63 & !a)) >> 63);
let e = if x == 0 { 0 } else { 1149 - n as u64 };
f64::from_bits((e << 52) + m)
}
Beautiful!
Disassembly
If we compile this for x86-64, we’ll see that the resulting machine code is entirely branchless!
Good job, Rustc LLVM!
Look, no labels, no jumps:
u128_to_f64:
bsr rax, rdi
mov edx, 127
cmovne rdx, rax
xor rdx, 63
add rdx, 64
bsr rcx, rsi
xor rcx, 63
test rsi, rsi
cmove rcx, rdx
mov rax, rsi
shld rax, rdi, cl
mov rdx, rdi
shl rdx, cl
xor r8d, r8d
test cl, 64
cmovne rax, rdx
cmovne rdx, r8
mov r9, rax
shr r9, 11
shld rax, rdx, 53
mov edx, edx
or rdx, rax
shl rcx, 52
movabs r10, 0x47d0000000000000
sub r10, rcx
or rdi, rsi
cmove r10, r8
shr rax, 63
add rdx, r9
mov ecx, r9d
not ecx
and eax, ecx
sub rdx, rax
shr rdx, 63
or rdx, r10
movq xmm0, rdx
ret
I don’t know about you, but if I was an x86 processor, I’d think this code is really pretty.
Results
As it turns out, we could’ve saved a lot of effort by just doing:
pub fn u128_to_f64(x: u128) -> f64 {
x as f64
}
But that’s obviously no fun.
Unfortunately, we cannot measure fun. What we can measure, though, is speed.
Quickly throwing together a benchmark using criterion
to test the builtin as f64
conversion gives the following results on my computer:
An average of 7.7 nanoseconds per conversion. Not bad.
..
Okay. Moment of truth…
Let’s try the same benchmark on our own implementation…
..
Wow. 4.8 nanoseconds per iteration. That’s almost twice as fast!
😮
Epilogue
Having beaten Rustc’s builtin as f64
operator,
I had proven to Rustc that, no, I totally didn’t just waste my time by forgetting about its builtin conversions.
I could finally pretend like this was my goal all along.
The resulting dopamine rush left me craving for more, which resulted in me spending the next several days and nights
implementing conversions from every single integer type to the floating point types, and back.
I worked on benchmarks and tests, and started tracking the differences in the
generated machine code for several different architectures, just to try to find
the version of the Rust code that compiled down to the best and most performant code on all platforms.
In the end I also made some architecture-specific versions,
such as an implementation
that makes use of native f64
addition and subtraction instructions
to perform u128-to-f64 conversion another two times faster (2.3 nanoseconds),
by using some tricks that could fill their own blog post.
Eventually, tired but satisfied, I pushed it all to GitHub and slowly woke up from my hyperfocus fueled week-long fever dream, wondering when I had last slept or eaten. I ate, slept, and then never cared about integer-float conversion ever again.
Update (2022-05-31): My conversion functions have been merged into
rust/compiler-builtins
for all conversions between all integer and floating point types. The optimized versions for 128 bit integers have also been merged in the .NET runtime as part of the the newInt128
andUInt128
support.