首页 > 解决方案 > C++ why max of 64bit double has 308 digits?

问题描述

In my environment (Win10 64bit, VC++2019 and 32bit project), sizeof(double) is 8 bytes, the max value should be 1.84e19. But std::numeric_limits<double>::max() is around 1.79e308.

Why they are so different?


The WIKI has more detailed informations: Double-precision floating-point format

The bits are laid out as follows: enter image description here

The real value assumed by a given 64-bit double-precision datum with a given biased exponent e and a 52-bit fraction is:

enter image description here

So if you google 1.999*2^1023 in your browser, it will give you 1.796794e+308.

标签: c++

解决方案


Floating point types work differently than integer types.

Integer types directly map their min-max range values to all the binary permutations from all 0s to all 1s. So they can directly represent a value, if its in range.

Floating point types are essentially 2 numbers: fraction (mantissa) and exponent.

So number 1,000,000 could be represented by fraction 1 and exponent 6 (10^6).

So a floating point number can represent a massive range of numbers (range is limited by exponent range) with variable accuracy (accuracy is limited by fraction range). As numbers get larger, due to the limited digits of fraction, smaller numbers will get less and less accurate. Eg if you reach 10,000 , adding 0.0001 will result in closest number to 10,000.0001 that can be represented, like 10,000.025786 or something.

As per IEE765, a double has 11 bits of exponent and 52 bits for fraction.

So when you pair a 52 bit long number, with a 11 bit long exponent, you get quite a large number. But that doesnt mean every number in that smallest to largest range can be accurately represented, unlike an integer type.


推荐阅读