开发跨境电商系统,郑州网站优化哪家好,网站备案代理公司,wordpress 动态文字文章目录IEEE754浮点数单精度(float,32bits)双精度(double,64bits)浮点数加减法流程逻辑工具程序IEEE754浮点数
What Every Computer Scientist Should Know About Floating-Point Arithmetic 一个浮点数包括三部分: 符号部分(Sign)、指数部分(Exponent)、分数部分(Fraction)…文章目录IEEE754浮点数单精度(float,32bits)双精度(double,64bits)浮点数加减法流程逻辑工具程序IEEE754浮点数What Every Computer Scientist Should Know About Floating-Point Arithmetic一个浮点数包括三部分: 符号部分(Sign)、指数部分(Exponent)、分数部分(Fraction)IEEE754浮点数不是均匀分布的。仅能代表有限个数的实数。数轴上有空隙多数浮点数是不能被精确表示的比如是十进制数0.1就不能被IEEE754浮点数精确表示对于normal浮点数,1.xxxx中的1是隐含存在的normal浮点数0有正负之分(S0/1, E0, Fraction0). 函数atan2(y,x)按IEEE-754标准x取正0和负0,结果对应π \piπ和− π -\pi−πIntel Fortran和Matlab中的atan2函数对正负零的处理有不同之处)有subnormal数(非常小接近数值0): E0 Fraction部分不为0IEEE754浮点数内部计算寄存器多出两位保证gaurd rounding)guard bits是三位有四种截断/舍入模式(rounding)有overflow/underflow浮点计算异常。underflow危害不大overflow需要特殊关注。overflow一个情形是从其他类型转换引起的(e.g.,一个很大的整数转成float, 或double转成float), 另外一个教科书级例子是hypot计算x 2 y 2 \sqrt{x^2y^2}x2y2, 和求多维向量长度∑ x i 2 \sqrt{\sum{x_i^2}}∑xi2, 类似计算要时刻避免浮点overflow溢出比较: inf1, 返回1; NaN1、NaN1、NaN1都返回0对于subnormal数一般有两种处理方式: flush to zero .vs. gradual underflow。subnormal数对性能影响较大可以指定编译选项打开或关闭flush to zerogradual underflow一般对比较精细计算中有帮助比如求函数数值导数等。单精度浮点数可表示范围:单精度(float,32bits)说明Bias127E范围[1…254], 0和255保留Range2 − 126 2^{-126}2−126to2 127 2^{127}2127一些特殊单精度数(调用std::numeric_limitsfloat获取)浮点数二进制表示000000000000000000000000000000000-010000000000000000000000000000000100111111100000000000000000000000-110111111100000000000000000000000eps001101000000000000000000000000001eps00111111100000000000000000000001min00000000100000000000000000000000max01111111011111111111111111111111denorm_min00000000000000000000000000000001infinity01111111100000000000000000000000sNaN01111111101000000000000000000000qNaN01111111110000000000000000000000双精度(double,64bits)说明Bias1023E范围[1…2046], 0和2047保留Range2 − 1022 2^{-1022}2−1022to2 1023 2^{1023}21023一些特殊双精度数(调用std::numeric_limitsdouble获取)浮点数二进制表示00000000000000000000000000000000000000000000000000000000000000000-0100000000000000000000000000000000000000000000000000000000000000010011111111110000000000000000000000000000000000000000000000000000-11011111111110000000000000000000000000000000000000000000000000000eps00111100101100000000000000000000000000000000000000000000000000001eps0011111111110000000000000000000000000000000000000000000000000001min0000000000010000000000000000000000000000000000000000000000000000max0111111111101111111111111111111111111111111111111111111111111111infinity0111111111110000000000000000000000000000000000000000000000000000sNaN0111111111110100000000000000000000000000000000000000000000000000qNaN0111111111111000000000000000000000000000000000000000000000000000浮点数加减法流程逻辑工具程序/************************************ 测试IEEE浮点数标准、表示形式等 *************************************/#includecmath#includeiostream#includebitset#includelimits#includetype_traits#includecstdint#includesstream#includestringusing namespace std;templatetypename Rstd::ostreamdump_bits(constR x,std::ostreamosstd::cout){uint8_t*u8(uint8_t*)x;strings();//assume little-endianfor(intisizeof(R)-1;i0;--i){std::bitset8b(u8[i]);sb.to_string();}oss;returnos;}templatetypename Rstd::ostreamdump_hex(constR x,std::ostreamosstd::cout){osstd::hexfloat;osx;osstd::defaultfloat;returnos;}templatetypename Tvoidprint_limits(){using flimitsnumeric_limitsT;coutradix:\tflimits::radix\n;coutmin_exponent:\tflimits::min_exponent\n;coutmax_exponent:\tflimits::max_exponent\n;coutdigits:\tflimits::digits\n;coutdigits10:\tflimits::digits10\n;coutepsilon:\tflimits::epsilon()\n;coutinf:\tflimits::infinity()\n;coutqNan:\tflimits::quiet_NaN()\n;coutsNan:\tflimits::signaling_NaN()\n;coutmin:\tflimits::min()\n;coutmax:\tflimits::max()\n;}templatetypename Tvoidprint_bits_and_hex(){static_assert(std::is_same_vT,float||std::is_same_vT,double||std::is_same_vT,longdouble);using flimitsnumeric_limitsT;autodump[](std::ostreamos,string name,constTx)-std::ostream{osname\t;dump_bits(x,os);os ;osstd::hexfloat;osx;osstd::defaultfloat;os\n;returnos;};dump(cout,infinity,flimits::infinity());dump(cout,sNaN,flimits::signaling_NaN());dump(cout,qNaN,flimits::quiet_NaN());dump(cout,0,T(0.0));dump(cout,-0,T(-0.0));dump(cout,1,T(1));dump(cout,-1,T(-1));dump(cout,eps,flimits::epsilon());dump(cout,1eps,flimits::epsilon()T(1));dump(cout,min,flimits::min());dump(cout,max,flimits::max());dump(cout,denorm_min,flimits::denorm_min());}intmain(intargc,char**argv){coutR(单精度浮点数(float)limits)\n;print_limitsfloat();coutR(双精度浮点数(double)limits)\n;print_limitsdouble();coutR(单精度浮点数(float)二进制模式)\n;print_bits_and_hexfloat();coutR(双精度浮点数(double)二进制模式)\n;print_bits_and_hexdouble();coutR(长精度浮点数(longdouble)二进制模式)\n;print_bits_and_hexlongdouble();////inf可和其他结果比较cout((numeric_limitsfloat::infinity()1.0f)?inf1:inf1)\n;//inf参加运算结果是infcoutnumeric_limitsfloat::infinity()/2.0f\n;//nan和其他数值比较结果都为falsecout(numeric_limitsfloat::quiet_NaN()1.0f)\n;cout(numeric_limitsfloat::quiet_NaN()1.0f)\n;//0有特殊运算定义cout(-0.0f0.0f)\n;return(0);}//编译: g -stdc17 cxx_ex3.cpp