c++,floating-point,precision,floating-accuracy,numerical-methods

Will it be more stable (or faster, or at all different) in general to implement a simple function to perform log(n) iterated multiplies if the exponent is known to be an integer? The result of exponentiation by squaring for integer exponents is in general less accurate than pow, but...

c,floating-point,floating-accuracy,ieee-754

I am going to assume IEEE 754 binary floating point arithmetic, with float 32 bit and double 64 bit. In general, there is no advantage to doing the calculation in double, and in some cases it may make things worse through doing two rounding steps. Conversion from float to double...

c++,std,equality,floating-accuracy

std::equal_to uses == to perform the comparison. If you want to compare with a tolerance, you'll have to write that yourself. (Or use a library.)

c++,double,floating-accuracy,double-precision,atof

The problem is that it is in general impossible to represent fractional decimal values exactly using binary floating point numbers. For example, 0.1 is represented as 1.000000000000000055511151231257827021181583404541015625E-1 when using double (you can use this online analyzer to determine the values). When computing with these rounded values the number of necessary...

You can avoid it by using BigDecimal BigDecimal d1 = new BigDecimal("100.00"); BigDecimal d2 = new BigDecimal("0.1"); for(int i = 0; i < 100; i++) { d1 = d1.subtract(d2); System.out.println(d1); } produces 99.90 99.80 99.70 99.60 99.50 99.40 99.30 99.20 ... ...

c,floating-point,floating-accuracy,floating-point-precision

I imagine speed is of some concern, or else you could just try the floating point-based estimate and adjust if it turned out to be too small. In that case, one can sacrifice tightness of the estimate for speed. In the following, let dst_base be 2^w, src_base be b, and...

prolog,floating-accuracy,eclipse-clp,interval-arithmetic,clpr

Several issues conspire here to create confusion: other than claimed, the three constants in the example do not have exact representations as double floats. it is not true that the initial example involves no rounding. the seemingly correct result in the first example is actually due to a fortunate rounding...

python,pandas,floating-accuracy

I think I can guess what is happening: In [481]: df=pd.DataFrame( { 'x':[0,0,.1,.2,0,0] } ) In [482]: df2 = pd.rolling_sum(df,window=2) In [483]: df2 Out[483]: x 0 NaN 1 0.000000e+00 2 1.000000e-01 3 3.000000e-01 4 2.000000e-01 5 2.775558e-17 It looks OK, except for the last one, right? In fact, the rounding...

limit,floating-accuracy,floating-point-precision,maxima

(1) use horner to rearrange the expression so that it can be evaluated more accurately. (%i1) display2d : false; (%o1) false (%i2) horner (x^5-6*x^4+14*x^3-20*x^2+24*x-16, x); (%o2) x*(x*(x*((x-6)*x+14)-20)+24)-16 (%i3) subst (x=1.999993580023622, %); (%o3) -1.77635683940025E-15 (2) use bigfloat (variable precision) arithmetic. (%i4) subst (x=1.999993580023622b0, x^5-6*x^4+14*x^3-20*x^2+24*x-16); (%o4) -1.332267629550188b-15 (%i5) fpprec : 50 $...

floating-point,numbers,floating-accuracy

This is an age old issue to do with floating point arithmetic. You can read more about it in lots of posts like this: Elegant workaround for JavaScript floating point number problem The typical simple solution to it is to treat decimals as integers then to divide them back to...

c++,algorithm,math,floating-point,floating-accuracy

Inaccurate operands vs inaccurate results I wonder whether similar problems will show up if I use mathematical functions. Actually, the problem that could prevent log2(8) to be 3 does not exist for basic operations (including *). But it exists for the log2 function. You are confusing two different issues: double...

c++,floating-point,floating-accuracy

In a comment below the question: I would expect to receive the value 1.16 and I get a compilation error on this You would expect 1.16 because you think you have the value 1.42. You don't. You have the double nearest to 1.42, and while it is close enough to...

c++,algorithm,floating-accuracy,radix

The internal representation of double is not decimal, it's already binary, and it's defined by an international standard, IEEE 754. So double is 64-bit and consists of 3 parts: the sign (1 bit), the exponent (11 bits) and the significand (52 bits). Roughly saying, this separation allows to store both...

c++,c,math,optimization,floating-accuracy

Location of problem The difference in the result between the two different levels of optimization occurs when calculating radLon2-radLon1. The result of this calculation is shown here. -O0 radLon2-radLon1 = 0xC00921FB54442D19 -O1 radLon2-radLon1 = 0xC00921FB54442D18 The difference is in the least significant bit bringing the -O0 result past the real...

d,floating-accuracy,floating-point-precision

This is probably because 0.6 can't be represented purely in a floating point number. You write 0.6, but that's not exactly what you get - you get something like 0.599999999. When you divide that by 0.1, you get something like 5.99999999, which converts to an integer of 5 (by rounding...

mysql,floating-point,floating-accuracy,floating-point-conversion

Decisions and consequences This is the consequences that you've got because you decided to use floating-point data type. Floats are not precise. And that means: yes, you can result in a>a = true For instance, your fourth row: mysql> SELECT * FROM t WHERE id=4; +------+--------+ | id | rating...

c++,c,floating-point,floating-accuracy,floating-point-precision

Can I get away with setting the rounding mode of the FP environment appropriately and performing a simple cast (e.g. (double) N)? Yes, as long as the compiler implements the IEEE 754 (most of them do at least roughly). Conversion from one floating-point format to the other is one...

split,floating-point,floating-accuracy

(using double-precision for the examples) I was wondering if it is possible to split somehow x = (I-1.0) + (f + 1.0), namely without floating point rounding error. There is no chance of obtaining such a split for all values. Take 0x1.0p-60, the integral part of which is 0.0 and...

No it's not the same for the reason you mentioned. Here's an example: double x = 894913.3; System.out.println(x * 0.0000001); // prints 0.08949133 System.out.println(x / 10000000); // prints 0.08949133000000001 Using a BigDecimal, we can see the difference between the two values: System.out.println(new BigDecimal(0.0000001)); System.out.println(new BigDecimal((double)10000000)); Ouput: 9.99999999999999954748111825886258685613938723690807819366455078125E-8 10000000 ...

visual-studio,visual-c++,floating-accuracy

The behavior is correct, the float type can store only 7 significant digits. The rest are just random noise. You need to fix the bug in your code, you are either displaying too many digits, thus revealing the random noise to a human, or your math model is losing too...

c++,vector,duplicates,duplicate-removal,floating-accuracy

First sort your vector using std::sort. Then use std::unique with a custom predicate to remove the duplicates. std::unique(v.begin(), v.end(), [](double l, double r) { return std::abs(l - r) < 0.01; }); // treats any numbers that differ by less than 0.01 as equal Live demo...

java,floating-accuracy,floating-point-precision

Simple answer is - yes. This example definitely returns false: public boolean alwaysFalse(){ float a=Float.MIN_VALUE; float b=Float.MAX_VALUE; float c = a / b; return a == c * b; } Updated More general answer is that there are two cases when false happens in your method: 1) when significand if...

c,floating-accuracy,floating-point-precision,approximation

Truncation vs. rounding. Due to subtle rounding effect of FP arithmetic, the product nol * noc may be slightly less than an integer value. Conversion to int results in fractional truncation. Suggest rounding before conversion to int. #include <math.h> int result = (int) roundf(nol * noc); ...

floating-point,double,floating-accuracy,ieee-754

In Java, a - b is never equal to 0 if a != b. This is because Java mandates IEEE 754 floating point operations which support denormalized numbers. From the spec: In particular, the Java programming language requires support of IEEE 754 denormalized floating-point numbers and gradual underflow, which make...

c++,floating-accuracy,floating-point-precision

It is indeed a floating-point issue: a typical 64-bit double only gives 53 bits, or about 15 decimal digits, of precision. You might (or might not) get more precision from long double. Or, for integers up to about 1019, you could use uint64_t. Otherwise, there's no standard type with more...

python,math,scipy,floating-accuracy

In this particular case, given those particular numbers, the quad approach will actually be more accurate. The CDF itself can be computed quickly and accurately, of course, but look at the actual numbers: >>> scipy.stats.norm.cdf(6), scipy.stats.norm.cdf(5) (0.9999999990134123, 0.99999971334842808) When you're differencing two very similar quantities, you lose accuracy. Similar problems...

c++,floating-point,floating-accuracy,sin

In C++, sin has an overload float sin(float f). And overload resolution is done on argument type, rather than return type. To force the use of double sin(double d) you need to cast the argument: sin(static_cast<double>(x)). (2) vs (3): the FP standard allows implementations to store intermediate results with greater...

python,mysql,orm,sqlalchemy,floating-accuracy

Per our discussion in the comments: sa.types.Float(precision=[precision here]) instead of sa.Float allows you to specify precision; however, sa.Float(Precision=32) has no effect. See the documentation for more information.

java,floating-point,floating-accuracy

Any COBOL programmer would be able to answer this immediately. The point of the question is that if you add the big numbers first, you lose precision when you come to add the small numbers. Add the small numbers first....

It depends on what you're trying to avoid, but probably not. If you're trying to avoid catastrophic cancellation (where 10^100 + 1 - 10^100 results in 0 instead of 1), using a wider FP type will help a little bit but not very much. If the numbers are a lot...

python,floating-point,ctypes,floating-accuracy

Jan Rüegg is right - this is just how floats work. If you're wondering why this only shows up with c_float, it's because c_floats print as "c_float({!r}).format(self.value). self.value is a double-precision Python float. Python's float type prints the shortest representation that converts to the floating point number, so although float(0.2)...

arrays,matlab,date,floating-accuracy

You can give year, month, day, ... in numeric format to the function datenum. Datenum accepts vectors for one or several of its arguments, and if the numbers are too big (for example, 120 minutes), datenum knows what to do with it. So by supplying the minutes vector in 20-minute...

floating-point,language-lawyer,c99,floating-accuracy

No such guarantee is made. Some math libraries attempt to satisfy this constraint for math library functions only, but that behavior is not required by any standard, and it is quite rare (even most libraries that attempt to provide it have compromises and bugs). For functions that are not in...

c++,algorithm,binary-search,floating-accuracy

The problem was I was setting minimal volume as high in the binary search, but I should use the maximal volume. The second problem was I was not passing maximal radius ^ 3 to the binary search function. Thanks for help

opengl,floating-accuracy,opentk

I don't think that glReadPixels is to blame here. I think that the Z buffer precision is the issue. By default, you typically have a 24 bit fixed-point depth buffer. Maybe it helps if you use a 32 bit floating point depth buffer, but you probably need an FBO for...

c++,floating-point,floating-accuracy,floating-point-precision

Converting from float to double preserves the value. For this reason, in the first snippet, d contains exactly the approximation to float precision of 112391/100000. The rational 112391/100000 is stored in the float format as 9428040 / 223. If you carry out this division, the result is exactly 1.12390995025634765625: the...

double,cross-platform,precision,floating-accuracy,deterministic

The C# standard, ECMA-334, says in section 11.1.6 Floating point types "Floating-point operations can be performed with higher precision than the result type of the operation.". Given that sort of rule, you cannot be sure of getting exactly the same answer everywhere. "can" means some implementations may use higher precision...

c++,c,floating-point,floating-accuracy

You compute temp = 3604.0f accurately at each iteration. The problem arises when you try adding 3604.0f to something else and round the result to the nearest float. floats store an exponent and a 23-bit significand, meaning any result with 1-bits more than 24 places apart is going to get...

c++,floating-point,rotation,geometry,floating-accuracy

I think the problem is that you're writing the rotated result back to the input array. p'x = cos(theta) * (px-ox) - sin(theta) * (py-oy) + ox p'y = sin(theta) * (p'x-ox) + cos(theta) * (py-oy) + oy Try doing the rotation out of place, or use temporary variables and...

go,floating-point,floating-accuracy

As per spec: Constant expressions are always evaluated exactly; intermediate values and the constants themselves may require precision significantly larger than supported by any predeclared type in the language. Since 912 * 0.01 is a constant expression, it is evaluated exactly. Thus, writing fmt.Println(912 * 0.01) has the same effect...

sql,floating-point,floating-accuracy,floating-point-precision,kognitio-wx2

But you are using the full precision of the 32-bit floating point number. Remember that these are binary floating point values, not decimal. When you view it in decimal, you get a bunch of trailing zeros, and it looks very nice and clean. But if you were to view those...

c++,floating-point,floating-accuracy

If you're unwilling to simplify the problem as specified in the comments, because what you posted was only an example of what you want, or otherwise, the custom class option is a decent way to go. I wrote up a very basic implementation of the kind of behavior you want....

Actually, using float.Epsilon may not make any significant difference here. float.Epsilon is the smallest possible float greater than zero (roughly 1.401298E-45), which does not mean that it's the smallest difference between any two arbitrary floats. Since floating-point math is imprecise, the difference between two seemingly equal numbers can be much...

Floating can be a bit difficult for beginners. First read up on it on sites like w3schools.com When placing images next to each other in your example, both images should float in one direction. img{ float:left; } Also, floats should be cleared when putting content beneath it. Read more on...

numeric,floating-accuracy,numerical-methods

In general if you want precise results in physical based rendering you don't want to use floats or doubles since they have massive rounding problems and thus introduce errors in your simulation. If you need or want to stick with floats/double you probably should rescale around zero. The reason is...

c,floating-point,floating-accuracy

This is most likely because you have some locale where the decimal separator is , instead of ., and since you are very likely not checking the return value of scanf() "which is what most books do and almost every tutorial", then the variable is not being initialized and what...

objective-c,floating-accuracy,nsdecimalnumber

All numbers, including NSDecimalNumber have only a specific accuracy. Floats and Doubles are binary representation of numbers and they can represent 23 and 52 binary digits respectively (about 7 and 16 decimal digits). That means that some numbers cannot be represented precisely. For example, consider 1 / 3 - to...

c++,c,floating-point,floating-accuracy,floating-point-precision

why by using an union object and there defining the same memory as int32 and float32 i get different solution? The only reason the float/int union even makes sense is by virtue of the fact that both float and int share a storage size of 32-bits. What you are missing...

floating-point,floating-accuracy

After a request for clarification, the question is about IEEE 754, independently of a programming language. In this context, obtaining the result 2.4196151872870495e-72 for the division being considered, in “round-to-nearest”, is purely and simply incorrect. The correct result is 2.41961518728705e-72, according to the definition found in the question: [...] every...

php,mysql,database,floating-accuracy,sqldatatypes

Notice how the DOUBLE(30, 27) made sense for 16-17 significant digits, then went haywire? FLOAT has 24 bits of precision. That is enough for about 7 significant digits. In particular, for latitude and longitude, that is precise enough to get within 1.7 meters (5.6 feet). For most lat/lng applications, that...

java,floating-point,int,floating-accuracy,floating-point-conversion

number2 should be promoted to float type Number 2 got promoted to float. and I should get a 0.96 You should get 1.0, because: 2 / 12 * 6 = 2 / 2 = 1 0.16 * 6 = 0.96 ... and that's why. It has nothing to do...

javascript,floating-point,double,floating-accuracy

Re when inaccuracies may occur: Barring getting deep into the technical details of IEEE-754 double-precision floating point and having your code tailor itself to the specific characteristics of the two values it's operating on, you won't know when a result will be inaccurate; any operation, including simple addition, can result...

c,formatting,floating-accuracy

You have assigned a binary floating point value with a decimal real value. Binary floating point cannot represent exactly all real decimal values. Single precision binary floating point is good for precise representation of approximately 6 decimal significant figures, and 123.449997 is 9 figures; so you have exceed the promised...

c++,floating-point,precision,floating-accuracy,numerical-methods

You are correct. The precision from 1.0 to 2.0 will be uniform across the surface, like you are using fixed point. The precision from -0.5 to 0.5 will be highest around the center point, and lower near the edges (but still quite good). The precision from 0.0 to 1.0 will...

c++,templates,comparison,floating-accuracy

This question seems to be asking two things: How can I do floating point comparisons without using operator==, and how can I modify the behaviour of a template depending on the type passed to it. One answer to the second question is to use type traits. The code below demonstrates...

java,math,floating-accuracy,pow

because it casts long to double System.out.println((double)999999999999999999L); outputs: 1.0E18 and System.out.println((long)(double)999999999999999999L); outputs: 1000000000000000000 why is that ? ...

c,floating-point,floating-accuracy,kernighan-and-ritchie

The first loop determines the number of bits contributing to the significand by finding the least power 2 such that adding 1 to it (using floating-point arithmetic) fails to change its value. If that's the nth power of two, then the significand uses n bits, because with n bits you...

c#,algorithm,floating-point,floating-accuracy

Do the following Calculate (int)(0.29/0.01) = 28 //want 29 here Next, calculate back i * 0.01 for i between 28-1 and 28+1 and pick up the one that is correct....

matlab,optimization,matrix-multiplication,floating-accuracy,numerical

Do you use format function inside your script? It looks like you used somewhere format rat. You can always use matlab eps function, that returns precision that is used inside matlab. The absolute value of -1/18014398509481984 is smaller that this, according to my Matlab R2014B: format long a = abs(-1/18014398509481984)...