c,if-statement,compiler-errors,floating-point,floating-point-precision

Because 0.5 has an exact representation in IEEE-754 binary formats (like binary32 and binary64). 0.5 is a negative power of two. 0.6 on the other hand is not a power of two and it cannot be represented exactly in float or double.

math,floating-point,fixed-point

In decimal, your fixed-point example is actually: 2 * 4.5 2 * 45 (after multiplying by 10) = 90 90 / 10 = 9 (after dividing the 10 back out) In binary, the same thing is being done, but just with powers of 2 instead of powers of 10 (as...

The float val is stored as 307.02999 or something like that intval just truncate the number, no rounding, hence 30702.

I would suggest generating a bound double and then converting to float: return Double.valueOf(random.nextDouble(Float.MIN_VALUE, Float.MAX_VALUE)).floatValue(); The nextDouble method has been replaced in Java 8 with a method to produce a stream of doubles. So in Java 8 you would use the following equivalent: DoubleStream randomDoubles = new Random().doubles(Float.MIN_VALUE, Float.MAX_VALUE); Double.valueOf(randomDoubles.findAny().getAsDouble()).floatValue();...

python,arrays,numpy,floating-point,floating-point-precision

The type of your diff-array is the type of H1 and H2. Since you are only adding many 1s you can convert diff to bool: print diff.astype(bool).sum() or much more simple print (H1 == H2).sum() But since floating point values are not exact, one might test for very small differences:...

swift,url,post,floating-point,decimal

stringByAddingPercentEscapesUsingEncoding(NSUTF8StringEncoding) returns optional. It needed to be unwrapped.

python,performance,floating-point,cpython,python-internals

A comment in the Python source code for float objects acknowledges that: Comparison is pretty much a nightmare This is especially true when comparing a float to an integer, because, unlike floats, integers in Python can be arbitrarily large and are always exact. Trying to cast the integer to a...

c++,floating-point,type-conversion

Overflow in the conversion from floating-point to integer is undefined behavior. You cannot rely on it being done with a single assembly instruction or with an instruction that overflows for the exact set of values for which you would like the overflow flag to be set. The assembly instruction cvttsd2si,...

assembly,floating-point,x86-64,sse

You should take some time to study the instruction set reference, so you at least get a rough idea what kind of possibilities you have. Also, you should read the appropriate ABI docs for the calling convention. That said, the answer to your first question is float return values should...

There are a number of platforms that don't take much care with their math library on which exp2(x) is simply implemented as exp(x * log(2)) or vice-versa. These implementations do not deliver good accuracy (or especially good performance), but they are fairly common. On platforms that do this, one function...

c#,floating-point,readprocessmemory

int dataRead = ReadAddress("solitaire", "solitaire.exe+97074 2c 10"); Gives you (hopefully) 4 bytes in an int. After you've read this int, you need to convert it to float, which means byte[] bytesOfTheNumber = BitConverter.GetBytes(dataRead); converting the integer number into a byte array first, and then float theFloatYouWant = BitConverter.ToSingle(bytesOfTheNumber, 0); using...

When you divide two ints you perform integer division, which, in this case will result in 22/64 = 0. Only once this is done are you creating a float. And the float representation of 0 is 0.0. If you want to perform floating point division, you should cast before dividing:...

False: 0f * NAN == NAN 0f * INFINITY == NAN and ... 0f * -1f == -0f (negative 0f), with 0f == -0f :-) (on Intel, VC++, and probably on any platform that uses IEEE 754-1985 floating points) Example on ideone (that uses GCC on some Intel compatible platform...

This would be one way std::string temp = "00000000000000000000000101111100"; assert(temp.length() == 32); int n = 0; for(int i = 0; i < temp.length(); ++i) { n |= (temp[i] - 48) << i; } float f = *(float *)&n; ...

c,arrays,struct,floating-point,malloc

There are a lot of issues with your code. I would advise you to practice more with C basics before attempting to do this. Here is approximation of what you might have wanted to achieve with your code: #include <stdio.h> #include <string.h> // This structure can hold array of floats...

You can see why here: http://stackoverflow.com/a/20910712/1073171 Thus, you can fix this code by changing your defines to either: #define MaxInt32 (int32)0x7FFFFFFF #define MinInt32 (int32)0x80000000 Or else: #define MaxInt32 (int32)2147483647 #define MinInt32 (int32)(-2147483647 - 1) The reasoning's are given in the answer I linked above. If you're using GCC, you could...

python,python-2.7,floating-point,division

repr shows more digits (I'm guessing just enough to reproduce the same float): >>> print result 100.0 >>> print repr(result) 99.99999999999997 >>> result 99.99999999999997 >>> print step 4.7619047619 >>> print repr(step) 4.761904761904762 >>> step 4.761904761904762 ...

linux-kernel,floating-point,syscall

On most modern processors, simple double addition such as 3.14 + 5.26 would be done by hardware commands, just like integer addition.

python,list,floating-point,integer

If ordering isn't important, just use type and a set comprehension. input = [1, 1.0, '1', -1, 1] output = [x[0] for x in {(y, type(y)) for y in input}] print output You can use the same idea to preserve ordering. output = [] output_filter = set() for y in...

c++,linux,floating-point,vectorization

Floating point CPU registers can be larger than the floating point type you're working with. This is especially true with float which is typically only 32 bits. A calculation will be computed using all the bits, then the result will be rounded to the nearest representable value before being stored...

javascript,floating-point,parsefloat

The answer is in your question; toFixed(2): var MoHCost = parseFloat($('#pucMoH').text()).toFixed(2); var SelectedVal = parseFloat($(this).val()).toFixed(2); var SelectedSum = (MoHCost * SelectedValInt).toFixed(2); ...

debugging,android-ndk,floating-point,arm

The ARM ARM has this to say about all the exception-trapping bits in FPSCR: [...]. This bit is RW only if the implementation supports the trapping of floating-point exceptions. In an implementation that does not support floating-point exception trapping, this bit is RES0. The Tegra K1 SoC in the SHIELD...

The example implicitly declares a binary floating point format, which has an arbitrary precision exponent, but only 2 bits in the mantissa. All numbers are of format 1.xx * 2^n. When one performs floating point addition, one must de-normalize or scale the arguments to have the same exponent. 0.25 =...

ios,swift,floating-point,floating-point-precision,floating-point-conversion

This is due to the way the floating-point format works. A Float is a 32-bit floating-point number, stored in the IEEE 754 format, which is basically scientific notation, with some bits allocated to the value, and some to the exponent (in base 2), as this diagram from the single-precision floating-point...

One way to do this is to use the toFixed method off a Number combined with parseFloat. Eg, var number = 1501.0099999999999909; var truncated = parseFloat(number.toFixed(5)); console.log(truncated); toFixed takes in the number of decimal points it should be truncated to. To get the output you need, you would only need...

assembly,floating-point,x86,sse,avx

Here's an alternate implementation which uses fewer instructions: static inline void maxminmag_test(__m128d & a, __m128d & b) { __m128d cmp = _mm_add_pd(a, b); // test for mean(a, b) >= 0 __m128d amin = _mm_min_pd(a, b); __m128d amax = _mm_max_pd(a, b); __m128d minmag = _mm_blendv_pd(amin, amax, cmp); __m128d maxmag = _mm_blendv_pd(amax,...

Use given code to get only 22.0 string value = "float;#22.0000000000000"; var number = value.Split('#')[1]; double num = double.Parse(number); Console.WriteLine(num.ToString("0.0")); Result: 22.0 mentioned code should be work. item.GetFormattedValue("ColumnName") OR try this post https://social.msdn.microsoft.com/Forums/office/en-US/fdef03db-9678-46cf-8ff7-03551f4b8466/how-to-convert-a-decimal-number-field-to-c?forum=sharepointdevelopmentprevious...

Since x = -10, x**i will alternate between positive and negative high values, and so will -(x**i) which is what is calulated when you write -x**i .np.exp(inf) = inf and np.exp(-inf) = 0 so for high enough numbers, you're alternating between infinity and 0. You probably wanted to write np.exp((-x)**i),...

c,gcc,floating-point,c11,sigfpe

After running the code, this was under cygwin, gdb dumped the trace. $ cat sigfpe.exe.stackdump Exception: STATUS_INTEGER_DIVIDE_BY_ZERO at rip=00100401115 rax=0000000000000000 rbx=000000000022CB20 rcx=0000000000000001 rdx=0000000000000000 rsi=000000060003A2F0 rdi=0000000000000000 r8 =0000000000000000 r9 =0000000000000000 r10=0000000000230000 r11=0000000000000002 r12=0000000000000000 r13=0000000000000001 r14=000000000022CB63 r15=000000000022CB64 rbp=000000000022CAD0 rsp=000000000022CAA0 program=C:\cygwin64\home\luser\sigfpe.exe, pid...

Your string to float conversions are working exactly as advertised, although obviously not exactly as you expect :-) IEEE754 single precision floating point values (as used in float) only have a precision of about seven decimal digits. If you want more precision, use double, which provides about fifteen decimal digits,...

To store 6.0 with fstp, you first need 6.0 in a floating point register. The easiest way to get it there is to load it from memory.. I think that sort of misses the point in this case. Anyway you can use a normal integer mov to store it, just...

Because of how floats and decimals are represented, the casts could be causing different results. For: double value1 = 0.999939f * 0.987792f you are multiplying two floats and then casting it to a double. The resulting float representation is the one being converted to a double. For: double value2 =...

python,string,replace,floating-point,integer

(?<=\d)\.0+\b You can simple use this and replace by empty string via re.sub. See demo. https://regex101.com/r/hI0qP0/22 import re p = re.compile(r'(?<=\d)\.0+\b') test_str = "15.0+abc-3" subst = "" result = re.sub(p, subst, test_str) ...

jsf,jsf-2,floating-point,converter

In a nutshell: floating-point-gui.de If you value precision, you should be using java.math.BigDecimal instead of float/double/java.lang.Float/java.lang.Double....

Here is the general method (maybe there is a more clever or mathematical way): Multiply by 2 repeatedly until it has no more decimal component. Take the resulting number and multiply by 2^-n where n is the number of iterations it took to get there. Your example: 3.5625 3.5625 *...

floating-point,julia-lang,multiplication

Some options: Use the inbuilt Rational type. The most accurate and fastest way would be 16//100 * 16//100 If you're using very big numbers these might overflow, in which case you can use BigInts instead, big(16)//big(100) * big(16)//big(100) (you don't actually need to wrap them all in bigs, as the...

Looks right. You can actually get rid of function calls for platforms which implement IEEE 754 (Intel's, Power's and ARMs do) because the special floating point values can be determined without calls. bool equiv(double x, double y) { return (x == y && (x || (1 / x == 1...

You are using the wrong format for scanf and printf - it should be "%f", not "%.f".

binary,floating-point,decimal,floating-point-precision,significant-digits

The precision is fixed, which is exactly 53 binary digits for double-precision (or 52 if we exclude the implicit leading 1). This comes out to about 15 decimal digits. The OP asked me to elaborate on why having exactly 53 binary digits means "about" 15 decimal digits. To understand this...

c++,algorithm,floating-point,logarithm

This is needed in order (li + ls) / 2 to work correctly. For example: 0.999 - 0.9981 = 0.0009 < 0.001 but: (0.999 + 0.9981) / 2 = 0.99855 On the other hand: (0.9999 + 0.9998) / 2 = 0.99985 which rounds up to 1, when rounding to 3rd...

python,numpy,floating-point,blas

Floating point calculations are not always reproducible. You may get reproducible results for floating calculations across different machines if you use the same executable image, inputs, libraries built with the same compiler and identical compiler settings (switches). However if you use a dynamically linked library you may get different results,...

c,gcc,floating-point,floating-point-precision

I'm surprised this does not work because I explicitly define the associativity of my operations when it matters. For example in the following code I but parentheses where it matters. This is exactly what -fassociative-math does: it ignores the ordering defined by your program (which is just as defined...

c,floating-point,char,type-conversion

Although it's not the same as "converting a char into a float", from various hints in your question I think what you really want is this: operand = operand - '0'; This converts the (usually) ASCII value in operand into the value that it represents, from 0 - 9. Typically,...

java,string,floating-point,formatting,type-conversion

System.out.println(String.format("%.1g%n", 0.9425)); System.out.println(String.format("%.1g%n", 0.9525)); System.out.println(String.format( "%.1f", 10.9125)); returns: 0.9 1 10.9 Use the third example for your case...

c++,floating-point,language-lawyer,floating-point-precision,minimum

If std::numeric_limits<F>::is_iec559 is true, then the guarantees of the IEEE 754 standard apply to floating point type F. Otherwise (and anyway), minimum permitted values of symbols such as DBL_DIG are specified by the C standard, which, undisputably for the library, “is incorporated into [the C++] International Standard by reference”, as...

Both are well defined behaviour. Quoting from http://en.cppreference.com tan If the argument is ±0, it is returned unmodified. If the argument is ±∞, NaN is returned and FE_INVALID is raised. If the argument is NaN, NaN is returned. atan If the argument is ±0, it is returned unmodified. If the...

java,floating-point,formatting

Either I misread the question or it changed while I was typing, my original answer suggesting DecimalFormat won't do what you're looking for. Regular expressions will: final String unformatted = Double.toString(6381.407812500002); final String formatted = unformatted.replaceAll("00+\\d*$", ""); System.out.println(formatted); Yields: 6381.4078125 You asked for any more than two zeroes and all...

go,floating-point,type-conversion,floating-point-conversion

The problem here is the representation of constants and floating point numbers. Constants are represented in arbitrary precision. Floating point numbers are represented using the IEEE 754 standard. Spec: Constants: Numeric constants represent values of arbitrary precision and do not overflow. Spec: Numeric types: float64 the set of all IEEE-754...

The problem is due to the fact that the intermediate calculations are being performed in a higher precision, and the rules for when to round back to float precision are different in each case. According to the docs By default, in code for x86 architectures the compiler uses the coprocessor's...

python,while-loop,binary,floating-point,decimal

I don't think this code is written all that well, but here's a rough idea. The first while loop: while ((2**p)*x)%1 != 0: ... is figuring out how many places in binary to the right of the decimal point will the result be. To use a familiar analogy let's work...

python,arrays,list,numpy,floating-point

Just access the first item of the list/array, using the index access and the index 0: >>> list_ = [4] >>> list_[0] 4 >>> array_ = np.array([4]) >>> array_[0] 4 This will be an int since that was what you inserted in the first place. If you need it to...

c,casting,floating-point,printf

7 / 3 without a cast is integer division, resulting in the integer 2, since any remainder is discarded, so it is equivalent of calling printf("%0.2f\n", 2); Now, why does that print 0 instead of 2? Well, you told it it is a float in the format string ("%0.2f"). So...

python,python-2.7,floating-point,bigfloat

As an example, here's how you might compute f1 to a precision of 256 bits using the bigfloat library. >>> from bigfloat import BigFloat, precision >>> with precision(256): ... x = BigFloat('1e-19') ... y = BigFloat('9e9') ... z = BigFloat('1e-18') ... f1 = x * x * y / ((1...

c,floating-point,floating-point-conversion

This is the actual article. The blog post is referring to the results in section 7 (in other words, “they” are not comparing anything in the blog post, “they” are regurgitating the information from the actual article, omitting crucial details): Implementations of Dragon4 or Grisu3 can be found in implementations...

python,ruby,haskell,floating-point,erlang

This happens because all the languages are using the same numerical representation for non-integer numbers: IEEE 754 floating point numbers with, most likely, the same level of precision. (Either 32-bit "floats" or 64-bit "doubles", depending on how your system and languages are configured.) Floating point numbers are the default choice...

c++,math,floating-point,fortran,benchmarking

I found some details in some random papers I happened to be browsing: Design issues in division and other floating-point operations by Oberman, S.F. Reconfigurable Custom Floating-Point Instructions by Jin et. al These papers plot the relative frequency of multiply, add, subtract, divide, and square root....

If you add a positive and a negative number that are very close to each other, producing a denormal, then the underflow flag is set (if underflow exception is not masked). Link to intel article denormals and underflow. Example code (using Microsoft compiler): double a,b,c; a = 2.2250738585072019e-308; b =...

I found the problem and the solution reading the comments :) The problem: I was retrieving data from a webservice via curl and json_encoding the result. At this point, json_encode converts bigints into floats. I solved it passing JSON_BIGINT_AS_STRING argument to json_encode. It's documented in PHP documentation: http://php.net/manual/es/function.json-decode.php...

The same as you would compare floating point numbers in any other language. Take the absolute value of the difference of the numbers and compare it against your acceptable delta. let delta: CGFloat = 0.00001 let a: CGFloat = 3.141592 let b: CGFloat = 3.141593 if abs(a-b) < delta {...

Floating-point numbers don't have decimal places. They have binary places, and the two are not commensurable. Any attempt to modify a floating-point variable to have a specific number of decimal places is doomed to failure. You have to do the rounding to a specified number of decimal places after conversion...

Probably your system has a different format for fractional numbers than what your file was written with. By default, float.Parse will use your system's locale settings to decide on this. To manually specify a format, you can use another overload: float.Parse(splittedLine[2], CultureInfo.InvariantCulture); ...

floating-point,floating-point-conversion

I post my code, which is not optimized, but works. If you have further suggestion please... let me know, especially for some optimization. void cast(uint64_t __x, uint64_t* __y, uint64_t __exs, uint64_t __mxs, uint64_t __eys, uint64_t __mys) { /* Input : __x -> number to be casted __y -> casted value...

floating-point,double,ieee-754,floating-point-conversion,significant-digits

First, for this question it is better to use the total significand sizes 24 and 53. The fact that the leading bit is not represented is just an aspect of the encoding. If you are interested only in a vague explanation, one decimal digits contains exactly log2(10) (about 3.32) bits...

c++,visual-studio,floating-point

Yes, it is negative infinity. To be sure, you could test it against: float.isNegativeInfinity...

java,floating-point,number-rounding

One way to do this is to convert the price to cents, divide by 5, round to, multiply by 5 again and convert back to dollars: double rounded = Math.round(num * 100.0 / 5.0) * 5.0 / 100.0; ...

python,powershell,floating-point,decimal

Change print "You are %d years old." % age to print "You are %g years old." % age By the way, eval is not necessary there, the floatconversion suffices....

assembly,floating-point,x86,fpu

Let's start with the first question. It isn't explicitly stated, but I think we can assume we're dealing with little-endian here (every PC you'll get to use today will use that). Thus, if you execute FLD m64p on that memory location, the floating point stack will contain these bytes in...

php,floating-point,numbers,integer

It is pretty simple. modulo (%) is the operator you want, it determines if there would be a remainder if x is divided by y... for example (3 % 2 = 1) and (4 % 2 = 0). This has been asked before too - pretty common question - you...

c,floating-point,double,double-precision

Double are represented as m*2^e where m is the mantissa and e is the exponent. Doubles have 11 bits for the exponent. Since the exponent can be negative there is an offset of 1023. That means that the real calculation is m*2^(e-1023). The largest 11 bit number is 2047. The...

php,mysql,floating-point,floating-point-precision,decimal-point

Understand that float is not a precise decimal number but a value that's close to it. The way it's stored is very efficient at the cost of not being exact. Without knowing your exact needs, one possible solution is to store the numbers as floats and round them to one...

c++,matlab,floating-point,ieee-754

Firstly, if your numerical method depends on the accuracy of sin to the last bit, then you probably need to use an arbitrary precision library, such as MPFR. The IEEE754 2008 standard doesn't require that the functions be correctly rounded (it does "recommend" it though). Some C libms do provide...

You want to use the JB and JA (jump below/above) instructions instead of JL/JG. The COMISS instruction sets the flags as if it were two unsigned integers being compared. This makes the effect on the flags simpler. The COMISS instruction's effect on flags is documented in the Intel 64 and...

There is such a thing as "floating point promotion" of float to double per [conv.fpprom]. A prvalue of type float can be converted to a prvalue of type double. The value is unchanged. This conversion is called floating point promotion. The answers to the linked question are correct. This promotion...

javascript,floating-point,rounding,floor

function formatNumber(x) { // convert it to a string var s = "" + x; // if x is integer, the point is missing, so add it if (s.indexOf(".") == -1) { s += "."; } // make sure if we have at least 2 decimals s += "00";...

python,floating-point,logarithm,natural-logarithm

To move to log space, use log. To move back again use exp. The rules in log space are different - eg. to perform multiplication is to add in logspace. >>> from math import log, exp >>> log(0.0000003) -15.01948336229021 >>> exp(-15.01948336229021) 3.0000000000000015e-07 >>> log(0.0000003) + log(0.0000003) -30.03896672458042 >>> exp(-30.03896672458042) 9.000000000000011e-14...

c#,matlab,file,binary,floating-point

You can use BinaryWriter to write the data to a file very easily: foreach (var value in samples32array) { writer.Write(value); } Now BinaryWriter is guaranteed to use little-endian format, so in your Matlab call, you should specify a machinefmt value of l to explicitly read it in little-endian format too....

c,pointers,floating-point,int,type-conversion

This is not correct. int and float are not guaranteed to have the same alignment. Remember: Casting a value and casting a pointer are different scenarios. Casting a pointer changes the way to refer to the type value, which can almost certainly result in a mis-alignment in most of the...

c,assembly,floating-point,printf,nasm

Your dist is a single 32-bit float, but printf needs a double 64-bit float. You have to transform it: %macro print_dist2 1 section .data .str db %1,0 ; %1 is macro call first actual parameter section .text sub esp, 8 ; Space for a 64-bit double floating point number fld...

The cast does not apply to the entire expression 123/33. You're casting the value 123 to a float, which causes any further operations to use floating-point math.

split,floating-point,floating-accuracy

(using double-precision for the examples) I was wondering if it is possible to split somehow x = (I-1.0) + (f + 1.0), namely without floating point rounding error. There is no chance of obtaining such a split for all values. Take 0x1.0p-60, the integral part of which is 0.0 and...

c,floating-point,printf,addition

According to IEEE 754 encoding, many numbers will have small changes to allow them to be stored.Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one. Single precision (float) gives you 23 bits of significand, 8 bits of exponent, and 1...

Flash does not like floating points. At all. Avoid them as much as you can.

It contains a modf function, that is like a swiss-knife, that allows you to define truncatef and roundf functions: # let truncatef x = snd (modf x);; val truncatef : float -> float = <fun> # truncatef 3.14;; - : float = 3. round can be also expressed with modf...

I'd be really tempted in this case to go with the simplest possible option. set Output [open "Output3.txt" w] set FileInput [open "Input.txt" r] set Desire 174260 while {[gets $FileInput line] >= 0} { lassign [regexp -inline -all {\S+} $line] key x1 y2 z1 if {$key == $Desire} { puts...

You cant multiply or devise float and decimal.Decimal() types, what I would suggest is multiplying by Decimal('0.0023') and Decimal('0.5'): for a in range(0,size) et = Decimal('0.0023')*ralist[rows[a][2]] * ( Decimal('0.5')*(rows[a][3] + rows[a][4]) + 17.8 ) * ( rows[a][3] - rows[a][4])**(0.5) eto_values.insert(a,et) ...

That is a good start. You can reduce one more addition: // Perform complex multiplication on the input and accumulate with the output pDstR[i] += fmsub(fSrc1R, fSrc2R, fSrc1I*fSrc2I); pDstI[i] += fmadd(fSrc1R, fSrc2I, fSrc2R*fSrc1I); Here you can make use of another fmadd in the calculation of the imaginary part: pDstI[i] =...

floating-point,ocaml,floating-point-conversion

float_of_string should be able to parse them: # float_of_string "0x1.199999999999Ap1";; - : float = 2.2 However as Alain Frisch noted on OCaml's bug tracker the support actually depends on the underlying libc and won't currently work on the MSVC ports....

To avoid the confusion between int and float and to avoid the issues with arithmaic operations between the two types, I propose scan() the input as float/double. Use sprintf() to convert that as a string. Tokenize the string using strtok() based on the delimiter . (decimal point). First part (token)...

Floating types are defined by §5.2.4.2.2 of the C standard (draft N1570 at least): The characteristics of floating types are defined in terms of a model that describes a representation of floating-point numbers and values that provide information about an implementation’s floating-point arithmetic.21) The following parameters are used to define...

python,arrays,list,numpy,floating-point

No numerical errors are being introduced when you convert the array to a list, it's simply a difference in how the floating values are represented in lists and arrays. Calling list(a) means you get a list of the NumPy float types (not Python float objects). When printed, the shell prints...

python,floating-point,ctypes,floating-accuracy

Jan Rüegg is right - this is just how floats work. If you're wondering why this only shows up with c_float, it's because c_floats print as "c_float({!r}).format(self.value). self.value is a double-precision Python float. Python's float type prints the shortest representation that converts to the floating point number, so although float(0.2)...