© Tony Lawrence, aplawrence.com

## Understanding Floating Point Formats

See also Understanding Packed BCD numbers

Under ordinary circumstances, you don't have to know or care how numbers are represented within your programs. However, when you are transferring data files that contain numbers, you will have to convert if the storage formats are not identical. If the numbers are just integers, that's fairly easy because the only differences will be the length and the byte order: how many bytes the number takes up, and whether it is stored lsb or msb (least significant byte or most significant byte first). Once you know that, conversion is trivial.

Floating point numbers are a whole other game. For example, in December of 1983, I had to convert some Tandy Basic programs and data files to Xenix MBASIC. The Basic programs themselves were fairly challenging, but the data files were even more so. Tandy stored floating point numbers in what they called "XS128 notation" (Excess 128 is what they really meant) and MBASIC used packed BCD. At the time, I had never given a single thought to how floating point numbers are stored. As you surely realize, this was long before you could ask Google to find you something like MAD 3401 IEEE Floating-Point Notes, and the availability of computer oriented books was not anything like it is today. I was on my own, with only "od -cx", my wits, and pure stubbornness to go on. There was an explanation in the manuals, but it was typical geek-babble and it made my head hurt. It took me several hours of painful work to understand what I needed to do, and a few hours more to write programs to do it, but the project got done. I haven't had to do anything like that since then, and you may never have had to at all, but that doesn't mean that neither of us ever will. So rather than you getting a headache from trying to puzzle it out (because there's still a lot of techno-babble out there) , I'll get you started.

The first thing you need to know is that your machine may give different results than mine. It probably won't unless you are using something odd, but if it does, don't panic: the theory is still the same; you just have a slightly different implementation. Here's a Perl program that is going to show us what's going on (you do not need to understand this script):

#!/usr/bin/perl showbits(0); for ($x=1; $x < 16384; $x*=2) { showbits($x); } showbits("5.75"); showbits("-.1"); sub showbits { $x=shift; $string=pack("f",$x); print "$x\t"; $y=uc(unpack("H*",$string)); print "$y\t"; for ($z=0;$z<8;$z+=2) { $hx[$z]=sprintf("%.8b ",hex(substr($y,$z,2))); } print substr($hx[0],0,1), " "; print substr($hx[0],1,7); print substr($hx[2],0,1), " "; print substr($hx[2],1,7); print substr($hx[4],0,8); print substr($hx[6],0,8); print "\n"; }

We're looking at single precision floating point numbers here. Double precision uses the same scheme, just more bits. Here's what the output looks like :

0 00000000 0 00000000 00000000000000000000000 1 3F800000 0 01111111 00000000000000000000000 2 40000000 0 10000000 00000000000000000000000 4 40800000 0 10000001 00000000000000000000000 8 41000000 0 10000010 00000000000000000000000 16 41800000 0 10000011 00000000000000000000000 32 42000000 0 10000100 00000000000000000000000 64 42800000 0 10000101 00000000000000000000000 128 43000000 0 10000110 00000000000000000000000 256 43800000 0 10000111 00000000000000000000000 512 44000000 0 10001000 00000000000000000000000 1024 44800000 0 10001001 00000000000000000000000 2048 45000000 0 10001010 00000000000000000000000 4096 45800000 0 10001011 00000000000000000000000 8192 46000000 0 10001100 00000000000000000000000 5.75 40B80000 0 10000001 01110000000000000000000 -.1 BDCCCCCD 1 01111011 10011001100110011001101

The first column is what the stored format looks like in hex. After that come the actual bits; I've separated them in this odd way for a very good reason (which will become clear later). The value "5.75" is stored as "01000000101110000000000000000000" or "40B80000" (hex).

You might easily guess that the first bit is the sign bit. I think that's what I first grokked back in 1983 too. The next 8 bits are used for the exponent, and the last 23 are the value. As you will no doubt notice, the value bits from 0 to 8192 are all empty, so I must be crazy and there's no point in reading this trash any farther.

Well, actually there is. There's a hidden bit there that isn't stored but is always assumed. If you are really compulsive and counted the bits, you see that only 23 bits are there. The hidden bit makes it 24.bits (or 4 bytes) and is always 1. So, if we add the hidden bit, the bits would look like:

0 0 00000000 100000000000000000000000 1 0 01111111 100000000000000000000000 2 0 10000000 100000000000000000000000 4 0 10000001 100000000000000000000000 8 0 10000010 100000000000000000000000 16 0 10000011 100000000000000000000000 32 0 10000100 100000000000000000000000 64 0 10000101 100000000000000000000000 128 0 10000110 100000000000000000000000 256 0 10000111 100000000000000000000000 512 0 10001000 100000000000000000000000 1024 0 10001001 100000000000000000000000 2048 0 10001010 100000000000000000000000 4096 0 10001011 100000000000000000000000 8192 0 10001100 100000000000000000000000 5.75 0 10000001 101110000000000000000000 -.1 1 01111011 110011001100110011001101

But remember, it's what I showed above that is really there.

One more thing: there's an implied decimal point after that hidden number. To get the value of bits after the decimal point, start dividing by two: so the first bit after the (implied) decimal point is .5, the next is .25 and so on. We don't have to worry about any of that for the powers of two, because obviously those are whole numbers and the bits will be all 0. But down at the 5.75 we see that at work:

First, looking at the exponent for 5.75, we see that it is 129. Subtracting 127 gives us 2. So 1.0111 times 2^2 becomes 101.11 (simply shift 2 places to the right to multiply by 4). So now we have 101 binary, which is 5, plus .5 plus .25 (.11) or 5.75 in total. Too quick?

Taking it in detail:

Exponent: 10000001, which is 129 (use the Javascript Bit Twiddler if you like). Subtract 127 leaves us with 2.

Mantissa: 01110000000000000000000

Add in the implied bit and we have 101110000000000000000000, with implied decimal point that's 1.01110000000000000000000

Multiple that by 2^2 to get 101.110000000000000000000

That is 4 + 1 + .5 + .25 or 5.75

Look at 2048. The exponent is 128 + 8 + 2 or 138, subtract 127 we get 11. Use the Bit Twiddle if you don't see that. The mantissa is all 0's, which with the implied bit makes this all 1.00000000000000000000000 times 2^11. What's 2^11? It's 2048, of course.

Now the -.1. This actually can't store precisely, but the method is still the same. The exponent is 64 + 32 + 16 + 8 + 2 + 1 or 123. Subtract 127 and we get -4, which means the decimal point moves 4 places to the left, making our value .000110011001100110011001101. Now you understand why it's stored after adding 127 - it's so we can end up with negative exponents. If we calculate out the binary, that's .625 + .3125 + .0390625 and on to ever smaller numbers which get us very, very close to .1 (but off slightly). The sign bit was set, so it's a -.1

The Tandy (and Dec Vax, by the way) "excess 128" exponent storage simply changes the ranges of positive versus negative numbers - other than that, it works just like this.

Finally, there are two reserved values: all 0's for 0, and all 1's for NaN (Not A Number) in other words, too large (or too small) for the format to hold. You'd also get that from dividing by zero.

That's it. Take a look at the link at the beginning if you want to go a little deeper, but this is probably all you need to get started.

(OLDER) <- More Stuff -> (NEWER) (NEWEST)

Printer Friendly Version

-> -> Understanding Floating Point Formats

Inexpensive and informative Apple related e-books:

Take Control of Apple Mail

Take Control of Automating Your Mac

Take Control of Podcasting on the Mac, Third Edition

Take Control of Apple Mail, Second Edition

Take Control of Upgrading to Mavericks

More Articles by Tony Lawrence

Find me on Google+

2003-09-01 Tony Lawrence

"Tandy stored floating point nubers in what they called "XS128 notation" (Excess 128 is what they really meant) and MBASIC used packed BCD."

Tandy used excess 128 because Microsoft used it in their 8 bit BASIC interpreters. Excess 128 worked well with 8 bit CPU's (like the Z80 in Tandy's TRS-80 boxes and the 6510 in the Commodore 64) because the conversion between ASCII and floating binary format could be carried out with simple binary adds, shifts and rotates -- operations that were relatively quick on these processors. The alternative, IEEE floating point format, was best implemented on 16 bit processors, where integer multiply and divide operations were common.

However, excess 128 format had a significant flaw, in that it didn't always accurately represent the fractional component of a number. Despite having 7 significant display digits when returned to ASCII format, excess 128 often caused computation errors that could drive a programmer batty. For example, it was not uncommon for an expression like 1.80-1.79 to result in .00999937 or something similar, a predictable result of trying to convert a base-10 fraction in a base-2 representation. I recall many different schemes that were concocted to work around this nonsense. It was what prompted me to turn to BCD routines in my assembly language programs.

Speaking of packed (or compressed) BCD, the MOS Technology 65xx processor family used in many 1970's and 1980's home computers (e.g., the Commodore 64, Atari, Apple II) could be made to do arithmetic in BCD by setting the decimal bit in the processor's status register with the SED mnemonic. If decimal mode was cleared (the CLD mnemonic), all arithmetic was performed on unsigned 8 bit numbers and a carry would occur during addition if the result exceeded an 8 bit value (the width of the .A or accumulator register). No half-carry was considered.

However, if decimal mode was set, a half-carry would automatically occur if, during addition, the value of the low order nybble went past $09 (this is MOS Technology notation for 0x09) -- $09 + $01 in BCD resulted in $10, not $0A as would be the case in binary mode. A full carry occurred if the accumulator went past $99 (it would roll back to $00 and the processor carry flag would be set). A similar action in reverse occurred if decimal mode subtraction was carried out.

There was a booby-trap in the C-64 that could catch the unwary when the processor was in decimal mode: if either an IRQ or NMI occurred, the kernel ROM handlers would push the processor status register onto the stack, thus preserving the decimal mode setting. However, the interrupt handler failed to return the processor to binary mode prior to continuing, which could result in a crash or at least bizarre behavior when some part of the interrupt handler used an ADC (add with carry) or SBC (subtract with borrow) operation -- the result would not be what was expected. The solution was to either disable IRQ's while decimal mode was in use or modify the interrupt vectors to point to code that would clear decimal mode before executing the main interrupt handler (the restoration of the status register from the stack at the end of the interrupt handler automatically placed the processor back into decimal mode). Of course, there was no way to disable an NMI, so if one occurred during a BCD routine, oh well!

BCD has the advantage of exactly converting from the ASCII representation of decimal numbers to the machine format and back. However, it takes more bytes to represent a given number, and BCD multiplication and division, as well as transcendental functions, tend to execute more slowly than their binary equivalents. Compromise, compromise...

--BigDumbDinosaur

I think you are confused somewhere. XS128 is the same concept as ieee floating point, and they both are unable to accurately represent numbers. The more bits you can give to the mantissa, the closer you can get, but neither accurately represent something as simple a -.1 (as in the examples of the article).

--TonyLawrence

No confusion. Of course, both excess 128 and IEEE are closely related due to their reliance on binary exponentiation to represent large numbers in a small space. In fact, 4 byte IEEE has exactly the same number of significant decimal digits as excess 128, which is also a 4 byte notation. Excess 128 handles negativity in a different fashion, though. My above wording, after rereading it, does seem to imply that IEEE was more accurate than excess 128 -- that wasn't my intention.

Excess 128 notation was devised as a way to avoid the use of negative 8 bit numbers (i.e., values where bit 7 was set) in twos complement arithmetic. Signed arithmetic operations on 8 bit processors are inefficient and produce only 7 bits of significance, obviously. Plus most 8 bit CPU's do not have a specific means of handling overflow into the sign bit -- they simply set a flag in the status register and leave it up to the programmer to handle the overflow.

Contrast that to a 16 bit processor, in which twos complement produces 15 bit significance, or the 31 bits of a 32 bit CPU. Also, all modern 16 and 32 bit CPU's have the means to deal with sign overflow. Hence excess 128 tends to be faster on older 8 bit CPU's, whereas IEEE format comes into its own on 16, 32 and 64 bit processors.

--BigDumbDinosaur

In the IEEE standards for floating-point numbers (IEEE 754 and 854), single precision (32-bit) uses 8-bit exponents in excess 127 notation (i.e., the bias is 127). Double precision (64-bit) uses 11-bit exponents in excess 1023.

--FredFoobar

---September 16, 2004

Sat Feb 19 17:55:27 2005:46 anonymousGracias, this is very informative and I've book marked this page. What did catch me for a while is that .0001 binary was written 0.625 instead 0.0625 and I missed the small decimal point before the one in -.1. So for five minutes I went crazy wondering where the extra 10 factor came from. Now I understand its -0.1=-0.0625+-0.03125, etc.

Sat Dec 6 07:03:23 2008:4858 qseepWhy are people so hung up on binary floating point not being able to represent 1/10? It also can't represent 1/3, an even simpler number - and neither can the decimal system. Most floating point calculations are for graphics or scientific calculations (financial calculations are fixed-point), and most floating point results are never converted to decimal. Yes, there are all sorts of accuracy gotchas with floating point - but I think representing 1/10 is the least of them. Now if we used a real base, like 12, we could represent 1/3 as .4, no problem. :)

Sat Dec 6 16:08:34 2008:4863 BigDumbDinosaurWhy are people so hung up on binary floating point not being able to represent 1/10?It isn't a problem until some accountant's spreadsheet tells him that 1.80 - 1.79 = .0099999.

Sat Dec 6 17:34:27 2008:4865 qseepI understand that would be a problem. But instead of resorting to the less efficient BCD format, it's easier to do calculations on cents instead of dollars internally. Thus 180 cents - 179 cents = 1 cent. You just draw a dot before the last two digits when you show it on screen.

Sat Dec 6 17:55:52 2008:4866 TonyLawrence:-)

No...

I'm sorry for laughing, but you just don't understand at all. This isn't a matter of where the decimal point is - as the article explains, that's always done with an exponent. This is a matter of precision.

I suggest re-reading more carefully, perhaps with pencil and paper.

Sat Dec 6 20:42:39 2008:4876 qseepYes, it is a matter of precision. The accountant enters a value of 1.80. They enter it as a series of ASCII characters, right? That's a decimal representation. So before storing it internally you can move the decimal point two spaces to the right. You represent the value internally as 180 cents, in binary. It could be an integer 180, or the floating point value 180.0. Either one has no loss of precision from the original value. Likewise, 1.79 can be represented internally as the integer 179, or the floating-point value 179.0. When you subtract, you get 1 if using integers, or 1.0 if using floating-point. There's no chance for getting an answer of 0.99999.

Actually, floating-point is never useful in financial calculations, because of the loss of precision. It's better to use integers or fixed-point arithmetic. Choosing floating-point over fixed-point implies that sometimes you won't care about the ones place - e.g., 1E25+1 is the same as 1 in single- or double-precision IEEE.

Let's say apples are selling 5 for 79 cents, and you buy one apple. Then you can use fixed-point arithmetic to divide 79.0 by five. Yes, you will not get exactly 15.8, due to the inability of binary representations to represent non-binary fractions. But it doesn't matter because you're going to round the answer up to 16 cents. The store is going to charge you exactly 16 cents, not 15.8.

Wed Jan 21 10:15:41 2009:5208 hpqseep - You *will* lose precision if you go the binary floating point way (try it!!!). It does not help that you move the decimal point two places to the right, since the numbers are always stored in exponential form. Thus, 1.79 is stored as .179E1 and 179.0 as .179E3. Same mantissa, no gain in precision. In either case, you'll end up with an answer like .999999 for the mentioned example.

Fri Feb 27 14:49:35 2009:5530 ScottHansonI was wrestling with the floating-point format until I came across your web page. You managed to explain the concept in simple enough terms that even I could understand. Thank you.

Sat Mar 13 18:10:14 2010:8211 anonymousCan you help with this?

A 15-bit floating point number representation has 1 bit for the sign of the number, 5 bits for the exponent and 9 bits for the mantissa (which is normalized). Numbers in the exponent are in signed magnitude representation. No bias is used and there are no implied bits. Show the representation for the smallest positive number this machine can represent. What is its value?

Sign Exponent Mantissa

Sat Mar 13 18:17:24 2010:8212 TonyLawrenceNo, I'm not going to help you with your homework, other than to say that you need to read the article above.

Sat Mar 13 19:27:14 2010:8213 TonyLawrenceYou know what annoys me the most about people like that?

It's that they don't even bother to try rephrasing the question - it's just pasted from their homework. That indicates either extreme laziness or total stupidity (you don't even understand well enough to rephrase).

I wonder how far people like this can actually get before it all falls down?

Sun Jan 9 18:09:09 2011:9220 JoeBha... Just had to comments since I had to convert some old DEC VAX files and had to write a routine that converted the different 32 bit floating point formats.

It took me quite a bit longer than you. How I wish I had found your page earlier!

Sun Jan 9 19:40:37 2011:9221 TonyLawrenceBut you still got 'er done, right?

:)

Fri Oct 28 04:37:50 2011:10082 BillFraserI've been looking for something like this for some time in order to demonstrate how integers are stored into real formats and what happens when you exceed the largest integer that can be stored sequentially - thank you for posting this and your excellent explanation.

One comment, when I ran you script - using perl v5.12.2 for MSWin32-x86-multi-thread, the unpack( "H*", $string) returned the bytes in reverse order - 0000803F instead of 3F800000 for 1 so I un-reversed them: $yy=""; for ($z=8;$z>0;$z-=2) { $yy .= substr($y,$z-2,2); }

HTH

Fri Oct 28 10:09:53 2011:10084 TonyLawrenceAhh - interesting! I see that my machines no need the same fix.

Fri Oct 28 10:18:03 2011:10085 TonyLawrenceI don't know what my fingers were thinking :-)

I mean they DO need the same fix. When I wake up a little, I'll figure out why.

------------------------

Printer Friendly Version

Understanding Floating Point Formats Copyright © September 2003 Tony Lawrence

Related Articles

Have you tried Searching this site?

Unix/Linux/Mac OS X support by phone, email or on-site: Support Rates

This is a Unix/Linux resource website. It contains technical articles about Unix, Linux and general computing related subjects, opinion, news, help files, how-to's, tutorials and more.

Contact us

Printer Friendly Version