Have you ever stopped to wonder how your computer handles numbers under the hood? What seems like a simple task—adding, subtracting, or dividing—is actually constrained by strict rules and limitations. Unlike humans, computers operate in a world of finite precision and binary logic. In this post, we explore the foundational ideas of binary numbers and floating-point representation, and how they shape the digital universe.
Precision Has Limits: Welcome to Finite Math
In everyday life, we write huge numbers without thinking twice. But inside a computer, memory is limited. Every number is stored using a fixed number of bits, meaning only a finite number of values can be represented.
Imagine being allowed to write only three decimal digits. You could store anything from 000
to 999
—but what about 1001
, or -5
, or 3.14159
? These are outside the allowable set. This limitation leads to two types of errors:
- Overflow: The result is too large (e.g.,
600 + 600 = 1200
, but 1200 can’t be stored). - Underflow: The result is too small, often approximated as zero.
Even fundamental properties like the associative law or distributive law can break down in finite-precision arithmetic. This is why numerical computing is both an art and a science.
Binary, Octal, and Hex—The Languages of Machines
Humans use base-10 numbers, but computers prefer base-2: binary. In binary, only two digits exist: 0
and 1
.
Here’s how the number 2001
looks in different systems:
System | Representation |
---|---|
Decimal (10) | 2001 |
Binary (2) | 11111010001 |
Octal (8) | 3721 |
Hex (16) | 7D1 |
To keep things unambiguous, notations like 0x7D1
(hex) or 11111010001₂
(binary) are often used.
How to Convert Between Systems
Binary ↔ Octal: Group bits in sets of three.
Binary ↔ Hex: Group bits in sets of four.
Decimal ↔ Binary: Use subtraction of powers of 2, or repeated division by 2.
For instance, converting 1492
to binary:
1492 ÷ 2 = 746 remainder 0
746 ÷ 2 = 373 remainder 0
...
Continue until quotient is 0, then read remainders bottom-up
Representing Negative Numbers in Binary
There are several methods to store negative numbers:
- Signed Magnitude: First bit is the sign (0 = +, 1 = –).
- One’s Complement: Flip all bits for negative.
- Two’s Complement: Flip all bits, then add 1.
- Excess-N: Add a bias (e.g., +128 for 8-bit numbers).
Two’s complement is most common today. It avoids having “+0” and “–0” and simplifies arithmetic. But even this method can’t symmetrically represent all positive and negative values due to the even number of bit combinations.
Binary Arithmetic and Overflow
In binary math:
0 + 0 = 0
0 + 1 = 1
1 + 0 = 1
1 + 1 = 0 (with carry)
Overflow detection is critical. If the carry into the sign bit doesn’t match the carry out of the sign bit, something went wrong.
Floating-Point Numbers: Riding the Precision Wave
When dealing with massive or tiny numbers (like 9 × 10⁻²⁸
or 2 × 10³³
), we turn to floating-point representation—essentially scientific notation for machines:
n = f × 10^e
Here, f
is the fraction (or mantissa) and e
is the exponent. Computers use a similar method but in binary, often to base 2.
Normalized vs. Denormalized
- Normalized: The first bit after the binary point is assumed to be
1
(saving space). - Denormalized: Used to handle underflows gracefully by sacrificing precision.
IEEE 754: The Floating-Point Bible
In the 1980s, IEEE standardized how floating-point numbers should work. Today, most CPUs follow this specification. Key features include:
Format | Size | Exponent Bias | Fraction Bits | Range |
---|---|---|---|---|
Single Precision | 32-bit | 127 | 23 bits | ~±10³⁸ |
Double Precision | 64-bit | 1023 | 52 bits | ~±10³⁰⁸ |
Special values include:
- Infinity (
exp = all 1s
,frac = 0
) - NaN (“Not a Number” — result of undefined ops like ∞/∞)
- Zero (
exp = 0
,frac = 0
, with sign bit determining +0 or –0) - Denormalized numbers: Smooth transition toward 0 when precision can’t be maintained.
IEEE 754 also ensures consistent rounding and error handling across platforms.
Real vs. Floating-Point Numbers
Real numbers form a continuum. Floating-point numbers do not. Only a finite number can be represented, and between each pair, there may be a vast ocean of unrepresentable values.
Still, floating-point arithmetic gives us:
- Huge dynamic range
- Predictable rounding behavior
- Special handling for edge cases (0, ∞, NaN)
It’s not perfect, but it’s an elegant compromise between speed, precision, and practicality.
Final Thoughts
Understanding how computers handle binary and floating-point numbers reveals the engineering trade-offs behind even the simplest calculations. The next time you see a rounding error or a floating-point glitch in your code, remember: it’s not a bug—it’s a fundamental part of how machines see the world.
Reference
Gajski, D. D. (1996). Principles of digital design.