Bits, bytes, and words: the big picture

Author

Clayton Cafiero and Surya Malik

Published

2025-10-12

Bit

The most fundamental unit of information in a computer is the bit (short for binary digit), which can hold a value of 0 or 1.
Physically, this is implemented as a low or high voltage.

Byte

A byte is an ordered sequence of eight bits.
The byte is the smallest addressable unit of memory in most architectures.
Even if only a single bit is required, the memory system allocates at least one byte. For example, the decimal value 300 is 100101100 in binary. This requires nine bits to store. Since memory addresses whole bytes, at least two bytes are required to store this value.

Word

A word is the “natural” data size for a given processor.
A word matches the width of the CPU registers and datapath.

In a 32-bit ARMv7 processor, a word is 32 bits (4 bytes).
In a 64-bit ARMv8 processor, a word is 64 bits (8 bytes).

Processors often require that words be stored at memory addresses that are multiples of the word size. For example, a 4-byte word should be placed at an address divisible by four. Misaligned accesses may be slower or may even cause exceptions on some systems. Alignment ensures efficient memory access.

Binary numbers

Binary numbers are represented using positional notation, similar to decimal numbers but with base two rather than base ten. Each position in such a representation corresponds to a power of two.

The value of an n-bit (unsigned) binary number is a sum of powers of two, with the leading coefficients being the digits of the binary number in question:

\text{value} = (b_{n-1} \times 2^{n-1}) + (b_{n-2} \times 2^{n-2}) + \ldots + (b_{1} \times 2^{1}) + (b_{0} \times 2^{0}).

Example:

1101 \text{ (binary)} = (1 \times 8) + (1 \times 4) + (0 \times 2) + (1 \times 1) = 13 \text{ (decimal)}.

Range of unsigned values

An n-bit unsigned integer can represent values from 0 to 2^{n} - 1.

Examples:

8-bit unsigned: [0, 255]
16-bit unsigned: [0, 65{,}535]
32-bit unsigned: [0, 4{,}294{,}967{,}295]

Signed integers

Unsigned binary representation, by definition, cannot represent negative values.
Several historical methods were developed to represent signed integers, but modern systems use two’s complement. The most intuitive scheme uses the leftmost bit to indicate sign (0 = \text{positive}, \; 1 = \text{negative}), with the remaining bits giving the magnitude. This approach is not used in modern computers—it consumes space unnecessarily, and it has two possible representations for zero: +0 and -0, but zero is neither positive nor negative!

Some early computers used one’s complement, in which forming the additive inverse was done by flipping all the bits of the positive representation.

Example (4-bit): +5 = 0101, -5 = 1010.
Problem: Two encodings for zero!

Nowadays, signed integers are represented using two’s complement. Under two’s complement representation, the additive inverse of a number is formed by inverting all bits of the positive number and then adding one.

Example (4-bit):

+6 = 0110
-6 = 1001 + 1 = 1010

Range of two’s complement

An n-bit two’s complement number represents values from -2^{\,n-1} to 2^{\,n-1} - 1.

4-bit: [-8, +7]
32-bit: [-2{,}147{,}483{,}648, +2{,}147{,}483{,}647]

Why two’s complement?

There’s only one representation for zero.
Addition and subtraction use the same adder circuits as unsigned numbers.
The carry out from the most significant bit can be ignored.

Code Snippet
#include <stdint.h>
#include <stdio.h>

int8_t negate8(int8_t x){
    return (int8_t)((~x) + 1);   // two's complement negation in 8 bits
}

int main(void){
    int8_t a = 53;               // 0b0011_0101
    int8_t na = negate8(a);      // should be -53
    printf("a=%d neg(a)=%d\n", a, na);
    return 0;
}

Overflow and carry out

Carry out occurs when a binary addition produces a carry beyond the most significant bit. In general, carry out is a property of unsigned arithmetic (but carry out can occur within an adder for signed or unsigned operands).

Example (8-bit unsigned):

11111111 + 000000001 = 00000000.

In this example, we have decimal 255 (binary 11111111) plus one. But with eight bits, we can’t represent decimal 256 with eight bits, unsigned. The carries proceed on addition right to left, and then there’s a carry out at the end. We have an extra bit beyond the range of representation.

Overflow occurs when the result of signed arithmetic is too large or too small to fit in the available bits—that is, when the sign bit is wrong. Let’s consider a similar example, but with 8-bit unsigned (two’s complement representation).

11111111 + 000000001 = 00000000

Here we have carry out, but the result is OK, because under this representation, this is equivalent to -1 + 1 = 0.

Here’s an example of overflow:

01111111 + 00000001 = 10000000.

This is decimal (127 + 1) which would equal 128, but decimal 128 can’t be represented with eight bits, signed. So the result here, under two’s complement, is -128. The sign is wrong! Overflow happens when adding two numbers of the same sign yields a result of the opposite sign.

Carry out is a raw arithmetic by-product (extra bit beyond range). Overflow is a semantic error in signed arithmetic interpretation (sign bit incorrect for result).

Note

In C and many modern languages, unsigned arithmetic is explicitly defined as modulo 2^{n}.

Detecting overflow

If two positive numbers yield a negative result, overflow has occurred.
If two negative numbers yield a positive result, overflow has occurred.
If the operands have different signs, overflow cannot occur.

Code Snippet
#include <stdint.h>
#include <stdio.h>

int add_overflows(int32_t a, int32_t b){
    int32_t s = a + b;
    return ((a ^ s) & (b ^ s)) < 0;  // same signs in, different sign out
}

int main(void){
    int32_t x = 1<<30;   // large positive
    int32_t y = 1<<30;   // large positive
    int32_t s = x + y;
    printf("s=%d overflow=%d\n", s, add_overflows(x, y)); // expect overflow=1
    return 0;
}

Subtraction

Subtraction is performed by taking the additive inverse of the subtrahend and adding.

Example: 5 - 3 is equivalent to 5 + (-3).

Code Snippet
#include <stdint.h>
#include <stdio.h>

int32_t sub_via_add(int32_t a, int32_t b){
    return a + (~b + 1);  // two's complement subtraction
}

int main(void){
    int32_t a = 5, b = 3;
    printf("a-b=%d sub_via_add=%d\n", a-b, sub_via_add(a,b));
    return 0;
}

Endianness

Endianness refers to the ordering of bytes when multi-byte data is stored in memory.

Little-endian: least significant byte stored at the lowest memory address. This is the default on ARM and x86.
Big-endian: most significant byte stored at the lowest address.
(Apologies to Jonathan Swift.)

Example: Storing the 32-bit word 0x12345678 (hex) at address 0x1000.

Little-endian:
- 0x1000: 78
- 0x1001: 56
- 0x1002: 34
- 0x1003: 12
Big-endian:
- 0x1000: 12
- 0x1001: 34
- 0x1002: 56
- 0x1003: 78

Note

Endianness is important for interpreting memory dumps, file formats, and network protocols.

Characters, ASCII, and Unicode

Computers must also represent non-numeric data such as symbols used in text.
This is achieved by mapping characters to integers.

ASCII (American Standard Code for Information Interchange) is a 7-bit encoding for 128 characters, including letters, digits, punctuation, and control codes. Examples: ‘A’ = 65, ‘a’ = 97, newline = 10. Extended ASCII uses eight bits to allow 256 characters. However, different systems have defined different extended sets. (This is one possible reason why you might see odd characters or gobbledygook—some text is intended to be interpreted under one system but instead is interpreted under a different system.)

Unicode defines a much larger set of characters for virtually all writing systems. However, Unicode has single- and multi-byte encodings, so we won’t investigate this further.

For more on Unicode, see: The Joy of Unicode (appendix) in “An Introduction to Programming and Computer Science with Python, second edition” (Cafiero, 2025).

No generative AI was used in writing this material. This was written the old-fashioned way.

Reuse

CC BY-NC-SA 4.0