Memory interface
Memory is external to the CPU core
Unlike registers, which reside inside the CPU and operate at CPU speed, memory is external to the processor core. It has slower access times than for registers. Main memory access time is measured in nanoseconds, whereas register access can be measured in picoseconds. Nevertheless, every instruction must be fetched from memory, LOAD instructions load data from memory into registers, and STORE instructions store data from registers into memory.
Memory requires an interface to communicate with the datapath, and this interface must be synchronized with clock signals to ensure correct behavior.
This separation creates a memory bottleneck. The CPU can execute instructions faster than memory can supply them. This motivates the use of caching (another topic we’ll explore later).
Many instructions require memory access,
The memory interface
How the CPU communicates with memory
Memory operates through a standardized interface consisting of:
- address lines, which specify which memory location to access
- data lines (on a bus), which transport the value being read or written, and
- control signals, which indicate the operation to perform.
Addressing
The width of address lines determines the addressable memory space. For example, if address lines are 32 bits wide, the CPU can address up to 2^{32} distinct memory locations. Now, 2^{32} = 4{,}294{,}967{,}296, so 32-bit addresses are sufficient to address up to 4 GB of memory. Memory addresses for instruction fetch come from the program counter (or branch or jump instruction). Where offsets are used, or where memory addresses are incremented, the ALU performs the necessary arithmetic.
Data width and the bus
The width of the memory bus matches the word size for a given processor, e.g., 32 bits for a 32-bit processor, 64 bits for a 64-bit processor. The bus is bidirectional, that is, data flows to memory on writes, and from memory on reads, on the same bus.
Control signals
Control signals provided by the CPU indicate the operation to perform: read or write.
- Read enable signal tells memory to read and return data.
- Write enable signal tells memory to store data.
- Chip select signal tells memory to select a particular memory location. In older systems, or systems in which memory is distributed across multiple chips, this is used to select specific chips. In SOC (system-on-a-chip) designs, this is used to select specific regions of memory.
A small example
To read from address 0x1000:
- The CPU sets address lines to
0x1000. - The CPU asserts a read enable signal.
- Memory places the value at address
0x1000on the data lines. - CPU reads the data from the data bus.
The memory bus
The memory bus is the communication pathway between the CPU and memory. It consists of:
- the address bus, which is unidirectional, from CPU to memory, and which carries memory addresses;
- the data bus, which is bidirectional, carries data from the CPU to memory (on writes) and from memory to the CPU (on reads); and
- the control bus, which is unidirectional, which carries control signals from the CPU to memory.
The memory bus is a shared resource among all memory operations.
In a single-cycle datapath, only one memory access can occur per cycle (either instruction fetch or data access, not both).
Address decoding and memory access
When the CPU provides a memory address, the memory system must select the correct physical location to read from or write to. This process is called address decoding.
Address decoding uses decoders, combinational circuits that convert a binary address into a one-hot signal that activates a specific memory row or word.
For example, consider decoding a 4-bit address. If the address were 0b0101 (decimal 5), a 4-to-16 one-hot decoder would produce an output of 0000 0000 0001 0000 to activate line five out of 16. Decoding of larger addresses don’t quite work the same way (e.g., with 32-bit addresses, we don’t have a 32-to-4,294,967,296 one-hot decoder, which would be silly). Instead, the memory interface uses high-order address bits to choose which chip, bank, or memory page is active. Each memory chip or region has its own smaller internal decoders that handle the bits relevant to that chip or region. So decoding is hierarchical and distributed, and decoding of a 32-bit address is partitioned among several levels of decoders. The decoding process ensures that only one memory location is selected at a time, thus preventing conflicts.
Memory organization
Memory is typically organized as a two-dimensional array of cells, with an address specifying row and column addresses of a given cell. Each cell stores a word, e.g., one word could be 32 bits.
For a memory with n addressable locations and w-bit words, the data width is w bits, and the address width is \lceil \log_2 n \rceil bits.
Example: 1024-word memory with 32-bit words.
- Address: 10 bits (to address 1024 = 2^{10} locations)
- Data: 32 bits per location
- Total storage: 1024 \times 32 = 32{,}768 bits = 4 KB.
Word-addressable vs byte-addressable
Most modern processors are byte-addressable: every byte has a unique address. However, memory is often accessed in multi-byte words, for example four bytes for a 32-bit word.
As an example with (artificially small) 4-bit addresses, the 32-bit word at address 0x1000 occupies bytes 0x1000, 0x1001, 0x1002, 0x1003. Under this scheme, addresses typically increment by the word size (as with the incrementing by four when updating the program counter).
Address decoding must handle alignment, ensuring that multi-byte words start at addresses divisible by the word size.
Read and write operations
Memory read cycle
The memory read operation retrieves data from a specified address.
- The CPU sets address lines to the target memory location.
- CPU asserts a read enable signal.
- Address decoding occurs (hierarchical and distributed) to select the target memory cell.
- Memory places data on the data bus.
- CPU reads the data from the bus.
- The CPU deasserts the read enable signal.
In a single-cycle datapath this must occur within one clock cycle, thus memory access time must be less than the clock period, and data must be stable before the next clock edge.
For example, when reading an instruction from memory:
- the program counter provides the address,
- the control unit asserts a read enable signal,
- the address decoding selects the target memory cell,
- memory places the fetch instruction on the data bus,
- the control unit reads the instruction from the bus, and the instruction register latches the instruction on the clock edge.
For example, consider a LOAD instruction: LOAD R1, [R2, #8]. This instruction tells the CPU to read the value from memory at address R2 + 8 and store it in register R1.
This instruction itself is fetched from memory. Once fetched, the instruction is decoded, and the ALU computes the effective address (base plus offset). The CPU control asserts a memory read enable signal, and the address decoding selects the target memory cell. Memory places the value at address R2 + 8 on the data lines. The CPU reads the data from the data bus. The CPU asserts a register write enable signal, and the register file writes the value to register R1.
Memory write cycle
The memory write operation stores data to a specified address.
- The CPU sets address lines to the target memory location.
- The CPU places data on the data bus.
- The CPU asserts a write enable signal.
- Address decoding selects the target memory cell.
- Memory latches the data into the selected location.
- The CPU deasserts write enable to complete the operation.
Data must be stable before and after the write enable signal, and the write must be complete before the next instruction.
When writing data to memory (STORE instruction):
- the ALU computes the effective address (base plus offset),
- the register file provides the value to store,
- the control unit asserts a write enable signal,
- the address decoding selects the target memory cell,
- memory stores the value at the computed address, and
- the control unit deasserts the write enable signal.
For example, consider a STORE instruction: STORE R4, [R3, #12]. This instruction tells the CPU to store the value in register R4 at address R3 + 12
This instruction itself is fetched from memory. Once fetched, the instruction is decoded, the ALU computes the effective address (base plus offset), and the register file provides the value to store. The CPU puts the data from R4 onto the data bus. The CPU control asserts a memory write enable signal, and the address decoding selects the target memory cell. Memory writes the value to the selected location, and the CPU deasserts the write enable signal.
Memory read enable and write enable signals are mutually exclusive: they cannot both be asserted in the same cycle. A memory location cannot be read from and written to simultaneously.
Timing, clock signals, and synchronization
Memory operations are synchronized to the system clock to ensure correct behavior. Typically, a rising edge triggers reads and writes. However, every memory element (register, flip-flop, latch, or memory cell interface) has two critical timing parameters: setup time and hold time.
Setup time is the minimum amount of time that data must be stable before the rising clock edge. That is, data must arrive in time.
Hold time is the minimum amount of time that data must remain stable after the clock edge.
If these timing constraints are violated, the system may read incorrect or stale data, write garbage to memory, skip instructions, or execute the wrong instructions.
In the context of memory devices (SRAM, DRAM, flash, etc.), setup and hold times are specified with respect to address, data, and control signals and a clock (or strobe signal). The memory interface must satisfy these timing margins—meaning it provides addresses, data, and control signals early enough before the next clock and doesn’t release them too soon afterward.
Memory access time
The memory access time is the delay between asserting the address/control signals and data becoming available (for reads) or being latched (for writes).
For a single-cycle datapath, the clock period must be long enough to accommodate:
- PC update,
- instruction fetch (memory read),
- instruction decode,
- ALU operation (for computing effective address),
- data memory access (if needed), and
- register writeback
This makes single-cycle designs slow, since the clock must wait for the longest possible instruction path. This is another motivation for multi-cycle datapaths.
Synchronous vs asynchronous memory
Most modern CPUs use synchronous memory, in which all operations triggered by clock edges. This is the most common approach, and it makes designing and debugging easier. Virtually all general-purpose computers use fully synchronous DRAM memory (SDRAM, DDR, LPDDR).
Some systems use asynchronous memory, in which operations are triggered by signal changes (no clock). While this can be faster, timing analysis is more complex. This is rarely used in modern systems, apart from embedded systems and certain microcontrollers. Historically, this approach was used in systems built around the Zilog Z80 microprocessor, Intel 8088/8086, Motorola 68000, certain MIPS processors and some game consoles (e.g., Game Boy),
Decoupling memory and CPU
When the CPU needs to read from or write to memory, the transfer happens in two parts:
- the address of the memory location is placed in the memory address register (MAR), and
- the data is placed in the memory data register (MDR).
These are special purpose registers which serve as buffers between the CPU and memory.
On a read, the MDR receives data from memory and holds it until the CPU needs it. On a write, the MDR holds data from the CPU that will be written to memory.
Memory address alignment
Most modern architectures require that multi-byte accesses be aligned to their word size:
- 2-Byte access (16-bit): address must be even (divisible by 2)
- 4-Byte access (32-bit): address must be divisible by 4
- 8-Byte access (64-bit): address must be divisible by 8
Why? Alignment simplifies address-decoding hardware, and allows for single-cycle access (no need to read multiple locations to get separate bytes within a word). This can increase memory throughput.
Unaligned accesses may cause exceptions (on strict architectures), require multiple memory cycles (on permissive architectures), and this typically results in slower performance.
Endianness and memory layout
In his 1726 novel, Gulliver’s Travels, Jonathan Swift tells of two factions among the Lilliputians: the “Big-endians” and “Little-endians.” Big-endians broke boiled eggs at the larger end, while the Little-endians broke them at the smaller end. This seemingly trivial disagreement escalated into numerous rebellions and accusations of heresy! That’s the origin of the terms, “big-endian,” “little-endian”, “bi-endian,” and “endianness.”
It turns out that there are two ways to organize multi-byte data in memory. In little-endian architectures, the least significant byte is stored at the lowest address. In big-endian architectures, the most significant byte is stored at the lowest address.
x86 architecture is little-endian. ARM, technically, is bi-endian, but most modern OSs running on ARM (macOS, Linux, Android, iOS) are little-endian. Sun SPARC and Motorola 68000 are big-endian (but no longer used).
Endianness doesn’t determine the order of bits within a byte, but it does determine the order of bytes within a word. Within each individual byte, the bit order is the same. Bits within a byte are always ordered the same way, with bit 7 being the most significant bit (MSB), and bit 0 being the least significant bit (LSB). Endianness only refers to the order of bytes within a word.
For example, consider a 32-bit word at memory address 0x1000. This consists of four bytes at addresses 0x1000, 0x1001, 0x1002, and 0x1003.
Under the little-endian scheme, the least significant byte is stored at address 0x1000, and the most significant byte is stored at address 0x1003. Under the big-endian scheme, the most significant byte is stored at address 0x1000, and the least significant byte is stored at address 0x1003.
Little-endian:
| Address | 0x1000 |
0x1001 |
0x1002 |
0x1003 |
|---|---|---|---|---|
| Data | 0x78 |
0x56 |
0x34 |
0x12 |
Result: 0x12345678
Big-endian:
| Address | 0x1000 |
0x1001 |
0x1002 |
0x1003 |
|---|---|---|---|---|
| Data | 0x12 |
0x34 |
0x56 |
0x78 |
Result: 0x12345678 (same).
Obviously, address decoding must account for endianness when accessing individual bytes within a word.
Side note: network protocols such as TCP/IP, etc., define “network byte order” as big-endian.
© 2025 Clayton Cafiero.
No generative AI was used in writing this material. This was written the old-fashioned way.