9. Multicycle Design
The single-cycle CPU is rather inefficient if we think about it. We have to wait for memory to fully complete before moving on.
Why not make different instructions take different amounts of time (and by taht, different number of clock cycles).
Remember the five phases of execute? Well, all instructions have IF, ID, EX, but only some writer to memory/registers.
For now, assume lw is 5 cycles.
For a multi-cycle CPU, we need to calculate the clock to accommodate a single phase. We can now chop instructions and make the clock faster, which means less time wasted by faster instructions (reduces latency).
However, in a multi-cycle design, the slowest stage limits the rate, so balanced stages are desired. For example split the memory operation into multiple clock cycles.
Each phase of execution has its own functional unit. Between phases, we need registers to hold onto the data for the next phase.
To measure this, we use CPI (cycles per instructions) and IPC (instructions per cycle). CPI measures the average number of cycles it takes to complete one instruction, while IPC is just the reciprocal of CPI. The CPI for a single cycle is 1.
Every program is different, and every program has a different instruction mix. An example is below.
As you can see, it is just the weighted average.
If we have n instructions, $\text{Total Time} = n \cdot \text{CPI} \cdot t \text{seconds}$.
Let's take 500 mega instructions. For a single cycle CPU, with cycle time of 5ns, the total time would be 2.5 seconds. For a multicycle CPU, with CPU of 3.95, cycle of 1ns, the total time would be 1.975 seconds.
The CPU
Here's the CPU.
And all of the signals with it.
How it works. The fetch instruction reads instruction from memory and increments PC. The decode makes the controller do its thing and make the ALU add PC to a potential branch offset (just in case).
R-Type
For an R-type, the execute instruction for add s0, s1, s2
for example, makes the ALU add registers A and B. Then write those results back into the register file.
I-Type
The execute instruction for lw s0, 4(s1)
for example, makes the ALU adds registers A and Imm to calculate the effective address. The ALU result is used as the address and memory is read. Then the data read from memory is written into the register file.
The execute instruction for beq s0, s1, label
makes the ALU substract registers A and B if the result is zero, then branch.
J-Type
The execute instruction for j target
basically, just does the jump.
Wait, how are control signals generated on each cycle? Well, single cycles signals don't change during each instruction (combinational circuit). Multi-cycle signals change during each instruction and different signals for each clock cycle means sequential circuit (needs to remember what it did before). To describe this behaviour we need state machines.
Finite-State Machines
This is a single cycle state machine.
This is a multi cycle state machine.
Here is a more descriptive example of a mutli cycle.
Performance
Response time is the length of time from start to finish. Throughput is the amount of work you can do in a span of time.
The CPU's job is to run instruction, so we can do each instruction faster (reduce latency) or do more instructions at once (increase throughput).