11. Pipelining
A Lesson on Laundry
Luis, Artur, Stephen, and Ray need to do laundry, washer takes 60 minutes, dryer takes 60 minutes, and folding takes 60 minutes. They can only do laundry on Saturday from 9am to 6pm.
Let's do it sequentially.
Oh no, we can only do 1/3 of a load an hour! We could buy more machines! But, thats expensive!
What if we pipelined the laundry:
Now we can do 2/3 of a load an hour with only one set of machines!
Upgrading the Multi-Cycle CPU
We can use this concept to a multi-cycle CPU!
Pipelining doesn't help with latency of a single instruction, but it does help with the throughput of the entire workload.
Different takes operating at different resources can execute simultaneously.
More stages, more potential speedup, but too many stages is not good.
With this we can get the CPI down closer to 1.
The average CPI is generally, $CPI=\frac{n+3}{n}$ and if we go to infinity the CPI would be one. We have to wait a couple of cycles to get everything up and running with full overlapping.
Issues
Structual Hazards
Attempting to use the same resource two different ways simultaneously. E.g.: You get home soaking wet and need to dry your clothes while someone is using the dryer.
In a CPU with a single memory we can't fetch a new instruction while reading a word from memory.
Structural hazards arise from lack of resources. We could eliminate the hazard by adding more resources. Or we can stall until the resource is available.
Data Hazards
Attempting to use an item before it is ready. E.g.: Only one sock of a pair is found during folding and it's in the dryer! Folding has to wait!
In a CPU, an instruction depends on result of prior instruction still in the pipeline.
Solution 1: wait.
Solution 2: modify the register a little bit.
Or the best solution 3: forward (passing the result from a later stage to an earlier one) the ALU output.
Control Hazards
Attempting to make a decision before condition is evaluated. E.g.: If the dirty clothes are not clean after washing! Then I must wash them again.
In CPU, branches.
Solution 1: Wait.
Solution 2: Predict what that the branch is never taken.
Its a gamble, but if we are right, there is no stall.
And if we are wrong, just fix it since the instructions haven't changed anything.
If we are correct at predicting the value 10% of the time, then take the complement and we are correct 90% of the time. The worst case is when we are correct 50% of the time.
Law of Diminishing Returns
Let's say we have this program and it takes $102010$ns to run.
What if we improve the multiplication to include what we need, it combines some registers, make it do the 3 step simultaneously, and ALU only needs certain things. So now, each multiplicatino takes $160$ns. Meaning our speedup is 6 times better.
However, our program now takes $22010$ns, so our program speed up is:
What happens as we decrease the proportion of execution time?
So, the proportion of time spent multiplying was not enough to have gains.
Optimization is a balancing act. As you solve bottlenecks, new ones appear. Improve things to a point, then there are diminishing returns.
If we look at pipelining, we see the same thing.
This is a regular multi-cycle CPU's CPI.
It's CPU time is $2660$ns.
Let's use pipelining with literally incorrect branching.
Now the CPU time is $871$ns which is a speed up of $3.05$.