### 18-447

Computer Architecture Lecture 14: Out-of-Order Execution (Dynamic Instruction Scheduling)

> Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/19/2014

### Announcements

- Lab due Friday (Feb 21)
- Homework 3 due next Wednesday (Feb 26)
- Exam coming up (before Spring Break)

## Reminder: Lab Late Day Policy Adjustment

- Your total late days have increased to 7
- Each late day beyond all exhausted late days costs you 15% of the full credit of the lab

## Reminder: A Note on Testing Your Code

- Testing is critical in developing any system
- You are responsible for creating your own test programs and ensuring your designs work for all possible cases
- That is how real life works also...
  - Noone gives you all possible test cases, workloads, users, etc. beforehand

### Lab 2 Grade Distribution



### Lab 2 Statistics

- MAX 99.62
- MIN 62.74
- MEDIAN 92.59
- MEAN 89.26
- STD 10.21

### HW 2 Grade Distribution

### **HW 2 Grade Distribution**



### HW 2 Statistics

- MAX 100
- MIN
   0
- MEDIAN 92.98
- MEAN 81.98
- **STD** 24.82

# Readings for Past Few Lectures (I)

- P&H Chapter 4.9-4.11
- Smith and Sohi, "The Microarchitecture of Superscalar Processors," Proceedings of the IEEE, 1995
  - More advanced pipelining
  - Interrupt and exception handling
  - Out-of-order and superscalar execution concepts
- McFarling, "Combining Branch Predictors," DEC WRL Technical Report, 1993.
- Kessler, "The Alpha 21264 Microprocessor," IEEE Micro 1999.

## Readings for Past Few Lectures (II)

 Smith and Plezskun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Trans on Computers 1988 (earlier version in ISCA 1985).

## Readings Specifically for Today

- Smith and Sohi, "The Microarchitecture of Superscalar Processors," Proceedings of the IEEE, 1995
  - More advanced pipelining
  - Interrupt and exception handling
  - Out-of-order and superscalar execution concepts
- Kessler, "The Alpha 21264 Microprocessor," IEEE Micro 1999.

## Readings for Next Lecture

- SIMD Processing
- Basic GPU Architecture
- Other execution models: VLIW, Dataflow
- Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008.
- Fatahalian and Houston, "A Closer Look at GPUs," CACM 2008.
- Stay tuned for more readings...

## Maintaining Precise State

- Reorder buffer
- History buffer
- Future register file
- Checkpointing
- Readings
  - Smith and Plezskun, "Implementing Precise Interrupts in Pipelined Processors," IEEE Trans on Computers 1988 and ISCA 1985.
  - Hwu and Patt, "Checkpoint Repair for Out-of-order Execution Machines," ISCA 1987.

## Registers versus Memory

- So far, we considered mainly registers as part of state
- What about memory?
- What are the fundamental differences between registers and memory?
  - Register dependences known statically memory dependences determined dynamically
  - Register state is small memory state is large
  - Register state is not visible to other threads/processors memory state is shared between threads/processors (in a shared memory multiprocessor)

### Maintaining Speculative Memory State: Stores

- Handling out-of-order completion of memory operations
  - UNDOing a memory write more difficult than UNDOing a register write. Why?
  - One idea: Keep store address/data in reorder buffer
    - How does a load instruction find its data?
  - Store/write buffer: Similar to reorder buffer, but used only for store instructions
    - Program-order list of un-committed store operations
    - When store is decoded: Allocate a store buffer entry
    - When store address and data become available: Record in store buffer entry
    - When the store is the oldest instruction in the pipeline: Update the memory address (i.e. cache) with store data

Out-of-Order Execution (Dynamic Instruction Scheduling)

## An In-order Pipeline



- Problem: A true data dependency stalls dispatch of younger instructions into functional (execution) units
- Dispatch: Act of sending an instruction to a functional unit

### Can We Do Better?

What do the following two pieces of code have in common (with respect to execution in the previous design)?

| IMUL R3 ← R1, R2 | LD R3 ← R1 (0)   |
|------------------|------------------|
| ADD R3 ← R3, R1  | ADD R3 ← R3, R1  |
| ADD R1 ← R6, R7  | ADD R1 ← R6, R7  |
| IMUL R5 ← R6, R8 | IMUL R5 ← R6, R8 |
| ADD R7 ← R9, R9  | ADD R7 ← R9, R9  |

Answer: First ADD stalls the whole pipeline!

- ADD cannot dispatch because its source registers unavailable
- Later independent instructions cannot get executed
- How are the above code portions different?
  - Answer: Load latency is variable (unknown until runtime)
  - What does this affect? Think compiler vs. microarchitecture

# Preventing Dispatch Stalls

- Multiple ways of doing it
- You have already seen THREE:
  - **1**.
  - **2**.
  - **3**.
- What are the disadvantages of the above three?
- Any other way to prevent dispatch stalls?
  - Actually, you have briefly seen the basic idea before
    - Dataflow: fetch and "fire" an instruction when its inputs are ready
  - Problem: in-order dispatch (scheduling, or execution)
  - Solution: out-of-order dispatch (scheduling, or execution)

## Out-of-order Execution (Dynamic Scheduling)

- Idea: Move the dependent instructions out of the way of independent ones
  - Rest areas for dependent instructions: Reservation stations
- Monitor the source "values" of each instruction in the resting area
- When all source "values" of an instruction are available, "fire" (i.e. dispatch) the instruction
  - Instructions dispatched in dataflow (not control-flow) order
- Benefit:
  - Latency tolerance: Allows independent instructions to execute and complete in the presence of a long latency operation

## In-order vs. Out-of-order Dispatch

In order dispatch + precise exceptions:



IMUL R3  $\leftarrow$  R1, R2 ADD R3  $\leftarrow$  R3, R1 ADD R1  $\leftarrow$  R6, R7 IMUL R5  $\leftarrow$  R6, R8 ADD R7  $\leftarrow$  R3, R5

Out-of-order dispatch + precise exceptions:



16 vs. 12 cycles

## Enabling OoO Execution

- 1. Need to link the consumer of a value to the producer
  - Register renaming: Associate a "tag" with each data value
- 2. Need to buffer instructions until they are ready to execute
  - Insert instruction into reservation stations after renaming
- 3. Instructions need to keep track of readiness of source values
  - Broadcast the "tag" when the value is produced
  - Instructions compare their "source tags" to the broadcast tag
     → if match, source value becomes ready
- 4. When all source values of an instruction are ready, need to dispatch the instruction to its functional unit (FU)
  - □ Instruction wakes up if all sources are ready
  - □ If multiple instructions are awake, need to select one per FU

## Tomasulo's Algorithm

- OOO with register renaming invented by Robert Tomasulo
  - Used in IBM 360/91 Floating Point Units
  - Read: Tomasulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units," IBM Journal of R&D, Jan. 1967.
- What is the major difference today?
  - Precise exceptions: IBM 360/91 did NOT have this
  - Patt, Hwu, Shebanow, "HPS, a new microarchitecture: rationale and introduction," MICRO 1985.
  - Patt et al., "Critical issues regarding HPS, a high performance microarchitecture," MICRO 1985.
- Variants used in most high-performance processors
  - □ Initially in Intel Pentium Pro, AMD K5
  - Alpha 21264, MIPS R10000, IBM POWER5, IBM z196, Oracle UltraSPARC T4, ARM Cortex A15

# Two Humps in a Modern Pipeline



#### in order

#### out of order

in order

- Hump 1: Reservation stations (scheduling window)
- Hump 2: Reordering (reorder buffer, aka instruction window or active window)

### General Organization of an OOO Processor



 Smith and Sohi, "The Microarchitecture of Superscalar Processors," Proc. IEEE, Dec. 1995.

### Tomasulo's Machine: IBM 360/91



# Register Renaming

- Output and anti dependencies are not true dependencies
  - WHY? The same register refers to values that have nothing to do with each other
  - They exist because not enough register ID's (i.e. names) in the ISA
- The register ID is renamed to the reservation station entry that will hold the register's value
  - □ Register ID  $\rightarrow$  RS entry ID
  - Architectural register ID  $\rightarrow$  Physical register ID
  - □ After renaming, RS entry ID used to refer to the register
- This eliminates anti- and output- dependencies
  - Approximates the performance effect of a large number of registers even though ISA has a small number

## Tomasulo's Algorithm: Renaming

### Register rename table (register alias table)



# Tomasulo's Algorithm

- If reservation station available before renaming
  - Instruction + renamed operands (source value/tag) inserted into the reservation station
  - Only rename if reservation station is available
- Else stall
- While in reservation station, each instruction:
  - Watches common data bus (CDB) for tag of its sources
  - When tag seen, grab value for the source and keep it in the reservation station
  - When both operands available, instruction ready to be dispatched
- Dispatch instruction to the Functional Unit when instruction is ready
- After instruction finishes in the Functional Unit
  - Arbitrate for CDB
  - Put tagged value onto CDB (tag broadcast)
  - Register file is connected to the CDB
    - Register contains a tag indicating the latest writer to the register
    - If the tag in the register file matches the broadcast tag, write broadcast value into register (and set valid bit)
  - Reclaim rename tag
    - no valid copy of tag in system!

### An Exercise

```
MUL R3 \leftarrow R1, R2
ADD R5 \leftarrow R3, R4
ADD R7 \leftarrow R2, R6
ADD R10 \leftarrow R8, R9
MUL R11 \leftarrow R7, R10
ADD R5 \leftarrow R5, R11
```



- Assume ADD (4 cycle execute), MUL (6 cycle execute)
- Assume one adder and one multiplier
- How many cycles
  - in a non-pipelined machine
  - in an in-order-dispatch pipelined machine with imprecise exceptions (no forwarding and full forwarding)
  - in an out-of-order dispatch pipelined machine imprecise exceptions (full forwarding)

### Exercise Continued

|                                  | Proelme structure |
|----------------------------------|-------------------|
| MUL RI, RZ, R3<br>ADD R3, R4- R5 | FDEW              |
| ADD RZ,R6- R7                    |                   |
| ADD R8, R9 -> R10                | con take multiple |
| MUL R7, RIO -> R11               | cycles .          |
| ADD RS, RM , RS                  |                   |
| MUL toxes 6 oydes                |                   |
| ADD Hokes 4 oyoles               |                   |
| How mony cycles total who dota / |                   |
| ** ** ** W/ **                   | . ?               |

### Exercise Continued



### Exercise Continued

MUL R3  $\leftarrow$  R1, R2 ADD R5  $\leftarrow$  R3, R4 ADD R7  $\leftarrow$  R2, R6 ADD R10  $\leftarrow$  R8, R9 MUL R11  $\leftarrow$  R7, R10 ADD R5  $\leftarrow$  R5, R11

FD123456W 234W FD FD1234 W 23456W 1234W F01234W FD FD 20 cydas

Tamosolo's algorithm + full Anverding

### How It Works







Cycle 2 Cycle 2: MUL RI, R2 - 1 R3 - reads its sources from the RAT - writes to its destinction in the RAT (renomes its destination -> allocates a researcher station entry -> allocates a tag for its destinction register - places its surges in the reservation states entry that is allocated. end of ayde 2: V tag value RI سی 105 Value V volve 100 9 R2 1 ~ 2 х 2 R3 0 X ----C d Rh 1 4 ~ 米 RUII 11 MUL at X becomes ready to execute ( What if multiple methodies become ready at tre some tome ) I both of its sources are volid in the resorden stoken X



cycle 4: \_ ADD R2, R6 -> R7 gots renamed and placed into RS end of cycle 4:



5

end of ayole 7: volve teg V 9 0 4 ~ RI 2 × 11 5 2 1 ~ 2 6 à ~ k2 ~ 2 8 9 1 ~ t ~ d C K30 X Ь ~ 0 ~ C a Kh N 4 1 d R5 0 n 1 6 RG ~ \* RZ 0 5 ~ RS ~ 8 1 RS ~ 9 I . RIG 0 0 ~ RILO All 6 instructions renamed. 26

- Note what happened to R5

Cycle 8: - MUL at X and ADD at b broadcost their tags and volves RS ontries vou iting for trese tags capture the volves and set the Volid 62 occordingly -> ( What is needed in HW to accomplish this? ) . CAM on togs that are broodcast for all RS entres & sources RAT entries working for these tags also capture the volves and set the Volid bits accordingly

## An Exercise, with Precise Exceptions





- Assume ADD (4 cycle execute), MUL (6 cycle execute)
- Assume one adder and one multiplier
- How many cycles
  - in a non-pipelined machine
  - in an in-order-dispatch pipelined machine with reorder buffer (no forwarding and full forwarding)
  - in an out-of-order dispatch pipelined machine with reorder buffer (full forwarding)

### Out-of-Order Execution with Precise Exceptions

- Idea: Use a reorder buffer to reorder instructions before committing them to architectural state
- An instruction updates the register alias table (essentially a future file) when it completes execution
- An instruction updates the architectural register file when it is the oldest in the machine and has completed execution

### Out-of-Order Execution with Precise Exceptions



#### in order

#### out of order

in order

- Hump 1: Reservation stations (scheduling window)
- Hump 2: Reordering (reorder buffer, aka instruction window or active window)

## Enabling OoO Execution, Revisited

- 1. Link the consumer of a value to the producer
  - Register renaming: Associate a "tag" with each data value
- 2. Buffer instructions until they are ready
  - Insert instruction into reservation stations after renaming
- 3. Keep track of readiness of source values of an instruction
  - Broadcast the "tag" when the value is produced
  - Instructions compare their "source tags" to the broadcast tag
     → if match, source value becomes ready
- 4. When all source values of an instruction are ready, dispatch the instruction to functional unit (FU)
  - Wakeup and select/schedule the instruction

## Summary of OOO Execution Concepts

- Register renaming eliminates false dependencies, enables linking of producer to consumers
- Buffering enables the pipeline to move for independent ops
- Tag broadcast enables communication (of readiness of produced value) between instructions
- Wakeup and select enables out-of-order dispatch