# Computer Architecture: Memory Interference and QoS (Part II)

Prof. Onur Mutlu Carnegie Mellon University

## Memory Interference and QoS Lectures

- These slides are from a lecture delivered at INRIA (July 8, 2013)
- Similar slides were used at the ACACES 2013 course
   <u>http://users.ece.cmu.edu/~omutlu/acaces2013-memory.html</u>

## QoS-Aware Memory Systems (Wrap Up)

Onur Mutlu onur@cmu.edu July 9, 2013 INRIA



## Slides for These Lectures

- Architecting and Exploiting Asymmetry in Multi-Core
  - <u>http://users.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture1-asymmetry-jul-2-2013.pptx</u>
- A Fresh Look At DRAM Architecture
  - http://www.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture2-DRAM-jul-4-2013.pptx
- QoS-Aware Memory Systems
  - http://users.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture3memory-gos-jul-8-2013.pptx
- QoS-Aware Memory Systems and Waste Management
  - <u>http://users.ece.cmu.edu/~omutlu/pub/onur-INRIA-lecture4-</u> <u>memory-qos-and-waste-management-jul-9-2013.pptx</u>

## Videos for Similar Lectures

- Basics (of Computer Architecture)
  - http://www.youtube.com/playlist?list=PL5PHm2jkkXmidJOd59 REog9jDnPDTG6IJ
- Advanced (Longer versions of these lectures)
  - <u>http://www.youtube.com/playlist?list=PLVngZ7BemHHV6N0ej</u> <u>HhwOfLwTr8Q-UKXj</u>

### Designing QoS-Aware Memory Systems: Approaches

- Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism
  - QoS-aware memory controllers [Mutlu+ MICRO'07] [Moscibroda+, Usenix Security'07] [Mudu+ ISCA'08, Top Picks'09] [Kim+ HPCA 10] [Kim+ MICRO'10, Top Picks'11] [Ebrahimi+ ISCA'11, MICRO'11] [Ausavarungnirun+, ISCA'12][Subramanian+, HPCA'13]
  - QoS-aware interconnects [Das+ MICRO'09, ISCA'10, Top Picks '11] [Grot+ MICRO'09, ISCA'11, Top Picks '12]
  - QoS-aware caches
- Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping
  - Source throttling to control access to memory system [Ebrahimi+ ASPLOS'10, ISCA'11, TOCS'12] [Ebrahimi+ MICRO'09] [Nychis+ HotNets'10] [Nychis+ SIGCOMM'12]
  - □ QoS-aware data mapping to memory controllers [Muralidhara+ MICRO'11]
  - QoS-aware thread scheduling to cores [Das+ HPCA'13]

## ATLAS Pros and Cons

- Upsides:
  - Good at improving overall throughput (compute-intensive threads are prioritized)
  - Low complexity
  - Coordination among controllers happens infrequently
- Downsides:
  - □ Lowest/medium ranked threads get delayed significantly → high unfairness

# TCM: Thread Cluster Memory Scheduling

Yoongu Kim, Michael Papamichael, <u>Onur Mutlu</u>, and Mor Harchol-Balter, <u>"Thread Cluster Memory Scheduling:</u> <u>Exploiting Differences in Memory Access Behavior"</u> <u>43rd International Symposium on Microarchitecture</u> (MICRO), pages 65-76, Atlanta, GA, December 2010. <u>Slides (pptx) (pdf)</u>

TCM Micro 2010 Talk

## Previous Scheduling Algorithms are Biased

24 cores, 4 memory controllers, 96 workloads



No previous memory scheduling algorithm provides both the best fairness and system throughput **SAFARI** 

9

# Throughput vs. Fairness



### Single policy for all threads is insufficient

# Achieving the Best of Both Worlds



### Thread Cluster Memory Scheduling [Kim+ MICRO'10]

- 1. Group threads into two *clusters*
- 2. Prioritize non-intensive cluster
- 3. Different policies for each cluster



higher

# **Clustering Threads**

**<u>Step1</u>** Sort threads by MPKI (misses per kiloinstruction)



# Prioritization Between Clusters

### Prioritize non-intensive cluster



- Increases system throughput
  - Non-intensive threads have greater potential for making progress
- Does not degrade fairness
  - Non-intensive threads are "light"
  - Rarely interfere with intensive threads

# Non-Intensive Cluster

### Prioritize threads according to MPKI



- Increases system throughput
  - Least intensive thread has the greatest potential for making progress in the processor

# Intensive Cluster

Periodically shuffle the priority of threads



- Is treating all threads equally good enough?
- **BUT:** Equal turns ≠ Same slowdown

# Case Study: A Tale of Two Threads

**Case Study:** Two intensive threads contending

1. random-access 2. streaming

- Which is slowed down more easily?



random-access thread is more easily slowed down

#### Why are Threads Different? random-access streaming stuck → req rea activated row rows Bank 1 Bank 2 Bank 3 Bank 4 Memory

• All requests parallel

• High **bank-level parallelism** 

All requests → Same row
High row-buffer locality

### Vulnerable to interference



### How to quantify difference between threads?



- **1. Round-Robin** shuffling **←** What can go wrong?
- 2. Niceness-Aware shuffling

**GOOD:** Each thread prioritized once



- **1. Round-Robin** shuffling **←** What can go wrong?
- 2. Niceness-Aware shuffling

Most prioritized

**GOOD:** Each thread prioritized once







# TCM Outline





Fairness

# TCM: Quantum-Based Operation



# TCM: Scheduling Algorithm

1. <u>Highest-rank</u>: Requests from higher ranked threads prioritized

- Non-Intensive cluster > Intensive cluster
- Non-Intensive cluster: lower intensity → higher rank
- Intensive cluster: rank shuffling

2. <u>Row-hit</u>: Row-buffer hit requests are prioritized

3. <u>Oldest</u>: Older requests are prioritized

# **TCM: Implementation Cost**

### **Required storage at memory controller** (24 cores)

| Thread memory behavior | Storage  |
|------------------------|----------|
| MPKI                   | ~0.2kb   |
| Bank-level parallelism | ~0.6kb   |
| Row-buffer locality    | ~2.9kb   |
| Total                  | < 4kbits |

• No computation is on the critical path

# Previous Work

**FRFCFS** [Rixner et al., ISCA00]: Prioritizes row-buffer hits

– Thread-oblivious 
 Low throughput & Low fairness

STFM [Mutlu et al., MICRO07]: Equalizes thread slowdowns

Non-intensive threads not prioritized 
 Low throughput

**PAR-BS** [Mutlu et al., ISCA08]: Prioritizes oldest batch of requests while preserving bank-level parallelism

Non-intensive threads not always prioritized 
 Low throughput

**ATLAS** [Kim et al., HPCA10]: Prioritizes threads with less memory service

Most intensive thread starves 
 Low fairness

# TCM: Throughput and Fairness

24 cores, 4 memory controllers, 96 workloads



TCM, a heterogeneous scheduling policy, provides best fairness and system throughput

# TCM: Fairness-Throughput Tradeoff

### When configuration parameter is varied...



TCM allows robust fairness-throughput tradeoff

# **Operating System Support**

- ClusterThreshold is a tunable knob
  - OS can trade off between fairness and throughput

- Enforcing thread weights
  - OS assigns weights to threads
  - TCM enforces thread weights within each cluster

# Conclusion

 No previous memory scheduling algorithm provides both high system throughput and fairness

- **Problem:** They use a single policy for all threads

- TCM groups threads into two *clusters* 
  - 1. Prioritize *non-intensive* cluster → throughput
  - 2. Shuffle priorities in *intensive* cluster  $\rightarrow$  fairness
  - 3. Shuffling should favor *nice* threads  $\rightarrow$  fairness
- TCM provides the best system throughput and fairness

## TCM Pros and Cons

- Upsides:
  - Provides both high fairness and high performance
- Downsides:
  - Scalability to large buffer sizes?
  - Effectiveness in a heterogeneous system?

# Staged Memory Scheduling

Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel Loh, and <u>Onur Mutlu</u>, "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems" 39th International Symposium on Computer Architecture (ISCA), Portland, OR, June 2012.



## SMS: Executive Summary

- Observation: Heterogeneous CPU-GPU systems require memory schedulers with large request buffers
- Problem: Existing monolithic application-aware memory scheduler designs are hard to scale to large request buffer sizes
- Solution: Staged Memory Scheduling (SMS)
   decomposes the memory controller into three simple stages:
   1) Batch formation: maintains row buffer locality
  - 2) Batch scheduler: reduces interference between applications
  - 3) DRAM command scheduler: issues requests to DRAM
- Compared to state-of-the-art memory schedulers:
  - SMS is significantly simpler and more scalable
  - SMS provides higher performance and fairness

# SMS: Staged Memory Scheduling



# SMS: Staged Memory Scheduling



# Putting Everything Together



# Complexity

### Compared to a row hit first scheduler, SMS consumes\*

- 66% less area
- 46% less static power

- Reduction comes from:
  - Monolithic scheduler  $\rightarrow$  stages of simpler schedulers
  - Each stage has a simpler scheduler (considers fewer properties at a time to make the scheduling decision)
  - Each stage has simpler buffers (FIFO instead of out-of-order)
  - Each stage has a portion of the total buffer size (buffering is distributed across stages)

### Performance at Different GPU Weights



# Performance at Different GPU Weights



 At every GPU weight, SMS outperforms the best previous scheduling algorithm for that weight

### Stronger Memory Service Guarantees

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and <u>Onur Mutlu</u>, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems" Proceedings of the <u>19th International Symposium on High-Performance Computer Architecture</u> (HPCA), Shenzhen, China, February 2013. <u>Slides (pptx)</u>

### Strong Memory Service Guarantees

- Goal: Satisfy performance bounds/requirements in the presence of shared main memory, prefetchers, heterogeneous agents, and hybrid memory
- Approach:
  - Develop techniques/models to accurately estimate the performance of an application/agent in the presence of resource sharing
  - Develop mechanisms (hardware and software) to enable the resource partitioning/prioritization needed to achieve the required performance levels for all applications
  - □ All the while providing high system performance

### **MISE:**

# Providing Performance Predictability in Shared Main Memory Systems

### Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu





### Unpredictable Application Slowdowns



# An application's performance depends on which application it is running with

### Need for Predictable Performance

- There is a need for predictable performance
  - When multiple applications share resources
  - Especially if some applications require performance

Our Goal: Predictable performance in the presence of memory interference

- Example 2: In server systems
  - Different users' jobs consolidated onto the same server
  - Need to provide bounded slowdowns to critical jobs

# Outline

- 1. Estimate Slowdown
  Key Observations
  Implementation
  MISE Model: Putting it All Together
  Evaluating the Model
- 2. Control Slowdown
  - Providing Soft Slowdown Guarantees
     Minimizing Maximum Slowdown

# $Slowdown = \frac{Performance Alone}{Performance Shared}$

### Key Observation 1

### For a memory bound application, Performance $\infty$ Memory request service rate



### Key Observation 2

Request Service Rate <sub>Alone</sub> (RSR<sub>Alone</sub>) of an application can be estimated by giving the application highest priority in accessing memory

> Highest priority  $\rightarrow$  Little interference (almost as if the application were run alone)



### 1. Run alone





### SAFARI

### Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications

# $Slowdown = \frac{Request Service Rate Alone (RSRAlone)}{Request Service Rate Shared (RSRShared)}$



### Memory phase slowdown dominates overall slowdown

# Key Observation 3

Non-memory-bound application

**Compute Phase** 

Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications

Slowdown =  $(1 - \alpha) + \alpha \frac{\text{RSR}_{\text{Alone}}}{\text{RSR}_{\text{Shared}}}$ 

With interference

Dnly memory fraction (a) slows down with interference

### **SAFARI**

# Measuring $RSR_{Shared}$ and $\alpha$

- Request Service Rate <sub>Shared</sub> (RSR<sub>Shared</sub>)
  - Per-core counter to track number of requests serviced
  - At the end of each interval, measure

# $RSR_{Shared} = \frac{Number of Requests Serviced}{Interval Length}$

- Memory Phase Fraction (*a*)
  - Count number of stall cycles at the core
  - Compute fraction of cycles stalled for memory

### SAFARI

### Estimating Request Service Rate Alone (RSRAlone)

- Divide each interval into shorter epochs
- At the beginning of each epoch
  - Memory controller randomly picks an application as the highest priority application
     How: Periodically give each application
- At the sneet ap internal, for each application metimotey

 $RSR_{Alone} = \frac{Number of Requests During High Priority Epochs}{Number of Cycles Application Given High Priority}$ 

# Inaccuracy in Estimating RSR<sub>Alone</sub>



### SAFARI

### Accounting for Interference in RSR<sub>Alone</sub> Estimation

 Solution: Determine and remove interference cycles from RSR<sub>Alone</sub> calculation

 $RSR_{Alone} = \frac{Number of Requests During High Priority Epochs}{Number of Cycles Applicatio n Given High Priority Interference Cycles}$ 

- A cycle is an interference cycle if
  - a request from the highest priority application is waiting in the request buffer *and*
  - another application's request was issued previously

# Outline

- 1. Estimate Slowdown
  - Key Observations
  - Implementation
  - MISE Model: Putting it All Together
     Evaluating the Model
- 2. Control Slowdown
  - Providing Soft Slowdown Guarantees
     Minimizing Maximum Slowdown



#### SAFARI

### Previous Work on Slowdown Estimation

Previous work on slowdown estimation

STFM (Stall Time Fair Memory) Scheduling [Mutlu+, MICRO '07]

- FST (Fairness via Source Throttling) [Ebrahimi+, ASPLOS '10]
- Per-thread Cycle Accounting [Du Bois+, HiPEAC `13]
- Basic Idea:



Count number of cycles application receives interference

### **SAFARI**

### Two Major Advantages of MISE Over STFM

- Advantage 1:
  - □ STFM estimates alone performance while an application is receiving interference → Hard
  - MISE estimates alone performance while giving an application the highest priority → Easier
- Advantage 2:
  - STFM does not take into account compute phase for non-memory-bound applications
  - $\square$  MISE accounts for compute phase  $\rightarrow$  Better accuracy

# Methodology

- Configuration of our simulated system
  - 4 cores
  - 1 channel, 8 banks/channel
  - DDR3 1066 DRAM
  - 512 KB private cache/core
- Workloads
  - SPEC CPU2006
  - 300 multi programmed workloads

### Quantitative Comparison



### Comparison to STFM



#### SAFARI

# Providing "Soft" Slowdown Guarantees

### Goal

- 1. Ensure QoS-critical applications meet a prescribed slowdown bound
- 2. Maximize system performance for other applications
- Basic Idea
  - Allocate just enough bandwidth to QoS-critical application
  - Assign remaining bandwidth to other applications

### MISE-QoS: Mechanism to Provide Soft QoS

- Assign an initial bandwidth allocation to QoS-critical application
- Estimate slowdown of QoS-critical application using the MISE model
- After every N intervals
  - □ If slowdown > bound B +/-  $\varepsilon$ , increase bandwidth allocation
  - □ If slowdown < bound B +/-  $\varepsilon$ , decrease bandwidth allocation
- When slowdown bound not met for N intervals
   Notify the OS so it can migrate/de-schedule jobs

# Methodology

- Each application (25 applications in total) considered the QoS-critical application
- Run with 12 sets of co-runners of different memory intensities
- Total of 300 multiprogrammed workloads
- Each workload run with 10 slowdown bound values
- Baseline memory scheduling mechanism
  - Always prioritize QoS-critical application
    - [Iyer+, SIGMETRICS 2007]
  - Other applications' requests scheduled in FRFCFS order
     [Zuray leff + UC Dataset 1007, Diverse 1007, 2000]

[Zuravleff +, US Patent 1997, Rixner+, ISCA 2000]

### A Look at One Workload



### Effectiveness of MISE in Enforcing QoS

### Across 3000 data points



MISE-QoS correctly predicts whether or not the bound is met for 95.7% of workloads

### Performance of Non-QoS-Critical Applications



### Other Results in the Paper

- Sensitivity to model parameters
   Robust across different values of model parameters
- Comparison of STFM and MISE models in enforcing soft slowdown guarantees
  - MISE significantly more effective in enforcing guarantees
- Minimizing maximum slowdown
  - MISE improves fairness across several system configurations

# Summary

- Uncontrolled memory interference slows down applications unpredictably
- Goal: Estimate and control slowdowns
- Key contribution
  - MISE: An accurate slowdown estimation model
  - Average error of MISE: 8.2%
- Key Idea
  - Request Service Rate is a proxy for performance
  - Request Service Rate <sub>Alone</sub> estimated by giving an application highest priority in accessing memory
- Leverage slowdown estimates to control slowdowns
  - Providing soft slowdown guarantees
  - Minimizing maximum slowdown

## **MISE:**

# Providing Performance Predictability in Shared Main Memory Systems

### Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu





Memory Scheduling for Parallel Applications

Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, <u>Onur Mutlu</u>, and Yale N. Patt, <u>"Parallel Application Memory Scheduling"</u> *Proceedings of the <u>44th International Symposium on Microarchitecture</u> (<i>MICRO*), Porto Alegre, Brazil, December 2011. <u>Slides (pptx)</u>

### Handling Interference in Parallel Applications

- Threads in a multithreaded application are inter-dependent
- Some threads can be on the critical path of execution due to synchronization; some threads are not
- How do we schedule requests of inter-dependent threads to maximize multithreaded application performance?
- Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO'11]
- Hardware/software cooperative limiter thread estimation:
  - Thread executing the most contended critical section
  - Thread that is falling behind the most in a *parallel for* loop

# Aside: Self-Optimizing Memory Controllers

Engin Ipek, <u>Onur Mutlu</u>, José F. Martínez, and Rich Caruana, <u>"Self Optimizing Memory Controllers: A Reinforcement Learning Approach"</u> *Proceedings of the <u>35th International Symposium on Computer Architecture</u> (<i>ISCA*), pages 39-50, Beijing, China, June 2008. <u>Slides (pptx)</u>

### Why are DRAM Controllers Difficult to Design?

- Need to obey DRAM timing constraints for correctness
  - There are many (50+) timing constraints in DRAM
  - tWTR: Minimum number of cycles to wait before issuing a read command after a write command is issued
  - tRC: Minimum number of cycles between the issuing of two consecutive activate commands to the same bank
  - ...
- Need to keep track of many resources to prevent conflicts
  - Channels, banks, ranks, data bus, address bus, row buffers
- Need to handle DRAM refresh
- Need to optimize for performance (in the presence of constraints)
  - Reordering is not simple
  - Predicting the future?

### Many DRAM Timing Constraints

| Latency                               | Symbol           | DRAM cycles | Latency                                        | Symbol           | DRAM cycles |
|---------------------------------------|------------------|-------------|------------------------------------------------|------------------|-------------|
| Precharge                             | <sup>t</sup> RP  | 11          | Activate to read/write                         | <sup>t</sup> RCD | 11          |
| Read column address strobe            | CL               | 11          | Write column address strobe                    | CWL              | 8           |
| Additive                              | AL               | 0           | Activate to activate                           | $^{t}RC$         | 39          |
| Activate to precharge                 | $^{t}RAS$        | 28          | Read to precharge                              | $^{t}RTP$        | 6           |
| Burst length                          | $^{t}BL$         | 4           | Column address strobe to column address strobe | $^{t}CCD$        | 4           |
| Activate to activate (different bank) | $^{t}RRD$        | 6           | Four activate windows                          | $^{t}FAW$        | 24          |
| Write to read                         | <sup>t</sup> WTR | 6           | Write recovery                                 | $^{t}WR$         | 12          |

Table 4. DDR3 1600 DRAM timing specifications

From Lee et al., "DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems," HPS Technical Report, April 2010.

### More on DRAM Operation and Constraints

- Kim et al., "A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM," ISCA 2012.
- Lee et al., "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA 2013.



#### Table 2. Timing Constraints (DDR3-1066) [43]

| Phase | Commands                                                                                                       | Name              | Value           |  |
|-------|----------------------------------------------------------------------------------------------------------------|-------------------|-----------------|--|
| 1     | $\begin{array}{l} \text{ACT} \rightarrow \text{READ} \\ \text{ACT} \rightarrow \text{WRITE} \end{array}$       | tRCD              | 15ns            |  |
|       | $\mathrm{ACT} \to \mathrm{PRE}$                                                                                | tRAS              | 37.5ns          |  |
| 2     | $\begin{array}{l} \text{READ} \rightarrow \textit{data} \\ \text{WRITE} \rightarrow \textit{data} \end{array}$ | tCL<br>tCWL       | 15ns<br>11.25ns |  |
| ·     | data burst                                                                                                     | tBL               | 7.5ns           |  |
| 3     | $\text{PRE} \rightarrow \text{ACT}$                                                                            | tRP               | 15ns            |  |
| 1&3   | $ACT \rightarrow ACT$                                                                                          | tRC<br>(tRAS+tRP) | 52.5ns          |  |

# Self-Optimizing DRAM Controllers

- Problem: DRAM controllers difficult to design → It is difficult for human designers to design a policy that can adapt itself very well to different workloads and different system conditions
- Idea: Design a memory controller that adapts its scheduling policy decisions to workload behavior and system conditions using machine learning.
- Observation: Reinforcement learning maps nicely to memory control.
- Design: Memory controller is a reinforcement learning agent that dynamically and continuously learns and employs the best scheduling policy.

### Self-Optimizing DRAM Controllers



Figure 2: (a) Intelligent agent based on reinforcement learning principles; (b) DRAM scheduler as an RL-agent

# Self-Optimizing DRAM Controllers

 Engin Ipek, <u>Onur Mutlu</u>, José F. Martínez, and Rich Caruana, <u>"Self Optimizing Memory Controllers: A Reinforcement Learning</u> <u>Approach</u>" *Proceedings of the <u>35th International Symposium on Computer Architecture</u>* 

(ISCA), pages 39-50, Beijing, China, June 2008.



Figure 4: High-level overview of an RL-based scheduler.

### Performance Results





Figure 15: Performance comparison of FR-FCFS and RL-based memory controllers on systems with 6.4GB/s and 12.8GB/s peak DRAM bandwidth

# QoS-Aware Memory Systems: The Dumb Resources Approach

### Designing QoS-Aware Memory Systems: Approaches

- Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism
  - QoS-aware memory controllers [Mutlu+ MICRO'07] [Moscibroda+, Usenix Security'07] [Mutlu+ ISCA'08, Top Picks'09] [Kim+ HPCA'10] [Kim+ MICRO'10, Top Picks'11] [Ebrahimi+ ISCA'11, MICRO'11] [Ausavarungnirun+, ISCA'12] [Subramanian+, HPCA'13]
  - QoS-aware interconnects [Das+ MICRO'09, ISCA'10, Top Picks '11] [Grot+ MICRO'09, ISCA'11, Top Picks '12]
  - QoS-aware caches
- Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping
  - Source throttling to control access to memory system [Ebrahimi+ ASPLOS'10, ISCA'11, TOCS'12] [Ebrahimi+ MICRO'09] [Nychis+ HotNets'10]
  - □ QoS-aware data mapping to memory controllers [Muralidhara+ MICRO'11]
  - QoS-aware thread scheduling to cores [Das+ HPCA'13]

### Fairness via Source Throttling

Eiman Ebrahimi, Chang Joo Lee, <u>Onur Mutlu</u>, and Yale N. Patt, <u>"Fairness via Source Throttling: A Configurable and High-Performance</u> <u>Fairness Substrate for Multi-Core Memory Systems"</u> <u>15th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems</u> (ASPLOS), pages 335-346, Pittsburgh, PA, March 2010. <u>Slides (pdf)</u>



## Many Shared Resources



### The Problem with "Smart Resources"

- Independent interference control mechanisms in caches, interconnect, and memory can contradict each other
- Explicitly coordinating mechanisms for different resources requires complex implementation
- How do we enable fair sharing of the entire memory system by controlling interference in a coordinated manner?

# An Alternative Approach: Source Throttling

- Manage inter-thread interference at the cores, not at the shared resources
- Dynamically estimate unfairness in the memory system
- Feed back this information into a controller
- Throttle cores' memory access rates accordingly
  - Whom to throttle and by how much depends on performance target (throughput, fairness, per-thread QoS, etc)
  - E.g., if unfairness > system-software-specified target then throttle down core causing unfairness & throttle up core that was unfairly treated
- Ebrahimi et al., "Fairness via Source Throttling," ASPLOS'10, TOCS'12.

### Fairness via Source Throttling (FST) [ASPLOS'10]



# System Software Support

- Different fairness objectives can be configured by system software
  - Keep maximum slowdown in check
    - Estimated Max Slowdown < Target Max Slowdown</p>
  - Keep slowdown of particular applications in check to achieve a particular performance target
    - Estimated Slowdown(i) < Target Slowdown(i)</p>
- Support for thread priorities
  - Weighted Slowdown(i) =
     Estimated Slowdown(i) x Weight(i)

# Source Throttling Results: Takeaways

- Source throttling alone provides better performance than a combination of "smart" memory scheduling and fair caching
  - Decisions made at the memory scheduler and the cache sometimes contradict each other
- Neither source throttling alone nor "smart resources" alone provides the best performance
- Combined approaches are even more powerful
   Source throttling and resource-based interference control

### Designing QoS-Aware Memory Systems: Approaches

- Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism
  - QoS-aware memory controllers [Mutlu+ MICRO'07] [Moscibroda+, Usenix Security'07] [Mutlu+ ISCA'08, Top Picks'09] [Kim+ HPCA'10] [Kim+ MICRO'10, Top Picks'11] [Ebrahimi+ ISCA'11, MICRO'11] [Ausavarungnirun+, ISCA'12] [Subramanian+, HPCA'13]
  - QoS-aware interconnects [Das+ MICRO'09, ISCA'10, Top Picks '11] [Grot+ MICRO'09, ISCA'11, Top Picks '12]
  - QoS-aware caches
- Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping
  - Source throttling to control access to memory system [Ebrahimi+ ASPLOS'10, ISCA'11, TOCS'12] [Ebrahimi+ MICRO'09] [Nychis+ HotNets'10] [Nychis+ SIGCOMM'12]
  - □ QoS-aware data mapping to memory controllers [Muralidhara+ MICRO'11]
  - QoS-aware thread scheduling to cores [Das+ HPCA'13]

# Memory Channel Partitioning

Sai Prashanth Muralidhara, Lavanya Subramanian, <u>Onur Mutlu</u>, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via <u>Application-Aware Memory Channel Partitioning"</u> <u>A4th International Symposium on Microarchitecture</u> (MICRO), Porto Alegre, Brazil, December 2011. <u>Slides (pptx)</u>

MCP Micro 2011 Talk

### Another Way to Reduce Memory Interference

### Memory Channel Partitioning

 Idea: System software maps badly-interfering applications' pages to different channels [Muralidhara+, MICRO'11]



- Separate data of low/high intensity and low/high row-locality applications
- Especially effective in reducing interference of threads with "medium" and "heavy" memory intensity
  - 11% higher performance over existing systems (200 workloads)

### Memory Channel Partitioning (MCP) Mechanism



- 2. Classify applications into groups
- 3. Partition channels between application groups
- 4. Assign a preferred channel to each application
- 5. Allocate application pages to preferred channel



Hardware

# 2. Classify Applications



# Summary: Memory QoS

- Technology, application, architecture trends dictate new needs from memory system
- A fresh look at (re-designing) the memory hierarchy
  - Scalability: DRAM-System Codesign and New Technologies
  - QoS: Reducing and controlling main memory interference: QoS-aware memory system design
  - Efficiency: Customizability, minimal waste, new technologies
- QoS-unaware memory: uncontrollable and unpredictable
- Providing QoS awareness improves performance, predictability, fairness, and utilization of the memory system

### Summary: Memory QoS Approaches and Techniques

- Approaches: Smart vs. dumb resources
  - Smart resources: QoS-aware memory scheduling
  - Dumb resources: Source throttling; channel partitioning
  - Both approaches are effective in reducing interference
  - No single best approach for all workloads
- Techniques: Request/thread scheduling, source throttling, memory partitioning
  - □ All approaches are effective in reducing interference
  - Can be applied at different levels: hardware vs. software
  - No single best technique for all workloads
- Combined approaches and techniques are the most powerful

• Integrated Memory Channel Partitioning and Scheduling [MICRO'11]

# Computer Architecture: Memory Interference and QoS (Part II)

Prof. Onur Mutlu Carnegie Mellon University