This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
buzzword [2015/04/10 16:38] kevincha [Lecture 28 (4/8 Wed.)] |
buzzword [2015/04/27 18:20] (current) rachata |
||
---|---|---|---|
Line 1250: | Line 1250: | ||
* Debugging is also difficult (non-deterministic behavior) | * Debugging is also difficult (non-deterministic behavior) | ||
* Dekker's algorithm | * Dekker's algorithm | ||
- | * Inconsistency -- the two processors did NOT see the same order of | + | * Inconsistency -- the two processors did NOT see the same order of operations to memory |
- | operations to memory | + | |
* Sequential consistency | * Sequential consistency | ||
* Multiple correct global orders | * Multiple correct global orders | ||
Line 1281: | Line 1280: | ||
* Invalidates the block on a write | * Invalidates the block on a write | ||
* Has an exclusive state | * Has an exclusive state | ||
+ | |||
+ | ===== Lecture 29 (4/10 Fri.) ===== | ||
+ | * MSI coherent protocol | ||
+ | * The problem: unnecessary broadcasts of invalidations | ||
+ | * MESI coherent protocol | ||
+ | * Add the exclusive state: this is the only cache copy and it is a clean state to MSI | ||
+ | * Multiple invalidation tradeoffs | ||
+ | * Problem: memory can be unnecessarily updated | ||
+ | * A possible owner state (MOESI) | ||
+ | * Tradeoffs between snooping and directory based coherence protocols | ||
+ | * Slide 31 has a good summary | ||
+ | * Directory: data structures | ||
+ | * Bit vectors vs. linked lists | ||
+ | * Scalability of directories | ||
+ | * Size? Latency? Thousand of nodes? Best of both snooping and directory? | ||
+ | |||
+ | | ||
+ | ===== Lecture 30 (4/13 Mon.) ===== | ||
+ | * In-memory computing | ||
+ | * Design goals of DRAM | ||
+ | * DRAM structures | ||
+ | * Banks | ||
+ | * Capacitors and sense amplifiers | ||
+ | * Trade-offs b/w number of sense amps and cells | ||
+ | * Width of bank I/O vs. row size | ||
+ | * DRAM operations | ||
+ | * ACTIVATE, READ/WRITE, and PRECHARGE | ||
+ | * Trade-offs | ||
+ | * Latency | ||
+ | * Bandwidth: Chip vs. rank vs. bank | ||
+ | * What's the benefit of having 8 chips? | ||
+ | * Parallelism | ||
+ | * RowClone | ||
+ | * What are the problems? | ||
+ | * Copying b/w two rows that share the same sense amplifier | ||
+ | * System software support | ||
+ | * Bitwise AND/OR | ||
+ | |||
+ | ===== Lecture 31 (4/15 Wed.) ===== | ||
+ | |||
+ | * Application slowdown | ||
+ | * Interference between different applications | ||
+ | * Applications' performance depends on other applications that they are running with | ||
+ | * Predictable performance | ||
+ | * Why are they important? | ||
+ | * Applications that need predictibility | ||
+ | * How to predict the performance? | ||
+ | * What information are useful? | ||
+ | * What need to be guarantee? | ||
+ | * How to estimate the performance when running with others? | ||
+ | * Easy, just measure the performance while it is running. | ||
+ | * How to estimate the performance when the application is running by itself. | ||
+ | * Hard if there is no profiling. | ||
+ | * The relationship between memory service rate and the performance. | ||
+ | * Key assumption: applications are memory bound | ||
+ | * Behavior of memory-bound applications | ||
+ | * With and without interference | ||
+ | * Memory phase vs. compute phase | ||
+ | * MISE | ||
+ | * Estimating slowdown using request service rate | ||
+ | * Inaccuracy when measuring request service rate alone | ||
+ | * Non-memory-bound applications | ||
+ | * Control slowdown and provide soft guarantee | ||
+ | * Taking into account of the shared cache | ||
+ | * MISE model + cache resource management | ||
+ | * Aug tag store | ||
+ | * Separate tag store for different cores | ||
+ | * Cache access rate alone and shared as the metric to estimate slowdown | ||
+ | * Cache paritiioning | ||
+ | * How to determine partitioning | ||
+ | * Utility based cache partitioning | ||
+ | * Others | ||
+ | * Maximum slowdown and fairness metric | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 32 (4/20 Mon.) ===== | ||
+ | |||
+ | * Heterogeneous systems | ||
+ | * Asymmetric cores: different types of cores on the chip | ||
+ | * Each of these cores are optimized for different workloads/requirements/goals | ||
+ | * Multiple special purpose processors | ||
+ | * Flexible and can adapt to workload behavior | ||
+ | * Disadvantages: complex and high overhead | ||
+ | * Examples: CPU-GPU systems, heterogeneity in execution models | ||
+ | * Heterogeneous resources | ||
+ | * Example: reliable and non-reliable DRAM in the same system | ||
+ | * Key problems in modern systems | ||
+ | * Memory system | ||
+ | * Efficiency | ||
+ | * Predictability | ||
+ | * Assymmetric design can help solving these problems | ||
+ | * Serialized code sections | ||
+ | * Bottleneck in multicore execution | ||
+ | * Parallelizable vs. serial portion | ||
+ | * Accelerate critical section | ||
+ | * Cache ping-ponging | ||
+ | * Synchronization latency | ||
+ | * Symmetric vs. assymmetric design | ||
+ | * Large cores + small cores | ||
+ | * Core assymmetry | ||
+ | * Amdahl's law with heterogeneous cores | ||
+ | * Parallel bottlenecks | ||
+ | * Resource contention | ||
+ | * Depends on what are running | ||
+ | * Accelerated critical section | ||
+ | * Ship critical sections to large cores | ||
+ | * Small modifications and low overhead | ||
+ | * False serialization might become the bottleneck | ||
+ | * Can reduce parallel throughput | ||
+ | * Effect on private cache misses and shared cache misses | ||
+ | | ||
+ | | ||
+ | ===== Lecture 33 (4/27 Mon.) ===== | ||
+ | |||
+ | * Interconnects | ||
+ | * Connecting multiple components together | ||
+ | * Goal: Scalability, flexibility, performance and energy efficiency | ||
+ | * Metric: Performance, bandwidth, bisection bandwidth, cost, energy efficienct, system performance, contention, latency | ||
+ | * Saturation point | ||
+ | * Saturation throughput | ||
+ | * Topology | ||
+ | * How to wire components together, affects routing, throughput, latency | ||
+ | * Bus: All nodes connected to a single ring | ||
+ | * Hard to increase frequency, bandwidth, poor scalability but simple | ||
+ | * Point-to-point | ||
+ | * Low contention and potentially low latency. Costly, not scalable and hard to wire. | ||
+ | * Crossbar | ||
+ | * No contention. Concurrent request from different src/dest can be sent concurrently. Costly. | ||
+ | * Multistage logarithmic network | ||
+ | * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar. | ||
+ | * Circuit switch | ||
+ | * Omega network, delta network. | ||
+ | * Butterfly network | ||
+ | * Intermediate switch between sources and destinations | ||
+ | * Switching vs. topology | ||
+ | * Ring | ||
+ | * Each node connected to two other nodes, forming a ring | ||
+ | * Low overhead, high latency, not as scalable. | ||
+ | * Unidirectional ring and bi-directional ring | ||
+ | * Hierarchical Rings | ||
+ | * Layers of rings. More scalable, lower latency. | ||
+ | * Bridge router connect multiple rings together | ||
+ | * Mesh | ||
+ | * 4 input and output ports | ||
+ | * More bisection bandwidth and more scalable | ||
+ | * Easy to layout | ||
+ | * Path diversity | ||
+ | * Routers are more complex | ||
+ | * Tree | ||
+ | * Another hierarchical topology | ||
+ | * Specialized topology | ||
+ | * Good for local traffic | ||
+ | * Fat tree: higher level have more bandwidth | ||
+ | * CM-5 Fat tree | ||
+ | * Fat tree with 4x2 switches | ||
+ | * Hypercube | ||
+ | * N-Dimensional cubes | ||
+ | * Caltech cosmic cube | ||
+ | * Very complex | ||
+ | * Routing algorithm | ||
+ | * How does message get sent from source to destination | ||
+ | * Static or adaptive | ||
+ | * Handling contention | ||
+ | * Buffering helps handling contention, but adds complexity | ||
+ | * Three types of routing algorithms | ||
+ | * Deterministic: always takes the same path | ||
+ | * Oblivious: takes different paths without taking into account of the state of the network | ||
+ | * For example, Valiant algorithm | ||
+ | * Adaptive: takes different paths taking into account of the state of the network | ||
+ | * Non-minimal adaptive routing vs. minimal adaptive routing | ||
+ | * Minimal path: path that has minimum number of hops | ||
+ | * Buffering and flow control | ||
+ | * How to store within the network | ||
+ | * Handling oversubscription | ||
+ | * Source throttling | ||
+ | * Bufferless vs. buffered crossbars | ||
+ | * Buffer overflow | ||
+ | * Bufferless deflection routing | ||
+ | * Deflect packets when there is contention | ||
+ | * Hot-potato routing | ||
+ | |||
+ | |||
+ | |