User Tools

Site Tools


readings

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

readings [2013/02/16 00:59]
yoonguk
readings [2013/04/29 16:53] (current)
yoonguk [Lecture 33 (4/29 Mon.)]
Line 151: Line 151:
 ===== Lecture 14 (2/18  Mon.) ===== ===== Lecture 14 (2/18  Mon.) =====
 **Required:** **Required:**
-  * {{00004607.pdf|Smith, J. E., & PleszkunAR. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}+  * {{00476078.pdf|Smith, J. E., & SohiGS. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
  
 **Recommended:** **Recommended:**
 +  * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
   * {{p18-hwu.pdf|Hwu, W. W., & Patt, Y. N. (1987). Checkpoint repair for out-of-order execution machines. Proceedings of the 14th annual international symposium on Computer architecture.}}   * {{p18-hwu.pdf|Hwu, W. W., & Patt, Y. N. (1987). Checkpoint repair for out-of-order execution machines. Proceedings of the 14th annual international symposium on Computer architecture.}}
 +
 +**Mentioned during lecture:**
 +  * {{patt_hwu_shebanow_-_1985_-_hps_a_new_microarchitecture_rationale_and_introduction.pdf|Patt, Y. N., Hwu, W. M., & Shebanow, M. (1985). HPS, a new microarchitecture: rationale and introduction. Proceedings of the 18th annual workshop on Microprogramming.}}
 +  * {{tomasulo_-_1967_-_an_efficient_algorithm_for_exploiting_multiple_arithmetic_units.pdf|Tomasulo, R. M. (1967). An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development.}}
 +  * {{p109-patt.pdf|Patt, Y. N., Melvin, S. W., Hwu, W. M., & Shebanow, M. C. (1985). Critical issues regarding HPS, a high performance microarchitecture. Proceedings of the 18th annual workshop on Microprogramming.}}
 +  * {{p248-sazeides.pdf |Sazeides, Y., & Smith, J. E. (1997). The predictability of data values. Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture.}}
 +===== Lecture 15 (2/20  Wed.) =====
 +**Required:**
 +  * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}
 +  * {{p50-fatahalian.pdf|Fatahalian, K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}}
 +
 +**Recommended:**
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
 +  * {{hinton_et_al._-_2001_-_the_microarchitecture_of_the_pentium_4_processor.pdf|Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., Kyker, A., & Roussel, P. (2001). The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal.}}
 +  * {{00491460.pdf|Yeager, K. C. (1996). The MIPS R10000 Superscalar Microprocessor. IEEE Micro.}}
 +  * {{05389044.pdf|Tendler, J. M., Dodson, J. S., Fields, J. S., Le, H., & Sinharoy, B. (2002). POWER4 system microarchitecture. IBM J. Res. Dev.}}
 +
 +
 +**Mentioned during lecture:**
 +  * {{p181-moshovos.pdf|Moshovos, A., Breach, S. E., Vijaykumar, T. N., & Sohi, G. S. (1997). Dynamic speculation and synchronization of data dependences. Proceedings of the 24th annual international symposium on Computer architecture.}}
 +  * {{p142-chrysos.pdf|Chrysos, G. Z., & Emer, J. S. (1998). Memory dependence prediction using store sets. Proceedings of the 25th annual international symposium on Computer architecture.}}
 +  * {{p34-gurd.pdf|Gurd, J. R., Kirkham, C. C., & Watson, I. (1985). The Manchester prototype dataflow computer. Commun. ACM.}}
 +  * {{00048862.pdf|Arvind, K., & Nikhil, R. S. (1990). Executing a Program on the MIT Tagged-Token Dataflow Architecture. IEEE Trans. Comput.}}
 +  * {{p297-hwu.pdf|Hwu, W., & Patt, Y. N. (1986). HPSm, a high performance restricted data flow architecture having minimal functionality. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +
 +===== Lecture 16 (2/25  Mon.) =====
 +**Required:**
 +  * P&H Chapter 5.4
 +  * Hamacher et al. Chapter 8.8
 +
 +===== Lecture 17 (2/27  Wed.) =====
 +**Required:**
 +  * P&H Chapter 5.4
 +  * Hamacher et al. Chapter 8.8
 +
 +**Recommended:**
 +  * {{denning_-_1970_-_virtual_memory.pdf|Denning, P. J. (1970). Virtual Memory. ACM Computing Surveys, 2(3).}}
 +  * {{00710872.pdf|Jacob, B., & Mudge, T. (1998). Virtual memory in contemporary microprocessors. IEEE Micro.}}
 +
 +===== Lecture 18 (3/1  Fri.) =====
 +**Required:**
 +  * P&H Chapter 5.4
 +  * Hamacher et al. Chapter 8.8
 +
 +**Recommended:**
 +  * {{denning_-_1970_-_virtual_memory.pdf|Denning, P. J. (1970). Virtual Memory. ACM Computing Surveys, 2(3).}}
 +  * {{00710872.pdf|Jacob, B., & Mudge, T. (1998). Virtual memory in contemporary microprocessors. IEEE Micro.}}
 +
 +
 +===== Lecture 19 (3/18 Mon.) =====
 +**Required:**
 +  * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}
 +  * {{p50-fatahalian.pdf|Fatahalian, K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}}
 +
 +**Mentioned during lecture:**
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{russell_-_1978_-_the_cray-1_computer_system.pdf|Russell, R. M. (1978). The CRAY-1 computer system. Commun. ACM.}}
 +  * {{00526924.pdf|Peleg, A., & Weiser, U. (1996). MMX technology extension to the Intel architecture. IEEE Micro.}}
 +
 +
 +===== Lecture 20 (3/20 Wed.) =====
 +**Required:**
 +  * P&H Chapters 5.1-5.3 (cache chapters)
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters)
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}
 +
 +**Mentioned during lecture:**
 +  * {{30470407.pdf|Fung, W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{p253-suleman.pdf |Suleman, M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}}
 +  * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{smith_-_1982_-_decoupled_accessexecute_computer_architectures.pdf|Smith, J. E. (1982). Decoupled access/execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p289-smith.pdf|Smith, J. E. (1984). Decoupled access/execute computer architectures. ACM Trans. Comput. Syst.}}
 +  * {{p199-smith.pdf|Smith, J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}}
 +  * {{00030730.pdf|Smith, J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}}
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung, H. T. (1982). Why Systolic Architectures? IEEE Computer.}}
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers.}}
 +
 +===== Lecture 21 (3/25 Mon.) =====
 +**Required:**
 +  * P&H Chapters 5.1-5.3 (cache chapters)
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters)
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}
 +
 +**Mentioned during lecture:**
 +  * {{01675827.pdf|Fisher, J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}}
 +  * {{2fbf01205185.pdf|Hwu, W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}}
 +  * {{p45-mahlke.pdf|Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., & Bringmann, R. A. (1992). Effective compiler support for predicated execution using the hyperblock. Proceedings of the 25th annual international symposium on Microarchitecture.}}
 +  * {{melvin_patt_-_1995_-_enhancing_instruction_scheduling_with_a_block-structured_isa.pdf|Melvin, S., & Patt, Y. (1995). Enhancing instruction scheduling with a block-structured ISA. Int. J. Parallel Program.}}
 +  * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao, E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}}
 +  * {{00877947.pdf|Huck, J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}}
 +
 +
 +===== Lecture 22 (3/27 Wed.) =====
 +**Required:**
 +  * P&H Chapters 5.1-5.3 (cache chapters)
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters)
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}
 +
 +**Mentioned during lecture:**
 +  * {{liptay_-_1968_-_structural_aspects_of_the_system360_model_85_ii_the_cache.pdf|Liptay, J. S. (1968). Structural aspects of the system/360 model 85: II the cache. IBM Syst. J.}}
 +
 +===== Lecture 23 (3/29 Fri.) =====
 +**Required:**
 +  * P&H Chapters 5.1-5.3 (cache chapters)
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters)
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}
 +** Mentioned during lecture:**
 +  * {{26080167.pdf|Qureshi, M. K., Lynch, D. N., Mutlu, O., & Patt, Y. N. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}}
 +  * {{05388441.pdf|Belady, L. A. (1966). A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.}}
 +
 +===== Lecture 24 (4/1 Mon.) =====
 +**Required:**
 +  * {{26080167.pdf|Qureshi, M. K., Lynch, D. N., Mutlu, O., & Patt, Y. N. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}}
 +**Mentioned during lecture:**
 +  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi, N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}
 +  * {{p74-rau.pdf|Rau, B. R. (1991). Pseudo-randomly interleaved memory. Proceedings of the 18th annual international symposium on Computer architecture.}}
 +  * {{p169-seznec.pdf|Seznec, A. (1993). A case for two-way skewed-associative caches. Proceedings of the 20th annual international symposium on computer architecture.}}
 +  
 +
 +===== Lecture 25 (4/3 Wed.) =====
 +**Required:**
 +  * {{26080167.pdf|Qureshi, M. K., Lynch, D. N., Mutlu, O., & Patt, Y. N. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}}
 +
 +**Mentioned during lecture:**
 +  * {{p81-kroft.pdf|Kroft, D. (1981). Lockup-free instruction fetch/prefetch cache organization. Proceedings of the 8th annual symposium on Computer Architecture.}}
 +
 +===== Lecture 26 (4/8 Mon.) =====
 +**Required:**
 +  * None
 +
 +**Recommended:**
 +  * {{p6-bell.pdf |Bell, G., & Strecker, W. D. (1998). Retrospective: what have we learned from the PDP-11 -- what we have learned from VAX and Alpha. 25 years of the international symposia on Computer architecture (selected papers).}}
 +  * {{liu_et_al._-_2012_-_raidr_retention-aware_intelligent_dram_refresh.pdf|Liu, J., Jaiyen, B., Veras, R., & Mutlu, O. (2012). RAIDR: Retention-Aware Intelligent DRAM Refresh. Proceedings of the 39th International Symposium on Computer Architecture.}}
 +
 +**Mentioned during lecture:**
 +  * {{p422-bloom.pdf|Bloom, B. H. (1970). Space/time trade-offs in hash coding with allowable errors. Commun. ACM.}}
 +  * {{kim_et_al._-_2012_-_a_case_for_exploiting_subarray-level_parallelism_salp_in_dram.pdf|Kim, Y., Seshadri, V., Lee, D., Liu, J., & Mutlu, O. (2012). A case for exploiting subarray-level parallelism (SALP) in DRAM. Proceedings of the 39th International Symposium on Computer Architecture.}}
 +  * {{seshadri_et_al._-_2012_-_the_evicted-address_filter_a_unified_mechanism_to_address_both_cache_pollution_and_thrashing.pdf|Seshadri, V., Mutlu, O., Kozuch, M. A., & Mowry, T. C. (2012). The evicted-address filter: a unified mechanism to address both cache pollution and thrashing. Proceedings of the 21st international conference on Parallel architectures and compilation techniques.}}
 +  * {{tldram_hpca13.pdf|Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, and Onur Mutlu. (2013). Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture. Proceedings of the 19th International Symposium on High-Performance Computer Architecture.}}
 +  * (To be published.) Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu. (2013). An Experimental Study of Data Retention Behavior in Modern DRAM Devices: Implications for Retention Time Profiling Mechanisms. Proceedings of the 40th International Symposium on Computer Architecture
 +  * {{tr-hps-2010-002.pdf|Lee, C. J., Narasiman, V., Ebrahimi, E., Mutlu, O., & Patt, Y. N. (2010). DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems. TR-HPS-2010-002, UT Austin.}}
 +
 +===== Lecture 27 (4/10 Wed.) =====
 +** Required: **
 +  * None
 +
 +** Recommended: **
 +  * {{p6-bell.pdf|Bell, G., & Strecker, W. D. (1998). Retrospective: what have we learned from the PDP-11—what we have learned from VAX and Alpha. 25 years of the international symposia on Computer architecture (selected papers).}}
 +  * {{p1-bell.pdf|Bell, G., & Strecker, W. D. (1976). Computer structures: What have we learned from the PDP-11? Proceedings of the 3rd annual symposium on Computer architecture.}}
 +
 +** Mentioned during lecture: **
 +  * {{moscibroda.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}}
 +  * {{30470146.pdf|Mutlu, O., & Moscibroda, T. (2007). Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 146–160).}}
 +  * {{3174a063.pdf|Mutlu, O., & Moscibroda, T. (2008). Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.}}
 +  * {{4299a065.pdf|Kim, Y., Papamichael, M., Mutlu, O., & Harchol-Balter, M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara, S. P., Subramanian, L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{p335-ebrahimi.pdf|Ebrahimi, E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}}
 +  * {{p362-ebrahimi.pdf|Ebrahimi, E., Miftakhutdinov, R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +
 +===== Lecture 28 (4/12 Fri.) =====
 +** Required: **
 +  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu, O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}}
 +  * {{04147648.pdf|Srinath, S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}}
 +
 +** Recommended: **
 +  * {{24400233.pdf|Mutlu, O., Kim, H., & Patt, Y. N. (2005). Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{01603492.pdf|Mutlu, O., Kim, H., & Patt, Y. N. (2006). Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance. IEEE Micro.}}
 +  * {{21260119.pdf|Armstrong, D. N., Kim, H., Mutlu, O., & Patt, Y. N. (2004). Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery. Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture.}}
 +
 +===== Lecture 29 (4/15 Mon.) =====
 +** Required: **
 +  * None
 +** Mentioned during lecture: **
 +  * {{p176-baer.pdf|Baer, J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}}
 +  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi, N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}
 +  * {{mowry_lam_gupta_-_1992_-_design_and_evaluation_of_a_compiler_algorithm_for_prefetching.pdf|Mowry, T. C., Lam, M. S., & Gupta, A. (1992). Design and evaluation of a compiler algorithm for prefetching. Proceedings of the fifth international conference on Architectural support for programming languages and operating systems.}}
 +
 +===== Lecture 30 (4/22 Mon.) =====
 +** Required: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}}
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/culler-mesi.pdf|C&S, Chapters 5.1 & 5.3]]
 +  * P&H, Chapter 5.8
 +** Recommended: **
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_309_314.pdf|Hill, Jouppi, Sohi. "Multiprocessors and Multicomputers," pp. 551-560 in Readings in Computer Architecture.]]
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_551_560.pdf|Hill, Jouppi, Sohi. "Dataflow and Multithreading," pp. 309-314 in Readings in Computer Architecture.]]
 +  * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos, M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}}
 +** Mentioned during lecture: **
 +  * {{p176-baer.pdf|Baer, J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}}
 +  * {{04147648.pdf|Srinath, S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}}
 +  * {{joseph_grunwald_-_1997_-_prefetching_using_markov_predictors.pdf|Joseph, D., & Grunwald, D. (1997). Prefetching using Markov predictors. Proceedings of the 24th annual international symposium on Computer architecture.}}
 +  * {{p279-cooksey.pdf|Cooksey, R., Jourdan, S., & Grunwald, D. (2002). A stateless, content-directed data prefetching mechanism. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems.}}
 +  * {{04798232.pdf|Ebrahimi, E., Mutlu, O., & Patt, Y. N. (2009). Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. High Performance Computer Architecture, 2009.}}
 +  * {{p186-chappell.pdf|Chappell, R. S., Stark, J., Kim, S. P., Reinhardt, S. K., & Patt, Y. N. (1999). Simultaneous subordinate microthreading (SSMT). Proceedings of the 26th annual international symposium on Computer architecture.}}
 +  * {{p2-zilles.pdf|Zilles, C., & Sohi, G. (2001). Execution-based prediction using speculative slices. Proceedings of the 28th annual international symposium on Computer architecture.}}
 +  * {{p40-luk.pdf|Luk, C.-K. (2001). Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. Proceedings of the 28th annual international symposium on Computer architecture.}}
 +  * {{p172-zilles.pdf|Zilles, C. B., & Sohi, G. S. (2000). Understanding the backward slices of performance degrading instructions. Proceedings of the 27th annual international symposium on Computer architecture.}}
 +  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu, O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}}
 +  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi, N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}
 +
 +===== Lecture 31 (4/24 Wed.) =====
 +** Required: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}}
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/culler-mesi.pdf|C&S, Chapters 5.1 & 5.3]]
 +  * P&H, Chapter 5.8
 +** Recommended: **
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_309_314.pdf|Hill, Jouppi, Sohi. "Multiprocessors and Multicomputers," pp. 551-560 in Readings in Computer Architecture.]]
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_551_560.pdf|Hill, Jouppi, Sohi. "Dataflow and Multithreading," pp. 309-314 in Readings in Computer Architecture.]]
 +  * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos, M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}}
 +** Mentioned during lecture: **
 +  * [[http://www.cs.utexas.edu/users/EWD/transcriptions/EWD01xx/EWD123.html|Dijkstra (1965). Cooperating Sequential Proceeses.]]
 +  * {{p124-goodman.pdf|Goodman, J. R. (1983). Using cache memory to reduce processor-memory traffic. Proceedings of the 10th annual international symposium on Computer architecture (pp. 124–131)}}
 +  * {{01675013.pdf|Censier, L. M., & Feautrier, P. (1978). A New Solution to Coherence Problems in Multicache Systems. IEEE Trans. Comput.}}
 +
 +===== Lecture 32 (4/26 Fri.) =====
 +** Required: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}}
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/culler-mesi.pdf|C&S, Chapters 5.1 & 5.3]]
 +  * P&H, Chapter 5.8
 +** Recommended: **
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_309_314.pdf|Hill, Jouppi, Sohi. "Multiprocessors and Multicomputers," pp. 551-560 in Readings in Computer Architecture.]]
 +  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_551_560.pdf|Hill, Jouppi, Sohi. "Dataflow and Multithreading," pp. 309-314 in Readings in Computer Architecture.]]
 +  * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos, M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}}
 +** Mentioned during lecture: **
 +  * {{p168-patel.pdf|Patel, J. H. (1979). Processor-memory interconnections for multiprocessors. Proceedings of the 6th annual symposium on Computer architecture.}}
 +  * {{p196-moscibroda.pdf|Moscibroda, T., & Mutlu, O. (2009). A case for bufferless routing in on-chip networks. Proceedings of the 36th annual international symposium on Computer architecture.}}
 +  * {{p27-gottlieb.pdf|Gottlieb, A., Grishman, R., Kruskal, C. P., McAuliffe, K. P., Rudolph, L., & Snir, M. (1982). The NYU Ultracomputer -- designing a MIMD, shared-memory parallel machine (Extended Abstract). Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p22-seitz.pdf|Seitz, C. L. (1985). The cosmic cube. Commun. ACM.}}
 +  * {{p278-glass.pdf|Glass, C. J., & Ni, L. M. (1992). The turn model for adaptive routing. Proceedings of the 19th annual international symposium on Computer architecture.}}
 +
 +===== Lecture 33 (4/29 Mon.) =====
 +** Required: **
 +  * None
 +
 +** Mentioned during lecture: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{grochowski_et_al._-_2004_-_best_of_both_latency_and_throughput.pdf|Grochowski, E., Ronen, R., Shen, J., & Wang, H. (2004). Best of Both Latency and Throughput. Proceedings of the IEEE International Conference on Computer Design (pp. 236–243).}}
 +  * {{tendler_et_al._-_2002_-_power4_system_microarchitecture.pdf|Tendler, J. M., Dodson, J. S., Fields, J. S., Le, H., & Sinharoy, B. (2002). POWER4 system microarchitecture. IBM J. Res. Dev.}}
 +  * {{01289290.pdf|Kalla, R., Sinharoy, B., & Tendler, J. M. (2004). IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro.}}
 +  * {{kongetira_aingaran_olukotun_-_2005_-_niagara_a_32-way_multithreaded_sparc_processor.pdf|Kongetira, P., Aingaran, K., & Olukotun, K. (2005). Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro.}}
 +  * {{p253-suleman.pdf|Suleman, M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}}
 +  * {{p441-suleman.pdf|Suleman, M. A., Mutlu, O., Joao, J. A., Khubaib, & Patt, Y. N. (2010). Data marshaling for multi-core architectures. Proceedings of the 37th annual international symposium on Computer architecture.}}
 +  * {{p223-joao.pdf|Joao, J. A., Suleman, M. A., Mutlu, O., & Patt, Y. N. (2012). Bottleneck identification and scheduling in multithreaded applications. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems.}}
readings.1360994379.txt.gz · Last modified: 2013/02/16 00:59 by yoonguk