## Introduction to: 18-740 "Computer Architecture" 18-640 "Foundations of Computer Architecture"

Prof. Onur Mutlu (18-740) Prof. John Paul Shen (18-640) August 31, 2015



18-640/18-740 Lecture 0



#### 18-740 Instructor: Onur Mutlu

- Associate Professor @ Carnegie Mellon University ECE/CS
- PhD from UT-Austin 2006, BS from Michigan 2000
- Past experience @ Microsoft Research, Intel, AMD
- omutlu@gmail.com (Best way to reach me)
- http://www.ece.cmu.edu/~omutlu
- http://users.ece.cmu.edu/~omutlu/projects.htm
- Research and Education in
  - Computer architecture and systems, bioinformatics
  - Memory and storage systems, emerging technologies
  - Many-core systems, heterogeneous systems, core design
  - Interconnects
  - Hardware/software interaction and co-design (PL, OS, Architecture)
  - Predictable and QoS-aware systems
  - Hardware fault tolerance and security
  - Algorithms and architectures for genome analysis

···

3

#### 18-740 Teaching Assistant: Nandita

- Nandita Vijaykumar
  - PhD student with Onur Mutlu
  - □ BE from PES Inst. Of Technology 2011
  - Past experience @ AMD
  - nandita@cmu.edu



- Office hours and locations will be posted online
  - http://www.ece.cmu.edu/~ece740/f15
- Reach all of us at
  - 740-official@ece.cmu.edu

4

#### 18-640 Cast of Characters

- Instructor: John Paul Shen (SV)
- Academic Services Assistants:
  - Zara Collier (PGH)
  - Stephanie Scott (SV)
- > Teaching Assistants:
  - Priyank Sanghavi (PGH)
  - Guanglin Xu (PGH)
  - Adeola Bannis (SV)
  - Revathy Shunmugam (PGH) Cover PGH & GZ sections

JIE Course Coordinator: Ziyan He (GZ)

- Cover PGH & SV sections
  - Cover GZ & PGH sections
  - Cover SV section









#### My Personal Background:

#### Academia

- Carnegie Mellon University
  - First Half
    - Sabbatical at Stanford
  - Second Half
    - Sabbatical at Intel

#### Computer Aided Design

Computer Architecture

#### > Industry

- Intel, Research Lab
  - Microarchitecture Lab → Microprocessor Design
- Nokia, Research Center
  - North America Lab Mobile Computing System

18-640/18-740 Lecture 0



#### 18-640/18-740 Computer Architecture

### Lecture 1: "Introduction To Computer Architecture"

John Paul Shen August 31, 2015

- Required Reading Assignment:
  - · Chapters 1 and 2 of Shen and Lipasti (SnL).
- Recommended References:
  - "Amdahl's and Gustafson's Laws Revisited" by Andrzej Karbowski. (2008)
  - "High Performance Reduced Instruction Set Processors" by Tilak Agerwala and John Cocke. (1987)



8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 7

#### 18-640/18-740 Computer Architecture

#### Lecture 1:

#### "Introduction To Computer Architecture"

- A. Instruction Set Architecture (ISA)
  - a. Hardware / Software Interface
  - b. Dynamic / Static Interface (DSI)
- **B.** Historical Perspective on Computing
  - a. Major Epochs
  - b. Processor Performance Iron Law (#1)
  - c. Course Coverage
- C. "Economics" of Computer Architecture
  - a. Amdahl's Law and Gustafson's Law
  - b. Moore's Law and Bell's Law

ENGINE

Carnegie Mellon University 8

Electrical & Computer

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

#### **Anatomy of Engineering Design**



Specification: Behavioral description of "What does it do?"

Synthesis: Search for possible solutions; pick best one. Creative process

Implementation: Structural description of "How is it constructed?"

Analysis: <u>Validate</u> if the design meets the specification.

"Does it do the right thing?" + "How well does it perform?"

8/31/2015 (©J.P. Shen) 18-640/18-740 Lecture 1 **Carnegie Mellon University** 9

#### Lecture 1: "Introduction to Computer Architecture"

#### A. Instruction Set Architecture (ISA)

- a. Hardware / Software Interface
- b. Dynamic / Static Interface (DSI)



8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1





[Gerrit Blaauw & Fred Brooks, 1981]

#### Art and Science of Instruction Set Processor Design

#### ARCHITECTURE (ISA) programmer/compiler view

- Functional programming model to application/system programmers
- Opcodes, addressing modes, architected registers, IEEE floating point

#### IMPLEMENTATION (µarchitecture) processor designer view

- Logical structure or organization that performs the ISA specification
- Pipelining, functional units, caches, physical registers, buses, branch predictors

#### **REALIZATION** (Chip) <u>chip/system designer view</u>

- Physical structure that embodies the implementation
- · Gates, cells, transistors, wires, dies, packaging



8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 13

#### Computer Architecture: Dynamic-Static Interface Architectural state requirements: **PROGRAM** Support sequential instruction execution semantics. • Support precise servicing of exceptions & interrupts. Windows 7 'static' Visual C++ Exposed to SW Architectural State x86 Machine Primitives Dynamic/Static Interface (DSI)=(ISA) ARCHITECTURE Von Neumann Machine Hidden in HW Microarchitecture State Logic Gates & Memory **Transistors & Devices** Buffering needed between arch and uarch states: Allow uarch state to deviate from arch state. Quantum Physics · Able to undo speculative uarch state if needed. DSI = ISA = a contract between the program and the machine. 8/31/2015 (©J.P. Shen) 18-640/18-740 Lecture 1 Carnegie Mellon University 14







#### Lecture 1: "Introduction to Computer Architecture"

#### **B.** Historical Perspective on Computing

- a. Major Epochs
- b. Processor Performance Iron Law (#1)
- c. Course Coverage



8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1



#### Historical Perspective on the Last Five Decades

- The Decade of the 1960's: "Computer Architecture Foundations"
  - Von Neumann computation model, programming languages, compilers, OS's
  - · Commercial Mainframe computers, Scientific numerical computers
- The Decade of the 1970's: "Birth of Microprocessors"
  - Programmable controllers, bit-sliced ALU's, single-chip processors
  - Emergence of Personal Computers (PC)
- The Decade of the 1980's: "Quantitative Architecture"
  - · Instruction pipelining, fast cache memories, compiler considerations
  - · Widely available Minicomputers, emergence of Personal Workstations
- The Decade of the 1990's: "Instruction-Level Parallelism"
  - Superscalar, speculative microarchitectures, aggressive compiler optimizations
  - Widely available low-cost desktop computers, emergence of Laptop computers
- The Decade of the 2000's: "Mobile Computing Convergence"
  - Multi-core architectures, system-on-chip integration, power constrained designs
  - · Convergence of smartphones and laptops, emergence of Tablet computers

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

#### Intel 4004, circa 1971



#### The first single chip CPU

- 4-bit processor for a calculator.
- 1K data memory
- 4K program memory
- 2,300 transistors
- 16-pin DIP package
- 740kHz (eight clock cycles per CPU cycle of 10.8 microseconds)
- ~100K OPs per second

Molecular Expressions: Chipshots

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 21

#### Intel Itanium 2, circa 2002



#### Performance leader in floating-point apps

- 64-bit processor
- 3 MByte in cache!!
- 221 million transistor
- 1 GHz, issue up to 8 instructions per cycle

In ~30 years, about 100,000 fold growth in transistor count!

http://cpus.hp.com/images/die\_photos/McKinley\_die.jpg

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

[ John Crawford, Intel, 1993]

#### Performance Growth in Perspective

- Doubling every 18 months (1982-2000):
  - total of 3,200X
  - Cars travel at 176,000 MPH; get 64,000 miles/gal.
  - Air travel: L.A. to N.Y. in 5.5 seconds (MACH 3200)
  - Wheat yield: 320,000 bushels per acre
- Doubling every 24 months (1971-2001):
  - total of 36,000X
  - Cars travel at 2,400,000 MPH; get 600,000 miles/gal.
  - Air travel: L.A. to N.Y. in 0.5 seconds (MACH 36,000)
  - Wheat yield: 3,600,000 bushels per acre

#### Unmatched by any other industry!!

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 23

#### Convergence of Key Enabling Technologies

- CMOS VLSI:
  - Submicron feature sizes:  $0.3u \rightarrow 0.25u \rightarrow 0.18u \rightarrow 0.13u \rightarrow 90n \rightarrow 65n \rightarrow 45n \rightarrow 32nm...$
  - Metal layers:  $3 \rightarrow 4 \rightarrow 5 \rightarrow 6 \rightarrow 7$  (copper)  $\rightarrow 12$  ...
  - Power supply voltage:  $5V \rightarrow 3.3V \rightarrow 2.4V \rightarrow 1.8V \rightarrow 1.3V \rightarrow 1.1V \dots$
- CAD Tools:
  - · Interconnect simulation and critical path analysis
  - Clock signal propagation analysis
  - · Process simulation and yield analysis/learning
- Microarchitecture:
  - Superpipelined and superscalar machines
  - Speculative and dynamic microarchitectures
  - Simulation tools and emulation systems
- Compilers:
  - · Extraction of instruction-level parallelism
  - · Aggressive and speculative code scheduling
  - · Object code translation and optimization

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1















#### Lecture 1: "Introduction to Computer Architecture"

#### C. "Economics" of Computer Architecture

- a. Amdahl's Law and Gustafson's Law
- b. Moore's Law and Bell's Law



8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 33

#### "Economics" of Computer Architecture

- Exercise in engineering tradeoff analysis
  - Find the fastest/cheapest/power-efficient/etc. solution
  - Optimization problem with 10s to 100s of variables
- All the variables are changing
  - At non-uniform rates
  - With inflection points
  - Only one guarantee: Today's right answer will be wrong tomorrow
- > Two Persistent high-level "forcing functions":
  - > Application Demand (PROGRAM)
  - > Technology Supply (MACHINE)

8/31/2015 (©J.P. Shen)

#### Four Foundational "Laws" of Computer Architecture

#### > Application Demand (PROGRAM)

- Amdahl's Law (1967)
  - Speedup through parallelism is limited by the sequential bottleneck
- Gustafson's Law (1988)
  - With unlimited data set size, parallelism speedup can be unlimited

#### > Technology Supply (MACHINE)

- Moore's Law (1965)
  - (Transistors/Die) increases by 2x every 18 months
- Bell's Law (1971)
  - (Cost/Computer) decreases by 2x every 36 months

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 35

#### Amdahl's Law

- Speedup = (Execution time on Single CPU)/(Execution on N parallel processors)
  - $t_s/t_p$  (Serial time is for **best** serial algorithm)



- h = fraction of time in serial code
- f = fraction that is vectorizable or parallelizable
- N = max speedup for f

• Overall speedup → →

$$Speedup = \frac{1}{(1-f) + \frac{f}{N}}$$

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

# Amdahl's Law Illustrated • Speedup = time<sub>without enhancement</sub> / time<sub>with enhancement</sub> • If an enhancement speeds up a fraction f of a task by a factor of N • time<sub>new</sub> = time<sub>orig</sub>·((1-f) + f/N) • S<sub>overall</sub> = 1 / ((1-f) + f/N) time<sub>orig</sub> (1 - f) time<sub>new</sub> (1 - f) \*\*B/31/2015 (©J.P. Shen) \*\*B-640/18-740 Lecture 1 \*\*Carnegie Mellon University\*\* 37



#### From Amdahl's Law to Gustafson's Law

- Amdahl's Law works on a fixed problem size
  - This is reasonable if your only goal is to solve a problem faster.
  - What if you also want to solve a larger problem?
    - Gustafson's Law (Scaled Speedup)
- Gustafson's Law is derived by fixing the parallel execution time (Amdahl fixed the problem size -> fixed serial execution time)
  - For many practical situations, Gustafson's law makes more sense
    - Have a bigger computer, solve a bigger problem.
- "Amdahl's Law turns out to be too pessimistic for high-performance computing."

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1

Carnegie Mellon University 39

#### Gustafson's Law

- Fix execution of the computation on a single processor as
  - s + p = serial part + parallelizable part = 1
- Speedup(N) = (s + p)/(s + p/N)=  $1/(s + (1 - s)/N) = 1/((1-p) + p/N) \leftarrow Amdahl's law$
- Now let 1 = (a + b) = execution time of computation on N processors (fixed)
   where a = sequential time and b = parallel time on any of the N processors
  - Time for sequential processing =  $a + (b \times N)$  and Speedup =  $(a + b \times N)/(a + b)$
  - Let  $\alpha = a/(a+b)$  be the sequential fraction of the parallel execution time
  - Speedup<sub>scaled</sub>(N) =  $(a + b \times N)/(a + b) = (a/(a+b) + (b \times N)/(a+b)) = \alpha + (1-\alpha)N$
  - $\bullet$  If  $\alpha$  is very small, the scaled speedup is approximately N, i.e. linear speedup.

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1



#### The Two "Gordon" Laws of Computer Architecture

#### ➤ Gordon Moore's Law (1965)

- (Transistors/Die) increases by 2X every 18 months
- Constant price, increasing performance
- Has held for 40+ years, and will continue to hold

#### > Gordon Bell's Law (1971)

- (Cost/Computer) decreases by 2X every 36 months (~ 10X per decade)
- Constant performance, decreasing price
- Corollary of Moore's Law, creation of new computer categories

"In a decade you can buy a computer for less than its sales tax today." – Jim Gray We have all been living on this exponential curve and assume it...

8/31/2015 (©J.P. Shen)

18-640/18-740 Lecture 1











