cosc 6385 computer architecture introduction and ...gabriel/courses/cosc6385_s16/ca_01_intro.pdf ·...

14
1 COSC 6385 Computer Architecture Introduction and Organizational Issues Edgar Gabriel Spring 2016 Organizational issues (I) Classes: Tuesday, 2.30pm – 4.00pm, SEC 202 Thursday, 2.30pm – 4.00pm, SEC 202 Evaluation as planned right now 1 homework: 25% 3 quizzes: 75% (25% each) In case of questions: email: e[email protected] Tel: (713) 743 3358 Office hours: PGH 524, Monday, 11am-11.45am or by appointment All slides available on the website: http://www.cs.uh.edu/~gabriel/courses/cosc6385_s16/ Videos of some lectures will be posted on the course web page

Upload: others

Post on 20-Oct-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

1

COSC 6385

Computer Architecture

Introduction and Organizational Issues

Edgar Gabriel

Spring 2016

Organizational issues (I)

• Classes:

– Tuesday, 2.30pm – 4.00pm, SEC 202

– Thursday, 2.30pm – 4.00pm, SEC 202

• Evaluation as planned right now

– 1 homework: 25%

– 3 quizzes: 75% (25% each)

• In case of questions:

– email: [email protected]

– Tel: (713) 743 3358

– Office hours: PGH 524, Monday, 11am-11.45am or by appointment

• All slides available on the website:

– http://www.cs.uh.edu/~gabriel/courses/cosc6385_s16/

– Videos of some lectures will be posted on the course web page

Page 2: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

2

Organizational Issues (III) • TA’s for the course:

– Youcef Barigou, email: [email protected]

• Dates for the quizzes:

– 1st quiz: Thursday, Feb 18

– 2nd quiz: Thursday, March 24

– 3rd quiz: Thursday, April 28

• Homework

– Announced: Tuesday, Feb 16

– Due on: Tuesday, March 8

Contents

• Textbook:

John L. Hennessy,

David A. Patterson

“Computer Architecture –

A Quantitative Approach”

5th Edition

Morgan Kaufmann Publishers

Page 3: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

3

Contents (II)

• Most of chapters 1 – 5

– Memory Hierarchy Design

– Instruction Level Parallelism

– Data Level Parallelism

– Thread Level Parallelism

• Appendix A, B, C

– Instruction Set Principles

– Review of Memory Hierarchies

– Pipelining

• Selected sections regarding storage systems

• Selected literature to multi-core processors

Why learning about Computer Architecture?

• Every loop iteration requires 3 memory operations

– 2 loads

– 1 store

• For a micro-processor having a frequency of 2 GHz this loop

requires

to satisfy one Floating Point Unit (FPU)

• Most modern processors have 2 FPUs and two or more Integer Units

which could work in parallel

for (i=0; i<n; i++ ) {

c[i] = a[i] + b[i];

}

sGBytessBytes /2410*2*4*3 19

Page 4: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

4

Memory technology (www.kingston.com/newtech)

• Memory Bandwidth

with

CycleOpfSBSB BUSBus /**max

maxSB

BUSSB

BUSf

: max. memory bandwidth

: Bandwidth of the memory bus (64 Bit = 8 Bytes)

: Frequency of the memory bus

Memory bandwidth

Name Frequency of

memory bus

(MHz)

max. bandwidth

PC100 SDRAM 100 800 MB/s

PC133 SDRAM 133 1.1 GB/s

PC1600 DDR 100 1.6 GB/s

PC2100 DDR 133 2.1 GB/s

PC2700 DDR 166 2.7 GB/s

PC3200 DDR 200 3.2 GB/s

PC3700 DDR 233 3.7 GB/s

PC4200 DDR 266 4.2 GB/s

Page 5: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

5

Memory modules (cont.)

• DDR2 and DDR3: further evolution of the DDR technology

– DDR3-xxx: denotes data transfer rate

– PC3-xxx: denotes theoretical bandwidth

Name Frequency

of memory

bus

(MHz)

Data rate

(MT/s)

Peak transfer rate

(GB/s)

DDR3-800 100 800 6.4

DDR3-1600 133 1066 8.5

DDR3-1333 166 1333 10.6

DDR3-1600 200 1600 11.8

DDR3-1866 233 1866 14.9

DDR3-2133 266 2133 17.0

Memory hierarchies

Size Access time

[cycles]

Backup (tape) TB, PT, EB

Primary data

storage (disk)

~ 1 TB > 106

main memory ~ 1-8 GB 100 - 1000

Caches ~ 1-32 MB 2 – 50

Register < 256 Words 1 - 2

Page 6: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

6

Memory hierarchies

• Do I have to care about memory hierarchies?

• Example: Matrix-multiply of two dense matrices

– “Trivial” code

for ( i=0; i<dim; i++ ) {

for ( j=0; j<dim; j++ ) {

for ( k=0; k<dim; k++) {

c[i][j] += a[i][k] * b[k][j];

}

}

}

Matrix-multiply

• Performance of the trivial implementation on an 2.2

GHz AMD Opteron with 2 GB main memory 1 MB 2nd

level cache

Matrix dimension Execution time

[sec]

Performance

[MFLOPS]

256x256 0.118 284

512x512 2.05 130

Page 7: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

7

Matrix-multiply (II)

• Peak floating point performance of the processor

2 * (2.2 * 109) Floating point operations/sec

= 4.4 * 109

= 4.4 GFLOPS

• Where are the missing FLOPS between theoretical peek

and achieved performance?

– Memory wait time

Number of floating

point units

Frequency of the processor

→ assuming that each FPU

can finish an operation per

cycle

Theoretical floating point peak

performance of the processor

Blocked code

for ( i=0; i<dim; i+=block ) {

for ( j=0; j<dim; j+=block ) {

for ( k=0; k<dim; k+=block) {

for (ii=i; ii<(i+block); ii++) {

for (jj=j; jj<(j+block); jj++) {

for (kk=k; kk<(k+block);kk++) {

c[ii][jj] += a[ii][kk] * b[kk][jj];

}

}

}

}

}

}

Page 8: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

8

Performance of the blocked code Matrix

dimension

block Execution time

[sec]

Performance

[MFLOPS]

“trivial”

[MFLOPS]

256x256 4 0.065 513 284

8 0.046 726

16 0.51 657

32 0.043 777

64 0.049 677

128 0.113 296

512x512 4 0.686 391 130

8 0.422 635

16 0.447 599

32 0.501 535

64 1.00 266

128 0.994 269

Page 9: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

9

Top 500 List

Page 10: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

10

Trends: Cores and Threads per Chip

19

Source: SICS Multicore Day’ 14

Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’

http://academy.hpc-russia.ru/files/intel_hpc_public.pptx

20

Source: SICS Multicore Day’ 14

Slide source: Andy Semin ‘Intel processors and platforms roadmap for energy efficient HPC solutions’

http://academy.hpc-russia.ru/files/intel_hpc_public.pptx

Page 11: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

11

21

Source: SICS Multicore Day’ 14

Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’

http://academy.hpc-russia.ru/files/intel_hpc_public.pptx

„Big Core“ – „Small Core“

Intel® Xeon® Processor Intel® Xeon Phi™

Coprocessor

Simply aggregating more cores

generation after generation is not

sufficient

Optimized for highest compute per

watt

Performance per core/thread must

increase each generation, be as fast

as possible

Willing to trade performance per

core/thread for aggregate

performance

Power envelopes should stay flat or

go down each generation

Power envelopes should also stay

flat or go down every generation

Balanced platform (Memory, I/O,

Compute)

Optimized for highly parallel

workloads

Cores, Threads, Caches, SIMD Cores, Threads, Caches, SIMD

Different Optimization Points

Common Programming Models

and Architectural Elements

For illustration only

22

Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’

http://academy.hpc-russia.ru/files/intel_hpc_public.pptx

Page 14: COSC 6385 Computer Architecture Introduction and ...gabriel/courses/cosc6385_s16/CA_01_Intro.pdf · – Memory Hierarchy Design – Instruction Level Parallelism – Data Level Parallelism

14