cosc 6385 computer architecture introduction and ...gabriel/courses/cosc6385_s16/ca_01_intro.pdf ·...

1

COSC 6385

Computer Architecture

Introduction and Organizational Issues

Edgar Gabriel

Spring 2016

Organizational issues (I)

• Classes:

– Tuesday, 2.30pm – 4.00pm, SEC 202

– Thursday, 2.30pm – 4.00pm, SEC 202

• Evaluation as planned right now

– 1 homework: 25%

– 3 quizzes: 75% (25% each)

• In case of questions:

– email: [email protected]

– Tel: (713) 743 3358

– Office hours: PGH 524, Monday, 11am-11.45am or by appointment

• All slides available on the website:

– http://www.cs.uh.edu/~gabriel/courses/cosc6385_s16/

– Videos of some lectures will be posted on the course web page

mailto:[email protected]

http://www.cs.uh.edu/~gabriel/courses/cosc6385_s13/

http://www.cs.uh.edu/~gabriel/courses/cosc6385_s13/

2

Organizational Issues (III) • TA’s for the course:

– Youcef Barigou, email: [email protected]

• Dates for the quizzes:

– 1st quiz: Thursday, Feb 18

– 2nd quiz: Thursday, March 24

– 3rd quiz: Thursday, April 28

• Homework

– Announced: Tuesday, Feb 16

– Due on: Tuesday, March 8

Contents

• Textbook:

John L. Hennessy,

David A. Patterson

“Computer Architecture –

A Quantitative Approach”

5th Edition

Morgan Kaufmann Publishers

mailto:[email protected]

3

Contents (II)

• Most of chapters 1 – 5

– Memory Hierarchy Design

– Instruction Level Parallelism

– Data Level Parallelism

– Thread Level Parallelism

• Appendix A, B, C

– Instruction Set Principles

– Review of Memory Hierarchies

– Pipelining

• Selected sections regarding storage systems

• Selected literature to multi-core processors

Why learning about Computer Architecture?

• Every loop iteration requires 3 memory operations

– 2 loads

– 1 store

• For a micro-processor having a frequency of 2 GHz this loop

requires

to satisfy one Floating Point Unit (FPU)

• Most modern processors have 2 FPUs and two or more Integer Units

which could work in parallel

for (i=0; i<n; i++ ) {

c[i] = a[i] + b[i];

}

sGBytessBytes /2410*2*4*3 19

4

Memory technology (www.kingston.com/newtech)

• Memory Bandwidth

with

CycleOpfSBSB BUSBus /**max

maxSB

BUSSB

BUSf

: max. memory bandwidth

: Bandwidth of the memory bus (64 Bit = 8 Bytes)

: Frequency of the memory bus

Memory bandwidth

Name Frequency of

memory bus

(MHz)

max. bandwidth

PC100 SDRAM 100 800 MB/s

PC133 SDRAM 133 1.1 GB/s

PC1600 DDR 100 1.6 GB/s

PC2100 DDR 133 2.1 GB/s

PC2700 DDR 166 2.7 GB/s

PC3200 DDR 200 3.2 GB/s

PC3700 DDR 233 3.7 GB/s

PC4200 DDR 266 4.2 GB/s

5

Memory modules (cont.)

• DDR2 and DDR3: further evolution of the DDR technology

– DDR3-xxx: denotes data transfer rate

– PC3-xxx: denotes theoretical bandwidth

Name Frequency

of memory

bus

(MHz)

Data rate

(MT/s)

Peak transfer rate

(GB/s)

DDR3-800 100 800 6.4

DDR3-1600 133 1066 8.5

DDR3-1333 166 1333 10.6

DDR3-1600 200 1600 11.8

DDR3-1866 233 1866 14.9

DDR3-2133 266 2133 17.0

Memory hierarchies

Size Access time

[cycles]

Backup (tape) TB, PT, EB

Primary data

storage (disk)

~ 1 TB > 106

main memory ~ 1-8 GB 100 - 1000

Caches ~ 1-32 MB 2 – 50

Register < 256 Words 1 - 2

6

Memory hierarchies

• Do I have to care about memory hierarchies?

• Example: Matrix-multiply of two dense matrices

– “Trivial” code

for ( i=0; i<dim; i++ ) {

for ( j=0; j<dim; j++ ) {

for ( k=0; k<dim; k++) {

c[i][j] += a[i][k] * b[k][j];

}

}

}

Matrix-multiply

• Performance of the trivial implementation on an 2.2

GHz AMD Opteron with 2 GB main memory 1 MB 2nd

level cache

Matrix dimension Execution time

[sec]

Performance

[MFLOPS]

256x256 0.118 284

512x512 2.05 130

7

Matrix-multiply (II)

• Peak floating point performance of the processor

2 * (2.2 * 109) Floating point operations/sec

= 4.4 * 109

= 4.4 GFLOPS

• Where are the missing FLOPS between theoretical peek

and achieved performance?

– Memory wait time

Number of floating

point units

Frequency of the processor

→ assuming that each FPU

can finish an operation per

cycle

Theoretical floating point peak

performance of the processor

Blocked code

for ( i=0; i<dim; i+=block ) {

for ( j=0; j<dim; j+=block ) {

for ( k=0; k<dim; k+=block) {

for (ii=i; ii<(i+block); ii++) {

for (jj=j; jj<(j+block); jj++) {

for (kk=k; kk<(k+block);kk++) {

c[ii][jj] += a[ii][kk] * b[kk][jj];

}

}

}

}

}

}

8

Performance of the blocked code Matrix

dimension

block Execution time

[sec]

Performance

[MFLOPS]

“trivial”

[MFLOPS]

256x256 4 0.065 513 284

8 0.046 726

16 0.51 657

32 0.043 777

64 0.049 677

128 0.113 296

512x512 4 0.686 391 130

8 0.422 635

16 0.447 599

32 0.501 535

64 1.00 266

128 0.994 269

9

Top 500 List

10

Trends: Cores and Threads per Chip

19

Source: SICS Multicore Day’ 14

Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’

http://academy.hpc-russia.ru/files/intel_hpc_public.pptx

20


Slide source: Andy Semin ‘Intel processors and platforms roadmap for energy efficient HPC solutions’










11

21




„Big Core“ – „Small Core“

Intel® Xeon® Processor Intel® Xeon Phi™

Coprocessor

Simply aggregating more cores

generation after generation is not

sufficient

Optimized for highest compute per

watt

Performance per core/thread must

increase each generation, be as fast

as possible

Willing to trade performance per

core/thread for aggregate

performance

Power envelopes should stay flat or

go down each generation

Power envelopes should also stay

flat or go down every generation

Balanced platform (Memory, I/O,

Compute)

Optimized for highly parallel

workloads

Cores, Threads, Caches, SIMD Cores, Threads, Caches, SIMD

Different Optimization Points

Common Programming Models

and Architectural Elements

For illustration only

22











12

Slide source: Jeremy Purches. ‘Inside Kepler’

https://eventbooking.stfc.ac.uk/uploads/mew23/purches.pdf

Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’

http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf



http://www.montblanc-project.eu/sites/default/files/publications/Are mobile processors ready for HPC.pdf





13

Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’

http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf

Slide source: ‘ARM Processors and Architectures’

https://www.arm.com/files/ppt/ARM_Processors_and_Architectures_-_Uni_Program_.pptx










cosc 6385 computer architecture introduction and ...gabriel/courses/cosc6385_s16/ca_01_intro.pdf ·...

Documents