cosc 6385 computer architecture introduction and ...gabriel/courses/cosc6385_s16/ca_01_intro.pdf ·...
TRANSCRIPT
1
COSC 6385
Computer Architecture
Introduction and Organizational Issues
Edgar Gabriel
Spring 2016
Organizational issues (I)
• Classes:
– Tuesday, 2.30pm – 4.00pm, SEC 202
– Thursday, 2.30pm – 4.00pm, SEC 202
• Evaluation as planned right now
– 1 homework: 25%
– 3 quizzes: 75% (25% each)
• In case of questions:
– email: [email protected]
– Tel: (713) 743 3358
– Office hours: PGH 524, Monday, 11am-11.45am or by appointment
• All slides available on the website:
– http://www.cs.uh.edu/~gabriel/courses/cosc6385_s16/
– Videos of some lectures will be posted on the course web page
2
Organizational Issues (III) • TA’s for the course:
– Youcef Barigou, email: [email protected]
• Dates for the quizzes:
– 1st quiz: Thursday, Feb 18
– 2nd quiz: Thursday, March 24
– 3rd quiz: Thursday, April 28
• Homework
– Announced: Tuesday, Feb 16
– Due on: Tuesday, March 8
Contents
• Textbook:
John L. Hennessy,
David A. Patterson
“Computer Architecture –
A Quantitative Approach”
5th Edition
Morgan Kaufmann Publishers
3
Contents (II)
• Most of chapters 1 – 5
– Memory Hierarchy Design
– Instruction Level Parallelism
– Data Level Parallelism
– Thread Level Parallelism
• Appendix A, B, C
– Instruction Set Principles
– Review of Memory Hierarchies
– Pipelining
• Selected sections regarding storage systems
• Selected literature to multi-core processors
Why learning about Computer Architecture?
• Every loop iteration requires 3 memory operations
– 2 loads
– 1 store
• For a micro-processor having a frequency of 2 GHz this loop
requires
to satisfy one Floating Point Unit (FPU)
• Most modern processors have 2 FPUs and two or more Integer Units
which could work in parallel
for (i=0; i<n; i++ ) {
c[i] = a[i] + b[i];
}
sGBytessBytes /2410*2*4*3 19
4
Memory technology (www.kingston.com/newtech)
• Memory Bandwidth
with
CycleOpfSBSB BUSBus /**max
maxSB
BUSSB
BUSf
: max. memory bandwidth
: Bandwidth of the memory bus (64 Bit = 8 Bytes)
: Frequency of the memory bus
Memory bandwidth
Name Frequency of
memory bus
(MHz)
max. bandwidth
PC100 SDRAM 100 800 MB/s
PC133 SDRAM 133 1.1 GB/s
PC1600 DDR 100 1.6 GB/s
PC2100 DDR 133 2.1 GB/s
PC2700 DDR 166 2.7 GB/s
PC3200 DDR 200 3.2 GB/s
PC3700 DDR 233 3.7 GB/s
PC4200 DDR 266 4.2 GB/s
5
Memory modules (cont.)
• DDR2 and DDR3: further evolution of the DDR technology
– DDR3-xxx: denotes data transfer rate
– PC3-xxx: denotes theoretical bandwidth
Name Frequency
of memory
bus
(MHz)
Data rate
(MT/s)
Peak transfer rate
(GB/s)
DDR3-800 100 800 6.4
DDR3-1600 133 1066 8.5
DDR3-1333 166 1333 10.6
DDR3-1600 200 1600 11.8
DDR3-1866 233 1866 14.9
DDR3-2133 266 2133 17.0
Memory hierarchies
Size Access time
[cycles]
Backup (tape) TB, PT, EB
Primary data
storage (disk)
~ 1 TB > 106
main memory ~ 1-8 GB 100 - 1000
Caches ~ 1-32 MB 2 – 50
Register < 256 Words 1 - 2
6
Memory hierarchies
• Do I have to care about memory hierarchies?
• Example: Matrix-multiply of two dense matrices
– “Trivial” code
for ( i=0; i<dim; i++ ) {
for ( j=0; j<dim; j++ ) {
for ( k=0; k<dim; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
Matrix-multiply
• Performance of the trivial implementation on an 2.2
GHz AMD Opteron with 2 GB main memory 1 MB 2nd
level cache
Matrix dimension Execution time
[sec]
Performance
[MFLOPS]
256x256 0.118 284
512x512 2.05 130
7
Matrix-multiply (II)
• Peak floating point performance of the processor
2 * (2.2 * 109) Floating point operations/sec
= 4.4 * 109
= 4.4 GFLOPS
• Where are the missing FLOPS between theoretical peek
and achieved performance?
– Memory wait time
Number of floating
point units
Frequency of the processor
→ assuming that each FPU
can finish an operation per
cycle
Theoretical floating point peak
performance of the processor
Blocked code
for ( i=0; i<dim; i+=block ) {
for ( j=0; j<dim; j+=block ) {
for ( k=0; k<dim; k+=block) {
for (ii=i; ii<(i+block); ii++) {
for (jj=j; jj<(j+block); jj++) {
for (kk=k; kk<(k+block);kk++) {
c[ii][jj] += a[ii][kk] * b[kk][jj];
}
}
}
}
}
}
8
Performance of the blocked code Matrix
dimension
block Execution time
[sec]
Performance
[MFLOPS]
“trivial”
[MFLOPS]
256x256 4 0.065 513 284
8 0.046 726
16 0.51 657
32 0.043 777
64 0.049 677
128 0.113 296
512x512 4 0.686 391 130
8 0.422 635
16 0.447 599
32 0.501 535
64 1.00 266
128 0.994 269
9
Top 500 List
10
Trends: Cores and Threads per Chip
19
Source: SICS Multicore Day’ 14
Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
20
Source: SICS Multicore Day’ 14
Slide source: Andy Semin ‘Intel processors and platforms roadmap for energy efficient HPC solutions’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
11
21
Source: SICS Multicore Day’ 14
Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
„Big Core“ – „Small Core“
Intel® Xeon® Processor Intel® Xeon Phi™
Coprocessor
Simply aggregating more cores
generation after generation is not
sufficient
Optimized for highest compute per
watt
Performance per core/thread must
increase each generation, be as fast
as possible
Willing to trade performance per
core/thread for aggregate
performance
Power envelopes should stay flat or
go down each generation
Power envelopes should also stay
flat or go down every generation
Balanced platform (Memory, I/O,
Compute)
Optimized for highly parallel
workloads
Cores, Threads, Caches, SIMD Cores, Threads, Caches, SIMD
Different Optimization Points
Common Programming Models
and Architectural Elements
For illustration only
22
Slide source: Andy Semin ‘ntel processors and platforms roadmap for energy efficient HPC solutionsn’
http://academy.hpc-russia.ru/files/intel_hpc_public.pptx
12
Slide source: Jeremy Purches. ‘Inside Kepler’
https://eventbooking.stfc.ac.uk/uploads/mew23/purches.pdf
Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’
http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf
13
Slide source: A. Ramirez et.al, ‘Are mobile processors ready for HPC?’
http://www.montblanc-project.eu/sites/default/files/publications/Are%20mobile%20processors%20ready%20for%20HPC.pdf
Slide source: ‘ARM Processors and Architectures’
https://www.arm.com/files/ppt/ARM_Processors_and_Architectures_-_Uni_Program_.pptx
14