overview of intel® core 2 architecture and software development tools june 2009
TRANSCRIPT
Overview of Intel® Core 2 Architecture and Software Development Tools
June 2009
Overview of Architecture & Tools
We will discuss:What lecture materials are availableWhat labs are availableWhat target courses could be impactedSome high level discussion of underlying
technology
Objectives
After completing this module, you will:
Be aware of and have access to several hours worth of MC topics including Architecture, Compiler Technology, Profiling Technology, OpenMP, & Cache Effects
Be able create exercises on how to avoid coding common threading hazards associated with some MC systems – such as Poor Cache Utilization, False Sharing and Threading load imbalance
Be able create exercises on how to use selected compiler directives & switches to improve behavior on each core
Be able create exercises on how to take advantage VTune analyzer to quickly identify load imbalance issues, poor cache reuse and false sharing issues
Agenda
Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within
each core (SSEx) Avoiding Memory/Cache effects
Why is the Industry moving to Multi-core? In order to increase performance and
reduce power consumption Its is much more efficient to run several
cores at a lower frequency than one single core at a much faster frequency
Power and Frequency
Power vs. Frequency Curve for Single Core Architecture
9
59
109
159
209
259
309
359
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4
Frequency (GHz)
Po
we
r (w
) Dropping Frequency= Large Drop PowerLower Frequency Allows Headroom
for 2nd Core
Agenda
Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within
each core (SSEx) Avoiding Memory/Cache effects
Processor-independent optimizations
/Od Disables optimizations
/O1 Optimizes for Binary Size and for Speed:
Server Code
/O2 Optimizes for Speed (default):
Vectorization on Intel 64
/O3 Optimizes for Data Cache:
Loopy Floating Point Code
/Zi Creates symbols for debugging
/Ob0 Turns off inlining which can sometimes help the Analysis tools do a more through job
AutoVectorization optimizations
QaxSSE2 Intel Pentium 4 and compatible Intel processors.
QaxSSE3 Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support
QaxSSE3_ATOM Can generate MOVBE instructions for Intel processors and can optimize for the Intel(R) Atom(TM) Processor and Intel(R) Centrino(R) Atom(TM) Processor Technology Extensions 3 (SSE3) instruction support
QaxSSSE3 Intel(R) Core(TM)2 processor family with SSSE3
QaxSSE4.1 Intel(R) 45nm Hi-k next generation Intel Core(TM) microarchitecture with support for SSE4 Vectorizing Compiler and Media Accelerator instructions
QaxSSE4.2 Can generate Intel(R) SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel(R) Core(TM) i7 processors. Can generate Intel(R) SSE4 Vectorizing Compiler and Media Accelerator, Intel(R) SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel(R) Core(TM) processor family.
Intel has a long history of providing auto-vectorization switches along with support for new processor instructions and backward support for older instructions is maintained
Developers should keep an eye on new developments in order to leverage the power of the latest processors
More Advanced optimizationsQipo Interprocedural optimization performs a static, topological
analysis of your application. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. May enable other optimizations like autoparallel and autovector
Qparallel enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel
Qopenmp enable the compiler to generate multi-threaded code based on the OpenMP* directives
Lab 1 - AutoParallelization
Objective: Use auto-parallelization on a simple code to gain experience with using the compiler’s auto-parallelization feature
Follow the VectorSum activity in the student lab doc
Try AutoParallel compilation on Lab called VectorSum
Extra credit: parallelize manually and see how you can beat the auto-parallel option – see openmp section for constructs to try this
Parallel Studio to find where to parallelize Parallel Studio will be used in several labs to find
appropriate locations to add parallelism to the code. Parallel Amplifier specifically is used to find hotspot
information – where in your code does the application spend most of its time
Parallel amplifier does not require instrumenting your code in order to find hotspots, compiling with symbol information is a good idea - /Zi
Compiling with /Ob0 turns off inlining and sometimes seems to give a more through analysis in Parallel Studio
Parallel Amplifier Hotspots
What does hotspot analysis show?
What about drilling down?
The call stack
The call stack shows the callee/caller relationship among function in he code
Found potential parallelism
Lab 2 – Mandelbrot Hotspot Analysis
Objective: Use sampling to find some parallelism in the Mandelbrot application
Follow the Mandelbrot activity called Mandelbrot Sampling in the student lab doc
Identify candidate loops that could be parallelized
Agenda
Multi-core Motivation Tools Overview Taking advantage of Multi-core
High level overview – Intel® Core Architecture Taking advantage of parallelism within
each core (SSEx) Avoiding Memory/Cache effects
Mobile Platform Optimized • 1-4 Execution Cores• 3/6MB L2 Cache Sizes• 64 Byte L2 cache line• 64-bit
6M
6M L24M
4M L2
Desktop Platform Optimized• 2-4 Execution Cores• 2X3, 2X6 MB L2 Cache Sizes• 64 Byte L2 Cache line• 64-bit
Server Platform Optimized• 4 Execution Cores• 2x6 L2 Caches• 64 Byte L2 Cache line• DP/MP support• 64-bit
2 cores2 cores 4 cores4 cores
**Feature Names TBD
6M2X6M
L22X3M L2
2 cores2 cores 4 cores4 cores
12M
4 cores4 cores
2X6M L2
12M
Intel® Core 2 Architecture
Snapshot in time during Penryn, Yorkfield, harpertown
Software develoers should know number of cores, cache line size and cache sizes to tackle Cache Effects materials
Memory Hierarchy
Magnetic Disk
Main Memory
L2 CacheL1 CacheCPU
~ 1’s Cycle ~ 1’s - 10 Cycle
~ 100’s Cycle
~ 1000’s Cycle
High Level Architectural view
A A A A
E E E E
C1 C2
B B
A A
E E
C
B
Intel Core 2 Duo Processor
Intel Core 2 Quad Processor
A = Architectural State E = Execution Engine & Interrupt C = 2nd Level Cache B = Bus Interface
Memory Memory
64B Cache Line 64B Cache LineDual Core has shared cacheQuad core has both shared
And separated cache
Intel® Core™ Microarchitecture – Memory Sub-system
With a separated cache
CPU1 CPU2
Memory
Front Side Bus (FSB)
Cache Line
Shipping L2 Cache Line~Half access to memory
Intel® Core™ Microarchitecture – Memory Sub-system
CPU2
Advantages of Shared Cache – using Advanced Smart Cache® Technology
CPU1
Memory
Front Side Bus (FSB)
Cache Line
L2 is shared:No need to ship cacheline
Intel® Core™ Microarchitecture – Memory Sub-system
False Sharing Performance issue in programs where cores may write to different memory
addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores
Core 0 Core 1
Tim
e
1 0
X[0] = 1X[1] = 1
1
X[0] = 0 X[1] = 0
10
X[0] = 2
1 12
False Sharing not an issue in shared cache
It is an issue in separated cache
Agenda
Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within
each core (SSEx) Avoiding Memory/Cache effects
Super Scalar Execution F
P
SIM
D
INT
Multiple Execution units
Allow SIMD parallelism
Many instructions can be retired in a clock cycle
Multiple operations executed within a single core at the
same time
IntelIntel
SSESSEIntelIntel
SSE4.1SSE4.1
IntelIntel
SSE2SSE21999 2000
IntelIntel
SSE3SSE32004
IntelIntel
SSSE3SSSE32006
200770 instr
Single-Precision Vectors
Streaming operations
144 instr
Double-precision Vectors
8/16/32
64/128-bit vector integer
13 instr
Complex
Data
32 instr
Decode
47 instructions
Video Accelerators
Graphics building blocks
Advanced vector instr
Will be continued by
• Intel SSE4.2 (XML processing end 2008)
• See - http://download.intel.com/technology/architecture/new-instructions-paper.pdf
History of SSE Instructions
Long history of new instructionsMost require using packing & unpacking instructions
SSE Data Types & Speedup Potential
4x floats4x floatsSSE
16x bytes16x bytes
8x 16-bit8x 16-bit shortsshorts
4x 32-bit4x 32-bit integersintegers
2x 64-bit integers2x 64-bit integers
1x 128-bit integer1x 128-bit integer
2x doubles2x doubles
SSE-2SSE-3SSE-4
Potential speedup (in the targeted loop) roughly the same as the amount of packingie. For floats – speedup ~ 4X
Goal of SSE(x)
++
Scalar processing traditional mode one instruction produces
one result
XX
YY
X + YX + Y
==
SIMD processing with SSE(2,3,4) one instruction produces
multiple results
++
x3x3 x2x2 x1x1 x0x0
y3y3 y2y2 y1y1 y0y0
x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0
XX
YY
X + YX + Y
==
•Uses full width of XMM registers•Many functional units •Choice of many of instructions•Not all loops can be vectorized
•Cant vectorize most function calls
Lab 3 – IPO assisted Vectorization
Objective: Explore how inlining a function can dramatically improve performance by allowing vectorization of loop with function call
Open SquareChargeCVectorizationIPO folder and use “nmake all” to build the project from the command line
To add switches to make envirnment use nmake all CF=“/QxSSE3” as example
Agenda
Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within
each core (SSEx) Avoiding Memory/Cache effects
Cache effects
Cache effects can sometimes impact the speed of an application by as much as 10X or even 100X
To take advantage of cache hierarchy in your machine, you should use and re-use data already in cache as much as possible
Avoid accessing memory in non- contiguous memory locations – especially in loops
You may need to consider a loop interchange to access data in a more efficient manner
Loop Interchange
Very important for the vectorizer!
for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j];
for(i=0;i<NUM;i++) for(k=0;k<NUM;k++)
for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j];
Fast Loop Index
Fast Loop Index
Non unit stride skipping in memory can cause cache thrashing –
particularly for arrays sizes 2^n
Unit Stride Memory Access (C/C++)
bN-10bN-10 bN-1N-1bN-1N-1
bk0bk0 bk1bk1 bk2bk2 bk3bk3 bkN-1bkN-1
b10b10 b11b11 b12b12 b13b13 b1N-1b1N-1
b00b00 b01b01 b02b02 b03b03 b0N-1b0N-1
j
j
b
k
Fastest incremented indexConsecutive memory access
aN-10aN-10 aN-1N-1aN-1N-1
ai0ai0 ai1ai1 ai2ai2 ai3ai3 aiN-1aiN-1
a10a10 a11a11 a12a12 a13a13 a1N-1a1N-1
a00a00 a01a01 a02a02 a03a03 a0N-1a0N-1
k
a
k
i
Next fastest loop indexConsecutive memory index
Pan ready to fry eggs
Refrigerator
Poor Cache Uilization - with Eggs
::
•Carton represents cache line
•Refrigerator represents main memory
•Table represents cache
•When table is filled up – old cartons are evicted and most eggs are wasted
•Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton.
•When table fills up old carton is evicted
User requests one specific egg
User requests 2nd specific egg
User requests a 3rd egg – Carton evicted
Refrigerator
Previous user had usedall eggs on table
::
Good Cache Utilization - with Eggs
Carton eviction doesn’t hurt us because we’ve already fried all the eggs in the cartons on the table – just like previous user
User requests Eggs 1-8User requests Eggs 9-16User eventually asks
for all the eggs•Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table
•User fries all eggs in carton before egg from next carton is requested
Lab 4 – Matrix Multiply Cache Effects
Objective: Explore the impact of poor cache utilization on performance with Parallel Studio and explore how to manipulation loops to achieve significantly better cache utilization & performance
BACKUP