overview of intel® core 2 architecture and software development tools june 2009

Overview of Intel® Core 2 Architecture and Software Development Tools

June 2009

Overview of Architecture & Tools

We will discuss:What lecture materials are availableWhat labs are availableWhat target courses could be impactedSome high level discussion of underlying

technology

Objectives

After completing this module, you will:

Be aware of and have access to several hours worth of MC topics including Architecture, Compiler Technology, Profiling Technology, OpenMP, & Cache Effects

Be able create exercises on how to avoid coding common threading hazards associated with some MC systems – such as Poor Cache Utilization, False Sharing and Threading load imbalance

Be able create exercises on how to use selected compiler directives & switches to improve behavior on each core

Be able create exercises on how to take advantage VTune analyzer to quickly identify load imbalance issues, poor cache reuse and false sharing issues

Agenda

Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within

each core (SSEx) Avoiding Memory/Cache effects

Why is the Industry moving to Multi-core? In order to increase performance and

reduce power consumption Its is much more efficient to run several

cores at a lower frequency than one single core at a much faster frequency

Power and Frequency

Power vs. Frequency Curve for Single Core Architecture

9

59

109

159

209

259

309

359

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 3.2 3.4

Frequency (GHz)

Po

we

r (w

) Dropping Frequency= Large Drop PowerLower Frequency Allows Headroom

for 2nd Core

Agenda



Processor-independent optimizations

/Od Disables optimizations

/O1 Optimizes for Binary Size and for Speed:

Server Code

/O2 Optimizes for Speed (default):

Vectorization on Intel 64

/O3 Optimizes for Data Cache:

Loopy Floating Point Code

/Zi Creates symbols for debugging

/Ob0 Turns off inlining which can sometimes help the Analysis tools do a more through job

AutoVectorization optimizations

QaxSSE2 Intel Pentium 4 and compatible Intel processors.

QaxSSE3 Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support

QaxSSE3_ATOM Can generate MOVBE instructions for Intel processors and can optimize for the Intel(R) Atom(TM) Processor and Intel(R) Centrino(R) Atom(TM) Processor Technology Extensions 3 (SSE3) instruction support

QaxSSSE3 Intel(R) Core(TM)2 processor family with SSSE3

QaxSSE4.1 Intel(R) 45nm Hi-k next generation Intel Core(TM) microarchitecture with support for SSE4 Vectorizing Compiler and Media Accelerator instructions

QaxSSE4.2 Can generate Intel(R) SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel(R) Core(TM) i7 processors. Can generate Intel(R) SSE4 Vectorizing Compiler and Media Accelerator, Intel(R) SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel(R) Core(TM) processor family.

Intel has a long history of providing auto-vectorization switches along with support for new processor instructions and backward support for older instructions is maintained

Developers should keep an eye on new developments in order to leverage the power of the latest processors

More Advanced optimizationsQipo Interprocedural optimization performs a static, topological

analysis of your application. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. May enable other optimizations like autoparallel and autovector

Qparallel enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel

Qopenmp enable the compiler to generate multi-threaded code based on the OpenMP* directives

Lab 1 - AutoParallelization

Objective: Use auto-parallelization on a simple code to gain experience with using the compiler’s auto-parallelization feature

Follow the VectorSum activity in the student lab doc

Try AutoParallel compilation on Lab called VectorSum

Extra credit: parallelize manually and see how you can beat the auto-parallel option – see openmp section for constructs to try this

Parallel Studio to find where to parallelize Parallel Studio will be used in several labs to find

appropriate locations to add parallelism to the code. Parallel Amplifier specifically is used to find hotspot

information – where in your code does the application spend most of its time

Parallel amplifier does not require instrumenting your code in order to find hotspots, compiling with symbol information is a good idea - /Zi

Compiling with /Ob0 turns off inlining and sometimes seems to give a more through analysis in Parallel Studio

Parallel Amplifier Hotspots

What does hotspot analysis show?

What about drilling down?

The call stack

The call stack shows the callee/caller relationship among function in he code

Found potential parallelism

Lab 2 – Mandelbrot Hotspot Analysis

Objective: Use sampling to find some parallelism in the Mandelbrot application

Follow the Mandelbrot activity called Mandelbrot Sampling in the student lab doc

Identify candidate loops that could be parallelized

Agenda

Multi-core Motivation Tools Overview Taking advantage of Multi-core

High level overview – Intel® Core Architecture Taking advantage of parallelism within


Mobile Platform Optimized • 1-4 Execution Cores• 3/6MB L2 Cache Sizes• 64 Byte L2 cache line• 64-bit

6M

6M L24M

4M L2

Desktop Platform Optimized• 2-4 Execution Cores• 2X3, 2X6 MB L2 Cache Sizes• 64 Byte L2 Cache line• 64-bit

Server Platform Optimized• 4 Execution Cores• 2x6 L2 Caches• 64 Byte L2 Cache line• DP/MP support• 64-bit

2 cores2 cores 4 cores4 cores

**Feature Names TBD

6M2X6M

L22X3M L2

2 cores2 cores 4 cores4 cores

12M

4 cores4 cores

2X6M L2

12M

Intel® Core 2 Architecture

Snapshot in time during Penryn, Yorkfield, harpertown

Software develoers should know number of cores, cache line size and cache sizes to tackle Cache Effects materials

Memory Hierarchy

Magnetic Disk

Main Memory

L2 CacheL1 CacheCPU

~ 1’s Cycle ~ 1’s - 10 Cycle

~ 100’s Cycle

~ 1000’s Cycle

High Level Architectural view

A A A A

E E E E

C1 C2

B B

A A

E E

C

B

Intel Core 2 Duo Processor

Intel Core 2 Quad Processor

A = Architectural State E = Execution Engine & Interrupt C = 2nd Level Cache B = Bus Interface

Memory Memory

64B Cache Line 64B Cache LineDual Core has shared cacheQuad core has both shared

And separated cache

Intel® Core™ Microarchitecture – Memory Sub-system

With a separated cache

CPU1 CPU2

Memory

Front Side Bus (FSB)

Cache Line

Shipping L2 Cache Line~Half access to memory


CPU2

Advantages of Shared Cache – using Advanced Smart Cache® Technology

CPU1

Memory

Front Side Bus (FSB)

Cache Line

L2 is shared:No need to ship cacheline


False Sharing Performance issue in programs where cores may write to different memory

addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores

Core 0 Core 1

Tim

e

1 0

X[0] = 1X[1] = 1

1

X[0] = 0 X[1] = 0

10

X[0] = 2

1 12

False Sharing not an issue in shared cache

It is an issue in separated cache

Agenda



Super Scalar Execution F

P

SIM

D

INT

Multiple Execution units

Allow SIMD parallelism

Many instructions can be retired in a clock cycle

Multiple operations executed within a single core at the

same time

IntelIntel

SSESSEIntelIntel

SSE4.1SSE4.1

IntelIntel

SSE2SSE21999 2000

IntelIntel

SSE3SSE32004

IntelIntel

SSSE3SSSE32006

200770 instr

Single-Precision Vectors

Streaming operations

144 instr

Double-precision Vectors

8/16/32

64/128-bit vector integer

13 instr

Complex

Data

32 instr

Decode

47 instructions

Video Accelerators

Graphics building blocks

Advanced vector instr

Will be continued by

• Intel SSE4.2 (XML processing end 2008)

• See - http://download.intel.com/technology/architecture/new-instructions-paper.pdf

History of SSE Instructions

Long history of new instructionsMost require using packing & unpacking instructions

SSE Data Types & Speedup Potential

4x floats4x floatsSSE

16x bytes16x bytes

8x 16-bit8x 16-bit shortsshorts

4x 32-bit4x 32-bit integersintegers

2x 64-bit integers2x 64-bit integers

1x 128-bit integer1x 128-bit integer

2x doubles2x doubles

SSE-2SSE-3SSE-4

Potential speedup (in the targeted loop) roughly the same as the amount of packingie. For floats – speedup ~ 4X

Goal of SSE(x)

++

Scalar processing traditional mode one instruction produces

one result

XX

YY

X + YX + Y

==

SIMD processing with SSE(2,3,4) one instruction produces

multiple results

++

x3x3 x2x2 x1x1 x0x0

y3y3 y2y2 y1y1 y0y0

x3+y3x3+y3 x2+y2x2+y2 x1+y1x1+y1 x0+y0x0+y0

XX

YY

X + YX + Y

==

•Uses full width of XMM registers•Many functional units •Choice of many of instructions•Not all loops can be vectorized

•Cant vectorize most function calls

Lab 3 – IPO assisted Vectorization

Objective: Explore how inlining a function can dramatically improve performance by allowing vectorization of loop with function call

Open SquareChargeCVectorizationIPO folder and use “nmake all” to build the project from the command line

To add switches to make envirnment use nmake all CF=“/QxSSE3” as example

Agenda



Cache effects

Cache effects can sometimes impact the speed of an application by as much as 10X or even 100X

To take advantage of cache hierarchy in your machine, you should use and re-use data already in cache as much as possible

Avoid accessing memory in non- contiguous memory locations – especially in loops

You may need to consider a loop interchange to access data in a more efficient manner

Loop Interchange

Very important for the vectorizer!

for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j];

for(i=0;i<NUM;i++) for(k=0;k<NUM;k++)

for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j];

Fast Loop Index

Fast Loop Index

Non unit stride skipping in memory can cause cache thrashing –

particularly for arrays sizes 2^n

Unit Stride Memory Access (C/C++)

bN-10bN-10 bN-1N-1bN-1N-1

bk0bk0 bk1bk1 bk2bk2 bk3bk3 bkN-1bkN-1

b10b10 b11b11 b12b12 b13b13 b1N-1b1N-1

b00b00 b01b01 b02b02 b03b03 b0N-1b0N-1

j

j

b

k

Fastest incremented indexConsecutive memory access

aN-10aN-10 aN-1N-1aN-1N-1

ai0ai0 ai1ai1 ai2ai2 ai3ai3 aiN-1aiN-1

a10a10 a11a11 a12a12 a13a13 a1N-1a1N-1

a00a00 a01a01 a02a02 a03a03 a0N-1a0N-1

k

a

k

i

Next fastest loop indexConsecutive memory index

Pan ready to fry eggs

Refrigerator

Poor Cache Uilization - with Eggs

::

•Carton represents cache line

•Refrigerator represents main memory

•Table represents cache

•When table is filled up – old cartons are evicted and most eggs are wasted

•Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton.

•When table fills up old carton is evicted

User requests one specific egg

User requests 2nd specific egg

User requests a 3rd egg – Carton evicted

Refrigerator

Previous user had usedall eggs on table

::

Good Cache Utilization - with Eggs

Carton eviction doesn’t hurt us because we’ve already fried all the eggs in the cartons on the table – just like previous user

User requests Eggs 1-8User requests Eggs 9-16User eventually asks

for all the eggs•Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table

•User fries all eggs in carton before egg from next carton is requested

Lab 4 – Matrix Multiply Cache Effects

Objective: Explore the impact of poor cache utilization on performance with Parallel Studio and explore how to manipulation loops to achieve significantly better cache utilization & performance

BACKUP

overview of intel® core 2 architecture and software development tools june 2009

Documents

core slide

advantage of multicore

overview of intel core

core ssex

intelr atomtm processor

frequency power

intelr ssse3

single core architecture