lecture1 - parallel computing architecturesece.uprm.edu/~wrivera/icom6025/lecture1.pdf · sisd simd...

Dr. Wilson Rivera

ICOM 6025: High Performance ComputingElectrical and Computer Engineering Department

University of Puerto Rico

Lecture 1Parallel Computing Architectures

• Goal: Understand parallel computing fundamental concepts – HPC challenges– Flynn’s Taxonomy– Memory Access Models– Multi-core Processors– Graphics Processor Units– Cluster Infrastructures– Cloud Infrastructures

Outline

ICOM 6025: High Performance Computing 2

Optimization of plasma heating systems for fusion experiments

Physics of high-temperature superconducting cuprates

Global simulation of CO2 dynamics

HPC Challenges

Fundamental instabilityof supernova shocks

Protein structure and function for cellulose-to-ethanol conversion

Next-generation combustion devices burning alternative fuels

Slide source: Thomas Zaharia

1980 1990 2000 2010 2020 2030

Capacity: # of Overnight

Loads cases run

Available Computational

Capacity [Flop/s]

CFD-basedLOADS

& HQ

Aero Optimisation& CFD-CSM Full MDO

Real timeCFD based

in flightsimulation

x106

1 Zeta (1021)

1 Peta (1015)

1 Tera (1012)

1 Giga (109)

1 Exa (1018)

102

103

104

105

106

LES

CFD-basednoise

simulation

RANS Low Speed

RANS High Speed

HS Design

Data Set

UnsteadyRANS

“Smart” use of HPC power:• Algorithms• Data mining• knowledge

Capability achieved during one night batchCourtesy AIRBUS France

HPC Challenges


High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL

HPC Challenges

HPC Challenges


https://computation.llnl.gov/casc/projects/.../climate_2007F.pdf

Flynn's Taxonomy

SISD SIMD

MISD MIMD

Data

Instructions


Flynn's Taxonomy

•Single Instruction, Multiple Data (SIMD)

– All processing units execute the same instruction at any given clock cycle

– Best suited for high degree of regularity • Image processing

– Good examples• SSE = Streaming SIMD Extensions• SSE, SSE2, Intel MIC (Xeon Phi)• Graphics Processing Units (GPU)


Flynn's Taxonomy

• Multiple Instruction, Multiple Data (MIMD)

– Every processing unit may be executing a different instruction stream, and working with a different data stream. • Clusters, and multicore computers. • In practice MIMD architectures may also include

SIMD execution sub-components.


Memory Access Models

• Shared memory• Distributed memory• Hybrid Distributed-Shared Memory


Shared Memory

Bus Interconnect

Memory

CPU CPU CPU

L2 L2 L2

I/O


Shared Memory

• multiple processors can operate independently but share the same memory resources – so that changes in a memory location effected by one processor

are visible to all other processors.

• Two main classes based upon memory access times– Uniform Memory Access (UMA)

• Symmetric Multi Processors (SMPs)– Non Uniform Memory Access (NUMA)

• Main disadvantage is the lack of scalability between memory and CPUs. – Adding more CPUs geometrically increases traffic on the shared

memory CPU path


Shared Memory

• Memory hierarchy tries to exploit locality – Cache hit: in cache memory access (cheap)– Cache miss: non-cache memory access

(expensive)


Distributed Memory

Network I/O

CPU

L2

M

L2

CPU M

CPU

L2

M

L2

CPU M


Distributed Memory

• Processors have their own local memory.• When a processor needs access to data in

another processor– it is usually the task of the programmer to

explicitly define how and when data is communicated

• Examples: Cray XT4, Clusters, Cloud


Hybrid (Distributed-Shared) Memory


Shared memory

Shared memory

Shared memory

Shared memory

NETWORK

In practice we have hybrid memory access

Parallel computing trends• Multi-core processors

– Instead of building processors with faster clock speeds, modern computer systems are being built using chips with an increasing number of processor cores

• Graphics Processor Unit (GPU) – General purpose computing and in particular data parallel high

performance computing

• Dynamic approach to cluster computing provisioning. – Instead of offering a fixed software environment, the application

provides information to the scheduler about what type of resources it needs, and the nodes are automatically provisioned for the user at run-time.

• Platform ISF Adaptive Cluster • Moab Adaptive Operating Environment

• Large scale commodity computer data centers (cloud)– Amazon EC2, Eucalyptus, Google App Engine


Multi-cores and Moore’s Law

Circuits complexity doubles every 18 months


Power wall (2004)

Source: The National Academies Press, Washington, DC, 2011

Source: Intel

Power Wall

• The transition to multi-core processors is not a breakthrough in architecture, but it is actually a result from the need of building power efficient chips


Power Density Limits Serial Performance


Many-cores (Graphics Processor Units)

• Graphics Processor Units (GPUs)

– throughput oriented devices designed to provide high aggregate performance for independent computations.

• prioritizing high-throughput processing of many parallel operations over the low-latency execution of a single task.

– GPUs do not use independent instruction decoders

• instead groups of processing units share an instruction decoder; this maximizes the number of arithmetic units per die area


Multi-Core vs. Many-Core

• Multi-core processors (minimize latency)– MIMD– Each core optimized for executing a single thread– Lots of big on-chip caches– Extremely sophisticated control

• Many-core processors (maximize throughput)– SIMD– Cores optimized for aggregating throughput– Lots of ALUs– Simpler control


CPUs: Latency Oriented Design

• Large caches– Convert long latency memory

accesses to short latency cache accesses

• Sophisticated control– Branch prediction for

reduced branch latency– Data forwarding for reduced

data latency• Powerful ALU

– Reduced operation latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign

23

Cache

ALUControl

ALU

ALU

ALU

DRAM

GPUs: Throughput Oriented Design

• Small caches– To boost memory throughput

• Simple control– No branch prediction– No data forwarding

• Energy efficient ALUs– Many, long latency but heavily

pipelined for high throughput• Require massive number of

threads to tolerate latencies© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign

24

DRAM

GPU

0

200

400

600

800

1000

1200

1400

9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009

GFLO

Ps

NVIDIAGPUIntelCPU

Multi-Core vs. Many-Core

T12

WestmereNV30 NV40

G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad


Intel® Xeon® Processor E7-8894 v4

• 24 cores• 48 threads• 2.40 GHz• 14 nm• 60MB cache• $8k (July 2017)

26

NVIDIA TITAN Xp

• 3840 cores• 1.6 GHz• Pascal Architecture• Peak = 12TF/s• $1.5K


Cluster Hardware Configuration

Head Node

Node 1

Node 2

Node n

Switch Local StorageExternal Storage


© Wilson Rivera

Cluster Head Node

• Head Node – Network interface cards (NIC): one connecting to

the public network and the other one connecting to the internal cluster network.

– A local storage is attached to the head node for administrative purposes such as accounting management and maintenance services


Cluster Interconnection Network

• The interconnection of the cluster depends upon both application and budget constraints. – Small clusters typically have PC based nodes connected

through a Gigabit Ethernet network– Large scale production clusters may be made of 1U or 2U

servers or blade servers connected through either • A Gigabit Ethernet network (Server Farm), or • A high performance computing network (High Performance

Computing Cluster)– Infiniband– Quadrics– Myrinet – Omni-Path (Intel)


Cluster Storage

•Storage Area Network (SAN)– Storage devices appear as locally attached to the

operating system.

•Network Attached Storage (NAS)– Distributed File-based protocols

• Parallel Virtual File System (PVFS)• General Parallel File System (GPFS)• Hadoop Parallel File System (HPFS)• Lustre• CERN-VM-FS


Cluster Software

Operating System Cluster Infrastructure Services

Cluster Tools and Libraries

Cluster Resource Manager Scheduler Monitor Analyzer

Communication Compiler Optimization


© Wilson Rivera

Top500.org


History of Performance

ICOM 6025: High Performance Computing 34Exascale Computing and Big Data

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s SUM

N=1

N=500

Projected Performance


#1 TAIHULIGHT @ CHINA

• June 2017• National

Supercomputing Center in Wuxi

• SW26010 processors developed by NRCPC

• 40,960 nodes• 10,649,600 cores• Peak =125 PF/s• R max =93 PF/s• 15,371 kW


Cloud Computing

• Cloud computing allows scaling on demand without building or provisioning a data center– Computing resources available on demand (self service) – Charging only for resources utilized (Pay-as-you-go)

• Worldwide revenue from public IT cloud services exceeded $21.5 billion in 2010– It will reach $72.9 billion in 2015– compound annual growth rate (CAGR) of 27.6%.

http://www.idc.com/prodserv/idc_cloud.jsp

Cloud versus Grid

• Grids– Sharing and coordination of distributed

resources– Grid Middleware

• Globus, UNICORE, Glite• Clouds

– Leverages on virtualization to maximize resource utilization

– Cloud Middleware• IaaS, PaaS, SaaS


Layered cloud model

From: K ChenWright University

Cloud Layers

– Infrastructure as a Service (IaaS)• Flexible in terms of the applications to be hosted • Amazon EC2, RackSpace, Nimbus, Eucalyptus

– Platform as a Service (PaaS)• Application domain-specific platforms• Google App Engine, MS Azure, Heroku

– Software as a Service (SaaS)• Service domain-specific• Salesforce, NetSuite


Unused resources

Cloud Economics• Pay by use instead of provisioning for peak

Static data center Data center in the cloud

Demand

Capacity

Time

Res

ourc

es

Demand

Capacity

TimeR

esou

rces

41

From: K ChenWright University

Cloud Economics• Setup:

– A peak period needs 10 servers to process requests– Assume your service is going to run for 1 year

• Private cluster: one-time investment– Servers $1,500 x 10 = $15,000 – Power/AC costs about $200/year/server => $2,000– Administrator: $50,000

• Public cloud:– Rush hours: 10 hours/day, which needs 10 nodes/hour– Other hours: 14hours need 2 nodes/hour– Total: 128 hour.nodes x $0.10/hour.node =$12.80/day– One year cost = $4,672

Cloud Economics

• Amazon EC2 Pricing • Google engine pricing• Hadoop Sizing•• How much to rent a supercomputer

– 8-core VM– 30 GB of RAM (each core 3.75GB)– $1.16/hour – 600,000 cores– 75,000 VMs– $87,000/hour – $2 million per day


Data analytics Ecosystem

ICOM 6025: High Performance Computing 44Exascale Computing and Big Data

Summary

– Parallel computing infrastructure trends• Multi-core Processors

– As a result from the need of building power efficient chips.• Graphics Processor Units

– Throughput oriented devices designed to provide high aggregate performance for independent computations

• Cluster Infrastructures– Head; interconnection; storage; software

• Cloud Infrastructures– Physical resources; virtual resources; infrastructure

services; application services


Scientific Computing Terminology

Terms

• HPC System• Interconnect• Node (blade,

sled, etc.)• Chassis

Definition• “High Performance Computing”

(HPC) Computer– Computers Connected through high speed interconnect and configured for scientific computing.

• The wiring, chips, and software that connects computing components.

• An independent computing unit of an HPC System. Unit has its own operating system (OS) and memory. The physical cases of a node are often called blades and sleds.

• Nodes are often aggregated into a chassis (with a backplane) for sharing electrical power, cooling and sharing a local interconnect.

Terminology (continued)

• Chip or Die• Socket• CPU (or

processor?)• Core• Hyper-

Threading

• Self-contained circuits on a single media of size ~20mm x 20mm, containing up to ~1 billion transistors.

• Provides a connection between and chip and a motherboard.

• A Central Processor Unit, consisting of a chip or die (often called a processor).

• Modern CPUs contain multiple cores. A core is an execution unit within that can execute a code’s instructions independently while other cores execute a different code’s instructions.

• A single core can have additional circuitry that allows two or more instruction streams (threads) to proceed through a single core “simultaneously”. Hyper-Thread is an Intel trademark for 2 threads. Xeon Phi Coprocessor supports 4 threads.

Terms Definition

lecture1 - parallel computing architecturesece.uprm.edu/~wrivera/icom6025/lecture1.pdf · sisd simd...

Documents