lecture1 - parallel computing architecturesece.uprm.edu/~wrivera/icom6025/lecture1.pdf · sisd simd...
TRANSCRIPT
Dr. Wilson Rivera
ICOM 6025: High Performance ComputingElectrical and Computer Engineering Department
University of Puerto Rico
Lecture 1Parallel Computing Architectures
• Goal: Understand parallel computing fundamental concepts – HPC challenges– Flynn’s Taxonomy– Memory Access Models– Multi-core Processors– Graphics Processor Units– Cluster Infrastructures– Cloud Infrastructures
Outline
ICOM 6025: High Performance Computing 2
Optimization of plasma heating systems for fusion experiments
Physics of high-temperature superconducting cuprates
Global simulation of CO2 dynamics
HPC Challenges
Fundamental instabilityof supernova shocks
Protein structure and function for cellulose-to-ethanol conversion
Next-generation combustion devices burning alternative fuels
Slide source: Thomas Zaharia
1980 1990 2000 2010 2020 2030
Capacity: # of Overnight
Loads cases run
Available Computational
Capacity [Flop/s]
CFD-basedLOADS
& HQ
Aero Optimisation& CFD-CSM Full MDO
Real timeCFD based
in flightsimulation
x106
1 Zeta (1021)
1 Peta (1015)
1 Tera (1012)
1 Giga (109)
1 Exa (1018)
102
103
104
105
106
LES
CFD-basednoise
simulation
RANS Low Speed
RANS High Speed
HS Design
Data Set
UnsteadyRANS
“Smart” use of HPC power:• Algorithms• Data mining• knowledge
Capability achieved during one night batchCourtesy AIRBUS France
HPC Challenges
ICOM 6025: High Performance Computing 5
High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL
HPC Challenges
HPC Challenges
ICOM 6025: High Performance Computing 6
https://computation.llnl.gov/casc/projects/.../climate_2007F.pdf
Flynn's Taxonomy
SISD SIMD
MISD MIMD
Data
Instructions
ICOM 6025: High Performance Computing 7
Flynn's Taxonomy
•Single Instruction, Multiple Data (SIMD)
– All processing units execute the same instruction at any given clock cycle
– Best suited for high degree of regularity • Image processing
– Good examples• SSE = Streaming SIMD Extensions• SSE, SSE2, Intel MIC (Xeon Phi)• Graphics Processing Units (GPU)
ICOM 6025: High Performance Computing 8
Flynn's Taxonomy
• Multiple Instruction, Multiple Data (MIMD)
– Every processing unit may be executing a different instruction stream, and working with a different data stream. • Clusters, and multicore computers. • In practice MIMD architectures may also include
SIMD execution sub-components.
ICOM 6025: High Performance Computing 9
Memory Access Models
• Shared memory• Distributed memory• Hybrid Distributed-Shared Memory
ICOM 6025: High Performance Computing 10
Shared Memory
Bus Interconnect
Memory
CPU CPU CPU
L2 L2 L2
I/O
ICOM 6025: High Performance Computing 11
Shared Memory
• multiple processors can operate independently but share the same memory resources – so that changes in a memory location effected by one processor
are visible to all other processors.
• Two main classes based upon memory access times– Uniform Memory Access (UMA)
• Symmetric Multi Processors (SMPs)– Non Uniform Memory Access (NUMA)
• Main disadvantage is the lack of scalability between memory and CPUs. – Adding more CPUs geometrically increases traffic on the shared
memory CPU path
ICOM 6025: High Performance Computing 12
Shared Memory
• Memory hierarchy tries to exploit locality – Cache hit: in cache memory access (cheap)– Cache miss: non-cache memory access
(expensive)
ICOM 6025: High Performance Computing 13
Distributed Memory
Network I/O
CPU
L2
M
L2
CPU M
CPU
L2
M
L2
CPU M
ICOM 6025: High Performance Computing 14
Distributed Memory
• Processors have their own local memory.• When a processor needs access to data in
another processor– it is usually the task of the programmer to
explicitly define how and when data is communicated
• Examples: Cray XT4, Clusters, Cloud
ICOM 6025: High Performance Computing 15
Hybrid (Distributed-Shared) Memory
ICOM 6025: High Performance Computing 16
Shared memory
Shared memory
Shared memory
Shared memory
NETWORK
In practice we have hybrid memory access
Parallel computing trends• Multi-core processors
– Instead of building processors with faster clock speeds, modern computer systems are being built using chips with an increasing number of processor cores
• Graphics Processor Unit (GPU) – General purpose computing and in particular data parallel high
performance computing
• Dynamic approach to cluster computing provisioning. – Instead of offering a fixed software environment, the application
provides information to the scheduler about what type of resources it needs, and the nodes are automatically provisioned for the user at run-time.
• Platform ISF Adaptive Cluster • Moab Adaptive Operating Environment
• Large scale commodity computer data centers (cloud)– Amazon EC2, Eucalyptus, Google App Engine
ICOM 6025: High Performance Computing 17
Multi-cores and Moore’s Law
Circuits complexity doubles every 18 months
ICOM 6025: High Performance Computing 18
Power wall (2004)
Source: The National Academies Press, Washington, DC, 2011
Source: Intel
Power Wall
• The transition to multi-core processors is not a breakthrough in architecture, but it is actually a result from the need of building power efficient chips
ICOM 6025: High Performance Computing 19
Power Density Limits Serial Performance
ICOM 6025: High Performance Computing 20
Many-cores (Graphics Processor Units)
• Graphics Processor Units (GPUs)
– throughput oriented devices designed to provide high aggregate performance for independent computations.
• prioritizing high-throughput processing of many parallel operations over the low-latency execution of a single task.
– GPUs do not use independent instruction decoders
• instead groups of processing units share an instruction decoder; this maximizes the number of arithmetic units per die area
ICOM 6025: High Performance Computing 21
Multi-Core vs. Many-Core
• Multi-core processors (minimize latency)– MIMD– Each core optimized for executing a single thread– Lots of big on-chip caches– Extremely sophisticated control
• Many-core processors (maximize throughput)– SIMD– Cores optimized for aggregating throughput– Lots of ALUs– Simpler control
ICOM 6025: High Performance Computing 22
CPUs: Latency Oriented Design
• Large caches– Convert long latency memory
accesses to short latency cache accesses
• Sophisticated control– Branch prediction for
reduced branch latency– Data forwarding for reduced
data latency• Powerful ALU
– Reduced operation latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign
23
Cache
ALUControl
ALU
ALU
ALU
DRAM
GPUs: Throughput Oriented Design
• Small caches– To boost memory throughput
• Simple control– No branch prediction– No data forwarding
• Energy efficient ALUs– Many, long latency but heavily
pipelined for high throughput• Require massive number of
threads to tolerate latencies© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012, SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign
24
DRAM
GPU
0
200
400
600
800
1000
1200
1400
9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
GFLO
Ps
NVIDIAGPUIntelCPU
Multi-Core vs. Many-Core
T12
WestmereNV30 NV40
G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
ICOM 6025: High Performance Computing 25
Intel® Xeon® Processor E7-8894 v4
• 24 cores• 48 threads• 2.40 GHz• 14 nm• 60MB cache• $8k (July 2017)
26
NVIDIA TITAN Xp
• 3840 cores• 1.6 GHz• Pascal Architecture• Peak = 12TF/s• $1.5K
ICOM 6025: High Performance Computing 27
Cluster Hardware Configuration
Head Node
Node 1
Node 2
Node n
Switch Local StorageExternal Storage
ICOM 6025: High Performance Computing 28
© Wilson Rivera
Cluster Head Node
• Head Node – Network interface cards (NIC): one connecting to
the public network and the other one connecting to the internal cluster network.
– A local storage is attached to the head node for administrative purposes such as accounting management and maintenance services
ICOM 6025: High Performance Computing 29
Cluster Interconnection Network
• The interconnection of the cluster depends upon both application and budget constraints. – Small clusters typically have PC based nodes connected
through a Gigabit Ethernet network– Large scale production clusters may be made of 1U or 2U
servers or blade servers connected through either • A Gigabit Ethernet network (Server Farm), or • A high performance computing network (High Performance
Computing Cluster)– Infiniband– Quadrics– Myrinet – Omni-Path (Intel)
ICOM 6025: High Performance Computing 30
Cluster Storage
•Storage Area Network (SAN)– Storage devices appear as locally attached to the
operating system.
•Network Attached Storage (NAS)– Distributed File-based protocols
• Parallel Virtual File System (PVFS)• General Parallel File System (GPFS)• Hadoop Parallel File System (HPFS)• Lustre• CERN-VM-FS
ICOM 6025: High Performance Computing 31
Cluster Software
Operating System Cluster Infrastructure Services
Cluster Tools and Libraries
Cluster Resource Manager Scheduler Monitor Analyzer
Communication Compiler Optimization
ICOM 6025: High Performance Computing 32
© Wilson Rivera
Top500.org
ICOM 6025: High Performance Computing 33
History of Performance
ICOM 6025: High Performance Computing 34Exascale Computing and Big Data
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
100 Tflop/s
10 Gflop/s
10 Tflop/s
1 Pflop/s
100 Pflop/s
10 Pflop/s SUM
N=1
N=500
Projected Performance
ICOM 6025: High Performance Computing 35
#1 TAIHULIGHT @ CHINA
• June 2017• National
Supercomputing Center in Wuxi
• SW26010 processors developed by NRCPC
• 40,960 nodes• 10,649,600 cores• Peak =125 PF/s• R max =93 PF/s• 15,371 kW
ICOM 6025: High Performance Computing 36
Cloud Computing
• Cloud computing allows scaling on demand without building or provisioning a data center– Computing resources available on demand (self service) – Charging only for resources utilized (Pay-as-you-go)
• Worldwide revenue from public IT cloud services exceeded $21.5 billion in 2010– It will reach $72.9 billion in 2015– compound annual growth rate (CAGR) of 27.6%.
http://www.idc.com/prodserv/idc_cloud.jsp
Cloud versus Grid
• Grids– Sharing and coordination of distributed
resources– Grid Middleware
• Globus, UNICORE, Glite• Clouds
– Leverages on virtualization to maximize resource utilization
– Cloud Middleware• IaaS, PaaS, SaaS
ICOM 6025: High Performance Computing 38
Layered cloud model
From: K ChenWright University
Cloud Layers
– Infrastructure as a Service (IaaS)• Flexible in terms of the applications to be hosted • Amazon EC2, RackSpace, Nimbus, Eucalyptus
– Platform as a Service (PaaS)• Application domain-specific platforms• Google App Engine, MS Azure, Heroku
– Software as a Service (SaaS)• Service domain-specific• Salesforce, NetSuite
ICOM 6025: High Performance Computing 40
Unused resources
Cloud Economics• Pay by use instead of provisioning for peak
Static data center Data center in the cloud
Demand
Capacity
Time
Res
ourc
es
Demand
Capacity
TimeR
esou
rces
41
From: K ChenWright University
Cloud Economics• Setup:
– A peak period needs 10 servers to process requests– Assume your service is going to run for 1 year
• Private cluster: one-time investment– Servers $1,500 x 10 = $15,000 – Power/AC costs about $200/year/server => $2,000– Administrator: $50,000
• Public cloud:– Rush hours: 10 hours/day, which needs 10 nodes/hour– Other hours: 14hours need 2 nodes/hour– Total: 128 hour.nodes x $0.10/hour.node =$12.80/day– One year cost = $4,672
Cloud Economics
• Amazon EC2 Pricing • Google engine pricing• Hadoop Sizing•• How much to rent a supercomputer
– 8-core VM– 30 GB of RAM (each core 3.75GB)– $1.16/hour – 600,000 cores– 75,000 VMs– $87,000/hour – $2 million per day
ICOM 6025: High Performance Computing 43
Data analytics Ecosystem
ICOM 6025: High Performance Computing 44Exascale Computing and Big Data
Summary
– Parallel computing infrastructure trends• Multi-core Processors
– As a result from the need of building power efficient chips.• Graphics Processor Units
– Throughput oriented devices designed to provide high aggregate performance for independent computations
• Cluster Infrastructures– Head; interconnection; storage; software
• Cloud Infrastructures– Physical resources; virtual resources; infrastructure
services; application services
ICOM 6025: High Performance Computing 45
Scientific Computing Terminology
Terms
• HPC System• Interconnect• Node (blade,
sled, etc.)• Chassis
Definition• “High Performance Computing”
(HPC) Computer– Computers Connected through high speed interconnect and configured for scientific computing.
• The wiring, chips, and software that connects computing components.
• An independent computing unit of an HPC System. Unit has its own operating system (OS) and memory. The physical cases of a node are often called blades and sleds.
• Nodes are often aggregated into a chassis (with a backplane) for sharing electrical power, cooling and sharing a local interconnect.
Terminology (continued)
• Chip or Die• Socket• CPU (or
processor?)• Core• Hyper-
Threading
• Self-contained circuits on a single media of size ~20mm x 20mm, containing up to ~1 billion transistors.
• Provides a connection between and chip and a motherboard.
• A Central Processor Unit, consisting of a chip or die (often called a processor).
• Modern CPUs contain multiple cores. A core is an execution unit within that can execute a code’s instructions independently while other cores execute a different code’s instructions.
• A single core can have additional circuitry that allows two or more instruction streams (threads) to proceed through a single core “simultaneously”. Hyper-Thread is an Intel trademark for 2 threads. Xeon Phi Coprocessor supports 4 threads.
Terms Definition