evolution of high performance cluster architectures
DESCRIPTION
Evolution of High Performance Cluster Architectures. David E. Culler [email protected] http://millennium.berkeley.edu/ NPACI 2001 All Hands Meeting. Much has changed since “NOW”. inktomi.berkeley.edu. NOW 110 UltraSparc +Myrinet. NOW1 SS+ATM/Myrinet. NOW0 HP+medusa FDDI. - PowerPoint PPT PresentationTRANSCRIPT
Evolution of High Performance Cluster
Architectures
David E. Culler
http://millennium.berkeley.edu/
NPACI 2001 All Hands Meeting
Much has changed since “NOW”
NOW0 HP+medusa FDDI
NOW1 SS+ATM/Myrinet
NOW 110 UltraSparc +Myrinet
inktomi.berkeley.edu
Millennium Cluster Editions
The Basic Argument
• performance cost of engineering lag– miss the 2x per 18 months
– => rapid assembly of leading edge HW and SW building blocks
– => availability through fault masking, not inherent reliability
• emergence of the “killer switch”
• opportunities for innovation– move data between as fast as within machine
– protected user-level communication
– large-scale management
– fault isolation
– novel applications
Clusters Took Off
• scalable internet services– only way to match growth rate
• changing supercomputer market
• web hosting
Engineering the Building Block
• argument came full circle in ~98
• wide-array of 3U, 2U, 1U rack-mounted servers
– thermals and mechanicals
– processing per square-foot
– 110 AC routing a mixed blessing
– component OS & drivers
• became the early entry to the market
Emergence of the Killer Switch
• ATM, Fiberchannel, FDDI “died”
• ServerNet bumps along
• IBM, SGI do the proprietary thing
• little Myrinet just keeps going– quite nice at this stage
• SAN standards shootout– NGIO + FutureIO => Infiniband– specs entire stack from phy to api
» nod to IPv6
– big, complex, deeply integrated, DBC
• Gigabit EtherNet steamroller...– limited by TCP/IP stack, NIC, and cost
Opportunities for Innovation
Unexpected Breakthru: layer-7 switches
• fell out of modern switch design– process packets in chunks
• vast # of simultaneous connections
• many line-speed packet filters per port
• can be made redundant
• => multi-gigabit cluster “front end”– virtualize IP address of services
– move service within cluster
– replicate it, distribute it
high-level xforms fail-over, load management Layer-7SwitchLayer-7Switch
NetworkNetwork
SwitchSwitch
e-Science
any useful app should be a service
Protected User-level messaging
Virtual Interface Architecture (VIA) emerged primitive & complex relative to academic prototypes industrial compromise went dormant
Incorporated in Infiniband big one to watch
Potential breakthrough user-level TCP, UDP with IP NIC storage over IP
Management
• workstation -> PC transition a step back
– boot image distribution, OS distribution
– network troubleshoot and service
• multicast proved a powerful tool
• emerging health monitoring and control
– HW level
– service level
– OS level still a problem
Rootstock
Local Local Rootstock Rootstock
ServerServer
InternetInternet
Rootstock Rootstock ServerServer Local Local
Rootstock Rootstock ServerServer
Local Local Rootstock Rootstock
ServerServer
UC BerkeleyUC Berkeley
Ganglia and REXEC
rexecd
rexecd
rexecd
rexecd
vexecd(Policy A)
rexec
Cluster IP Multicast Channel
%rexec –n 2 –r 3 indexer
minimum $
vexecd(Policy B)
Node A Node B Node C Node D
“Nodes AB”
Also: bWatch BPROC: Beowulf Distributed Process Space VA Linux Systems: VACM, VA Cluster Manager
Network Storage
• state-of-practice still NFS + local copies
• local disk replica management lacking
• NFS doesn’t scale– major source of naive user frustration
• limited structured parallel access
• SAN movement only changing the device interface
• Need cluster content distribution, caching, parallel access and network striping
see: GPFS, CFS, PVFS, HPSS, GFS,PPFS,CXFS, HAMFS,Petal, NASD...
Distributed Persistent Data Structure Alternative
Service
DDS lib
Storage
“brick”
Service
DDS lib
Service
DDS lib
Storage
“brick”
Storage
“brick”
Storage
“brick”
Storage
“brick”
Storage
“brick”
System Area Network
Clustered
Service DistrHash tableAPI
Single-nodedurablehash table
Redundantlow latencyhigh xputnetwork
Scalable Throughput
100
1000
10000
100000
1 10 100 1000
# of DDS bricks
ma
x t
hro
ug
hp
ut
(op
s/s)
reads
writes
(128,13582)
(128,61432)
“Performance Available” Storage
D A
D A
D A
D A
Static Parallel Aggregation
D A
D A
D A
D ADis
trib
ute
d Q
ue
ue
Adaptive Parallel Aggregation
0%10%20%30%40%50%60%70%80%90%
100%
0 5 10 15
Nodes
% o
f P
ea
k I/
O R
ate
Adpative Agr.
Static Agr.
0%10%20%30%40%50%60%70%80%90%
100%
0 5 10 15
Nodes Perturbed
% o
f P
ea
k I/
O R
ate
Adpative Agr.
Static Agr.
Application Software
• very little movement towards harnessing architectural potential
• application as service– process stream of requests (not shell or batch)
– grow & shrink on demand
– replication for availability
» data and functionality
– tremendous internal bandwidth
• outer-level optimizations, not algorithmic
Time is NOW
finish the system area networktackle the cluster I/O problemcome together around management
toolsget serious about application services