distributed systems -...
TRANSCRIPT
CSCE 4600/5640
Distributed Systems
D S FDr. Song FuComputer Science & Engineering
University of North TexasUniversity of North Texas
December 3, 2012
Titan - Cray XK7Titan Cray XK7
• # 1 supercomputer (top500.org)p p ( p g)• 27.1 petaflops, 18,688 nodes• Hybrid – GPU accelerator (NVIDIA K20x)y ( )• Processors: 299k Opteron 6274 (2.2 GHz) cores,
18.7k NVIDIA GPUs – 560k cores• Memory: 710 terabytes• OS: Linux• Interconnect: Gemini• Power: 8.21 megawatts
12/3/2012 CSCE4600/5640: Distributed Systems 2
Clusters & Grids
Edinburgh
Cambridge
Newcastle
Oxford
Glasgow
ManchesterBelfast
DL For high-performance scientific & engineering computing
Tier0/1 facility
Tier2 facility
10 Gbps link
Tier3 facility
Cardiff
Soton
LondonRAL Hinxton computing
Largest distributed system– 250+ teraflops computation
2.5 Gbps link
622 Mbps link
Other linkBeowulf Clusters – 30+ petabyte storage– 40Gb/s networking
Compute Cloudsp Characteristics
– Service-oriented computing– Transparent to users– Easy to use
36 data centers Google maintains over
y– Full control by owners
450,000 servers power required range
upwards of 20 megawattsBusinesses, from startups to
Cloudupwards of 20 megawatts, (cost in the order of US$2 million per month in electricity charges)
4+ billion phones by 2010 [Source: Nokia]
Web 2.0-enabled PCs, TVs, t
startups to enterprises
electricity charges)etc.
12/3/2012 CSCE4600/5640: Distributed Systems 4
Wireless Sensor Networks
12/3/2012 CSCE4600/5640: Distributed Systems 5
Networking and Parallel ComputingNetworking and Parallel Computing
• Computer networking– Hardware that connects computers– Software that sends/receives messages from one
t t th hi h i ht b diff tcomputer to another, which might be on different networks (end to end delivery)
– Goal is to transmit messages reliably and efficientlyg y y
• Parallel Computing– Multiple homogeneous processors in “one” computerp g p p– Shared or distributed memory– Goal is to execute a program faster by division of labor
12/3/2012 CSCE4600/5640: Distributed Systems 6
Distributed ComputingDistributed Computing• Networked computers that could be far apart
– rely on computer networking
• Communicate and coordinate by sending messages
• Goal is to share (access/provide) distributedhi hi h li biliresources; to achieve high reliability
• Issues:– Concurrent execution of processes– No global clock for coordination
More components more independent failures12/3/2012 CSCE4600/5640: Distributed Systems 7
– More components, more independent failures
A Distributed Systemy
12/3/2012 CSCE4600/5640: Distributed Systems 8
Examples of Distributed SystemsExamples of Distributed Systems• Berkeley Open Infrastructure for Network Computing
(BOINC)(BOINC) • SETI@home: Search for Extra-Terrestrial Intelligence. • Mobile Computing -- computers movep g p• Ubiquitous Computing -- computers embedded
everywhere• Issues:• Issues:
– discovery of resources in different host environments– dynamic reconfiguration– limited connectivity– privacy and security guarantees to the user and the host
environment
12/3/2012 CSCE4600/5640: Distributed Systems 9
Distributed Operating SystemsDistributed Operating Systems• Users not aware of multiplicity of machines
– Access to remote resources similar to access to local resources
D Mi i d b f i• Data Migration – move data by transferring entire file, or transferring only those portions of the file necessary for the immediate taskthe file necessary for the immediate task
• Computation Migration – transfer the computation rather than the data across thecomputation, rather than the data, across the system
12/3/2012 CSCE4600/5640: Distributed Systems 10
Design IssuesDesign Issues• Clusters – a collection of semi-autonomous machines that
i lacts as a single system
• Transparency – the distributed system should appear as a conventional, centralized system to the user
• Fault tolerance – the distributed system should continue yto function in the face of failure
• Scalability – as demands increase the system shouldScalability as demands increase, the system should easily accept the addition of new resources to accommodate the increased demand
12/3/2012 CSCE4600/5640: Distributed Systems 11
Data PartitionData PartitionSIMD (Single Instruction, Multiple Data)
Serial Form
while(…){…} Distribute Data
Serial Formof code
Serial Form Serial Form Serial Form
12/3/2012 12Execute all data streams simultaneously
Functional PartitionFunctional PartitionMultiple Instruction Multiple Data (MIMD)
Serial Formf dof code
The Parts
A Partition
12/3/2012 CSCE4600/5640: Distributed Systems 13
Remote Procedure Call
int main(…) {…func(a1, a2, …, an);…
int main(…) {…func(a1, a2, …, an);……
}
void func(p1, p2, …, pn) {…
…}
void func(p1, p2, …, pn) {}
p p p…
}
Conventional Procedure Call Remote Procedure Call
12/3/2012 CSCE4600/5640: Distributed Systems 14
Distributed Object InterfacejProcess Process
RL lObject Interface
RemoteObject Interface
LocalObject Interface
e g Corba DCOM
Performance
e.g. Corba, DCOM,SOAP, …
Remote ObjectClientLocal Objects Remote Object
ClientLocal Objects
Remote ObjectServer
Remote ObjectServer
(a) Single Interface to Objects (b) Interface to Local and Remote Objects15
Migration and Load Balancingg gpp
pp
Machine A
p
pp
p
p
p
Machine A
p
p
ppp
pMachine A
Machine B
pp
p
p Machine A
Machine Bp
pp
p
p
Machine C
pp
Machine C
pp
(a) Before Balancing (b) After Balancing
CSCE4600/5640: Distributed Systems 16
p A thread or process
SynchronizationSynchronization• Distributed synchronization
– No shared memory no semaphores– New approaches use logical clocks & event
ordering• Transactions
– Multiple operations with a commit or abort– Became a mature technology in DBMSgy
• Concurrency controlTwo phase locking
12/3/2012 CSCE4600/5640: Distributed Systems 17
– Two-phase locking
Failure DetectionFailure Detection• Detecting hardware failure is difficult• To detect a link failure, a handshaking protocol can be
used• Assume Site A and Site B have established a link• Assume Site A and Site B have established a link
– At fixed intervals, each site will exchange an I-am-upmessage indicating that they are up and running
• If Site A does not receive a message within the fixed interval, it assumes either (a) the other site is not up or (b) the message was lostthe message was lost
• Site A can now send an Are-you-up? message to Site B• If Site A does not receive a reply, it can repeat the
12/3/2012 CSCE4600/5640: Distributed Systems 18
p y, pmessage or try an alternate route to Site B
Failure Detection (cont.)Failure Detection (cont.)• If Site A does not ultimately receive a reply from Site B, it
l d f f il h dconcludes some type of failure has occurred
• Types of failures:Types of failures:- Site B is down- The direct link between A and B is down- The alternate link from A to B is down- The message has been lost
• However, Site A cannot determine exactly why the failure has occurred
12/3/2012 CSCE4600/5640: Distributed Systems 19
ReconfigurationReconfiguration• When Site A determines a failure has occurred, it must
fi hreconfigure the system:
1. If the link from A to B has failed, this must be broadcast1. If the link from A to B has failed, this must be broadcast to every site in the system
2. If a site has failed, every other site must also be notified indicating that the services offered by the failed site are no longer availablelonger available
• When the link or the site becomes available again, this
12/3/2012 CSCE4600/5640: Distributed Systems 20
information must again be broadcast to all other sites
Cloud ComputingCloud ComputingCloud computing is a model for enabling convenient, on-
d d k h d l f fi bldemand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned pp ) p y pand released with minimal management effort or service provider interaction. (from NIST)
12/3/2012 21
Essential Cloud CharacteristicsEssential Cloud Characteristics
• On-demand self-serviceOn demand self service
• Ubiquitous network access
• Location independent resource pooling
• Rapid elasticity• Rapid elasticity
• Measured services
12/3/2012 CSCE4600/5640: Distributed Systems 22
Three Delivery ModelsThree Delivery Models• Cloud Software as a Service (SaaS)
– Use provider’s applications over a network • Cloud Platform as a Service (PaaS)
– Deploy customer-created applications to a cloud • Cloud Infrastructure as a Service (IaaS)
– Rent processing, storage, network capacity, and other fundamental computing resources
T b id d “ l d” h b d l d• To be considered “cloud” they must be deployed on top of cloud infrastructure that has the key characteristics
12/3/2012 CSCE4600/5640: Distributed Systems 23
Software as a Service (SaaS)Enterprise Software Revolution
Software as a Service (SaaS)
• Most of SaaS applications are implemented based on available modules in theavailable modules in the way that new business logics are realized. For example: salesforce.com
12/3/2012 CSCE4600/5640: Distributed Systems 24
Platform as a Service (PaaS)Platform as a Service (PaaS)• Platform providing all the facilities necessary to
support the complete process of building and delivering web applications
12/3/2012 CSCE4600/5640: Distributed Systems 25
Infrastructure as a Service (IaaS)Infrastructure as a Service (IaaS)• Infrastructure as a Service
is sometimes referred to asis sometimes referred to as Hardware as a Service (HaaS).
• The service provider owns the equipment and is q presponsible for housing, running and maintaining it
• The client typically pays on a per-use basis
12/3/2012 CSCE4600/5640: Distributed Systems 26
Four Cloud Deployment ModelsFour Cloud Deployment Models• Private cloud
– enterprise owned or leased
• Public cloud– Sold to the public, mega-scale infrastructure
• Community cloud– shared infrastructure for specific community
• Hybrid cloud– composition of two or more clouds
12/3/2012 CSCE4600/5640: Distributed Systems 27
Foundational Elements of CloudsFoundational Elements of Clouds• Primary technologies
– Virtualization– Service Oriented Architectures– Distributed Computing– Broadband Networks
Browser as a platform– Browser as a platform– Free and Open Source Software
• Other technologies• Other technologies– Autonomic systems– Web Web
12/3/2012 CSCE4600/5640: Distributed Systems 28
The NIST Cloud Definition FrameworkThe NIST Cloud Definition FrameworkHybrid Clouds
DeploymentCommunityCommunity
CloudCloudPrivate Private CloudCloud
Public CloudPublic CloudDeploymentModels
Service Software as a Platform as a Infrastructure as aServiceModels
E ti l
Software as a Service (SaaS)
Platform as a Service (PaaS)
Infrastructure as a Service (IaaS)
On Demand Self-ServiceEssentialCharacteristics
Resource Pooling
Broad Network Access Rapid Elasticity
Measured Service
Common Characteristics Vi t li ti S i O i t ti
Homogeneity
Massive Scale Resilient Computing
Geographic DistributionCharacteristics
Low Cost Software
Virtualization Service Orientation
Advanced Security