distributed systems -...

29
CSCE 4600/5640 Distributed Systems D S F Dr. Song Fu Computer Science & Engineering University of North Texas University of North Texas December 3, 2012

Upload: others

Post on 09-Feb-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

CSCE 4600/5640

Distributed Systems

D S FDr. Song FuComputer Science & Engineering

University of North TexasUniversity of North Texas

December 3, 2012

Page 2: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Titan - Cray XK7Titan Cray XK7

• # 1 supercomputer (top500.org)p p ( p g)• 27.1 petaflops, 18,688 nodes• Hybrid – GPU accelerator (NVIDIA K20x)y ( )• Processors: 299k Opteron 6274 (2.2 GHz) cores,

18.7k NVIDIA GPUs – 560k cores• Memory: 710 terabytes• OS: Linux• Interconnect: Gemini• Power: 8.21 megawatts

12/3/2012 CSCE4600/5640: Distributed Systems 2

Page 3: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Clusters & Grids

Edinburgh

Cambridge

Newcastle

Oxford

Glasgow

ManchesterBelfast

DL For high-performance scientific & engineering computing

Tier0/1 facility

Tier2 facility

10 Gbps link

Tier3 facility

Cardiff

Soton

LondonRAL Hinxton computing

Largest distributed system– 250+ teraflops computation

2.5 Gbps link

622 Mbps link

Other linkBeowulf Clusters – 30+ petabyte storage– 40Gb/s networking

Page 4: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Compute Cloudsp Characteristics

– Service-oriented computing– Transparent to users– Easy to use

36 data centers Google maintains over

y– Full control by owners

450,000 servers power required range

upwards of 20 megawattsBusinesses, from startups to

Cloudupwards of 20 megawatts, (cost in the order of US$2 million per month in electricity charges)

4+ billion phones by 2010 [Source: Nokia]

Web 2.0-enabled PCs, TVs, t

startups to enterprises

electricity charges)etc.

12/3/2012 CSCE4600/5640: Distributed Systems 4

Page 5: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Wireless Sensor Networks

12/3/2012 CSCE4600/5640: Distributed Systems 5

Page 6: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Networking and Parallel ComputingNetworking and Parallel Computing

• Computer networking– Hardware that connects computers– Software that sends/receives messages from one

t t th hi h i ht b diff tcomputer to another, which might be on different networks (end to end delivery)

– Goal is to transmit messages reliably and efficientlyg y y

• Parallel Computing– Multiple homogeneous processors in “one” computerp g p p– Shared or distributed memory– Goal is to execute a program faster by division of labor

12/3/2012 CSCE4600/5640: Distributed Systems 6

Page 7: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Distributed ComputingDistributed Computing• Networked computers that could be far apart

– rely on computer networking

• Communicate and coordinate by sending messages

• Goal is to share (access/provide) distributedhi hi h li biliresources; to achieve high reliability

• Issues:– Concurrent execution of processes– No global clock for coordination

More components more independent failures12/3/2012 CSCE4600/5640: Distributed Systems 7

– More components, more independent failures

Page 8: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

A Distributed Systemy

12/3/2012 CSCE4600/5640: Distributed Systems 8

Page 9: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Examples of Distributed SystemsExamples of Distributed Systems• Berkeley Open Infrastructure for Network Computing

(BOINC)(BOINC) • SETI@home: Search for Extra-Terrestrial Intelligence. • Mobile Computing -- computers movep g p• Ubiquitous Computing -- computers embedded

everywhere• Issues:• Issues:

– discovery of resources in different host environments– dynamic reconfiguration– limited connectivity– privacy and security guarantees to the user and the host

environment

12/3/2012 CSCE4600/5640: Distributed Systems 9

Page 10: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Distributed Operating SystemsDistributed Operating Systems• Users not aware of multiplicity of machines

– Access to remote resources similar to access to local resources

D Mi i d b f i• Data Migration – move data by transferring entire file, or transferring only those portions of the file necessary for the immediate taskthe file necessary for the immediate task

• Computation Migration – transfer the computation rather than the data across thecomputation, rather than the data, across the system

12/3/2012 CSCE4600/5640: Distributed Systems 10

Page 11: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Design IssuesDesign Issues• Clusters – a collection of semi-autonomous machines that

i lacts as a single system

• Transparency – the distributed system should appear as a conventional, centralized system to the user

• Fault tolerance – the distributed system should continue yto function in the face of failure

• Scalability – as demands increase the system shouldScalability as demands increase, the system should easily accept the addition of new resources to accommodate the increased demand

12/3/2012 CSCE4600/5640: Distributed Systems 11

Page 12: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Data PartitionData PartitionSIMD (Single Instruction, Multiple Data)

Serial Form

while(…){…} Distribute Data

Serial Formof code

Serial Form Serial Form Serial Form

12/3/2012 12Execute all data streams simultaneously

Page 13: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Functional PartitionFunctional PartitionMultiple Instruction Multiple Data (MIMD)

Serial Formf dof code

The Parts

A Partition

12/3/2012 CSCE4600/5640: Distributed Systems 13

Page 14: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Remote Procedure Call

int main(…) {…func(a1, a2, …, an);…

int main(…) {…func(a1, a2, …, an);……

}

void func(p1, p2, …, pn) {…

…}

void func(p1, p2, …, pn) {}

p p p…

}

Conventional Procedure Call Remote Procedure Call

12/3/2012 CSCE4600/5640: Distributed Systems 14

Page 15: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Distributed Object InterfacejProcess Process

RL lObject Interface

RemoteObject Interface

LocalObject Interface

e g Corba DCOM

Performance

e.g. Corba, DCOM,SOAP, …

Remote ObjectClientLocal Objects Remote Object

ClientLocal Objects

Remote ObjectServer

Remote ObjectServer

(a) Single Interface to Objects (b) Interface to Local and Remote Objects15

Page 16: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Migration and Load Balancingg gpp

pp

Machine A

p

pp

p

p

p

Machine A

p

p

ppp

pMachine A

Machine B

pp

p

p Machine A

Machine Bp

pp

p

p

Machine C

pp

Machine C

pp

(a) Before Balancing (b) After Balancing

CSCE4600/5640: Distributed Systems 16

p A thread or process

Page 17: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

SynchronizationSynchronization• Distributed synchronization

– No shared memory no semaphores– New approaches use logical clocks & event

ordering• Transactions

– Multiple operations with a commit or abort– Became a mature technology in DBMSgy

• Concurrency controlTwo phase locking

12/3/2012 CSCE4600/5640: Distributed Systems 17

– Two-phase locking

Page 18: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Failure DetectionFailure Detection• Detecting hardware failure is difficult• To detect a link failure, a handshaking protocol can be

used• Assume Site A and Site B have established a link• Assume Site A and Site B have established a link

– At fixed intervals, each site will exchange an I-am-upmessage indicating that they are up and running

• If Site A does not receive a message within the fixed interval, it assumes either (a) the other site is not up or (b) the message was lostthe message was lost

• Site A can now send an Are-you-up? message to Site B• If Site A does not receive a reply, it can repeat the

12/3/2012 CSCE4600/5640: Distributed Systems 18

p y, pmessage or try an alternate route to Site B

Page 19: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Failure Detection (cont.)Failure Detection (cont.)• If Site A does not ultimately receive a reply from Site B, it

l d f f il h dconcludes some type of failure has occurred

• Types of failures:Types of failures:- Site B is down- The direct link between A and B is down- The alternate link from A to B is down- The message has been lost

• However, Site A cannot determine exactly why the failure has occurred

12/3/2012 CSCE4600/5640: Distributed Systems 19

Page 20: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

ReconfigurationReconfiguration• When Site A determines a failure has occurred, it must

fi hreconfigure the system:

1. If the link from A to B has failed, this must be broadcast1. If the link from A to B has failed, this must be broadcast to every site in the system

2. If a site has failed, every other site must also be notified indicating that the services offered by the failed site are no longer availablelonger available

• When the link or the site becomes available again, this

12/3/2012 CSCE4600/5640: Distributed Systems 20

information must again be broadcast to all other sites

Page 21: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Cloud ComputingCloud ComputingCloud computing is a model for enabling convenient, on-

d d k h d l f fi bldemand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned pp ) p y pand released with minimal management effort or service provider interaction. (from NIST)

12/3/2012 21

Page 22: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Essential Cloud CharacteristicsEssential Cloud Characteristics

• On-demand self-serviceOn demand self service

• Ubiquitous network access

• Location independent resource pooling

• Rapid elasticity• Rapid elasticity

• Measured services

12/3/2012 CSCE4600/5640: Distributed Systems 22

Page 23: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Three Delivery ModelsThree Delivery Models• Cloud Software as a Service (SaaS)

– Use provider’s applications over a network • Cloud Platform as a Service (PaaS)

– Deploy customer-created applications to a cloud • Cloud Infrastructure as a Service (IaaS)

– Rent processing, storage, network capacity, and other fundamental computing resources

T b id d “ l d” h b d l d• To be considered “cloud” they must be deployed on top of cloud infrastructure that has the key characteristics

12/3/2012 CSCE4600/5640: Distributed Systems 23

Page 24: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Software as a Service (SaaS)Enterprise Software Revolution

Software as a Service (SaaS)

• Most of SaaS applications are implemented based on available modules in theavailable modules in the way that new business logics are realized. For example: salesforce.com

12/3/2012 CSCE4600/5640: Distributed Systems 24

Page 25: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Platform as a Service (PaaS)Platform as a Service (PaaS)• Platform providing all the facilities necessary to

support the complete process of building and delivering web applications

12/3/2012 CSCE4600/5640: Distributed Systems 25

Page 26: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Infrastructure as a Service (IaaS)Infrastructure as a Service (IaaS)• Infrastructure as a Service

is sometimes referred to asis sometimes referred to as Hardware as a Service (HaaS).

• The service provider owns the equipment and is q presponsible for housing, running and maintaining it

• The client typically pays on a per-use basis

12/3/2012 CSCE4600/5640: Distributed Systems 26

Page 27: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Four Cloud Deployment ModelsFour Cloud Deployment Models• Private cloud

– enterprise owned or leased

• Public cloud– Sold to the public, mega-scale infrastructure

• Community cloud– shared infrastructure for specific community

• Hybrid cloud– composition of two or more clouds

12/3/2012 CSCE4600/5640: Distributed Systems 27

Page 28: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

Foundational Elements of CloudsFoundational Elements of Clouds• Primary technologies

– Virtualization– Service Oriented Architectures– Distributed Computing– Broadband Networks

Browser as a platform– Browser as a platform– Free and Open Source Software

• Other technologies• Other technologies– Autonomic systems– Web Web

12/3/2012 CSCE4600/5640: Distributed Systems 28

Page 29: Distributed Systems - BIUu.cs.biu.ac.il/~ariel/download/ds590/resources/misc/DistributedSystems.pdf · • Fault tolerance – the distributed system should continue to function in

The NIST Cloud Definition FrameworkThe NIST Cloud Definition FrameworkHybrid Clouds

DeploymentCommunityCommunity

CloudCloudPrivate Private CloudCloud

Public CloudPublic CloudDeploymentModels

Service Software as a Platform as a Infrastructure as aServiceModels

E ti l

Software as a Service (SaaS)

Platform as a Service (PaaS)

Infrastructure as a Service (IaaS)

On Demand Self-ServiceEssentialCharacteristics

Resource Pooling

Broad Network Access Rapid Elasticity

Measured Service

Common Characteristics Vi t li ti S i O i t ti

Homogeneity

Massive Scale Resilient Computing

Geographic DistributionCharacteristics

Low Cost Software

Virtualization Service Orientation

Advanced Security