distributed systems -...

CSCE 4600/5640

Distributed Systems

D S FDr. Song FuComputer Science & Engineering

University of North TexasUniversity of North Texas

December 3, 2012

Titan - Cray XK7Titan Cray XK7

• # 1 supercomputer (top500.org)p p ( p g)• 27.1 petaflops, 18,688 nodes• Hybrid – GPU accelerator (NVIDIA K20x)y ( )• Processors: 299k Opteron 6274 (2.2 GHz) cores,

18.7k NVIDIA GPUs – 560k cores• Memory: 710 terabytes• OS: Linux• Interconnect: Gemini• Power: 8.21 megawatts

12/3/2012 CSCE4600/5640: Distributed Systems 2

Clusters & Grids

Edinburgh

Cambridge

Newcastle

Oxford

Glasgow

ManchesterBelfast

DL For high-performance scientific & engineering computing

Tier0/1 facility

Tier2 facility

10 Gbps link

Tier3 facility

Cardiff

Soton

LondonRAL Hinxton computing

Largest distributed system– 250+ teraflops computation

2.5 Gbps link

622 Mbps link

Other linkBeowulf Clusters – 30+ petabyte storage– 40Gb/s networking

Compute Cloudsp Characteristics

– Service-oriented computing– Transparent to users– Easy to use

36 data centers Google maintains over

y– Full control by owners

450,000 servers power required range

upwards of 20 megawattsBusinesses, from startups to

Cloudupwards of 20 megawatts, (cost in the order of US$2 million per month in electricity charges)

4+ billion phones by 2010 [Source: Nokia]

Web 2.0-enabled PCs, TVs, t

startups to enterprises

electricity charges)etc.


Wireless Sensor Networks


Networking and Parallel ComputingNetworking and Parallel Computing

• Computer networking– Hardware that connects computers– Software that sends/receives messages from one

t t th hi h i ht b diff tcomputer to another, which might be on different networks (end to end delivery)

– Goal is to transmit messages reliably and efficientlyg y y

• Parallel Computing– Multiple homogeneous processors in “one” computerp g p p– Shared or distributed memory– Goal is to execute a program faster by division of labor


Distributed ComputingDistributed Computing• Networked computers that could be far apart

– rely on computer networking

• Communicate and coordinate by sending messages

• Goal is to share (access/provide) distributedhi hi h li biliresources; to achieve high reliability

• Issues:– Concurrent execution of processes– No global clock for coordination

More components more independent failures12/3/2012 CSCE4600/5640: Distributed Systems 7

– More components, more independent failures

A Distributed Systemy


Examples of Distributed SystemsExamples of Distributed Systems• Berkeley Open Infrastructure for Network Computing

(BOINC)(BOINC) • SETI@home: Search for Extra-Terrestrial Intelligence. • Mobile Computing -- computers movep g p• Ubiquitous Computing -- computers embedded

everywhere• Issues:• Issues:

– discovery of resources in different host environments– dynamic reconfiguration– limited connectivity– privacy and security guarantees to the user and the host

environment


Distributed Operating SystemsDistributed Operating Systems• Users not aware of multiplicity of machines

– Access to remote resources similar to access to local resources

D Mi i d b f i• Data Migration – move data by transferring entire file, or transferring only those portions of the file necessary for the immediate taskthe file necessary for the immediate task

• Computation Migration – transfer the computation rather than the data across thecomputation, rather than the data, across the system


Design IssuesDesign Issues• Clusters – a collection of semi-autonomous machines that

i lacts as a single system

• Transparency – the distributed system should appear as a conventional, centralized system to the user

• Fault tolerance – the distributed system should continue yto function in the face of failure

• Scalability – as demands increase the system shouldScalability as demands increase, the system should easily accept the addition of new resources to accommodate the increased demand


Data PartitionData PartitionSIMD (Single Instruction, Multiple Data)

Serial Form

while(…){…} Distribute Data

Serial Formof code

Serial Form Serial Form Serial Form

12/3/2012 12Execute all data streams simultaneously

Functional PartitionFunctional PartitionMultiple Instruction Multiple Data (MIMD)

Serial Formf dof code

The Parts

A Partition


Remote Procedure Call

int main(…) {…func(a1, a2, …, an);…

int main(…) {…func(a1, a2, …, an);……

}

void func(p1, p2, …, pn) {…

…}

void func(p1, p2, …, pn) {}

p p p…

}

Conventional Procedure Call Remote Procedure Call


Distributed Object InterfacejProcess Process

RL lObject Interface

RemoteObject Interface

LocalObject Interface

e g Corba DCOM

Performance

e.g. Corba, DCOM,SOAP, …

Remote ObjectClientLocal Objects Remote Object

ClientLocal Objects

Remote ObjectServer

Remote ObjectServer

(a) Single Interface to Objects (b) Interface to Local and Remote Objects15

Migration and Load Balancingg gpp

pp

Machine A

p

pp

p

p

p

Machine A

p

p

ppp

pMachine A

Machine B

pp

p

p Machine A

Machine Bp

pp

p

p

Machine C

pp

Machine C

pp

(a) Before Balancing (b) After Balancing

CSCE4600/5640: Distributed Systems 16

p A thread or process

SynchronizationSynchronization• Distributed synchronization

– No shared memory no semaphores– New approaches use logical clocks & event

ordering• Transactions

– Multiple operations with a commit or abort– Became a mature technology in DBMSgy

• Concurrency controlTwo phase locking


– Two-phase locking

Failure DetectionFailure Detection• Detecting hardware failure is difficult• To detect a link failure, a handshaking protocol can be

used• Assume Site A and Site B have established a link• Assume Site A and Site B have established a link

– At fixed intervals, each site will exchange an I-am-upmessage indicating that they are up and running

• If Site A does not receive a message within the fixed interval, it assumes either (a) the other site is not up or (b) the message was lostthe message was lost

• Site A can now send an Are-you-up? message to Site B• If Site A does not receive a reply, it can repeat the


p y, pmessage or try an alternate route to Site B

Failure Detection (cont.)Failure Detection (cont.)• If Site A does not ultimately receive a reply from Site B, it

l d f f il h dconcludes some type of failure has occurred

• Types of failures:Types of failures:- Site B is down- The direct link between A and B is down- The alternate link from A to B is down- The message has been lost

• However, Site A cannot determine exactly why the failure has occurred


ReconfigurationReconfiguration• When Site A determines a failure has occurred, it must

fi hreconfigure the system:

1. If the link from A to B has failed, this must be broadcast1. If the link from A to B has failed, this must be broadcast to every site in the system

2. If a site has failed, every other site must also be notified indicating that the services offered by the failed site are no longer availablelonger available

• When the link or the site becomes available again, this


information must again be broadcast to all other sites

Cloud ComputingCloud ComputingCloud computing is a model for enabling convenient, on-

d d k h d l f fi bldemand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned pp ) p y pand released with minimal management effort or service provider interaction. (from NIST)

12/3/2012 21

Essential Cloud CharacteristicsEssential Cloud Characteristics

• On-demand self-serviceOn demand self service

• Ubiquitous network access

• Location independent resource pooling

• Rapid elasticity• Rapid elasticity

• Measured services


Three Delivery ModelsThree Delivery Models• Cloud Software as a Service (SaaS)

– Use provider’s applications over a network • Cloud Platform as a Service (PaaS)

– Deploy customer-created applications to a cloud • Cloud Infrastructure as a Service (IaaS)

– Rent processing, storage, network capacity, and other fundamental computing resources

T b id d “ l d” h b d l d• To be considered “cloud” they must be deployed on top of cloud infrastructure that has the key characteristics


Software as a Service (SaaS)Enterprise Software Revolution

Software as a Service (SaaS)

• Most of SaaS applications are implemented based on available modules in theavailable modules in the way that new business logics are realized. For example: salesforce.com


Platform as a Service (PaaS)Platform as a Service (PaaS)• Platform providing all the facilities necessary to

support the complete process of building and delivering web applications


Infrastructure as a Service (IaaS)Infrastructure as a Service (IaaS)• Infrastructure as a Service

is sometimes referred to asis sometimes referred to as Hardware as a Service (HaaS).

• The service provider owns the equipment and is q presponsible for housing, running and maintaining it

• The client typically pays on a per-use basis


Four Cloud Deployment ModelsFour Cloud Deployment Models• Private cloud

– enterprise owned or leased

• Public cloud– Sold to the public, mega-scale infrastructure

• Community cloud– shared infrastructure for specific community

• Hybrid cloud– composition of two or more clouds


Foundational Elements of CloudsFoundational Elements of Clouds• Primary technologies

– Virtualization– Service Oriented Architectures– Distributed Computing– Broadband Networks

Browser as a platform– Browser as a platform– Free and Open Source Software

• Other technologies• Other technologies– Autonomic systems– Web Web


The NIST Cloud Definition FrameworkThe NIST Cloud Definition FrameworkHybrid Clouds

DeploymentCommunityCommunity

CloudCloudPrivate Private CloudCloud

Public CloudPublic CloudDeploymentModels

Service Software as a Platform as a Infrastructure as aServiceModels

E ti l

Software as a Service (SaaS)

Platform as a Service (PaaS)

Infrastructure as a Service (IaaS)

On Demand Self-ServiceEssentialCharacteristics

Resource Pooling

Broad Network Access Rapid Elasticity

Measured Service

Common Characteristics Vi t li ti S i O i t ti

Homogeneity

Massive Scale Resilient Computing

Geographic DistributionCharacteristics

Low Cost Software

Virtualization Service Orientation

Advanced Security

distributed systems -...

Documents