replication, concluded and new trends part i zachary g. ives university of pennsylvania cis 455 /...

32
Replication, Concluded and New Trends Part I Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems April 24, 2008

Upload: arron-harper

Post on 26-Dec-2015

221 views

Category:

Documents


3 download

TRANSCRIPT

Replication, Concludedand New Trends Part I

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

April 24, 2008

2

Reminders

Project demos Friday May 9 Demo must be on mod cluster, at least 4 machines Email to schedule early next week Please aim to have “code complete” by May 1 or so –

integration testing is important!

Project report and final code due Monday May 12 Includes experiments showing scalability of querying,

crawling, or indexing

Final exam May 12, 6-8PM, here in Towne 315

3

Recall: Central Issues in Replication

What to replicate Where to replicate How to maintain consistency

(and how fresh data needs to be) How to route requests to replicas

4

Choosing Which Replica to Use

Round-robin Simply allocate each request to the next server,

incrementing next at each point Does this make sense over the entire Internet?

Load balancing Requires some sort of load-monitoring code at each server

(e.g., how many threads running, queues available, etc.) Server feeds this information into the coordinator What are the dangers of doing this?

Topologically aware Tries to allocate requests to the “nearest” server, according

to network performance This is what content distribution networks like Akamai aim

for

5

Physically Mapping the Requests

Internal routing (transparent to client) One machine masquerades as the server It forwards requests to a particular machine This is commonly done by search engines, CNN,

etc.

Domain Name Server tricks When a client looks up the server, it gets a

different DNS address than other machines do – thus it transparently talks to someone else

Typically, content distribution networks do this

Let’s look at Akamai as an example…

6

Akamai: A Real-WorldContent Distribution Network

Goal is to produce replicas of data from “big” web sites e.g., images from CNN; QuickTime movie trailers from Apple Similar services from Cisco, Digital Island, Exodus, etc.

Basic model: Providers pay Akamai to distribute their content Akamai asks ISPs if they can install boxes in their networks

18,000 servers in 1,000 networks in 69 countries HTTP responses include a “base” container, plus references to

embedded content from “nearby” Akamai boxes Idea: HTML changes frequently; images, videos do not

Newer service, EdgeComputing, also creates JSP pages in the end nodes

7

Bird’s Eye View of AkamaiSupplying QuickTime Content

Client

QT Server1:

GE

T /i

ndex

.htm

2: O

K in

dex.

htm

href

= q

t.aka

mai

.tech

.net

Akamai DNS

3: QUERYqt.akamai.tech.net

4: IP 123.45.67.89(“nearby” Akamai box)

Replica123.45.67.895: GET /my.qt

6: OK my.qt

Akamaiinfrastructure

Occasionalupdates

Occas

iona

l

upda

tes

Network to

pology

informati

on

8

What Akamai’s Secret Algorithms Do

Replica maintenance Not every data item needs to be replicated everywhere

– determine what items should be placed at which replicas

This is based on ideas from peer-to-peer that we’ll discuss later in the course

Estimating the closest replica This is incredibly hard – how to figure out which replica

is the fastest to access – unless a replica at every ISP Studies have shown that Akamai and other CDNs aren’t

perfect [Johnson et al.] But it’s still typically “good enough”

Today Akamai also tries to deliver some kinds of dynamic content, using database-like technology

9

Replication, Summarized

Replication is generally controlled by the content producer – the server

Considerations about where to place replicas for max effect, how to maintain correct semantics of the application We’ve seen Akamai as a real example

Can also have something similar, but client-side-driven, potentially with support from intermediate points in the network…

10

Caching

Caching is, in essence, a lazy, request-driven version of replication

Generally needs to be fully transparent Implemented over standard HTTP Replication can often use proprietary protocols

However: servers want control over what gets cached Why is this important?

Caching can be done at either endpoint, or somewhere in between…

11

Why Caching Works on the Web

Web tends to have a Zipfian distribution of requests Flat in log-log scale A few items are very frequent, many are somewhat

frequent, a huge number are infrequent (“heavy tail”)

Graphs: www.useit.com/alertbox/zipf.html

12

Where Caching is Done on the Web

Typically, the browser does its own local caching Netscape, IE, Mozilla, etc. maintain a large (10s of

MB) cache of documents and images

Some organizations do their own caching using a proxy server Large ISPs, AOL, many companies

At the server-side, certain objects are frequently cached in memory to speed up responses

13

Proxy Servers

At an organizational level, may route all requests through a gateway Sometimes this is a firewall, other times not Sometimes it’s application-transparent, other times

it must be specified

“Proxy server” is a middleman Takes requests from client – makes requests of

server Reads response from server – forwards to client May perform shared caching of requests Very common for large ISPs, businesses, etc.

14

Cache Hits and Misses

Misses are due to several possible factors: First-time request – compulsory miss Uncacheable item

(e.g., HTTPS, item marked as uncacheable) Item expired Insufficient space in cache – capacity miss

(Structure of cache means item will be evicted – conflict miss)

15

Cache Replacement Considerations

Cost of fetching Cost of storage How often it’s been used Probability it will be accessed again When last modified When likely to expire

16

Cache Item Replacement Algorithms

Algorithms look similar to those in other disciplines (OS virtual memory, microprocessor caching): Least Recently Used Least Frequently Used Others

Size – remove largest item in cache Hybrid of LFU/LRU and size etc.

This problem is easier than in other disciplines – why?

17

Summary

Replication and caching are very common in today’s web Improve performance Replication also provides greater availability

Replication is generally done with server knowledge; caching is done (roughly) transparently

Both rely on frequent requests

18

The Future

Let’s take a brief look at current trends – what might be a few years down the pike…

Let’s step back and look at the fundamental assumptions we’ve been making in our distributed architectures What basic assumptions do we make about our nodes

in any of our basic architectures (client-server, P2P, etc.)?

What basic assumptions do we make about the data? What basic assumptions do we generally make about

distributing computation?

19

Smaller Systems

What lies in the future? RF tags, cameras,

camera phones, temperature sensors, etc.

… all interconnected by some form of networking

Sensor networks: the latest rage in distributed systems research

http://robotics.eecs.berkeley.edu/~pister/SmartDust/

20

What Can We Do with Sensor Networks?

Environmental monitoring: temperature in different parts of a building air quality monitoring equipment and staff in hospitals etc.

Law enforcement: Video feeds and anomalous behavior (J. Shi)

Research studies: Study ocean temperature, currents Monitor status of eggs in endangered birds’ nests

Fun: Record sporting events or performances from every

angle (video & audio) New immersive environments

21

Why They’re Hard

Many, many devices Power and resource constraints

Most of these devices are wireless, tiny, battery-powered Can only transmit data every so often!!

High rate of failure and error Use redundancy to overcome this

Very limited intelligence Many sensors can’t run sophisticated code Can we use the large amount of parallelism to

compensate?

Very local knowledge Know about a few nodes within proximity

22

The Problems of Focus

Languages: how do we express what we want to do with sensor networks Surprisingly effective: subset of SQL for monitoring data

from relatively simple sensors Why would this be?

Robustness: need to combine info from many sensors to account for individual errors

Routing: need to aggregate data in a power-efficient way

Streams: data is an infinitely long sequence – how do we deal with that? Summarization data structures (data is roughly according

to this distribution) Operations over “sliding windows” Again, SQL is the basis of a lot of work!

23

Will Sensor Networks Make Itto the Real World?

A definite “yes”… … because they’re already deployed: RFIDs, cameras,

traffic monitors But they aren’t yet general-purpose Implications on privacy?

We don’t yet understand how to write applications for sensor networks Want to insulate the programmer from low-level

considerations Want to satisfy performance and resource constraints What is the right level of describing an application? Web

services? Something else? Can database-style languages get us most of the way?

Sensor Net Research at Penn(Ives, Guha, Lee, Loo; Mihaylov, Liu, Jacob)

The Internet is now heavily based on “streaming” data, remote devices that do sensing Sometimes it’s “motes”, other times it’s routers, monitoring

software on servers, etc. It’s very complicated to program for all of these devices

Can we build apps that let us integrate and monitor relevant data, without worrying about device specifics?

The key idea: use query languages (think XQuery or SQL) as the basic way of requesting sensor data Extend with ideas from data integration, to support

heterogeneous sensors, combining sensor data with databases, etc.

24

ASPEN: Sensors as Distributed Data

Goal: extensible monitoring of streaming extensible monitoring of streaming data sourcesdata sources Programmer specifies computation, and the

processing and data “flow” to where it’s most “flow” to where it’s most effectiveeffective A smart optimizer knows about devices and

connectivity

Programming is data-centricdata-centric, not device-centric Everything is abstracted as tables

View of the system continuously refreshedcontinuously refreshed

Basic Approach 1/5Hide physical connectivity and location details from programmer – group data sources into abstract relations

Mic(lat, long,time,sample)

Video(lat, long,time,frame)

Basic Approach 2/5

Mic(lat, long,time,sample)

Video(lat, long,time,frame)

Represent each sensor as the source of a stream of time-varying tuples

(385301,770201,1,)

(385302,770201,1,)

(385303,770201,1,)

(385301,770202,1,)

(385301,770202,1, )

(385302,770202,1,)

(385300,770200,1,―)

(385302,770200,1,―)

(385300,770201,1,―)

(385301,770202,1, ┘)

(385300,770200,1,―)

,(385300,770200,2, ―)

,(385302,770200,2, ┘)

,(385300,770201,2, ┘)

,(385301,770202,2, ┐)

,(385300,770200,2, ―)

, (385301,770201,2,)

, (385302,770201,2,)

, (385303,770201,2, )

, (385301,770202,2,)

, (385301,770202,2,)

, (385302,770202,2,)

,…

,…

,…

,…

,…

, …

, …

, …

, …

, …

, …

Basic Approach 3/5

“Show me all of the video frames between [38°53.01’,77°02.01’] and [38°53.03’,77°02.01’] with a ”

“How many video frames with a are also near a microphone sample with sound?”

… Can also combine with lookups in tables to do data integration

e.g., “Show me video frames with a that fall within the coordinates of the conference room inRoomTable?”

e.g., “Find the ssn of Bob Smith, use this to look up histransponder ID, and show me video near him”

Support queries based on properties of the data, independent of the devices

Basic Approach 4/5Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors

(385301,770201,1,), (385301,770201,2,)

(385302,770201,1,), (385302,770201,2,)

(385303,770201,1,), (385303,770201,2, )

(385301,770202,1,), (385301,770202,2,)

(385301,770202,1, ), (385301,770202,2,)

(385302,770202,1,), (385302,770202,2,)

(385300,770200,1,―), (385300,770200,2, ―),…

(385302,770200,1,―), (385302,770200,2, ┘) ,…

(385300,770201,1,―), (385300,770201,2, ┘) ,…

(385301,770202,1, ┘), (385301,770202,2, ┐) ,…

(385300,770200,1,―), (385300,770200,2, ―) ,…

AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)

where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >

Basic Approach 5/5

(385303,770201,1,), (385303,770201,2, )

(385301,770202,1, ),

(385302,770200,2, ┘) ,…

(385300,770201,2, ┘) ,…

(385301,770202,1, ┘), (385301,770202,2 ┐) ,…

(385303,770201,1,, ┘), (385301,770202,1,, ┘), (385303,770201,2,, ┘), (385303,770201,2,, ┘), (385303,770201,2, , ┐),

Support logical views – “abstract sensors” integratingdata from different types of lower-level sensors

AVObservations(lat, long,time,frame,sample) :- video(lat,long,time,frame), mic(lat2,long2,time,sample)

where dist(lat,long,lat2,long2) < 5m and sample > ― and frame >

Challenges We Are Addressing

Data integration has been based on static Data integration has been based on static datadata AdaptAdapt mappings, queries to stream data, including

timing, synchronization, link properties, …

Optimization of queries is hard in the Optimization of queries is hard in the simplest case, and here we need to do it simplest case, and here we need to do it in distributed fashion with limited in distributed fashion with limited knowledgeknowledge Distribute computation Distribute computation to the network, and to the

devices with the “right” position and “right” capabilities

31

32

Next Time

Bigger systems: The grid and cloud computing

Higher-level networks: Semantics and the Web