replication(2). some interesting observations r top 1 % of all documents account for 20% - 35% of...
TRANSCRIPT
Replication(2)
Some Interesting Observations
Top 1 % of all documents account for 20% - 35% of proxy requests
Top 10% account for 45% - 55% of requests
It takes 25% to 40% of all documents to account for 70% of requests
It takes 70% to 80% of all documents to account for 90% of requests
Web Caching
As an example, we use the web to illustrate caching and other related issues
browser Web Proxycache
request
response
request
response
Web server
browserWeb server
request
response
Web Browser Caching
Web browsers have their own caches. When a page is downloaded from a site the web page is put into the browser cache.
This is especially useful in those cases when the back button is pressed.
If a new copy is needed then a “refresh” can be done.
No page stays permanently in the cache. There is limited room. A replacement algorithm is needed to determine
which cached page should be purged.
Web Browser Caching
Client pull The server provides the content with instructions on
when the client should ask for a refreshed copy of the content or if the content should be cached.
Server push The server transmits page information to the screen. The browser application displays the information
and leaves the connection to the server open. With an open connection, the server can continue to
push updated pages for your screen to display on an ongoing basis. You can close the connection by closing the page.
The server is in control Browser caches are different from proxy caches
(discussed next).
Web Caching
Proxy caches (also called proxy server) Intercepts HTTP requests from client
• Serves object if in its cache• If not goes to object’s home server
– On behalf of user, gets the object and possibly deposits in its cache before returning to user
• Usually deployed at edges of a network– Wide area bandwidth savings, improved response time
and increased availability of static web-based objects A browser may have to be configured to point
to the proxy server. Usually a web cache is purchased and installed
by an ISP e.g., a university.
Push-Based Approach Server tracks all proxies that have requested
objects If a web page is modified, notify each proxy Notification types
Indicate object has changed [invalidate] Send new version of object [update]
How to decide between invalidate and updates? Pros and cons? One approach: send updates for more
frequently accessed objects, invalidate for rest
proxyWeb server
push
Push-Based Approaches
Advantages Provide tight consistency [minimal stale data] Proxies can be passive
Disadvantages Need to maintain state at the server
• Recall that HTTP is stateless• Need mechanisms beyond HTTP
State may need to be maintained indefinitely• Not resilient to server crashes
The disadvantage is the reason why push-based approaches are not used
Pull-Based Approaches
The proxy is entirely responsible for maintaining consistency
The proxy periodically polls the server to see if object has changed Use if-modified-since HTTP messages: This type of
message can be used by a proxy to tell a remote server to return a copy only if it has been modified.
Key question: When should a proxy poll? Server-assigned Time-to-Live (TTL) values
• No guarantee if the object will change in the interim
proxyWeb server
poll
response
Pull-Based Approach: Intelligent Polling
Proxy can dynamically determine the refresh interval Compute based on past observations
• Start with a conservative refresh interval• Increase interval if object has not changed
between two successive polls• Decrease interval if object is updated between
two polls• Adaptive: No prior knowledge of object
characteristics needed
Pull-Based Approach
Advantages Server remains stateless Resilient to both server and proxy failures
Disadvantages Weaker consistency guarantees (objects
can change between two polls and proxy will contain stale data until next poll)
High message overhead
A Hybrid Approach: Leases Lease: Duration of time for which server agrees to notify
proxy of modification Issue lease on first request, send notification until expiry
Need to renew lease upon expiry Smooth tradeoff between state and messages exchanged
Zero duration => polling, Infinite leases => server-push Efficiency depends on the lease duration Limited use
Client Proxy
Server
Get + lease req
Reply + leaseread
Invalidate/update
Cooperative Caching
Caching infrastructure can have multiple web proxies Proxies can be arranged in a hierarchy or
other structures Proxies can cooperate with one another
• Answer client requests• Propagate server notifications
Uses a combination of HTTP and ICP (Internet Caching Protocol).
• ICP can be used by one cache to quickly ask another cache if it has an object.
• HTTP is used to actually retrieve the object.
Problems
Caching proxies serve only their parents and not all Internet users.
Content providers (say, Web servers) cannot rely on existence and correct implementation of caching proxies.
Accounting issues with caching proxies: Example: www.cnn.com needs to know the
number of hits to the advertisements displayed on the web page.
Content Distribution Networks (CDN)
Business Model: A content provider such as www.cnn.com or Yahoo pays a CDN company (such as Akamai) to get its content to the requesting users with short delays.
A CDN provides a mechanism for Replicating content on multiple servers
in the InternetProviding clients with a means to
determine the servers that can deliver the content fastest.
Terminology
Content: Any publicly accessible combination of text, images, applets, frames, MP3, video, flash, virtual reality objects, etc.
Content Provider: Any individual, organization, or company that has content that it wishes to make available to users.
Origin Server: Content provider’s server , where the content is first uploaded.
Surrogate Server (sometimes called edge server): Content distributor’s server, where the replicated content is kept.
Players of the game
Content Provider
H/W and S/W Vendor
Content Distributor
Hosting Provider
Yahoo, MSNBC, CNN
Cisco, Lucent, Inktomi, CacheFlow
Akamai, Digital Island, AT&T
Exodus
Sells se
rvers
Send content
Install
servers
CDN: Distribution
The CDN company places hundreds of CDN servers in Internet hosting centers.
The CDN replicates its customers’ content in the CDN servers. Whenever, a customer updates its content (e.g., web page), the CDN redistributes the fresh content to the CDN servers.
The CDN provides a mechanism so that when a user requests content, the content is provided by the CDN server that can most rapidly deliver the content to the user. This can be the closest CDN server to the user
(perhaps in the same ISP as the user) or may be a CDN server with a congestion-free path to the user.
CDN: Distribution
CDN server in Asia
CDN server in Europe
CDN server in SouthAmerica
CDN distribution node
Origin server inNorth America
push content
push content
push contentpush content
Akamai CDN
CDN: Functional Components
Distribution Service Redirection Service Accounting and Billing system
CDN:Distribution Service
The content provider determines which of its objects it wants the CDN to distribute.
The content provider tags and then pushes this content to a CDN node, which in turn replicates and pushes the content to all its CDN servers.
CDN: Distribution Service
When a browser in a user’s host is instructed to retrieve a specific object (specified using a URL), how does the browser determine whether it should retrieve the object from the origin server or from one of the CDN servers?
As an example, suppose the hostname of the content provider is www.cnn.com
Suppose the hostname of the CDN company is www.akamai.com
CDN: Redirection
Users get an html document from www.cnn.com; this could be index.html
The file index.html uses a modified URL for content that has been replicated. Example: If the gif files are what has been
replicated then <img src=“http://cnn.com/af/x.gif> may be modified as follows:
<img src=http://a73.g.akamaitech.net/7/23/cnn.com/af/x.gif>
The browser needs to resolve aXYZ.g.akamaitech.net hostname for replicated content.
CDN: Redirection
DNS is configured so that all queries about g.akamaitech.net that arrive at a DNS server are sent to an authoritative DNS server for g.akamaitech.net. This is referred to as a Akamai DNS server (authoritative DNS server)
When the Akamai DNS server receives the query, it extracts the IP address of the requesting browser.
Based on the IP address and information that it has about the Internet (called a map), the IP address of an Akamai server(surrogate server) is returned to the requesting browser based on policy e.g., select the server that is the fewest hops away.
CDN Redirection
The Akamai DNS server IP address is now in the cache of the local DNS server.This implies that it is not always
necessary to go to the root DNS server. The TTL associated with the IP address of
an Akamai server(surrogate) is relatively small. This is done for performance reasons.
Akamai content distribution servers are caches
CDN Redirection
What if content is not there? If the request content is not found then the
surrogate will ask other surrogates within a specified region for information.
If requested information is still not found or is stale, then a request is made to the original web site.
CDN Redirection
...
<img src="http://www.cdn.com/cnn/images/1.gif”>
...
Index.html
GET w
ww
.cnn.co
m/in
dex.h
tml
Ind
ex.h
tml
DNS query: cdn.com ?
GET /cnn/im
ages/1.gif
1.gif
64.236.24.28
Authoritative DNS server for cdn.com
Local DNS serverClient
CNN.com
64.236.24.28
DN
S qu
ery:
cdn
.com
?64
.236
.24.
28
PUT /images/*.gif
CDN Selection The tricky issue is selecting which local
content server to use for a particular request Want to spread load evenly Want minimal impact if server is added or
removed. In Akamai, each surrogate server sends
measurement results to the Network Operations Communications Center (NOCC). Measurement results include number of active
TCP connections, HTTP request arrival rate, bandwidth availability, etc
This information is used by the Akamai DNS server.
Accounting Mechanism Accounting mechanisms collect and
track information related to request routing, distribution and delivery.
Information is gathered in real time and put into log files for each CDN component.
This gets sent to the Network Operations Communications Center (NOCC).
Full Site Delivery vs. Partial Site Delivery
Full Site Delivery : All the contents are delivered by the CDN (including HTML, images, and other objects).
Partial Site delivery: Only images, streaming media and other bandwidth intensive objects delivered by the CDN.
CDNs and Content
Content Suitable for CDNS Images Streaming media Java applets Static information
Content not suitable Dynamic information Personalized information
Current Akamai Customers
Summary
We have examined replication and issues related to the design and implementation of a replicated system.
Many choices and tradeoffs to consider