© 2013 a. haeberlen, z. ives welcome to cis 455 / 555 – internet and web systems zachary g. ives...

26
© 2013 A. Haeberlen, Z. Ives Welcome to CIS 455 / 555 – Internet and Web Systems Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems January 14, 2015

Upload: stone-casey

Post on 14-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

© 2013 A. Haeberlen, Z. Ives

Welcome to CIS 455 / 555 –

Internet and Web Systems

Zachary G. IvesUniversity of Pennsylvania

CIS 455 / 555 – Internet and Web Systems

January 14, 2015

© 2011-14 A. Haeberlen, Z. Ives 2

What this Course Is About

• How do we build services like Google, Akamai, iTunes, Facebook, EBAY, …?

• What are the principles behind them?(This is NOT a course on building Web sites! See CIS 450/550…)

• How do “cloud computing,” P2P, and Web services relate?

• The main themes of the course:

• Distributed systems concepts, with emphasis on data, scalability and interoperability (including “the cloud”)

• Data representation fundamentals, with emphasis on XML

• Information retrieval concepts, including ranking and indexing

• It’s a course that involves building software using the principles learned, evaluating it, and programming in teams

© 2011-14 A. Haeberlen, Z. Ives 3

How Does this Relate to Other CIS Courses?

NETS 212

• Cloud service layers

• Key/value stores, in particular

• MapReduce, Spark, and data-parallel programming basics

CIS 450/550

• Data representation and management

• Relational querying with SQL; XML querying with XQuery

• DBMS-backed web sites

• 455/555 focuses on data with respect to interoperability

CIS 350/573: software engineering and mashups

CIS 505: focuses on distributed systems and algorithms• CIS 505 is less project-oriented than CIS 555

• CIS 555 covers Web services, cloud architectures in more detail

© 2011-14 A. Haeberlen, Z. Ives 4

Some Things We’ll Look at

•What are the principles behind building systems that work on the Internet?

• How do these relate to many of today’s hot technologies?• Web servers, DHTML, Servlets, JSP, …

• XML

• Web services

• Peer-to-peer

• Application servers

• Cloud computing environments

• Content distribution networks

• Web search

• Mash-ups

• The cloud

• …

© 2011-14 A. Haeberlen, Z. Ives 5

Staff

• Instructor: Zack Ives, zives@cis

• Office: 576 Levine North

• Office hours W 1:30-2:30 (and by arrangement)

• TAs:

• Avani Deshpande Akshay Hegde

• Mounica Maddela Shruthi Gorantala

• Shenga Ding

• Piazza: piazza.com/upenn/spring2015/cis455555

• Will have custom homework submission platform (coming soon)

© 2011-14 A. Haeberlen, Z. Ives 6

Textbooks

• Distributed Systems: Principles and Paradigms, 2nd ed, Tanenbaum and van Steen

• We’ll read from the book ~50% of the time

• Frequent supplementary handouts

• Excerpts from several books

• Many recent research papers

• Your first one, which you should read by Wed:http://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf

(linked off the CIS 555 page)

© 2011-14 A. Haeberlen, Z. Ives 7

Prerequisites, Workload, etc.

Necessary skills:

• Ability to code in Java: there is a substantial implementation project

• Good debugging skills – this will be the biggest time sink!

• The ability to work as a team with classmates (towards the end)

• A willingness to learn how to read API documentation

• Some exposure to threads and concurrent programming

• A willingness to “push the envelope”

Workload:

• Several programming/debugging-based homework assignments

• A substantial term project with experimental evaluation and a report

• Two midterms

Payoff:• Lots of practical development and debugging experience

• A good working knowledge of the fundamentals behind scalable systems

• A working “academic clone of Google,” hosted on Amazon EC2!

WARNING: this course should be considered 1.5 CU!

© 2011-14 A. Haeberlen, Z. Ives 8

A Disclaimer…

• This remains a “bleeding edge” course!• Goal 0: an understanding of scalable distributed data-centric

systems

• Goal 1: a look under the covers of today’s hottest topics – in lectures and in projects

• Goal 2: a level of comfort in managing large, complex software development with others’ code

• Part of this means doing a substantial implementation project

• As in the real world: learning APIs, dealing with inadequate tools

• Most of you will find this a struggle! You’ll spend many hours debugging!

• We will be using some immature technology• Not everything will have been validated ahead of time

• We’ll do the best we can to smooth over the bugs!

• We hope it will be a fun course, though…

… And an interesting one!

© 2011-14 A. Haeberlen, Z. Ives 9

A Bit of Context for the Course

© 2011-14 A. Haeberlen, Z. Ives 10

What Exactly Is the Web?

• The Web consists of HTTP servers that publish HTML, XML, and a few other content types• These are hyperlinked via URLs (a subset of URIs)

• Plus there are a huge number of web clients

• The Web is built on a number of Internet protocols:• DNS, TCP, IP

• Other Internet services use other protocols• SMTP, IMAP, POP, AIM, FTP, …

• Streaming media, music swapping protocols, …

• Web services, custom applications may actually also use HTTP in ways it wasn’t designed for

© 2011-14 A. Haeberlen, Z. Ives

11

The Internet is Built in Layers

IPv4, IPv6 Unicast, (multicast)

TCP (session-based)

UDP (sessionless)

WiFi, ZigBee, Ethernet, WiMax

Lightweight streaming, etc.

SSH, FTP,HTTP, IM, P2P,

Web Services, distrib

transactions, …

Link

IP

Transport

Session

Middleware

Your Application

… …

© 2011-14 A. Haeberlen, Z. Ives 12

What Is an Internet System?

• Not just a web server or web application…

• An application built over the Internet, whose functionality is distributed across more than one machine

• Typically, at least in a client-server or server-to-server fashion, but may have many more participants

• Typically, data and/or code must be exchanged in distributed fashion for the functioning of the application

• Often, the data must be partitioned, replicated, translated, etc. (“shards” in Google-speak)

• Often, the code is written in multiple different environments, languages, etc.

• Often, there are concerns about handling failures, firewalls, attacks, …

© 2011-14 A. Haeberlen, Z. Ives 13

Why Are Internet System Topics Interesting?

• Understanding what’s underneath today’s Web

• How does it work?

• What are its shortcomings?

• What are its strengths?

• Understanding distributed algorithms

• Using the right approach when designing new protocols and web systems

• Being able to anticipate what’s actually possible in the future

© 2011-14 A. Haeberlen, Z. Ives 14

Example: Web Search, a Cloud Service

Index Servers

Crawlers

Search Interface Servers

queries HTML forms;

results

query results

Web

Pages

pages

keywords +

locations

clientclientclient

Uses a model ofdocument/word

similarity to rankmatches

© 2011-14 A. Haeberlen, Z. Ives 15

Example: Social Networking (Facebook / Twitter), a Cloud

Service

Recommender

Users &

entities

User PageServers

clicks pages & notifications

suggestions

common properties,

usage logs, …

clientclientclient

updates, posts

© 2011-14 A. Haeberlen, Z. Ives 16

Example: Enterprise (or Web) Information Integration

XML sources

Mediator

System

queries results in

“mediated schema”

clientclientclient

Relational

sources HTML sources

XQuery

+ XPath

over

XML XMLSQL

ODBC

results HTTP POST

HTML

Maps all data into a single format and virtual

schema

© 2011-14 A. Haeberlen, Z. Ives 17

Example: SETI@home

Problem Partitioning

clientclientclient

Breaks computation intomany parts and

distributes them tothe clients Data

Aggregation

New sub-problems Computed

subresults

© 2011-14 A. Haeberlen, Z. Ives 18

Example: P2P File Sharing

client

client

client

client

request

request

request data

data

data

Processes name-basedrequests for data; each

node can make requests,forward requests,

return data

© 2011-14 A. Haeberlen, Z. Ives 19

What are the Hard Problems?

• Disclaimer: most of the hard problems AREN’T solved (or solvable) – and there often isn’t any single BEST solution

Much of systems design is about finding the right compromise for each specific problem

• We can divide them into:• Scalability

• Availability / reliability

• Consistency

• Interoperability

• Location and resource discovery

© 2011-14 A. Haeberlen, Z. Ives 20

Scalability

• How do we support a large number of clients or requests?

• Distribute work!

• Challenges:

• Coordination – takes significant overhead in the general case

• Load balancing – avoid having bottlenecks

• Parts of the solution:

• Client-server, multi-tier, P2P architectures

• Restricted programming models, e.g., MapReduce

• Data partitioning, replication, remote procedure calls, …

© 2011-14 A. Haeberlen, Z. Ives 21

Availability/Reliability

• How do we ensure the system is “up” when we want it to be, and doing the “right” thing?

• Replication and redundancy

• Security measures against attacks

• Ability to undo/redo

• Challenges:

• Keeping things consistent

• Performance vs. security

• Acknowledgments

• Parts of the solution:

• Data partitioning, replication, …

• Logging, transactions, …

• Redundant hardware, multiple sites, …

• Quorum and consensus algorithms

© 2011-14 A. Haeberlen, Z. Ives 22

Consistency / Consensus

• Replication, distribution, and failures make it difficult to keep a unified, consistent view of the world – how do we combat this?

• Locking, concurrency control, and invalidation schemes

• Clock synchronization

• Challenges:

• Locking has huge performance overhead

• Network partitions, disconnected operation

• Parts of the solution:

• Optimistic concurrency control, 2-phase locking

• Distributed clock sync

• Conflict resolvers

© 2011-14 A. Haeberlen, Z. Ives 23

Interoperability

• How do we coordinate the efforts of components that have different data formats and/or source languages, and are on different machines?

• Standardization!

• Challenges:• Everything has a different semantics!

• Parts of the solution:• Standard data formats: XML, XML schemas

• “Schema mediation” and data translation

• Remote procedure calls: CORBA, XML-RPC, …

© 2011-14 A. Haeberlen, Z. Ives 24

Location & Resource Discovery

• How do you find what you’re looking for?

• Naming

• Declarative queries over standard schemas

• Advertisements

• Challenges:

• Naming has implicit semantics

• What do you do when you don’t know what to call something?

• Parts of the solution:

• Directory systems – DNS, LDAP, etc.

• Resource discovery and advertising protocols

• Overlay networks, sharding schemes

• Standardized schemas

© 2011-14 A. Haeberlen, Z. Ives 25

Our First Focus: Single Machines, aka Servers

• How do you handle large numbers of concurrent users?

• Processes

• Threads

• Events

• Hybrids (e.g., thread pools)

• Staged architectures

© 2011-14 A. Haeberlen, Z. Ives 26

Next Time…

•We’ll look under the covers of an HTTP server

• Key ideas in building scalable systems

• Principles of HTTP and web servers

• Management of concurrent sessions

• To read by next Wednesday:

• Lampson and Saltzer paperhttp://research.microsoft.com/en-us/um/people/blampson/33-Hints/Acrobat.pdf

• Tanenbaum Ch. 3.1

• If necessary: Review Tanenbaum “Modern OS,” Ch. 2.3 or a similar OS book on interprocess communication