1Jim Binkley
World Wide Web
TCP/IP class
2Jim Binkley
outline◆ intro - the big picture (elephant in breadbox)◆ HTML - Hypertext Markup Language
– we can hopefully ignore this as you know it
◆ HTTP - Hypertext Transfer Protocol◆ etc. (short, but there is a lot of etc)
– elephant makes room for rhino, hippo, and giraffe
3Jim Binkley
intro◆ what is the World Wide Web?◆ an information system that links data from different
protocols under one umbrella◆ it allows pages to be linked together so that you can jump
from one to another in a non-continuous way (hypertextover the Internet) (...end of linear thinking ...)
◆ it allows display of graphics (2D images) plus small dosesof audio and even video -- tolerates heterogeneousdatatypes and very-well may slice bread
◆ no, it is NOT the Internet, just one more meta-network. Itis a loose collection of technologies though.
4Jim Binkley
key technologies/buzzwords◆ html and http◆ html - Hypertext-markup-language
– you probably know some of this (or all?)
◆ http - Hypertext-transfer-protocol– end to end transport on top of TCP– uses MIME like SMTP/email, FTP-like error messages– http used to transfer html and/or other file types
◆ URLs - addresses
5Jim Binkley
history in 1 slide◆ 1989, Tim Berners-Lee at CERN (European Laboratory
for Particle Physics) proposed World Wide Web protocols◆ W3 consortium now “leading” effort, includes CERN,
MIT, INRIA, see http://www.w3c.org◆ early browser called Mosaic, done at NCSA (National
Center for Supercomputing Applications), 1993◆ then netscape, then browser wars (netscape vs. IE)◆ plus a blizzard of possible add-on technologies to extend
the web on the client or server sides– java/CGI&perl/JavaScript/dynamic html/plugs-ins, blah, blah
6Jim Binkley
this slide is wrong? - standards◆ html 4.0, see http://www.w3.org for updates
– ORA book on HTTP, 3rd edition, 1998– not the whole picture of course given netscape
vs IE hooks◆ rfc2616, http/1.1◆ many other possible documents including
security-related (SSL) + email/MIMERFCS
7Jim Binkley
intro - concepts◆ web client supports > 1 protocol for fetching
documents, HTTP (native web), ftp, gopher,USENET, WAIS, telnet(?), etc.
◆ HTML page == formatted ( graphics+text+links )◆ key here is tying graphics and text together,
along with hypertext links; i.e.., a discontinuousjump to other material anywhere on net
◆ link = to ftp, to telnet, to more HTML hypertext,to arbitrary program at web server
8Jim Binkley
basic client/server architecture
browser
http get file/tcp to port 80web server/httpd
netscape/IEas CLIENT
read file fromfile system
~jrb/index.html+ jrb.png
MIME type + .html/.pngfile to client tcp port
blahblah, blah
web servers servehtml, web browsersmay do more than http/html
9Jim Binkley
hypertext links - URLS◆ a link will normally include a WWW network address for
a page or something...◆ called an URL, Uniform Resource Locator◆ syntax = protocol://dns name[:tcp port]/file
◆ examples:http://www.foo.com/index.htmlftp://zymurgy.cs.pdx.edufile:/some/where/local.txttelnet: //somewhere.mud.edu:8000gopher://some.gopher.server.edu/news:alt.fan.cecil-adams (note: no dns name)
10Jim Binkley
this is a GREAT idea◆ in the history of GREAT ideas ...◆ and so simple
11Jim Binkley
web browser may speak morethan http; e.g., ftp client too
ftp.cs.pdx.edu/foo.txt
ftp server
HTTPserver
webclient
.htmldocs
www.cs.pdx.edu
HTTP protocol
ftpprotocol
newsserver
nntp
and it may do email, make coffee, change baby diapers, etc...
12Jim Binkley
file format extensibility◆ web has native set of file formats including
HTML docs, and graphics but clients and serverscan be extended to handle other file formats (andother code ...)
◆ fetch of .ps file may invoke GNU ghostscriptpostscript viewer
◆ client/server communicate file format info viamail MIME encoding format
◆ client may be taught to invoke externalapplication to handle new format
13Jim Binkley
file heterogeneity◆ MIME encodings exist for many kinds of
data or can make new one up on fly◆ e.g., windows pc netscape viewer can be
taught to invoke powerpoint for “.ppt” files◆ can invoke audio/video viewers for sound
and video files◆ client may not know how to display a
server-side file, should just do download
14Jim Binkley
client/server code extensibility◆ for many reasons, desirable to extend both
client/server functionality– just a few examples of MANY possible technologies
◆ server-side– common-gateway-interface with perl– basically API between web server and some program to
pass parameters coming in over the web– could invoke database OR whatever
15Jim Binkley
client-side extensibility◆ may wish better gui/formatting than with
just plain HTML OR◆ wish to offload work from busy server
(server scalability issues)◆ can use java/JavaScript, etc
– java can be used on server-side for that matter◆ our goal here is NOT to explore these
issues (basically just http ...)
16Jim Binkley
intro - summary◆ platform independent (HTML)◆ protocol opaque (HTTP/ftp/gopher, etc.)◆ ties docs together over net with hypertext
links (HTML/links)◆ 2-d graphics (HTML)◆ can tolerate file heterogeneity (MIME)◆ client/server extensions via various
programming languages/techniques
17Jim Binkley
intro - HTML◆ HTML is a language that consists of ways of “marking
up” text and including pictures and links◆ the markup symbols are called tags and are not displayed
at the viewer, rather they are interpreted as suggestions ashow to format the display
◆ clients format HTML as best they can - interpretation isnot the same from client to client
◆ tags include ways to include pictures in GIF/JPEG/pngformat, links, paragraphs, lists, GUI objects like buttonsand fill-in fields (forms)
◆ note: html really is NOT a networking protocol, just adisplay language somewhat akin to postscript/NROFF/Tex
18Jim Binkley
HTML example◆ <p> blah blah blah </p>
<p> blegh <b>foo!</b> </p>◆ when we fetch and display the HTML:
blah blah blah
blegh foo!
19Jim Binkley
HTML < SGML◆ HTML is subset of SGML, Structured
Generalized Markup Language◆ SGML used by US DOD/ISO developed◆ software exists for SGML◆ key is that HTML is simplified over SGML◆ another key: “trust the client”◆ tradeoff: platform independence versus
authoring control
20Jim Binkley
religion and html◆ some want absolute control over how their
data is displayed, want “physical” control◆ some want platform independence, want
“logical” suggestions where client does bestjob it can according to local circumstances
◆ <strong> Do It My Way! </strong>◆ <b> Do It My Way! </b>
21Jim Binkley
some basic html tagselement type descriptionA container src/dest of linkB container bold textLINK empty link from this docBR empty line breakH1...H6 container heading levelIMG empty imageLI empty list itemUL container unordered listP empty paragraphHR empty horizontal rule
22Jim Binkley
html example - the basic skeleton<html><head><title> Simple Web Page </title><link rev=“MADE” href=”mailto:[email protected]”></head><body>
THE BODY GOES HERE
</body></html>
23Jim Binkley
html body - the inside<h1> Simple Web Page - first level header </h1>Here is a picture of my friend, Bev Kramlich, hope she neverhears about this. <P><img src=“bevk.png”> Bev Kramlich <P><h2 A Second level header. Plus Interesting Web Places to Visit</h2><!-- <b> you didn’t see this </b> --><UL><LI> <A href=“http://www.NCSA.uiuc.edu/SDG/People/robm/sg.html”>A Typical System Administrator</A></UL><hr> <address> [email protected] </address>
theresult...
25Jim Binkley
HTTP protocol - encapsulation
link ip tcp http data
http on top of tcp on top of IP
data == ASCII text, html, image, typed with MIME type
26Jim Binkley
HTTP - Hypertext TransferProtocol
◆ protocol web clients use to talk to “web” servers(use http/fetch html)
◆ TCP-based, typically to server port 80◆ simple request/response protocol◆ client makes request, tells server what it can
handle for file types◆ server responds with MIME type + data file, type
info usually gained from file suffix (foo.png)◆ commands done in ASCII, errors in ASCII
27Jim Binkley
HTTP, cont.◆ commands called “methods”, but for the most
part, just a variation on “get file”◆ server status and errors similar to error strings
found in ftp/email– 200 - successful– 300 - not done yet; e.g., 301 is moved permanently– 400 - client error; e.g., 403 forbidden (server refuses)– 500 - server error; 503 service unavailable at the
moment
◆ http 1.0 being replaced by http 1.1
28Jim Binkley
protocol overviewclient server
TCP connect TCP/socket accept (port 80)HTTP get file
accept filetypes x, y, ...
server status + header info
MIME type return file data
<read and display data>TCP close
note: typically DNS before TCP ...
29Jim Binkley
HTTP 1.0 Methods (3)◆ GET file - used for fetching most HTML documents, file
is URL minus protocol/DNS portions.– may use conditional if-Modified-Since time
get is done only if object is newer than time– also used for one form of cgi-bin forms (“get”)
◆ HEAD file - get server-side header info about file but notfile itself. Used for link test, cache test.
◆ POST cgi-bin/file - another way to do forms– theoretically used to annotate/append/”post” message
or send record to database
30Jim Binkley
HTTP methods, cont.◆ but other methods have been proposed; e.g.,◆ PUT - put new URL and overwrite old one◆ DELETE◆ question is: how to authorize remote file access;
i.e., how to make it secure so can doPUT/DELETE, therefore less available
◆ original designers hoped that annotation of pageswould be possible (yellow-sticky analogy alongwith hypertext) - unsuccessful idea at this point
31Jim Binkley
protocol trace - GET method% telnet localhost 80GET /foo.html HTTP/1.0 <cr> <cr>HTTP/1.0 200 OKDate: Tuesday, 22-Nov-94 18:10:58 GMTServer: NCSA/1.3MIME-version: 1.0Content-type: text/htmlLast-modified: Wednesday, 16-Nov-94 21:18:37 GMTContent-length: 1115<HTML> <TITLE> Joe FooBar’s Home ... </TITLE>HTML...etc., etc...
request:header:
body:
32Jim Binkley
server-side note (file mapping)◆ in previous slide /foo.html is mapped on
server side to server documents file tree◆ e.g., with server on UNIX, file tree maybe
/usr/local/httpd/htdocs/index.html◆ GET / -> root is mapped to index.html◆ /foo.html would be in above htdocs
directory
33Jim Binkley
user home page - UNIX server◆ on UNIX server, users may have home pages◆ http://foo.org/~bob◆ cd ~bob; mkdir public_html;◆ make it world readable, chmod 664 public_html◆ make your home page public_html/index.html◆ make it world readable too◆ so http://foo.org/~bob as URL is mapped to
http://foo.org/~bob/public_html/index.html by web server
34Jim Binkley
protocol trace - HEAD method% telnet localhost 80
HEAD / HTTP/1.0 <cr> <cr>
HTTP/1.0 200 OKDate: Tuesday, 22-Nov-94 18:13:45 GMTServer: NCSA/1.3MIME-version: 1.0Content-type: text/htmlLast-modified: Wednesday, 16-Nov-94-21:18:37 GMTContent-length: 1115<connection closed>
35Jim Binkley
some example MIME types
text/plain no formattingtext/html display as HTML (.html)application/postscript fireup ps viewer (.ps)application/powerpoint fireup powerpoint (.ppt)image/jpeg jpeg image, inline display (.jpeg)image/gif gif image, inline display (.gif)audio/basic u-law format, fireup audio playback (.au)video/mpeg short “movie”, fireup mpeg player (.mpeg)audio/x-midi MIDI file format (.mid)
MIME type viewer action
36Jim Binkley
MIME type extensibility◆ on server, add types to server config file,
server associates file extension with MIMEtype
◆ on client, teach client about local viewerapps; e.g., windows NCSA mosaic.ini file[Viewers]TYPE10=“audio/x-midi”audio/x-midi=“mplayer %ls”
37Jim Binkley
http 1.1 - just an introduction◆ RFC 2616 - fundamental redefinition of
HTTP 1.0– 176 pages long ...
◆ host request header - next slide◆ must support persistant connections
– one TCP connection, many itty-bitty imagefiles
– not one connection per file– good for TCP and good for the Inet
38Jim Binkley
host request◆ client may send:
GET /pub/WWW/TheProject.html HTTP/1.1Host: www.w3.org
◆ (absolute URL or path) + host info (may includeport)
◆ may help eliminate wasteful binding of IPaddresses to ONE server, since this info is nowavailable to server (not buried in stack)
◆ can now bind multiple names to one IP address
39Jim Binkley
etc. section - a few more tricks◆ proxies◆ cgi-bin, quick overview◆ security◆ server-side scalability IS A PROBLEM◆ there is no end to this ...
– use the web to learn about the web– after all, WWW put the Internet on the map
40Jim Binkley
http - proxy extension◆ Internet-capable server can act as proxy for clients not on
Internet - useful in firewall situations◆ client simply sends http request with real URL
(ftp/http/gopher, whatever) encapsulated in http request◆ server proxies as real client to internet◆ sends info back to client◆ server can cache results - useful for Internet-wide
efficiency◆ can do gopher, http, ftp, can’t do telnet of course
41Jim Binkley
proxy picture
client
proxy server
remoteInet server(http: ftp:)
ftp get
diskcache
42Jim Binkley
cgi-bin: server-side extensibility◆ cgi - Common Gateway Interface◆ server-side app lives in /<server-path>/cgi-bin, invoked by
web server.◆ can be coded in C, perl, C++, shellscript, java◆ needs to be able to write to stdout, read environment
variables or stdin◆ conventions exist (GET/POST) for passing parameters
from form to cgi applet◆ cgi app can send more HTML back to client as output,
which may in turn have more form tags/cgi references
43Jim Binkley
forms + cgi-bin apps◆ one can invoke “forms” on the client-side◆ forms consist of a limited set of GUI objects, text
fill areas, fill-in fields, select menus, buttons,◆ all expressed as HTML tags in the HTML src◆ when form is complete, user “sends” via
embedded URL to backend cgi-bin app located athttp server
◆ server-side cgi-bin app processes form
44Jim Binkley
cgi-bin architecture
clientGET/POST + params
form outputweb server
formshownhere
cgi-binapp
pipe
45Jim Binkley
security◆ end/end exists, authentication and/or encryption◆ plaintext password and/or IP address
authentication exist– not ideal for the usual reasons
◆ SSL offers easy server-side encryption– client-side authentication less-easy– leads to issues of Public Key Infrastructure
◆ beware: download of code from strangers◆ privacy issues; .e.,g., cookies which are ASCII
state stored by server at client
46Jim Binkley
server-side scalabilityis challenging issue
◆ 1 server - 100 million clients want ONE PAGERIGHT NOW!
◆ intranet solutions include:– round-robin DNS– NAT-like remapping of local addresses, 1 to many
◆ Internet solutions– try to determine “nearest” server and bounce request
(e.g., use BGP routing info)– try to build large web of smart servers and clever
rewrite/caching schemes at application layer