internet web basics

44
8/3/2019 Internet Web Basics http://slidepdf.com/reader/full/internet-web-basics 1/44 Copyright © Ellis Horowitz 1999-2012 1 Lecture The Internet and Web Basics

Upload: karthik-reddy-ginuga

Post on 06-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 1/44

Copyright © Ellis Horowitz 1999-2012 1

Lecture

The Internet and Web Basics

Page 2: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 2/44

Copyright © Ellis Horowitz 1999-2012 2

The Internet and the WWW are Different

•The Internet is a global digital infrastructure

that connects hundreds of millions of computers and

people

•The World Wide Web is a mechanism that unifies the

retrieval and display of a subset of data on the

Internet

•An intranet is a local/global informationstructure that connects an organization internally.

Intranets today often make use of Web technologies

•An extranet is a private network that uses the

public telecommunication system to securely sharepart of a business's information or operations with

suppliers, vendors, partners, customers, or other

businesses.

Page 3: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 3/44

Copyright © Ellis Horowitz 1999-2012 3

How Big is the Internet -

https://www.isc.org/solutions/surveyDate | HostCount

-------+-----------Jul 11 |849,869,781

Jan 11 |818,374,269

Jul 10 |768,913,036

Jan 10 |732,740,444

Jul 09 |681,064,561

Jan 09 |625,226,456

Jul 08 |570,937,778

Jan 08 |541,677,360

Jul 07 |489,774,269Jan 07 |433,193,199

Jul 06 |439,286,364

Jan 06 |394,991,609

Jul 05 |353,284,187

Jan 05 |317,646,084

Jul 04 |285,139,107

Jan 04 |233,101,481

Jan 03 |171,638,297

Jul 02 |162,128,493Jan 02 |147,344,723

Jul 01 |125,888,197

Jan 01 |109,574,429

Jul 00 | 93,047,785

Jan 00 | 72,398,092

Jul 99 | 56,218,000

Jan 99 | 43,230,000

Jul 98 | 36,739,000

Jan 98 | 29,670,000Jul 97 | 19,540,000

Jan 97 | 16,146,000

Jul 96 | 12,881,000

Jan 96 | 9,472,000

Jul 95 | 6,642,000

Jan 95 | 4,852,000

Jul 94 | 3,212,000

Jan 94 | 2,217,000

Jul 93 | 1,776,000

hosts are/were doubling every 18 months

See the survey background at: http:///www.isc.org/solutions/survey

It counts the number of IP addresses that have been assigned a name. The survey

queries the domain name system for the name assigned to every possible IP address.

But rather than sending a query to every one of the 4.3 billion possible IP addresses,

the survey starts with a list of all network numbers that have been delegated within the

IN-ADDR.ARPA domain. See http://www.isc.org/solutions/survey/background for

details

Page 4: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 4/44

Copyright © Ellis Horowitz 1999-2012 4

Key Internet Technologies

• Packet switching - permits multiple pairs of computers to

communicate over a shared network

– Messages/files are broken into segments of varying size,

called packets.

– Each packet is labeled with source and destination

addresses

– The receiver must re-assemble the packets in the proper

order

– The inventor(s) of packet switching is in dispute, see

http://query.nytimes.com/gst/fullpage.html?res=9F0CE3DA1

139F93BA35752C1A9679C8B63

• IP Addresses

– An IPv4 address is a 32-bit number, from 0 to about 4.3

billion

– These numbers are written as four sets of eight bits

each, network.subnetwork.subnetwork.computer

• TCP/IP protocol (see ahead)

• Domain Name System (see ahead)

Page 5: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 5/44

Copyright © Ellis Horowitz 1999-2012 5

IPv4 Address

• A 32-bit number divided into four sets of 8 bitnumbers, e.g. a.b.c.d

• There are three classes of IP addresses

– class A - 16 million hosts on 127 networks

– class B - 65,000 hosts on 16,000 networks

– class C - 254 hosts on 2 million networks

– class D - reserved for multicast

– class E - reserved for IETF for its research purposes

• USC has a class B license, 128.125.x.y

• We are running out of IP addresses

– CIDR* address makes more IP addresses available– see http://www.webopedia.com/TERM/C/CIDR.html

in common practice

 

Class  Leftmost bits  Network Local Address (host) 

A 0 7 bits 24 bits

B 10 14 bits 16 bits

C 110 21 bits 8 bits

D 1110 28 bits (Multicast address)

*Classless Inter-Domain Routing

Page 6: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 6/44

Copyright © Ellis Horowitz 1999-2012 6

IP Address (IPv4)

Ipv4 Address Ranges

Class  Leftmost bits  Start address  End address 

A 0xxx 0.0.0.0 127.255.255.255

B 10xx 128.0.0.0 191.255.255.255

C 110x 192.0.0.0 223.255.255.255

D 1110 224.0.0.0 239.255.255.255

E 1111 240.0.0.0 255.255.255.255

Private Address Ranges

Class  Private start address  Private end address 

A 10.0.0.0 10.255.255.255

B 172.16.0.0 172.31.255.255

C 192.168.0.0 192.168.255.255

Page 7: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 7/44

Copyright © Ellis Horowitz 1999-2012 7

Classless Inter-Domain Routing 

• An IP addressing scheme that replaces the older systembased on classes A, B, and C.

– a single IP address can be used to designate many uniqueIP addresses.

– A CIDR IP address looks like a normal IP address exceptthat it ends with a slash followed by a number, called

the IP network prefix. For example: 172.200.0.0/16

– The IP network prefix specifies how many addresses are

covered by the CIDR address, with lower numbers coveringmore addresses. CIDR currently uses prefixes anywherefrom 13 to 27 bits

– For example, in the CIDR address 206.13.01.48/25, the"/25" indicates the first 25 bits are used to identify

the unique network leaving the remaining bits to

identify the specific host.

CIDR Block Prefix# Equivalent Class C # of Host Addresses

 /27 1/8th of a Class C 32 hosts

 /26 1/4th of a Class C 64 hosts /25 1/2 of a Class C 128 hosts

 /24 1 Class C 256 hosts

 /23 2 Class C 512 hosts

Page 8: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 8/44

Copyright © Ellis Horowitz 1999-2012 8

IPv6

• IPv4 uses a 32-bit address space• IPv6 uses a 128-bit address space

• IPv6 supports:

– a total of more than 3 x 1038 addresses – OR-

– a total 6 x 1023 addresses for every square meter on the Earth’s surface

– Currently Internet routers must support both IPv4 and IPv6;

– The conversion to IPv6 was slowed by the use of NAT routers (Network Address Translation)

– A NAT router listens to outbound data packets from local devices andreroutes these packets to the global Internet, while rewriting the IP address

and port number in these packets.– Each outgoing packet stream is assigned a unique port number. Incoming

packets are scanned for the port numbers and if these port numbers matchan existing communication stream or a port number in a fixed routingtable, the destination IP address and port number is rewritten and the

packet is forwarded to the internal device.• References (IPv4)

– http://compnetworking.about.com/library/weekly/aa042400b.htm (IP Tutorial)

• References (IPv6):

– http://www.ipv6.org/

– http://www.pcsupportadvisor.com/nasample/c0655.pdf (Understanding IPv6 by

David Morton)

Page 9: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 9/44

Copyright © Ellis Horowitz 1999-2012 9

IPv4 vs. IPv6

• IPv4 is 32bits divided into four 8-bit segments, separated by dots. Each segment is a

number between 0 and 255

• IPv6 is 128 bits divided into eight 16-bit segments, separated by colons. Each segment is

a number between 0 and 2^16-1. The number is written as 4 hexadecimal digits;

• IPv6 uses 16 octets (128 bit addresses) instead of 4 (32-bit addresses)

• Here is an example

– 2001:0011:abcd:0000:0000:0000:0023:4567

– 2001 is the address type.

– The 0011:abcd defines the subnet. A /48 subnet is typical. That means that the first

48 bits of the ipv6 addresses you get are fixed, and you are free to assign values tothe other 80 bits for each of the devices in your network. In some cases a /64 subnet

is assigned to you where the first 64 bits are fixed.

– The ISP will route all traffic for destinations where the address begins with

2001:0011:abcd or 2001:0011:abcd:0000 to a single internet connection. The

connection must route these addresses further to internal devices.

– The difference between IPv4 and IPv6 is that instead of one IP address which can

point to only one device, we get a truckload full of IP addresses which we have toroute ourselves.

Page 10: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 10/44

Copyright © Ellis Horowitz 1999-2012 10

TCP/IP

• TCP/IP is a two-layer program.– The higher layer, Transmission Control Protocol, manages

the assembling of a message or file into smaller packetsthat are transmitted over the Internet and received by aTCP layer that reassembles the packets into the originalmessage.

– The lower layer, Internet Protocol, handles the addresspart of each packet so that it gets to the rightdestination.

• Each gateway computer (router) on the network checks thisaddress to see where to forward the message. Even thoughsome packets from the same message are routed differentlythan others, they'll be reassembled at the destination.

• TCP/IP solves several problems of network reliability– if a router is overrun with packets, it discards them

– if a packet is lost, it re-requests it

• the receiver acknowledges receipt to the source

• the sender starts a timer and if no acknowledgementis received it automatically resends the packet

• the sender’s timer uses a different time dependingupon the distance to the destination and currentinternet traffic

– it reorders the packets into proper sequence

– it eliminates duplicate packets

Page 11: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 11/44

TCP Stack

11Copyright © Ellis Horowitz 1999-2012

Page 12: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 12/44

Copyright © Ellis Horowitz 1999-2012 12

TCP/IP is a Suite of Protocols

• Routing Protocols include– IP (Internet Protocol) actual transmission of data

– ICMP (Internet Control Message Protocol) handles messages for IP

– RIP (Routing Information Protocol) determines best routing

– OSPF (Open Shortest Path First) an alternate delivery method

• Network Address Protocols include

– ARP (Address Resolution Protocol) determines the unique numeric

addresses of machines on the network– DNS (Domain Name Service) determines numeric addresses from

machine names

– RARP (Reverse Address Resolution Protocol) determines theaddresses of machines on the network, but in a reverse orderfrom ARP

• User based services

– BootP (Boot Protocol) boots a network by reading info from aserver

– FTP (File Transfer Protocol) allows transfer of files across thenetwork

– Telnet, used to remotely log in to another machine

• Gateway based services

– EGP (Exterior Gateway Protocol) governs the transfer of routing

information for external networks– GGP (Gateway-to-Gateway Protocol) handles routing of information

between gateways

– IGP (Interior gateway protocol) handles routing of info forinternal networks

Page 13: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 13/44

Copyright © Ellis Horowitz 1999-2012 13

Layering of TCP/IP Protocols

application

layer

transport

layer

networklayer

data link

layer

HTTP FTP TELNET NFS/RPC DNS SNMP

TCP UDP

IP

Open Systems Interconnect (OSI) Reference Model includes 7 layers: application,

presentation, session, transport, network, data link and physical.

(Note: use WireShark, a network protocol analyzer, to show packets at each

layer. See http://www.wireshark.org/)

Page 14: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 14/44

HTTP/HTTPS Protocol Stacks

14Copyright © Ellis Horowitz 1999-2012

From  HTTP: The Definitive Guide, by David Gourley, Brian Totty

Page 15: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 15/44

Copyright © Ellis Horowitz 1999-2012 15

Internet Domain Names

• The Domain Name System is a mapping to/from IP

addresses to domain names

– defined in RFC 1034, 1035, see e.g.

– http://www.faqs.org/rfcs/rfc1035.html

– Invented in 1983 by Paul Mockapetris, see

http://en.wikipedia.org/wiki/Domain_name_system

• There are 13 top level root name servers, see

– www.dns.net/dnsrd/tld.html

• ICANN is the organization in charge of maintainingthe DNS system, see

– www.icann.com

Page 16: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 16/44

Copyright © Ellis Horowitz 1999-2012 16

Top Level Domain Names

• Top level domains were originally divided into the

following logical categories

– com commercial and industrial organizations

– edu educational institutions

– gov non-military, government affiliated

organizations

– mil military organizations

– net network operations

– org other organizations and user groups

• new top level domains have been added– .biz, .info, .name, .museum, .coop, .aero, .pro, .xxx

• www.internic.net/faqs/new-tlds.html

• In Oct. 2009 ICANN agreed to accept internationalized

domain names, encoded as Unicode:

– see http://www.icann.org/en/topics/idn/fast-track/• In 2011 ICANN agreed to offer Generic Top Level domains:

– see http://www.icann.org/en/tlds/select.htm or the movie

at http://newgtlds.icann.org/announcements-and- 

 media/video/overview-en

Page 17: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 17/44

Copyright © Ellis Horowitz 1999-2012 17

Domain Names Outside the US

• Countries append their 2 letter country code,

Two letter codes are maintained as an ISO 3166

standard. Here is a sample

AFGHANISTAN AFALBANI AL

ALGERIA DZ

AMERICAN SAMOA AS

ANDORRA AD

ANGOLA AOANGUILLA AI

ANTARCTICA AQ

ANTIGUA & BARBUDA AG

ARGENTINA AR

ARMENIA AMARUBA AW

AUSTRALIA AU

AUSTRIA AT

AZERBAIJAN AZ

BAHAMAS BSBAHRAIN BH

BANGLADESH BD

BARBADOS BB

BELARUS BY

BELGIUM BEBELIZE BZ

BENIN BJ

BERMUDA BM

BHUTAN BT

BOLIVIA BOBOSNIA AND HERZ. BA

BOTSWANA BW

BOUVET ISLAND BV

BRAZIL BR

Page 18: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 18/44

DNS Resolution

• The DNS protocol is an important part of the web's infrastructure

• Every time you visit a website, your computer performs a DNS lookup

• Complex pages often require multiple DNS lookups before they start loading,

so your computer may be performing hundreds of lookups a day

• DNS latency is mainly due to

– The round-trip time to make the request and get the response, due to

network congestion, overloaded servers, denial-of-service attacks

– Cache misses which cause recursive querying of other name servers

• Google has introduced a Public DNS– Configure your network to use 8.8.8.8 and 8.8.4.4

– Google handles more than 70 billion requests a day!

– Google also has IPv6 addresses

• 2001:4860:4860::8888 and 2001:4860:4860::8844

– http://code.google.com/speed/public-dns/docs/intro.html

• Another alternative is opendns.com

– The have a global network of DNS resolvers to speed resolution

– The base service is free, but upgrades cost

Copyright © Ellis Horowitz 1999-2012 18

Page 19: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 19/44

DNS Resolution

• The chart shows the times spent

loading a page where black 

represents DNS resolution, Gray

represents Connection waiting,

Yellow represents connection, red is

JavaScript parsing, and blue isJavaScript execution.

• There are 13 calls to the DNS

resolver and 5 of them are serial

lookups accounting for several

seconds of the total 11 seconds spentloading the page

Copyright © Ellis Horowitz 1999-2012 19

http://code.google.com/speed/public-dns/docs/performance.html

Page 20: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 20/44

Copyright © Ellis Horowitz 1999-2012 20

Internet Statistics

Conclusion: the .net and .com

categories are the largest

followed by Japan, Italy and

Brazil

Distribution of Top-Level Domain Namesby Host Count, July. 2011,

at http://ftp.isc.org/www/survey/reports/2011/01/bynum.txt

Above shows 99 million .com sites out of

a total 135 million or roughly 73% of the total

See http://www.domaintools.com/internet-statistics

Page 21: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 21/44

Copyright © Ellis Horowitz 1999-2012 21

 Who Controls Internet Domain Names

• Granting of domain names is done by a registrar

• Registrars must be approved by ICANN,www.icann.org, the Internet Corporation forAssigned Names and Numbers

• Currently there are more than 100 registrarsassigning domain names for *.com, *.org, and*.net

• All domain name registrars share theirinformation with the domain name registry,which for com/net/org is Network Solutions,

see:– http://www.networksolutions.com/

Page 22: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 22/44

Copyright © Ellis Horowitz 1999-2012 22

Internet Domain Name Registrars

• There are three key software systems a registrar must

implement

– A Whois service checks if a domain name is already

registered and returns registration information

• This is a client/server application thatinterfaces with NSI’s database; a read-only

operation

– A Shared Registry System reserves a name for the

registrar

• This is a client/server application that

interfaces with NSI’s database; a write

operation

– A local database maintains customer accounts,

domain names, etc.

Page 23: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 23/44

Copyright © Ellis Horowitz 1999-2012 23

Internet Traffic

• How efficiently is the Internet working now– http://www.internettrafficreport.com/

– http://netflow.internet2.edu/

internet2 is a project to develop new technologies

for high-performance computer networking. It is

led by a consortium of 206 universities.

While specifically developed tofacilitate research and educational purposes,

the involvement of research, commercial and

government organizations also aims to distribute

these technology into the wider community.

The tables below show the type and amount of

traffic

Data Transfers are 41%

HTTP is approx 39%

HTTPS is approx 48% of 

encrypted traffic

Page 24: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 24/44

Copyright © Ellis Horowitz 1999-2012 24

Useful Routines

• traceroute indicates the path of a packet from

source to destination

pollux.usc.edu(19):/usr/sbin/traceroute

doc.ic.ac.uk

Try tracert on Windows

• ping  sends a packet and waits for a response;

determines if the site is up

pollux.usc.edu(35):/usr/sbin/ping mit.edu

• nslookup will return the IP address given the

domain name, and vice-versa

nslookup pollux.usc.edu returns

Name: pollux.usc.edu

Addresses: 128.125.7.29

Page 25: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 25/44

Copyright © Ellis Horowitz 1999-2012 25

How the Internet Functions Today

Wide Area Backbone, e.g. AT&T,SPRINT 

 Regional

Provider, e.g. Los Nettos

 Regional

Provider 

 Regional

Provider, e.g. Earthlink 

 Local Local Local Local Local

Page 26: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 26/44

Copyright © Ellis Horowitz 1999-2012 26

Defining the World Wide Web

• A wide-area hypertext, multimedia information

retrieval system that provides access to a

large universe of documents

• A uniform way of accessing and viewing some

information on the Internet

• The WWW

– creates a world in which information has a

reference by which it can be accessed

– subsumes the capabilities of ftp, gopher,

wais and news

Page 27: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 27/44

Copyright © Ellis Horowitz 1999-2012 27

Graphical View of the WWW

Web

server

Web

server

Web

server

Web

server

Data

Source

Data

Source

Data

Source

Data

Source

Intranet

Internet

Browser

computer

Page 28: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 28/44

Copyright © Ellis Horowitz 1999-2012 28

 Major Technology Components

• Client/server architecture

– where client programs interact with web

servers

• Network protocol

– HTTP, Hypertext Transfer Protocol, is thelanguage understood by browsers and web

servers

– designed to move quickly from document to

document

• Addressing system (Uniform Resource Locators)

– http://domain/directory/file.html

• Markup Language

– every web server understands and every browser

displays

– includes support for HyperText and multimedia

Page 29: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 29/44

Copyright © Ellis Horowitz 1999-2012 29

Client/Server Architecture Model

Multiplatformbrowsers

(clients)

MechanismsAddressing scheme (URL) + Protocols (HTTP, etc.) + Format

Negotiation (MIME)

Servers foreach

protocol

HTTPserver

FTPserver

Gopherserver

NNTPserver

Terminals PCs Macs X Windows

Page 30: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 30/44

Copyright © Ellis Horowitz 1999-2012 30

The WWW Server

• Web browsers and Web servers communicate according to

a protocol known as HTTP (HyperText Transfer

Protocol)

– The current HTTP protocol is version 1.1

• The Web server is a software system running on a

machine often called the Web server, don’t confusethem

• A web server can

– receive and reply to HTTP requests

– retrieve documents from specified directories

– run programs in specified directories

– handle limited forms of security

• A web server does not

– know about the contents of a document, links in a

document, images in a document or whether a

particular file, e.g. a *.gif file, is in the

correct format

Page 31: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 31/44

Copyright © Ellis Horowitz 1999-2012 31

Uniform Resource Locator (URL)

• A mechanism whereby an Internet resource can be

specified in a single line of ASCII text

• See RFC 1738: http://www.faqs.org/rfcs/rfc1738.html

URL Refers to:

file://pub/xt.ps a PostScript file in directory

pub on your local machine

ftp://usc.edu/docs/sweng.txt

file sweng.txt in directory docs

on usc.edu, an anonymous ftp sitehttp://nunki.usc.edu/mydocs/book.doc

a file in directory mydocs on

machine nunki.usc.edu, a WWW site

news:comp.compilers the newsgroup computers.compilers

mailto:[email protected] an e-mail address

Page 32: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 32/44

Copyright © Ellis Horowitz 1999-2012 32

General Description of a URL

1. Scheme followed by a colon

http:,ftp:,gopher:,news:,mailto:,wais:,telnet:

2. Double slash (only for http, ftp, gopher,

wais) //

3. Internet domain name e.g., pollux.usc.edu

4. Port number (this field is optional; e.g.,

pollux.usc.edu:8081)

Standard or default port numbers:

--- ftp is 21 gopher is 70

--- telnet is 23 http is 80

--- smtp is 25 nntp is 119

--- imap is 143 secure nntp is 563

--- pop3 is 110 secure pop3 is 995

5. Path e.g., /pub/docs

Page 33: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 33/44

Copyright © Ellis Horowitz 1999-2012 33

URL Character Set

• RFC 1738, Dec. 1994 defines the URL character set as"...Only alphanumerics [0-9a-zA-Z], the special characters "$-_.+!*'()," [not including the

quotes], and reserved characters used for their reserved purposes may be usedunencoded within a URL."

• However, HTML supports ISO-8859-1 (ISO-Latin) character set– HTML 4.x extends the character set to all of Unicode

• Therefore, in URLs an escape mechanism is used, % followed by twohex digits

• Characters that should be encoded include:

%, /, ., .., #, ?, ;, :, $, +, @, &, =

• Here are some encoded values for so-called “unsafe” characters

~ %7E | %7C

SPACE %20 \ %5C

% %25 ^ %5E

& %26 [ %5B

= %3D ] %5D? %3F # %23

{ %7B > %3E

} %7D < %3C

Page 34: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 34/44

Copyright © Ellis Horowitz 1999-2012 34

 Markup Languages

• HTML - hypertext markup language, specifies

document layout and the specification of

hypertext links to text, graphics and other

types of objects

• Browsers display text and graphics using the

markup as guidance

• However, HTML is not like a word processing

program, e.g. Microsoft Word or WordPerfect,

and not like a page description languages, e.g.

postscript

– as a result, translation into HTML can

produce a result that does not look exactly

like the original

Page 35: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 35/44

Copyright © Ellis Horowitz 1999-2012 35

 What is HyperText?

•Regular text, with the additional feature of links

to related documents

•As you read documents and follow links, you

traverse a “web” of interconnections

Emancipation

Proclamation

... all persons found as

slaves within any State, ...

Declaration of

Independence

When in the course of

human events it becomes

necessary for one ...

Gettysburg Address

by A. Lincoln

Fourscore and seven years ago, ourfathers brought forth upon this

continent a new nation, conceived in

liberty and dedicated to the

proposition that all men are created

equal. We are now engaged in a

great Civil War  , testing whether that

nation or any other nation soconceived and so dedicated can long

endure.

War Between the

States by Eric

Barnes,

McGraw-Hill

Page 36: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 36/44

Copyright © Ellis Horowitz 1999-2012 36

The WWW Data Model

Findinformation

about

trains

Submit

Yahoo

My Courses 

My Research 

ProfessorJohn Smith’shome page

Index ofmaterial on

trains

Description

of aspecifictrain

Searching 

Education 

My home page

Search forfaculty

John Smith

University ofWisconsinhome page

Search

Search

HyperlinkHyperlink

Hyperlink

Submit

Link

Link

Static page

Dynamicpage

Staticpages

43

A directed graph where nodes are documents, edges

are links

Page 37: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 37/44

Copyright © Ellis Horowitz 1999-2012 37

Graph Structure and the Web

• Nodes = static web pages (~1+ billion)

• Edges = static hyperlinks (~10 billion)

• It’s a sparse graph: ~7 links/page on average

• Some Questions

– is the web connected? can we always traverse from

one page to any other

– can link connectivity improve the results of

search engines?– if we watch the web graph change over time, what

does that tell us about social processes

• Reference: http://www9.org/w9cdrom/160/160.html

Page 38: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 38/44

Copyright © Ellis Horowitz 1999-2012 38

Some Graph Algorithms

• Weakly connected components (WCC)

– a maximal subgraph of a directed graph such that for

every pair of vertices u, v in the subgraph, there is an

undirected path from u to v and a directed path from v

to u

• Strongly connected component (SCC)

– A maximal subgraph of a directed graph such that for

every pair of vertices u, v in the subgraph, there is a

directed path from u to v and a directed path from v to

u.

• Algorithms for the above all exist in linear time

• A Graph's diameter

– The length of the "longest shortest path" between any

two graph vertices, OR the largest number of vertices

which must be traversed in order to travel from one

vertex to another when paths which backtrack, detour, or

loop are excluded from consideration

Page 39: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 39/44

Copyright © Ellis Horowitz 1999-2012 39

Challenges of Scale

• A typical algorithm to compute the diameter of

a graph requires a number of steps

~(nodes * edges), or ~(pages * links)

• For 1 billion pages, 10 billion links, and 0.10

microseconds/step we need ~1 billion seconds or

about 10 million days

• Results of a May 1999 crawl at Alta Vista

– 220 million pages after duplicates are eliminated

– Giant WCC has ~186 million pages

– Giant SCC has ~56 million pages

– Cannot browse your way from any page to any other

– Next biggest SCC has ~150K pages

– Other crawls produce similar results

Page 40: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 40/44

Copyright © Ellis Horowitz 1999-2012 40

Reachability Question

• How many pages are reachable from a random

page?

• Start at page p

– get its neighbors and put them on a list– following the neighbors, repeating the

process, watching for loops and marking dead

ends

• Keep track of the number of pages reached from

p, as a function of the distance d

• Experiment: start at 1,000 random pages and for

each build BFS profiles

• Results:

– either dies quickly (~100 pages reached)

– or explodes and reaches ~100 million pages

Page 41: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 41/44

Copyright © Ellis Horowitz 1999-2012 41

 Web Anatomy

one can pass from any node of IN through SCC to any node of OUT.Hanging off IN and OUT are TENDRILS containing nodes that are

reachable from portions of IN, or that can reach portions of OUT,

without passage through SCC. It is possible for a TENDRIL hanging 

off from IN to be hooked into a TENDRIL leading into OUT, forming 

a TUBE -- a passage from a portion of IN to a portion of OUT without

touching SCC 

Page 42: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 42/44

Copyright © Ellis Horowitz 1999-2012 42

Early History of the WWW

• 1989-1990 Tim Berners-Lee conceives the WWW at CERN inGeneva

• 11/90 Berners-Lee releases WWW prototype on NeXtcomputer

• 01/92 Release of source code for line mode browser,

lynx and HTTP• 03/93 Mosaic browser from NCSA is released

• 09/93 WWW internet traffic now measures 1% of NSF

backbone

• 12/94 Netscape Navigator 1.0 is released

World Wide Web Consortium formed

• 08/95 Microsoft Windows 95 and Internet

Explorer 1.0 released

• 12/95 Java is released

• 12/04 Firefox 1.0 is released

• 09/08 Google Chrome 1.0 is released

See http://www.w3.org/History.html and tim Berners-Lee’s presentation at the 10th

anniversary, http://www.w3.org/2004/Talks/w3c10-HowItAllStarted/?n=1

Page 43: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 43/44

Copyright © Ellis Horowitz 1999-2012 43

Recent WWW Developments

• Browsers continue to be enhanced

– Microsoft develops Internet Explorer 6-9, and 10(beta on Windows 8) including support for ActiveX,Active Server Pages, and .NET, and special toolssuch as Expression Web

– Netscape opens the source for Navigator producingNetscape 7.x-8.1,followed by Mozilla and Firefox

– Netscape browser “killed” by AOL on 12/28/2007.

– Opera available on Windows, Mac OS X and Java-based cell phones and PDAs, as Opera Mini

– Apple Safari (WebKit) available on Mac OS X,Windows, smartphones and tablets (Apple iPhone,iPod Touch, iPad, Android, Nokia Symbian)

– Google releases Chrome browser, 2008 (WebKit)

• Other interesting technologies

– multimedia streaming, e.g. Adobe Flash, Microsoft

Silverlight, and now HTML5/H.264 (discussed later)– Application servers, e.g. IBM's WebSphere, BEA

Weblogic

Page 44: Internet Web Basics

8/3/2019 Internet Web Basics

http://slidepdf.com/reader/full/internet-web-basics 44/44

Copyright © Ellis Horowitz 1999-2012 44

 WWW Consortium 

• Founded in 10/94, headed by Tim Berners-Lee,

http://www.w3.org

• Goal: “to lead the World Wide Web to its full

potential by developing common protocols thatpromote its evolution and ensure its

interoperability.”

• Many of the technologies guided by the WWW

consortium will be discussed this semester:– HTML, Style Sheets, Document Object Model,

international character sets, HTTP, XML,

etc.