searching finding needles in the world’s biggest haystack ...jpeterson/pagematch_pagerank2.pdf ·...

SEARCHINGFinding Needles in the World’s Biggest Haystack

- or -The Page Match & Page Rank Problems

We will look at how a search engine carries out an impossible task:

• Read your search word or phrase;

• Search all 40 billion pages on the Internet;

• List all the pages that contain some or all of your phrase;

• Rank these matches so the best matches are first;

• And do this in less than two seconds.

1. The PageMatch Problem investigates how your browser takes your Inter-net search string, passes it to a search engine which returns an answerincredibly fast. We will see that this process involves a number of hiddentricks and indexes.

2. The PageRank Problem considers how it is possible, once many matchingweb pages have been found, to select the very few that correspond mostclosely to your search string.

Goals for this lecture

1. To see the motivation behind creating the Internet and the World WideWeb

2. To see the difference between a browser and a search engine.

3. To realize that a search engine doesn’t wait for your search string andthen start checking one page after another on the web;

4. To realize that it’s possible to create an index of the web, just like theindex of a book;

5. To understand how Web crawling is done to create an index;

6. To recognize how information searches were done before the Internet.

Reading Assignment

Read Chapter 2, pages 10–23, “9 Algorithms that Changed the Future”.

What’s the Internet?

Once, computers were simply stand-alone devices.

A computer could be used to create information and store it on its own harddrive.

I could use my computer to write a letter to you, but to send it, I’d have toprint it out and put it in an envelope.

Or, if you were in the same office, I could copy my letter onto a removabledisk, carry it to your computer, and load it onto your hard disk, so you couldview it.

People very quickly invented ways of connecting all the computers in anoffice using cables, so that information could be moved from one hard diskto another by simple keyboard commands.

Gradually, it became clear that it was a good idea to connect all the com-puters in the whole world this way.

This required modifying computers to include an Internet port which couldbe connected by a thin cable to a local network, over which informationcould be sent and received. All of this is hardware.

It required building a network of Internet Servers, so that every messagecould be passed from its sender computer, through a sequence of servers,until it reached its recipient computer.

It required coming up with a universal system of numeric addresses, calledthe Internet Protocol or IP, so that each message included an exact addressthat the servers could read.

It required setting up Domain Name Servers (DNS) so that humans couldspecify an easy-to-remember address such as facebook.com or fsu.edu,but these would be automatically translated to numeric IP addresses.

Up to this point, the Internet simply connected computers.

This was good for scientists, who often wanted to access a big, far awaysupercomputer without having to get on an airplane, or package their datain an envelope and mail it to the computer.

Instead, using the Internet, a scientist could start on a desktop computer,and log in to the supercomputer using programs called ssh or putty.

They could run a big scientific program on the supercomputer.

Then they could bring the data back to their desktop computer using a filetransfer program called ftp or sftp.

In fact, scientists still use the Internet this way. One of the problems we arehaving today with hackers is that the Internet was set up to do research.

What’s the World Wide Web?

The Internet had been set up to make it easy for a person on one computerto send information to another computer, which was like sending a letter.

But sometimes someone wanted to make an announcement to many peoplethey knew, (so a lot of individual letters), or perhaps even to any interestedperson (even people unknown to the author.) This was more like a newspaperor poster.

This idea eventually brought about the creation of the World Wide Web(WWW), by Tim Berners-Lee at the European Center for Nuclear Research.

WWW took advantage of the existence of the Internet, and its addressesand server system to create something that had not been imagined before.

The web began with the idea that any computer user might want to sharesome information with people around the world.

To make this possible, the user simply had to set aside a special location ordirectory or folder on their computer, often called public html.

Any documents placed inside this location would be presumed to be intendedto be viewed by any interested person.

To access the documents, however, required knowledge of the address ofthe computer. The addresses had a specific format, starting with the prefixhttp://www and then listing information that specified the country, typeof institution, department of the institution, person, directory and filename.

Such an address can be a lot to remember. A fairly short example is:https://www.sc.fsu.edu/undergraduate/courses

These documents were usually fairly short, and so were called web pages.

Thus, to publicize a new class to any interested computer user, an instructorcould create a document called newclass.txt and place it in the public htmlof a computer:

Announcing a new class in Computational Thinking!

This fall, the Department of Scientific Computing

will be offering a class called "Computational Thinking".

For details, stop by room 444 Dirac Science Library.

Or check out the syllabus at

http://people.sc.fsu.edu/~jpeterson/teaching

As the WWW became better known, people began to take advantage of it,posting interesting and useful information in their public areas.

But the crazy complicated address system was a huge drawback. No onecould find the information without knowing the address, and the address wasso long and complicated that it wasn’t possible to try to guess where thingswere.

The first improvement came with the invention of links, which allowed aperson writing a web page to make a bridge or simple connection to otherweb pages, saving readers from having to know more addresses.

A web page using links would allow a user to jump to another web pagesimply by clicking on highlighted or underlined or colored text, rather thanhaving to enter that address separately:

Announcing a new class in Computational Thinking!

This fall, the Department of Scientific Computing

will be offering a class called "Computational Thinking".

For details, stop by room 445 Dirac Science Library.

Or check out the SYLLABUS.

--------

Now that interesting web pages existed on the web, and it wasn’t so hardto find them, a new kind of program was developed for computers, called abrowser.

One early browser was called Mosaic, which transformed into NetScape, andthen into FireFox.

The browser’s job was to display a web page on the computer screen, allowthe user to specify an address for a new web page, to effortlessly jump to anew web page if the user clicked on a link on the current web page, and tojump back if the user changed their mind.

Because it still wasn’t obvious how to find the most interesting information,browsers often included lists of recommended web pages for users to try.

Some browsers included a simple feature which would allow the user to ask aquestion, and get back a recommendation for a good web page to investigate.

Because people added, modified, and deleted web pages every day, the webwas very dynamic, and the most interesting places to examine changed fromday to day.

The simple programs inside a browser for guiding users relied on lists createdby a staff who could not keep up with the rapid growth in the number ofweb pages and users.

Browser companies began desperately researching new methods of satisfyingusers, who wanted good information on what was new on the internet, andhow to get there.

The problem was that the web changed too rapidly for human analysis, buthow could a computer be of any use in finding and evaluating web pages?

Browsers and Search Engines

Originally, internet browsers tried to handle user searching themselves, orsimply included pointers to a few places that listed resources.

Then users demanded a much improved search facility.

It seemed sensible to split up the work, to separate the browser from thesearch engine. The browser would take care of moving to any particular webpage, but the search engine would be a separate program that figured outwhich web page was the right place to go to next, in response to the user’squestion.

In some ways, the browser was like a car which could go anywhere, while thesearch engine was like a GPS system, which could direct the browser to adesired location.

Ever since those days, there have been two clearly distinct but cooperatingprograms for navigating on the web.

Browsers like Mosaic/NetScape/Firefox, Google Chrome, Microsoft InternetExplorer/Edge, Apple Safari, and Opera allow people to view and retrieveinformation on the internet.

Search engines, like Google Search, Microsoft Bing, DuckDuckGo, YahooSearch!, Ask.com, are used from within a browser, by typing a search stringinto a search box. The search engine then seeks to list web pages thatpertain to that search string, after which the user can ask the browser todisplay information from a selected site.

Winners and losers in the browser wars

The search engine wars

In 2002, Google, Yahoo, and Microsoft each had about 30% of the searchengine market share, but Google made some dramatic improvements to itssearch engine, and drove Yahoo and Microsoft down to less than 20% shareseach (and dropping).

To most of us, the web is simply an enormous pile of

• documents;

• photographs, images, pictures;

• email, messages, chats;

• songs;

• magazines;

• videos, movies;

• programs;

• advertisement, commercials and promotional videos;

• retail stores and ticket outlets;

• social media sites.

The Web may seem to be able to connect us to everything; but if we don’tknow where something is, we might as well not have it.

So as the Internet has grown in size (number of things connected) andcomplexity (kinds of things connected), it has become increasingly importantto develop and improve ways of:

• finding things you are looking for;

• recommending new things you might be interested in;

• learning enough about your interests to guess what other things you mightwant to see.

Specialized search engines

You are probably familiar with some specialized search programs that helpyou find information on the web, such as NetFlix, Orbitz, Yelp, Mapquest,and Amazon.

These programs work well because they have a very limited vocabulary. Youoften select choices from a list, or enter simple information into boxes, likeyour credit card number and address.

For such a program, it’s easy to imagine a simple procedure for getting thedesired information from the user, and then checking a list of song titles,travel dates, or shoe sizes, until a desired match is found, and then finishingup the billing and shipping arrangements.

While such programs seem to carry out an elaborate sequence of steps, ingeneral the choices are not so large, and it’s easy to determine when you’vefound a match (the right size, the right price, the correct dates).

But a real search engine has a much more difficult task!

Suppose you are looking for a list of the rulers of Russia, or advice onhow to whistle, or instructions on how to get rid of a wasp infestation.

As soon as the browser realizes you’re doing a search, it hands the informationto the search engine, as a list of one or more words or search strings.

The search engine doesn’t speak English or French, or Mandarin. It has noidea what your words actually mean.

If your search word is apple, should it concentrate on fruit, on the computercompany, on the recording label for the Beatles, Fiona Apple, a movie named“The Apple”? What about Applebee’s restaurant?

What makes matters worse is that if you search on “Apple”, there seem tobe more than a billion web pages on which that word appears!

What happens between your question and the answers?

Your browser is running on the computer right in front of you.

The information you need is somewhere else, perhaps far away.

We assume you’ve got a connection, wireless or wired, to your local network.Your request goes out from the local network to the Internet... and a fewseconds later, your answer appears, a list of hundreds or thousands of “hits”,showing about 15 or 20 on the first page, including the location and anextract from the matching text.

It seems like magic; in fact, it’s impossible for a search engine to receive asearch string and check every word of every web page on the Internet andreturn the good matches to you unless you are willing to wait days for ananswer.

But we got the answer in two seconds!

It seems impossible for this to work!

Why does this seem like an impossible task?

For your browser to access just one web page:

• the browser converts the web address to a numeric IP (Internet Protocol)address;

• it has to set up a connection to that address;

• once the connection is made, it has to request a copy of the page fromthe remote computer;

• it has to wait for the remote computer to find the file locally;

• it has to wait for the information in the web page to be copied across theInternet into a local file.

Each single web page access can take on the order of 1 second.

There are more than 40 billion web pages on the Internet;

To copy every web page would take 40 billion seconds = 1240 years.

Browser users expect rapid response, and browser developers try every trickthey can think of to keep their users happy.

Although a typical web page access might take a second, sometimes there isa longer delay because the network is busy, or the path to the web page isunusually long, or the server that controls the web page is very slow.

Since users often refer to a single web page several times in a single session,browsers added a cache, that is, they saved copies of the most recent webpages that have been visited. When you request a page, rather than goingimmediately to the web, the browser first checks to see if it already has acopy in its cache.

This Cache Trick usually works fine. But you can see it break down some-times. If a web page has been updated since you first referred to it, yourbrowser may continue to show you the out-of-date version. This is whymost browsers include a Refresh button, which really means throw awaythe cached page and get a fresh copy from the web!

What really happens is a little more complicated . . .

It’s natural to assume that if you search for polar bears, then the searchengine...really searches every web page on the Web, and comes back with alist of all the pages that contain that word.

That is impossible. So what does happen?

In order to find your matching pages, the search engine did a great deal ofwork long before your fingers touched the keyboard:

• web crawling: making a map of the Web;

• examining links: making an index of every word.

Let’s see how careful advance work makes your search happen so fast!

Step 1: Making a map

A search engine prepares in advance for your question, whatever it is.

It starts this process by locating all the web pages it can find. This is calledweb crawling. It’s hard to do right, because there’s no map of the internet,and new sites appear and disappear every minute.

Imagine the Web as a network of stops in the New York subway system.The search engine needs to take random rides on this subway system, andnotice every place it stops, and how it got there, and whether anything haschanged since the last visit. All of this goes to updating a map of the Web.

A web crawler, often called a spider, is actually a program. A search enginehas many web crawlers exploring the web. From time to time, a crawlervisits the main FSU web page.

It notes all the links on the FSU web page, some of which changed since thelast visit. It makes as many observations as possible in the local FSU webpage directory, sending this information back to the search engine.

Once it has extracted all the information it can from the main FSU webpage, it decides somewhat randomly where to explore next.

From their starting sites, the swarm of spiders follow links across the web,recording what they see and sending it back to the main search engine site,where all the information is combined into an updated snapshot of the web,its web sites, the web pages, and the links in those web pages.

How does a spider start its travels over the Web?

Some parts of the web are well known, and don’t change their locations.These are servers, computers devoted to a special task. These are goodplaces for an internet crawler to begin its exploration.

A web server does nothing but send and retrieve web pages to your browserwhile you’re surfing the Internet.

An e-mail server is a computer that works as a virtual post office, receivingand storing mail messages, and interacting with you when you check yourmailbox.

Facebook servers do nothing but handle all the activities of their users,and they are packed with information that a crawler wants to see.

If you have a web site, you may be surprised or unhappy to realize that aspider will look at your information from time to time and send a summaryof it back to Microsoft or Google or Yahoo.

Usually, if you have a web site, anybody can look at your information, butstill, it can be unsettling to realize that some viewers are actually vacuumingup your information for their own use.

Because of user complaints and privacy concerns, some major web sites andsocial media sites try to restrict access by these crawlers.

Creating a map of the web is part of the search engine’s plan to efficientlyrespond to your requests.

When you send a question to the search engine, it relies first on this map,rather than on the actual web. That means that, sometimes, the map andthe web differ.

The map may include a web page that has since actually disappeared. Forthis reason, your search request will sometimes point to what seems like aninteresting web page, but when you try to view it, it says “Can’t find thispage!” and you’re left thinking, “What? You just told me to look at it!”

Also, if you put a web page up, it will not be visible to any search engine forsome time, until a web crawler runs into it and adds it to the map.

These failures are two consequences of the otherwise very reliable MapTrick.

Step 2: Making an index of the web

So now we have some idea of how it’s possible to map the web using crawlers.

But just because we know where every web page is, we still have the impos-sible task of checking them all against the user’s search string.

Luckily, Google solved this problem, using other information gathered by theweb crawlers.

When you do a Google search, you are NOT actually searching the web, butrather Google’s index of the web.

Google’s index is well over 100,000,000 gigabytes of information, requiringone million computing hours to build, and is constantly updated.

Now we want to understand what it means to index the web.

When you issue a web search query, it is processed in two stages:

• matching searches for all matching pages using the web map;

• ranking orders those matching pages so the best appear first.

The query “London bus timetable” seeks matches, and then ranks them.

Library search: Search engines have always existed

One way to realize how search engine indexing makes the page match taskpossible is to think about what used to happen, in the good old days, whenyou went to the library to work on a term paper.

You might start by wandering through the library, looking for the books youneed, but after a while, you realize this is hopeless.

Then you find a librarian and say:

“I need a list of the rulers of Russia.”

The librarian doesn’t know where your information is either. But the librarianknows how you can find it.

The librarian says: “Search the catalog card index!”

The old card catalog index contained hundreds of thousands of search phrases:topics, author names, events, all in alphabetical order.

Every time a new book came into the library, the librarians prepared cardsfor every topic covered by the book, and added these to the index.

With great work, and over a long time, the card catalog was built up to bea labor saving device that allowed you to “instantly” (well, within a minuteor two) discover the “addresses” (library call numbers) of every book in thelibrary that might pertain to your topic.

Searching the card catalog under Russia, Rulers we might see a sequenceof cards like:

Lieven, Dominic “Russia’s rulers under the old regime” DK253.L54Lev, Timofeev “Russia’s secret rulers” DK510.73.T56Warnes, David “Chronicle of the Russian Tsars” DK37.6W37Gooding, John “Rulers and subjects” DK189.G86

Perhaps the title by Warnes seems the best match for our interest. In thatcase, we need to make a note of the information DK37.6W37 because thisis the Library of Congress catalog number for the book.

This number amounts to an address for the book, so once we have a mapof the library, we now know where to search next.

So search phrase plus index gets us the location of the information.

Books have an index too!

Most nonfiction books include an index, which lists names and topics inalphabetical order, and the pages on which these are discussed.

Once we get the book in our hands, we’re holding 300 or 400 pages of densetext; we probably don’t want to read or even scan the whole book to find alist of Russia’s rulers!

Instead, we rely on the fact that books include an index at the back, makingit possible to rapidly locate the pages on which key topics are presented.

So we turn to the back of the book, and luckily, find “Rulers, List: 217” andturn to page 217 and finally have what we want:

Rurik I 862-879Oleg of Novgorod 879-912Igor I 913-945Olga of Kiev 945-962... ...

Now Strozier library claims to hold about 3.3 million books.

A typical book might have 300 pages.

So we can estimate there are a total of 3,300,000 x 300 ≈ 1 billion pages.

Our search lead us to one page on one book, so essentially we picked onepage out of a billion in about 15 minutes.

This means that in every single second, we essentially skipped over a millionpages of useless text in our search for the right one.

Library search rate ≈ one million pages per second.

Twice, in our library search, we referred to an index as an efficient way torapidly narrow our search.

The library index listed books in the library that might pertain to our topic.

The book index listed pages in the book that might pertain to our topic.

The indices saved us from having to search every book, or search every page.

Of course, the reason we had a fast search was that other people (librariansand publishers) had already done the hard work of preparing accurate indexesfor us.

Web search is similar to library search

Searching the web is similar to searching the library:

• There are a huge number of items;

• Each item contains many pieces of information;

• Each item has an address.

We need:

• a way to specify a search topic;

• a list of items that might include this topic;

• an address for each item;

• a map that tells us how to reach any address;

• a way to reach the address;

• a way to view the item;

• a way to locate our search topic within the item.

Search Engines as Fast as Possible by Tech Quickie

https://www.youtube.com/watch?v=ADvI44Sap3g

Map + Match + Rank

We have seen that when you enter a search phrase in your browser, thebrowser passes this request to a search engine, and the search engine doesn’tactually go out and check the web, but rather looks at an index of the webcreated from information gathered by web crawlers as they constructed amap of the web.

The stages involved in this process include:

1. map the web, its web pages and topics;

2. match the search words to web pages;

3. rank the matching web pages by importance.

Let’s assume that the web map has been built, and that it’s time to thinkabout how to rapidly match web pages to an arbitrary user search word. Inthe next lecture we will see how the matches can be ranked.

An index can be used to speed up the search for occurrences of some word,phrase, or topic.

An index essentially says “If you’re looking for this topic, check out theseplaces.”

It should be clear that this idea, which worked so well for libraries and books,can be extended to the web and web pages.

An index takes a lot of work to prepare; once it’s created, it makes searchingincredibly faster.

Google search index

Example. Consider a very simplified example of the web where there arethree webpages.

Indexing these web pages is very similar to indexing a book. We will considerevery word important. So we start by listing alphabetically the unique wordswe find. Each word will be followed by a “1” if it occurs in page 1, a “2” ifin page 2, and so on.

Let’s go through this painful indexing process now.

Our resulting index should look like this. It is a single file that contains allthe words that appear, and which web pages use them:

Word Web pages containing worda 3

cat 1 3dog 2 3mat 1 2on 1 2sat 1 3

stood 2 3the 1 2 3

while 3

A search engine could answer questions with just this single index file.

If we enter the search string dog, the search engine can quickly find thecorresponding line in the index (this is quick, because a computer takesadvantage of alphabetical ordering even more efficiently than we do). Thenthe search engine can report that “dog” appears in webpages 2 and 3.

Of course, a real search engine would include a bit of the text surroundingthe occurrence of “dog” but that’s an added feature that we don’t need toworry about right now.

If we enter the search string cat, our search engine will tell us that this wordoccurs in pages 1 and 3.

Often, our search string involves more than one word.

For example, if we are interested in finding a web page that discusses theproblems of having both a cat and a dog, then we don’t want pages that havecats and pages that have dogs, we want pages that contain both keywordsat the same time.

So let’s enter a search phrase dog cat.

The search engine can determine that “dog” occurs on pages 2 and 3; Thenthe search engine can find “cat” on pages 1 and 3.

Now it must put these two facts together, realizing that “dog” and “cat”both appear only on page 3, and this is the single result returned.

So finding two search words together on the same page takes two searches,three words would require three searches, and so on.

Notice an important fact: to answer these questions, we did not have tohave access to the original web pages. We needed that when we made theindex, but now a single index file allows us to answer questions about allthree pages.

Now the World Wide Web has 40 billion web pages. Suppose we can createa similar (but huge!) index file for them. Then, to answer simple matchingquestions about all the web pages, we only have to search one place, theindex file, not the web itself.

This is one clue to how a search task that should take a thousand years canbe cut down to 2 seconds.

With our index for this example can we handle connected keywords?

In most search engines, it is possible to enter a phrase, using quotationmarks, such as “cat sat”. In that case, you are not just asking that bothwords appear somewhere on the same page, but that they appear immediatelytogether, in that order. That is, we want the bigram cat sat.

The index file we created for the tiny web can tell us that both cat and satoccur on pages 1 and 3, but not whether they occur together.

It might seem that the solution is to look up cat and then go to those webpages and find whether sat occurs in the right position.

This is not acceptable! It would require downloading every web page thathad cat, and reading the entire web page to see whether sat was the nextword. There is no way to guarantee a fast answer.

The answer to this problem is to include word locations in the index.

Suppose in version #2 of our index file:

• we label every word in each page with its position;

• we record every occurrence of a word in the page along with its position.

So if the occurs twice in a web page, each occurrence is listed, along withits position.

New index including word location

Word Page-Position

a 3-5cat 1-2 3-2dog 2-2 3-6mat 1-6 2-6on 1-4 2-4sat 1-3 3-7

stood 2-3 3-3the 1-1 1-5 2-1 2-5 3-1

while 3-4

The first number indicates the web page number the word occurs on and thesecond number indicates its position on that page.

Now suppose we are given the search phrase cat sat.

We look up the word cat, and see that it occurs on page 1 as word 2, andon page 3 as word 2.

sat is on page 1 as word 3, right after cat, so we have a hit on page 1.

sat is on page 3 as word 7, not immediately following cat there, and so thatcounts as a miss.

By making our index more intelligent, we can now answer any phrase inquiryabout our web pages, and we still don’t need to access the web to do this.

We call this The Word Location Trick.

Exercise. Use the following table of page numbers and word locations toanswer the following questions.

Word Page - Position

a 1-7animal 1-2 2-6 3-5

be 2-4cheetah 1-8 2-1 3-1 2-5 3-1

earth 1-4fastest 1-1 2-5 3-3

is 1-5 3-2land 3-4may 2-2not 1-6 2-3on 1-3

1. On what page(s) does the word earth occur?

2. On what page(s), if any, does the bigram on earth occur? If it doesn’toccur, enter “0”.

3. On what page(s), if any, does the trigram fastest land animal occur? Ifit doesn’t occur, enter “0”.

Reading assignment

Read Chapter 3, pages 24–37, “9 Algorithms that Changed the Future”.

Socrative Quiz Searching Quiz1

CTISC1057

True or false.

1. Safari is a search engine.

2. Google Chrome is a browser.

3. You have to use a browser to get to a search engine.

4. When you type in keywords in a browser, then the search engine checksevery word of every web page and returns your results.

5. A spider or a crawler is an electronic robot which is inside your computer.

6. A spider starts its Web search at a popular Web site or server.

7. Currently, Yahoo is the most popular search engine in the U.S.

8. When you type in keywords in a browser then the search engine is notactually searching the Web but an index of the Web.

9. The Internet and the World Wide Web were developed at the same time.

10. Links on a web page provide a bridge to other web pages.

Goals for this lecture

1. Last time we began to look at how an index of the web can be made. Wewant to expand this idea;

2. To understand the importance of the location of keywords on a Web page;

3. To realize that Web pages are typically written in a language called HTML;

4. How HTML tags can indicate what a web page is about.

5. To understand how web pages are ranked after we find the ones matchingsearch phrase.

When is one match for two search words better than another?

Suppose we were interested in learning the cause of malaria. We mightnaturally search on malaria cause although we probably don’t insist thatthose two words occur exactly together so we don’t need quotes.

Suppose the search engine discovered two web pages with both match words.We can see the first web page is a better match.

What clues could a search engine use?

The answer is “Nearness” because close words are typically a better match.

Here we see part of our index file, with keywords highlighted.

Word Page, Positionby 1-1

cause 1-6 2-2common 1-5

...malaria 1-8 2-19many 2-13

of 2-10 2-14...

the 1-3 1-24 2-7 2-11

Using our word location index, the search engine can see that on page 1,malaria and cause are just 1 word apart, versus 16 words apart on page 2.

Although both pages match both keywords, page 1 may be the better matchbecause the keywords occur much closer to each other.

Notice that the computer does not understand what it is reading! It coulddo the same kind of analysis for keywords and text written in Italian, or inancient Mayan.

It may look like an intelligent action, but it’s based on a very simple idea:physically close keywords suggest a better match.

The Nearness Trick is thus also useful for our upcoming page ranking task.

We already know two ways to specify multiple key words to a search engine:

• quoted, we prefer that the words appear together, in that order;

• unquoted, the words can be in any order and far apart.

It turns out that most search engines automatically prefer situations in whichthe keywords are close. However, one interesting feature in Google Searchallows us to specify how close we want the words to be.

If we use one asterisk between two quoted keywords, then we are asking forpages where the keywords appear in that order, separated by exactly oneword

"string1" * "string2"

Nearness can also help us avoid spam pages.

One reason that search engines also prefer matches in which multiple key-words are close is to avoid being trapped by spamdexes. A spamdex is anartificial web page that simply contains a grab bag of keywords, without anyinformation. You could make such a web page by posting a dictionary, minusthe definitions, for instance. We will look at this more closely later in thecourse.

A search engine looking for hair loss remedy or tap dance lessons orperpetual motion machines will find matches (but no information!) ona spamdex page, and the spamdex operator will pick up some money bydisplaying ads to the annoyed user.

Thus, even if the user doesn’t request that the keywords be close, searchengines avoid matching pages that fail the proximity test.

Indexing should notice a web page title

Web pages are actually a little more complicated than the simple text fileswe have used so far as examples.

Web pages are written in HyperText Markup Language HTML. HTML allowsthe author to vary the font type and size, to include tables, lists and figures,and to indicate the structure of the document.

A web page author can specify a title for the web page using specific HTMLtags to indicate where the title starts and stops.

If a search engine is looking for the key word malaria, doesn’t it make ahuge difference if the actual title of a page is “Malaria”?

An example of title searching using HTML

Here are three web pages which include title information.

The pages as we see them.

The pages as the browser and search engine see them.

When written in HTML the Web page typically has a title between thespecial tags <titleStart> and <titleEnd>. (The actual HTML tags are

slightly different.) An intelligent search engine takes advantage of noticingthe title!

Our improved tiny web index will notice and include HTML tags.

Since the spiders see the actual text that generates the Web page, the first“word” to index is the HTML tag <titleStart>, the next words are the onesin the title, then the HTML tag <titleEnd>, then the HTML tag for startingthe body, followed by the body of the page and finally the HTML tag forending.

For example, for the page entitled “My Cat” we have the following indexing.

<titleStart> my cats <titleEnd>1 2 3 4

<bodyStart> the cat sat5 6 7 8

on the mat <bodyEnd>9 10 11 12

Word Page - Positiona 3-10

cat 1-3 1-7 3-7dog 2-3 2-7 3-11mat 1-11 2-11my 1-2 2-2 3-2on 1-9 2-9

pets 3-3sat 1-8 3-12

stood 2-8 3-8the 1-6 1-10 2-6 2-10 3-6

while 3-9<titleStart> 1-1 2-1 3-1<titleEnd> 1-4 2-4 3-4<bodyStart> 1-5 2-5 3-5<bodyEnd> 1-12 2-12 3-13

Now suppose a user searches for dog. A page in which dog is in the title isprobably a stronger match.

Each time a page is found containing the word dog, the engine can checkwhether this word is actually part of the web page title. It does this bycomparing the positions of <titleEnd> and dog and <titleStart>. If thekeyword falls between the two title markers, then this web page is more highlyrelated than if it occurs elsewhere.

By looking at our index, we see the following cases:

Page titleStart dog titleEnd Start < dog < End?2 1 3 4 yes2 1 7 4 no3 1 11 4 no

This technique is The Metaword Trick.

Maps and Indexes have solved the pagematch problem

From what we have seen, the impossible problem of quickly responding toa request to find keywords in all the webpages in the world has become thepossible problem of intelligently searching a single index file.

Just as with card catalogs and an index at the back of a book, the creationof an index file for the web takes a great deal of time, and space.

Google, for instance, has created enormous collections of computer serverswhose job is to collect all the information on all the web pages and create,update, and analyze the corresponding index file. This means that the indexfile is actually always out of date (like Google Street View) but regularlyupdated piece by piece.

Computer “Tricks”

The search engine may seem to be intelligent - it’s answering your questions,after all. But actually, we have simply figured out a number of tricks thatmake it possible to come up with reasonable approximations of good answers:

• The Cache Trick

• The Map Trick

• The Index Trick

• The Nearness Trick

• The Word Location Trick

• The MetaWord Trick

These are examples of computational thinking in action: given a problem,how we can use the strengths of a computer (ability to store information, togather and remember new information, look up information quickly, repeat

operations) to simulate the abilities of a human (read all the web pages, andfind the matching ones).

Exercise. Consider the following two sample web pages with HTML tags.Index the pages by completing the given table.

Page 1

<titleStart> Seminoles Football <titleEnd> <bodyStart> VisitSeminoles.com and get FSU news from

the official athletic site <bodyEnd>

Page 2

<titleStart> Florida State Seminoles Football Schedule<titleEnd> <bodyStart> Florida State Seminoles Football

scores schedule stats roster players newsand more <bodyEnd>

Word Page - Positionand

athleticFloridafootball

fromFSUget

morenews

officialplayersroster

schedulescores

SeminolesSeminoles.com

siteStatestatsthevisit

<titleStart><titleEnd><bodyStart><bodyEnd>

Exercise. Use the index below to answer the following questions.

Word Page - Positiona 1-5 1-14 2-16

basketball 1-9 2-3be 2-15can 2-14

chance 1-15college 2-2deep 2-8for 1-7 1-16 2-7fsu 1-8 2-4 2-15has 2-5in 1-20 2-19

madness 1-3 1-19 2-21major 2-17march 1-2 1-18 2-20ncaa 2-9

pieces 2-6player 2-18

possibility 1-6run 2-11

season 2-23some 1-8 1-17still 1-4 1-13

tallahassee 1-21there’s 1-12 1-10 2-6 2-10

this 2-22tournament 2-10<titleStart> 1-1 2-1<titleEnd> 1-10 2-12<bodyStart> 1-11 2-13<bodyEnd> 1-22 2-12 2-24

1. What is the title of the first page?

2. Does the bigram “march madness” appear on the first page? On thesecond page?

3. Does the word “tournament” appear in the title of the second page?

4. Does the bigram “march madness” appear in the title of the second page?

5. Does the word “tallahassee” appear on both pages?

6. How many words are in the title of the second page? (excluding<titleStart>and <titleEnd>)

7. Does the bigram “major player” occur on either page? If so, which one.

8. Does the bigram “fsu season” occur on either page? If so, which one.

After finding matches, we need to do ranking!

Even with the tricks we have described, it is common for a search engine todiscover thousands or millions of matching web pages.

The mapping and matching algorithm are only the beginning of the processof responding to your web search.

Next it will be necessary to consider the page rank algorithm, which considersall the matching pages that have been found, and sorts them in order ofimportance, so that even with millions of matches, most users know theirbest choice is a match on one of the first few pages.

The Page Rank Problem

We have seen that search engines use Web Mapping and Web Indexing tofind matches to search words.

• Before a search engine sees any user search words, it has already done alot of work in preparation.

• So a search engine must send out a constant stream of web crawlers,which randomly explore the Web, reporting where they are, how they gotthere, and what they find.

• The information from web crawlers is used to create and update an enor-mous index of the web:

– directions to reach every server, folder, directory that contains webpages;

– a list of every web page;

– a list of every word and its location in the web page;

– a list of hyperlinks in every web page.

• Instead of the search engine storing a copy of the entire web, it reallytransforms the web into an enormous and complicated index file.

• Having a local index means that, when you send in a key word, the searchengine can very quickly look it up locally, rather than having to touch theweb at all.

• The index will provide a list of all the pages on the web that have yourkeyword somewhere in their text.

• Since there are 40 billion web pages, it’s easy for almost any key word toresult in millions of matches.

• Most of those matches are probably not very useful, and there’s no wayyou would have the patience to search through them one by one for thetruly relevant ones.

• When you use a search engine, you often find the best matches on thefirst page. That’s because, after the search engine found all the matches,it went through them and made a very good guess as to which pages werethe best. But the search engine is a computer program. It can’t actually

understand the pages it is handling. How can it decide what to put onpage 1?

• We saw two simple tricks (Nearness and MetaWord Tricks) that couldhelp with ranking web pages.

• The Nearness Trick assumes that if you specified two keywords like malariaand cause, then matching pages should be preferred where these wordswere closer together in the text.

• The Metaword Trick assumes that the person who wrote the web pageincluded special editorial comments, using the HTML language. In par-ticular, if your keyword turns out to appear in the title of a web page,that makes it likely to be a better match.

• These two tricks are useful, but when you’re dealing with millions ofmatching pages, we need much more powerful tools in order to quicklypick the best matches.

Sorting a list of a million web sites is different from sorting a list of numbersor names.

Sorting a large set of objects by importance is called ranking.

If the search engine was a human, and familiar with the question we asked,then it could carefully read every matching web page, and sort them in orderof usefulness, showing us the best matches first.

Asking a computer to do web page ranking seems to be another exampleof an impossible task, since the search engine can’t actually understand theweb pages.

But automatic ranking is also a vital task, because we can’t afford to paypeople to read and rank web pages (nor can we wait that long!), and thequality of pages on the web varies from marvelous to ridiculous.

Is ranking without comprehension possible?

The search engine doesn’t understand what the search words or the webpages mean - it doesn’t even know English.

How can reliable, reputable, reasonable web pages be selected over the oceanof misinformed or irrelevant matter on the web?

Early search engines actually did try having people read and rate individualweb pages; But such a rating system is very expensive, requires hiring expertsin every possible field, and must be updated daily as web pages change andnew ones are added.

Nonetheless, we will come up with a solution, and it will work well even ifthe pages are written in Polish, Esperanto, or the Martian language!

Example. Ranking restaurants.

We ourselves sometimes choose between items about which we seem to knownothing, and we don’t simply flip a coin. We have our own set of tricks touse.

Suppose your job has sent you to work in the country of Vulgonia for a week.Let’s assume you don’t speak a word of Vulgonian.

You go to the downtown area hoping to get something to eat, and you seetwo restaurants are open. One says BLATNOSKI LOBSOPPY and the othersays DINGLE MARKSWART. There are menus in the window, but you can’tread them. The windows are too steamy to see inside.

You stand outside the two restaurants for ten minutes, and make up yourmind. You are confident you are going to a good restaurant. How is thispossible?

It looks like the restaurant on the left is much more popular than the oneon the right. You automatically assume the left one is better, based onobserving the choices of other people.

You are basing your choice on its popularity. Search engines use a similartrick called the Authority Trick.

So we can see that, at least in some simple situations, we may find that weneed to make a choice without knowing what we’re choosing.

If there are already other people making choices, then a reasonable strategyis to prefer the most popular choice.

A refinement to this strategy arises if you have some way of judging thepeople making the choices. If you feel some people are more reliable, ormore knowledgeable, or have more in common with you, you might weighttheir choices more strongly, giving their choices more authority.

An example of this is when you take a movie recommendation of a friendover that of someone you just met.

The Authority Trick suggests that you can sometimes make good choiceswithout knowing what you’re choosing. But a computer can’t count peoplegoing to a restaurant. Can we see a way to extend this idea of the authoritytrick to our web page ranking problem?

Example. Ranking papers in mathematics

Let’s suppose that, by an incredible mistake, you’ve been asked to givesome advice about a famous problem in higher mathematics, a language youprobably don’t speak!

This famous mathematical problem is called the Riemann hypothesis.

Ever since Riemann stated his hypothesis in 1859, mathematicians have beeninvestigating and puzzling over how to determine whether it is true that theRiemann function f (ζ) will never produce a zero result except for a specialset of input numbers.

This has resulted in many papers, of varying quality, being written about theRiemann hypothesis.

Now suppose a friend has suddenly become interested in this Riemann hy-pothesis, and has asked you to recommend just one paper to read.

Knowing nothing about this problem, you go to the library and find 20mathematical papers on this topic.

You can’t just hand your friend all 20 papers!

Can you make a recommendation that at least looks intelligent?

The 20 mathematical papers you found each includes a bibliography, that is,a list of other papers that the author referred to while writing this paper.

Even if you don’t know much mathematics, you still can recognize a bibiog-raphy; that means that you not only have 20 papers, you also have 20 listsof what are probably good, useful papers.

Suppose you notice many of the papers cite one or both of these papers:

• Conrey, J. Brian (2003), The Riemann Hypothesis, Notices of the American Mathematical Society.

• Dudek, Adrian W. (2014). On the Riemann hypothesis and the difference between primes. Interna-tional Journal of Number Theory 11 (03).

Now that you know that people who think about the Riemann hypothesisoften cite papers by Conrey or Dudek, you could reasonably go to your friendand suggest that those papers would be an excellent starting point.

You naturally assumed that all the citations in a paper are “votes” or “likes”for other papers, so that if you could keep track of the papers with the mostvotes, you probably had a few winners.

The popularity idea isn’t perfect. A paper could also be frequently citedbecause it is controversial or disputed or wrong.

Your collection of papers might also include many references to:

• Smaley, Ricardo (2012), Riemann was wrong!, Ruritanian Mathematics Journal.

Without mathematical training, and reading this paper, we can’t judgewhether it is a correct reference or not, but the fact that so many peoplehave referred to it suggests that it is nonetheless an interesting reference,and one that you might mention to your friend.

The point is that the existence of bibliographies means we can make someguesses about influential and important papers, without reading them.

And that suggests that a computer program can sometimes use similar cluesto make what look like intelligent decisions.

Ranking: Counting followers of Twitter users

One of the most prominent activities for users of social media is the abilityto “friend” or “follow” or “like” or “connect” with other users.

This automatically creates a ranking among users: those with the mostfriends, followers, likes, or connections are seen to be most important, andare naturally regarded by new users as worth following as well.

On Twitter, for instance, Taylor Swift has more followers than Barack Obama,so, at least as far as Twitter users are concerned, a popular singer has farmore authority or importance than the U.S. president.

Now you can start to see how a computer can learn about us humans;simply by counting followers on Twitter, a computer program would gainsome correct notions of fame and influence.

In our example of the Riemann hypothesis, the most authority was given tothe paper which received the most citations from others.

In the Twitter example, the most authority was given to the person who hadthe largest number of followers.

But if we think about it, we can use this idea to try to estimate the impor-tance of web pages as well.

Most web pages include links, referring the reader to other pages that haverelated information.

These links are created by humans, who have made a choice about whichweb pages to link.

Thus, a link is a sort of vote, recorded on one web page, for another webpage.

Ranking: Counting links to web pages

Webpage links

Most web pages include links, allowing a reader to get more informationabout a specific topic.

This is often because a user begins with a very general question, and wantshelp in gradually focussing the question to get the exact information desired.

A high school student might be interested in whether FSU has a foreignstudy program in its German Department. One way to get there involvesmoving through a series of web pages, each of which includes a link to thenext one.

At each step, the user is exploring (sometimes making a mistake, and movingbackwards!) The links between pages allow the user to slowly explore theinformation and usually make it to the right stuff.

How is a link put into a web page?

We have mentioned that the HyperText Markup Language (HTML) includessome tools for formatting a web page, so that it has a title, and can makelists, include images, and various fonts. HTML also specifies how a writercan insert a link into a web page.

For example, near the top of the FSU main web page, there is an itemACADEMICS which is a hyperlink.

Actually, ACADEMICS is the visible part of the link, what the user sees.There is also an invisible part, what is called the Universal Resource Locator(URL), which is simply the web address. For ACADEMICS, this web addressis www.fsu.edu/academics/

When you click on the word ACADEMICS on the FSU web page, whathappens is the same as if you typed the URL www.fsu.edu/academics/into your browser. The hyperlink just makes this easy for you.

The ACADEMICS hyperlink is set up by inserting the following text into theFSU main web page, which simply lists the URL to be associated with thevisible text:

<a href = "www.fsu.edu/academics/"> ACADEMICS </a>

Any number of links can be included in a web page, and they can point toany place on the web that the author of a web page thinks might be useful.

So we can think about a link as a sort of vote by one web page for anotherweb page. We need to think about how to count these votes.

Compare web page A which includes 50 links to other web pages, while 50different web pages all link to web page B.

There’s no reason to think web page A is very important, but web page B hasgotten the attention of 50 different web writers; perhaps there’s somethingworth seeing on web page B!

Links TO a web page are important, not links FROM it.

So a web site that contains no links might still be very important, whereasa web site that has no links to it seems to be pretty useless.

We’ll call this initial idea for ranking web pages the Hyperlink Trick.

Links: Inlinks are web pages that vote for you

Example.

As a simplified example of the Hyperlink Trick, suppose you search for ascrambled egg recipe and the search engine finds two matching pages: Ernie’srecipe and Bert’s recipe.

Which recipe should the search engine recommend most strongly?

The search engine looks at how many inlinks each website has.

A search engine can’t read or understand the recommending pages.

But it can certainly count the number of links coming in to either of the tworecipes.

The fact that Bert’s page has more links is at least a suggestion that people(who can read and understand Web pages!) found Bert’s page more useful,or his recipe better tasting.

So in the absence of any better information, a search engine could take thenumber of incoming links to a Web page as a rough indication of the rankor value or authority of that page.

Incoming links can be reported by web crawlers

We have already seen that we needed an army of programs called webcrawlers to wander around the web constantly, gathering information aboutnetwork connections, web sites, web pages, search phrases inside of webpages and the locations of those search phrases.

This goes to making up a map of the web, and an index of search words.

Now that we see that incoming links are important, we can simply ask ourweb crawlers to include this information in their searches. As they “read” aweb page, they notice every link that points to another page, and they tellthe search engine Web page A links to web page B.

Now our search engine can track the incoming links to every page.

Is ranking by link count good enough?

It’s easy to see some problems with such a simple ranking system.

1. If all the Web pages pointing to Bert’s page said “This recipe is terrible!”,the search engine would still give Bert’s page a higher ranking than Ernie’s;

2. If Ernie knew how the search engine works, he could quickly write 10 newWeb pages that praise his recipe, so now he ranks higher than Bert;

3. High school students and film critics both make top ten lists of movies,and there are many more high school students than film critics.

Example. Returning to our example of Bert & Ernie’s webpages for scrambledeggs, we now look at a webpage that links to Bert’s page and another whichlinks to Ernie’s page.

Each page has one link, but are these links equal?

John MacCormick is not a famous chef, but Alice Waters is.

If we, being humans, know that John MacCormick is not a famous chef, butAlice Waters is, then we are likely to assume that it’s safe to prefer Bert’srecipe, because Alice Waters’s recommendation has more authority.

We would like to modify our ranking procedure. Instead of only countingthe number of hyperlinks to pages, we’d like to include somehow a measureof the authority of the Web page that is making the recommendation, thatis, the hyperlink to Bert or Ernie’s page.

In this way, our ranking procedure can take advantage of The AuthorityTrick. Hyperlinks from pages with high “authority” will result in a higherranking than links from pages with low authority.

But how can a computer determine authority?

Let’s consider combining the Hyperlink Trick with the Authority Trick.

Let’s start by assuming there are a total of 102 web pages that point toeither John MacCormick or Alice Waters, and let’s assign each of these webpages an authority of 1.

Now suppose John MacCormick has 2 hyperlinks pointing to his web page,while Alice Waters has 100.

We might give MacCormick an authority of 2, and Alice Waters an authorityof 100, as though the lower level web pages were voting for them.

Then we might suppose that any recommendation (hyperlink) by Alice Wa-ters should add 100 “authority points” to that web page, while a recommen-dation by John MacCormick would only be worth 2 points.

This means that Ernie’s recipe has an authority score of 2, and Bert’s 100.

Google would know that Alice Waters’ site has more authority than JohnMacCormick’s site because her site would have a higher ranking.

Count the links to the pages that link to the pages that link to the pages . . .

Adding up links and links to links and so on almost works...

The ideas of using hyperlinks and assigning authority are good ones. Wemight try to implement these ideas by starting every web page with oneauthority point. Then, each hyperlink in a page would cause that page’sauthority points to be added to the linked page’s authority points.

Then we just have to do this for all web pages and we’re done, right?

Unfortunately, this idea won’t quite work. A problem arises if we encountera cycle, a sequence of hyperlinks forming a loop.

In this case, you can start on page A, jump to B, then E, and back to A.

The pages A, B, and E form a cycle.

This means that our method for assigning authority points will fail. To seethis, let’s imagine trying to compute the rankings for this case.

C and D have no links pointing to them, so they get a score of 1.C and D point to A, so A gets a score of 2.A points to B, and B points to E, so they get scores of 2 as well.

Are we done? No, A’s score is out date now!

We update A to 4. But then we must update B and C...and A again. Andthere must be many cases of cycles like this over the entire Web.

To save our idea, we need the Random Surfer Trick.

The Random Surfer will follow a trail of hyperlinks, but only for a while!

The random surfer simulates a person surfing the Internet.

A starting page is picked at random. If this page has any hyperlinks, thesurfer picks one at random and moves to that new page. If that page hashyperlinks, another random choice leads to another page, and so on.

If a page has no links, a new page is chosen at random.

The random surfer never moves backward.

Even if the current page has links, the surfer is allowed occasionally to insteadmake a jump to a random page, as though he/she/they is bored.

In some sense, the random surfer models user behavior.

The random surfer model takes into account the quantity (the HyperlinkTrick) and the quality (the Authority Trick) of incoming links at each page.

Randomly surfing the Internet seems an odd way of trying to understand theauthority index we are seeking.

However, if we do this experiment many, many times, then you should be

able to see that a web page that is pointed to by many links will be morelikely to be visited often by the random surfer.

On the other hand, we will never get stuck in an infinite loop, because wealways restart the process after a certain number of steps.

So the Random Surfer Trick estimates the authority index by wanderingthrough the Internet, and noticing which pages it visits most often.

Example of Random Surfing

Here the surfer starts at page A, and moves to another page following arandomly selected link (darker arrow). Three such steps reach page B.

From page B, the surfer jumps (dashed line) to page C, then links to pageD, then another page, then another random jump.

From there, the surfer takes two linking steps and stops.

It turns out that if you let the random surfer wander around the web likethis, then you have solved the authority index problem.

This is because, in a natural way, the importance or authority of a web pageis related to the number of times the random surfer visited that web page.

More precisely, if we make the authority index a percentage, then the au-thority of a web page is the percentage of visits that were made to thatpage.

The web has lots of cycles that could trap someone who can only move alonglinks. But the surfer gets bored easily, and jumps around, escaping the cycletraps.

A simulation on an internet of 16 pages

We have recorded the number of times each page was viewed by the randomsurfer over 1000 steps.

Authority = number of visits to this page / number of steps times 100 (toconvert to percent)

Rank = Hyperlink + Authority + Random Surfer

1. Hyperlink Trick suggested that a page with many incoming links shouldreceive a high ranking. But the more incoming links a page has, the morelikely the random surfer will visit it.

2. Authority Trick: an incoming link from a highly authoritative page shouldimprove a page’s rank more than a link from a less authoritative page.But a popular page will be visited more often than an unpopular one,and so there will be more opportunities for the surfer to arrive from thepopular page than the less popular one.

3. Random Surfer Trick: a random surfer in the form of a computer programstarts at a page at random and follows hyperlinks moving to a new page.The surfer occasionally jumps to a random page, eliminating the problemof getting stuck in loops.

Now our search engine is complete

A good search engine needs:

1. a map of the web, its pages, and links;

2. a page match algorithm that finds pages that match the user’s searchwords;

3. a page rank algorithm that sorts matching pages so the most authoritativeare listed first.

It only took half a second to carry out 1,000,000 steps of the random surferprocedure in our earlier example using 16 pages.

Since the World Wide Web has 40 billion pages, it will obviously take muchlonger to compute a complete authority index list.

However, if several computers carry out a separate random surfer analysis,the results can be combined. Since Google has about two million computerprocessors available, the task suddenly becomes much more doable.

Moreover, Web pages don’t change very fast, so results can be computedevery week or so.

So a good search engine will have available an up-to-date ranking of all Webpages, before a user has made any search requests.

In 1998, Larry Page and Sergey Brin announced their PageRank algorithm,built into Google search. Results were noticeably better and faster, andthe “first page” results often exactly what users were seeking; Google soonbecame the dominant search engine. Google and its competitors continueto improve their search engines.

Example. Is Google’s PageRank algorithm foolproof? The Case of J CPenney.

The Web began as a way for scientific researchers to communicate but nowit involves commercial services and advertising. Getting a company’s adver-tisement pages onto the first page of search results means big money.

People do a lot of shopping online; instead of visiting stores, they look foritems by using the search engine on their browser. When the search resultscome back, most shoppers only look at the first page of results, and 1/3 ofthe time they go for the very first result. That means being the first resultin a search can make big money for an online company.

The following is excerpted from the “The dirty little secrets of search”, byDavid Segal, which appeared in the New York Times, Feburary 12, 2011.

Pretend for a moment that you are Google’s search engine. Someone typesthe word “dresses” and hits enter. What will be the very first result? Thereare, of course, a lot of possibilities. Macy’s comes to mind. Maybe a specialtychain, like J. Crew or the Gap. Perhaps a Wikipedia entry on the history of

hemlines.

O.K., how about the word “bedding”? Bed Bath & Beyond seems a candi-date. Or Wal-Mart, or perhaps the bedding section of Amazon.com.

You could imagine a dozen contenders for each of these searches. But inthe last several months, one name turned up, with uncanny regularity, in theNo. 1 spot for each and every term: J. C. Penney.

The company bested millions of sites - and not just in searches for dresses,bedding, and area rugs. This striking performance lasted for months, mostcrucially through the holiday season, when there is a huge spike in onlineshopping. Type in “Samsonite carry on luggage”, for instance, and Penneyfor months was first on the list, ahead of Samsonite.com.

Google’s stated goal is to sift through every corner of the Internet and findthe most important, relevant Web sites. Does the collective wisdom of theWeb really say that Penney has the most essential site when it comes todresses? And bedding? And area rugs? And dozens of other words andphrases?

The New York Times asked an expert, Doug Pierce, to study this question.What he found suggests that the Google search often represents layer uponlayer of intrigue.

If you own a Web site about Chinese cooking, your site’s Google rankingwill improve as other sites link to it. Even links that have nothing to dowith Chinese cooking can bolster your profile. And here’s where the strategythat aided Penney comes in. Someone paid to have thousands of linksplaced on hundreds of sites scattered around the web, which lead directly toJCPenney.com.

Mr Pierce found 2,015 pages with phrases like “casual dresses”, ”eveningdresses”, “little black dress” or “cocktail dress”. Click on any of thesephrases and you are bounced to the main page for dresses on JCPenney.com.

Some of these sites are related to clothing, but many are not. There arelinks to JCPenney.com’s dresses page on sites about diseases, cameras, cars,dogs, aluminum sheets, travel, snoring, diamond drills, bathroom tiles...

Google warns against using such tricks to improve search engine ratings. The

penalty for getting caught is a pair of virtual concrete shoes: the companysinks in Google’s results.

On a Wednesday in 2011, JCPenney was the subject of Google’s “correctiveaction”.

At 7pm, JCPenney was the No.1 result for “Samsonite carry on luggage”.Two hours later, it was No. 71.

At 7pm, Penney was No. 1 in searches of “living room furniture”. By 9pm,it had sunk to No. 68.

Penney fired its search engine consulting firm, and announced that they were“disappointed” with Google’s actions.

Google engineer Matt Cutts emphasized that there are 200 million domainnames and a mere 24,000 employees at Google. “Spammers never stop”, hesaid.

What if you want to create your own Webpage?

A web page is just a text file, which can be created with any text editor.

However, unlike a typical text file, a web page includes special tags thatessentially say “This is the title”, or “This part should be in italics” or “Thisbegins a list” or “This is a link to another page.”

The rules for using and interpreting these tags are part of HTML, the Hy-perText Markup Language.

The browser uses the tags in order to format the web page, so that you onlysee what looks like a text file.

To understand how a web page works, you can look at the version thatincludes the tag information. In FireFox, for instance, you would go to theDeveloper menu and choose page Source.

If the web page is very fancy, you may be surprised at how complicated theHTML version is!

The web browser Mozilla has an online editor for learning to create a Webpage using HTML. You can find it at https://thimble.mozilla.org/

It is a good place to learn by taking a template and changing it.

The simplest template is creating a poster. A sample is given and youcan modify it however you like. Other projects are included like creatingautomatic excuses for late assignments!

Socrative Quiz Searching Quiz2

CTISC1057

Use the table below to answer the first 5 questions.

Word Page - Position

a 3-10cat 1-3 1-7 3-7dog 2-3 2-7 3-11mat 1-11 2-11my 1-2 2-2 3-2on 1-9 2-9

pets 3-3sat 1-8 3-12

stood 2-8 3-8the 1-6 1-10 2-6 2-10 3-6

while 3-9<titleStart> 1-1 2-1 3-1<titleEnd> 1-4 2-4 3-4<bodyStart> 1-5 2-5 3-5<bodyEnd> 1-12 2-12 3-13

1. The word mat occurs on all three pages.

2. The word pets is in only one of the three titles.

3. The word sat is adjacent to cat but not to dog.

4. The word my only occurs in the title of the pages and not in the body.

5. What page(s) is the word “mat” on?

6. When we use Google Search to search the WWW, we are really notsearching the Web but rather an index of the Web.

7. A web page which has the search words in the title is probably moreimportant than one which doesn’t.

8. A hyperlink on a web page allows the user to jump from one web addressto another without actually knowing the address of the second.

9. Links FROM a web page give more authority to the page than links TOthe web page.

10. Google’s page ranking algorithm is foolproof.

searching finding needles in the world’s biggest haystack ...jpeterson/pagematch_pagerank2.pdf ·...

Documents