an investigation of the factors influencing website...

The WBPN model: a proposed design approach to maximize website visibility to

search engines

Full research article submitted to ITAL for review

Author: FirstWeideman, M, Dr.Head: Research Planning and Capacity BuildingFaculty of Business Informatics, Cape TechnikonCape Town, SOUTH [email protected] tel: 27 21 9135515 fax: 27 21 9134801

Author: SecondBinedell, M, Mr.Student, B.Tech (IT)Faculty of Business Informatics, Cape TechnikonCape Town, SOUTH AFRICA

1

The WBPN model: a proposed design approach to maximize website visibility to search engines.

ABSTRACTThis article presents a literature survey and a proposed model towards the design of electronically visible websites. The intent of the research work was to determine what the views and guidelines of field experts are, and to propose a design approach to guide prospective website authors.

Many users rely on search engines for the retrieval of user-specific information from the Internet. To attract new visitors, websites must be indexed and ranked by search engines. There are many website design factors that influence a website’s visibility to search engines. In the conclusion a model for the design of a search engine friendly website is proposed.

2

The WBPN model: a proposed design approach to maximize website visibility to search engines

INTRODUCTIONThe Internet offers unprecedented freedom to any prospective author who needs to publish a collection of data, images, sound or any other kind of information, which can be presented in an electronic format. This author needs basic knowledge of HTML or an authoring program, a link to the Internet through an Internet service provider, and some time to construct a website.

The result of this scenario is twofold. On the one hand, the Internet undoubtedly contains the richest, fastest changing source of information waiting for the adept searcher in almost any field imaginable. However, on the other hand, this information is not categorized, is sometimes of questionable origin, and above all, is often difficult to find.

Search engines generate by far the most Internet traffic, as users scurry around trying to find up to date and useful information in this confusing maze of data. These search engines constantly need to find new webpages and index their contents to provide the user with fresh answers to their information needs. However, it has been claimed that search engines do not always produce relevant, unbiased results. The focus of this article is on a webpage design approach to ensure that search engines have an easy task in finding and indexing webpages. The resultant model is referred to as the WBPN model – the Weideman Binedell Positive Negative approach.

RELATED LITERATURE

The InternetThe Internet was incepted in 1969 by the U.S. Department of Defense. The purpose was to share research among military, industry and university resources, and to provide a system for sustaining communication among military units in the event of a nuclear attack. Thus, the system allowed traffic to be routed around the network via many alternative routes, not just a single path. This network of computers expanded rapidly and by 1977 many international connections had been made.

In 1990 the physicist Tim Berners-Lee developed the World-Wide Web (WWW or the web) in CERN, Switzerland. Based on hypertext (a system of embedding links in text to link to other text) the idea of the WWW was to allow distributed teams to electronically share research on common projects, thus creating a web of human knowledge.

By 1991 several tools were developed to help sift through the large amount of information on the Internet. One popular tool was Gopher, developed at the University of Minnesota. Gopher was a uniform system of menus that allows users to browse through and retrieve files stored on different computers (Poulter, 1997). The WWW has since replaced Gopher as the medium used for publishing information on the Internet (Berners-Lee et al, 1994). The WWW is the fastest growing and most easily navigable section of the Internet and has expanded across all physical and logical boundaries.

3

WebsitesA website is a collection of information organized in files commonly referred to as webpages. Webpages are electronic documents that contain any combination of text, sound and graphics. Webpages are stored on a web server and are requested by a client program, commonly referred to as an Internet browser. The first graphical browser for the WWW, called Mosaic, was developed in 1993 (D’Angelo et al, 1998).

Green (2000) differentiates between two types of webpages: static and dynamic. Static webpages are manually produced by a web designer and displays the same generic information to everyone. Dynamic webpages are computer generated. Information, customised to the requirements of the user, is retrieved from a database and displayed on a blank webpage template. Dynamic webpages are only present on the web temporarily.

Intrinsically webpages are the common thread that binds the Internet community together. They are the vehicles which convey the thoughts, sounds, and images of their authors, they contain the information the inhabitants of this community are constantly searching for, they are printed, e-mailed and shared constantly by users. The Internet and websites have become synonyms in our vocabulary.

Search EnginesThe lack of a default search engine for the WWW has prompted the development of a multitude of search engines during the mid 1990’s (see Table 1). Green (2000) describes two types of search engines, web directories and crawler based search engines.

All crawler based search engines have three primary components (Green, 2000): - a crawler (robot, spider) is a program dispatched by a search engine to search the

web for new webpages and index the hypertext on those pages,- the indexed webpages are stored in a search engine database, and- interrogation/retrieval software is used to query the database.

The retrieval software matches a user’s search query to words that appear on webpages and displays the list of results on screen. These results are ranked according to algorithms developed by the search engine.

According to Sullivan, the major crawler based search engines and search result providers today are Google, AllTheWeb, MSN Search, AOL Search, Ask Jeeves, HotBot, Lycos, Teoma and Inktomi.

1994 LycosWebCrawlerYahoo! Web directory.

Since 2002, crawler-based results are provided by Google.1995 AltaVista Launched as the largest search engine on the web.

ExiteInfoseek Renamed “Go” in 1999.

1996 HotBotInktomi Provides search results to other search engines. The Inktomi

index cannot be interrogated directly.LookSmart Web directory.

4

1997 AOL SearchNorthern Light Became the search engine with the biggest index (indexing 16%

of the Web).1998 Ask Jeeves Launched as “the first natural language search engine”.

Direct Hit Ranking of results is based on websites that users have visited.Closed in 2002.

Google Introduced the “PageRank” ranking algorithm.Currently the most popular search engine in use.

MSN Search Owned by Microsoft.Open Directory Web directory.OvertureReal Names

1999 AllTheWeb Search result provider.FAST Launched with the largest ever search engine index – over 200

million webpages.2000 Teoma2001 WiseNut

Table 1: Timeline of search engine development (Green, 2000; Sullivan, 2003)

The Role of the Internet in the LibraryIt is evident from the literature that the role of the library and the librarian in education is changing, resulting from the influence of technology. The title of a collection of essays by White underlines this fact: “At the Crossroads: Librarians on the Information Superhighway” (White 1995). Marcella claims that more than three-quarters of the respondents to a mailed survey on information needs in the UK claimed that they would use a public library at least occasionally to satisfy their information needs (Marcella and Baxter 1999:175).

However, rapid advancements in the IT/IS field have made it difficult for libraries to keep up with the constant demand for new texts in these fields. Budget constraints and updating logistics compounded this problem. It was a natural move to merge the function of a library and that of the Internet as resource providers. Some authors viewed the incorporation of the Internet into the service offered by a library as a challenge (Radcliff et al, 1993:17). Others found that the largest application for hypertext systems in academic libraries in the UK was for networked document retrieval (Furner-Hines et al, 1995:31). This indicated that this merging had been taking place. It would however be naïve to expect that a total merge between the traditional IT department and the library should take place at all institutions of higher education. IT departments normally serve other sections as well: the academics as vehicle in education, and the administration as support for registration and financial matters, to name but two. In this situation, the Internet can be seen as an enabling technology rather than as a choice to be made.

Voorbij claims that libraries should support their users by assisting them in searching for information on the WWW (Voorbij 1999:598). However, it was found that librarians, traditionally a certain source of answers for the information-seeking learner, experienced problems in dealing with the Internet as medium. Subject librarians did not always find what they needed using the searching facilities on the Internet. Basu has shown that 29.6% of a study group of reference librarians found the information they needed between 21% and 30% of the time, while only 14.8% of them found it more than 50% of the time (Basu 1995:38). Problems associated with the use of this method to find information included lack

5

of time and the quality and quantity of data returned (Basu 1995:38,39). Another study showed that some librarians were unable to supply answers to the questions put to them owing to a lack of sophistication on their side (Devlin and Burke 1997:104). Most authors referred to the paradigm shift: librarians had to start acquiring new competencies to be able to survive (O’Leary 2000:21,22), and information professionals had to pro-actively search the web to filter out unwanted information for the users (Burton 1999:104). Corcoran claims that “…traditional business models of library operation are defunct…”, and that although the number of requests for information had decreased, the complexity level had increased, implying that the expertise level of the service provider must also rise (Corcoran et al, 2000:29,30). Sherman quoted Kahle, who states that for every one individual entering a library per day, 20 web searches are being done, implying that the web is beginning to challenge the value of all libraries put together (Sherman 1999:61). Two authors describe in detail how three high-profile libraries (Los Alamos National Laboratory, Shell Research and Hewlett-Packard) have adapted dramatically to be able to deliver a cutting-edge service (Pack and Pemberton 1999a, Pack and Pemberton 1999b; Pemberton and Pack 1999).

Website VisibilityCommercial usage of the Internet became widespread with the appearance of the WWW. In 1993, 1.5% of web servers were on .com domains (the domain intended for commercial entities). By 1997, this figure has risen to 60% (Brin et al, 1998). In 1999, Lawrence found that 83% of web servers contained commercial content. The majority of these websites are built to attract new business (Lawrence et al, 1999).

The success of an online enterprise depends largely on the number of potential customers that visit the site. There are many ways to publish and promote a website, and the options range from traditional or online advertising to exchanging links with partner websites or including the URL on office stationary (Thelwall, 2000, 2001). For many sites, search engines will be a significant source of new visitors. Green (2000) argues that search engines are amongst the most popular sites on the web. Lawrence et al (1999) ranks search engines among the top sites accessed on the web and estimates that 85% of Internet users use search engines to find information.

A website owner can obtain search engine recognition by simply inviting search engines to index its website. Being listed in a search engine index is, however, no guarantee that a user will be able to find the website. Websites that are not ranked highly (say, within the top 10 to 20 results displayed by the search engine) are less likely to be visited (Courtois et al, 1999; Notess, 1999). Users tend to examine only the first page of search results and once they find a good match for their search, they tend not to look further down the list. Most search engines display only 10 of the most relevant results on the first page. Thus, exclusion from the top 10 results means that only a small number of search engine users will actually see a link to the website (Introna et al, 2000; Henzinger et al, 2002).

Information SeekingA large amount of research has been done on the way Internet users approach the retrieval of information from the Internet. In one recent study it was found that women in IT professions make use of all four identified searching modes (undirected viewing, conditioned viewing, informal search and formal search) - (Choo et al, 2003). A tendency towards high-volume searching of sports-related results around the dates of world-class events has resulted in the implementation of very specific architecture and design of high

6

volume websites (Dantzig 2003). It is claimed that approximately 80% of all website traffic originates from search engines.

Internet searching has become part of our lives. It has become the way in which we find information needed to progress a task, and it is the cause of much frustration when information discovery becomes entangled in detail.

VISIBILITY ISSUES

META TagsMETA tags are optional components of a HTML document that provide information on particular document characteristics but are not visible when the page is viewed in a browser (Tunender et al, 1998). The word “META” in “META tags” is derived from the word “Metadata”, which means data about data. META tags have been included in the HTML specification from version 2.0 in order to facilitate indexing of a webpage by automatic programs such as search engine crawlers (Thelwall, 2000).

Although a variety of META tags exist, only two have been widely used for search engine optimization: the “description” and “keywords” META tags. The description tag should contain a brief description of the content of the webpage, while the keywords tag should be populated with a list of words or phrases associated with the content of the webpage. See Figure 1 for an example of these tags in use.

Figure 1: METAtag examples

The use of the keywords META tag in webpages has lead to some problems. The early search engines of 1996 read the content of this tag and associated the keywords it contained with the regular text on the page. Abuse of the tag soon became widespread. Website authors would use misleading words or simply repeat some words excessively in an attempt to improve the relevancy of their webpages. Terms such as keyword stacking

7

and stuffing, tiny text and spam have become commonplace (Thurow, 2003). This made the tag an unreliable indication of site content. In 1997, support for the tag was dropped by most of the major search engines and newer search engines never added support at all (Sullivan, 2003). At the time of writing, only Inktomi still indexes the keywords tag.

Tunender tested Alta Vista, Excite, Infoseek and Lycos in 1998 and found that none of the search engines were actively indexing the META tag (Tunender et al, 1998). Research conducted by SiteMetrics in 1998 proved that 32% of commercial websites used META tags. In the same year, Qin et al found that only 24.4% of a sample of webpages contained META tags, many of them inappropriately used. In a survey of website home pages in 1999 Lawrence et al found that only 34.2% of the pages contained META tags. In a similar study in 2000, Thelwall recorded the use of the keywords tag at 35% and that of the description tag at 33%. A more recent work claims that 24,2% of a sample of webpages made effective use of the keyword metatag, 12,1% had this tag in the coding of the webpage, but did not use it correctly/effectively. A total of 63,7% did not use it at all (Weideman, 2002).

Sullivan argues that, since it enjoys so little support, it is not worth the effort to write keywords META tags for new webpages. It is not necessary, however, to remove the existing keywords META tags from old webpages. He further claims that only two META tags, the “description” and the “robots” tags, are still useful for website owners. These tags are still supported by most of the major search engines (Sullivan, 2003).

The description META tag allows website owners to control, to some degree, how webpages are described by search engines. The text in the description META tag, if present, is used by many search engines as the description for the webpage displayed in their listings. A combination of text from the description tag and relevant text from the body of the webpage is also used. Table 2 indicates how the major search engines compile the descriptions that are displayed for webpages.

The robots tag can be used to specify that a particular page should not be indexed by a search engine. Most major search engines support this tag. A more efficient way to block indexing, however, is to use a “robots.txt” file. Robots.txt is a simple text file placed in the root directory of the website that contains the names of the webpages that should be indexed and those that should be ignored. The use of a robots.txt file implies that it is not necessary to add the META robots tag to every page.

There are many other META tags, such as the “author” and “date” META tags, but these are ignored by the major search engines (Sullivan, 2003).

AllTheWeb Text from webpage“Description” tag is used for second description

AltaVista “Description” tag and text from webpageGoogle “Description” tag or text from webpageInktomi “Description” tag or text from webpageLycos “Description” tagTeoma “Description” tag or “Description” tag and text from webpageWisenut Text from webpage

“Description” tag is not supported

8

Table 2: Search Engines webpage description compilation (Sullivan, 2003)

The Title TagThe HTML title tag is placed within the header of a webpage, which should contain a brief but descriptive title for the webpage. Its contents also appears in the reverse (top) bar of the browser when the webpage is viewed. Furthermore, it is also the default text that is used when a webpage is added to the list of “Bookmarks” or “Favorites” in a browser.

The title tag is widely considered to be one of the most important webpage elements that is taken into account by search engines when pages are ranked (Thelwall, 2000; Tunender et al, 1998; Sullivan, 2003). The research results of Tunender et al (1998) indicate that the retrieval of a website can be improved by better utilization of the title tag. The most efficient retrieval is likely to be achieved when a searchable webpage contains multiple keywords within the title tag. Most search engines also use the text in the title tag as the title for the webpages that appear on search results pages. One research project proved that 32 out of a sample of 33 academic webpages made effective use of the TITLE metatag (Weideman, 2002).

Keyword Rich Page ContentThe ranking algorithms of most search engines consider both the position and the frequency of keywords within a webpage. A document is likely to rank higher if it contains more instances of a particular keyword and if the keyword appears earlier in the document (Introna et al, 2000; Sullivan, 2003).

Sullivan recommends that website owners select relevant keywords for webpages. Keywords should consist of two or more words, since too many sites will be relevant for a single word. Keywords should appear in the HTML title tag, in the page header and in the first paragraphs of the webpage. Keywords should be relevant to the content of the webpage. In the past, having page headers or keywords in HTML header tags (H1, H2, etc.) could improve rankings. Currently, however, it is considered to contribute little to the relevancy of page text. Website content must be structured in such a way that the focus of each webpage is on only one particular topic.

Tunender et al (1998) found that the frequency of a term’s appearance is more important than the uniqueness of the term. A character string that appears multiple times on a webpage is more likely to be indexed than one that is unique.

Some websites try to deliberately manipulate their placement in the rankings of search engines. The resulting pages are called spam (Henzinger et al, 2002). Text spam is a common technique that involves modifying the text on a webpage in such a way that the search engine rates the page as being more relevant than the human reader may expect. One method is to overfill a webpage with repetitions of a small set of keywords, called keyword stuffing (Thurow, 2003). For example, a website might repeat keywords in a small font at the bottom of a webpage or in letters that are the same color as the background of the page. Another method is to add irrelevant but frequently requested words to a page to make sure that the page is retrieved by common searches. Henzinger et al uses the example of pornographic sites that sometimes add the names of famous personalities to their pages so that these pages are returned when users search for such personalities.

9

Spamming has become a widespread problem and a major threat to the quality of search engine rankings. As a result, search engines are consistently developing and refining techniques to detect and circumvent spam (Henzinger et al, 2002).

FramesA frames based webpage is composed of individual webpages, called frames, that are blended together according to the instructions of a master page, the “frameset” page (Sullivan, 2003). Frames are normally used on a website to enhance navigation or to keep certain elements fixed on the screen while the main window contents move. Frames based webpages pose some problems for search engines (Thelwall, 2000; Tunender et al, 1998). Firstly, some search engine crawlers cannot interpret the frame layout instructions on frameset pages (Sullivan, 2003). Since frameset pages generally contain no other information, search engines will not be able to index any content or follow hyperlinks to other webpages. This can cause an entire website to become invisible to a search engine. Sullivan suggests that websites add additional information to frameset pages for search engines to index, as well as hyperlinks to the rest of the site. The “noframes” HTML tag can be used for this purpose.

Secondly, if a search engine crawler can interpret the frameset, there is the possibility that an individual frame will be viewed outside the proper frame context. One author claims that only the frameset page can be reliably pointed to:

“The individual frames can be pointed to, but unless the server automatically redirects the request, the frame will be displayed as a frame on its own, orphaned from the frameset. Some such pages are designed to function on their own, but many are not, commonly not including any navigation aids. As a result frames pages are ignored or partially ignored by many search engines or inappropriately pointed to by those that do index them.” Thelwall (2000).

This author recommends that websites maintain a duplicate, equally high quality non-frames version of their site. There is, however, still the risk of search engines directing users to an incomplete frames page instead of the equivalent non-frames page. Sullivan explains how special scripts can be used on individual frames to reinstate the frameset context whenever a frame is orphaned from the frameset.

MultimediaMultimedia content on webpages includes clipart, digital photographs, sound, video and other media. Almost 70% of the WWW is non-textual (Green, 2000). This is not surprising, since humans process information in visual format more readily than textual format. Multimedia web content includes images, audio and video.

Excessive use of multimedia has been widely discouraged, partially because multimedia files in general take longer to download than HTML text. One author proposed a set of guidelines for successful webpage design and recommended that no more than three images be used per webpage. Broadband capacity has, however, expanded over the years and websites can now publish more multimedia content without radically compromising the speed of webpage delivery (D’Angelo et al, 1998).

10

One problem multimedia content presents to search engine crawlers is that most of them cannot interpret the information contained in multimedia files. Webpages that are heavy with multimedia content are especially compromised since search engines will find little or no text on the page to index. However, a number of products are available that can efficiently recognize patterns inside image files to “see” what the actual image shows (Sullivan, 2003). Further research is required to determine to what extent search engines are currently effectively utilizing these technologies.

All the major search engines provide multimedia search facilities, ranging from searching for photos and images to retrieving video and audio files. Green (2000) argues that the importance of these search engines will continue to grow. Multimedia search engines will typically consider the keywords surrounding multimedia page elements as well as the keywords on the rest of the page to determine whether a multimedia file is relevant for a particular search term. The information contained in the HTML tags that are used to deliver the multimedia content, such as the “image” tag, is also considered (Sullivan, 2003).

Dynamic ContentDynamic webpages are generated dynamically by a computer script. The script typically requests data from a database and presents the information on a blank webpage template. The content on these pages is customized to the requirements of the user. Dynamic webpages, often found in e-commerce websites, are only temporarily present on the web and forms part of what Green (2000) refers to as the “invisible web”.

Sullivan explains that the scripts on dynamic webpages can interfere with the operation of search engine crawlers. As a result, most search engines do not index dynamic webpages. The URL of a dynamic webpage contains special symbols, such as “?” or “&”, or references to special scripting files or folders. Many crawlers are programmed to detect and ignore webpages that contain any of these elements.

Finally, Sullivan recommends the use of static webpages instead of dynamic pages, where possible. Static webpages can be indexed by all search engines. One method will involve creating a mirror of the dynamic content of a website in static pages. Again, many software tools are available to assist (Sullivan, 2003).

Frequent UpdatesURL’s become outdated when webpages are moved to a new location, files are renamed or when they no longer exist on the web server. Evidence suggests that outdated URL’s abound on the web. It was found that more than 5% of the webpages returned by search engines no longer exist (Benbow, 1998; Lawrence et al, 1998; Lawrence et al, 1999).

Because of the frequency of webpage updates, search engines are forced to re-index the web regularly. As a result, large search engine indices become increasingly difficult to maintain. Search engines may even reach a point beyond which it is not economical for them to expand their coverage. Lawrence found that the mean age of a webpage in a search engine index is 186 days, while the median age is 57 days. This suggests that new or modified webpages may not be indexed for several months (Lawrence et al 1999). Webmasters may find that search engine indices are not reflecting updates to webpages, or that search engines return references to pages that have been moved or no longer exist.

11

To ensure that webpages are re-indexed on a regular basis, Sullivan suggests that website owners subscribe to paid inclusion programs, offered by most search engines. With paid inclusion, search engines agree to revisit and re-index webpages regularly.

To alleviate the problem of outdated URL’s, Benbow (1998) suggests that a website provide webpages to refer the user to the new location. The error messages that a web server displays when outdated URL’s are encountered may also be modified to provide additional information such as links that may help the visitors find what they are looking for.

Link PopularityLink popularity refers to the number of hyperlink references (called inlinks) made from other pages to a certain webpage. A common complaint of search engine users has been that the information returned by search engines does not satisfy their information need. As a result, most search engines now incorporate popularity ratings for pages in their ranking algorithms (Thelwall, 2000). One of the first to do so was PageRank, a methodology developed by Google in 1998 (Brin et al, 1998). PageRank provides a measure of the importance of a page based upon the number of inlinks and the popularity of the linking pages, which is measured in the same way. It is based on the assumption that a page with many links pointing to it is more likely to contain high quality information than one with few links pointing to it. Google has since developed the Google Toolbar, a utility that makes it possible to search the Google database from a toolbar in a web browser. The Toolbar displays the PageRank score of the webpage currently visited by the user.

Link popularity algorithms not only consider the number of inlinks, but also the quality and the context of those links (Sullivan, 2003). A link from a webpage that in turn is linked to by many other pages has a higher ranking that a link from a webpage that is linked to by no other pages. Furthermore, link context means that search engines examine the text surrounding links to determine the relevance of the page being linked to. It should also be noted that the major search engines do not discount internal linkage, that is, linking within the same domain.

Some consider link popularity to be a more accurate assessment of a page’s quality, since it is based entirely on the editorial judgment expressed by those who publish links to other websites. It is also difficult for website owners to manipulate link popularity ratings (Sullivan, 2003). Lawrence et al (1999), however, argues that search engines are increasingly presenting a biased view of the information available on the web. With popularity ranking, popular webpages tend to become more popular, whereas new, unlinked webpages may never be shown in search engine listings. This may prevent the widespread visibility of new high-quality information on the web.

Some websites attempt to artificially boost their popularity ratings by exchanging links with other websites (Thelwall, 2000). Link farms have seen a rise in popularity, where users exchange links and create large volumes of artificial links (Thurow 2003). Sullivan argues that popularity ratings can be improved with effective link building. Simply increasing the number of inbound links, however, might in itself not improve a website’s rating. Since most search engines also consider the importance of inbound links, a few links from high-quality, content-related webpages are likely to produce better results than a high number of low-quality links. Sullivan suggests that website authors use search engines to locate other sites with which links can be exchanged. It can be done by doing a search using the its target keywords, reviewing the top results returned, and requesting those websites to link to

12

it. This method of link building is likely to improve the quality of inbound links, since the sites that are linked to contain related content and are already ranking well for the website’s target search terms.

Doorway PagesThese are webpages containing some relevant content, but whose main purpose is to link to the "real" webpage where the actual content of value is stored. Many types of doorway pages with different purposes exist. In general, doorway pages are optimized to satisfy search engine crawlers, but are not pleasing to human readers. They are also referred to as gateway pages, portal pages or entry pages.

One type of doorway page is a webpage that consists entirely of links. These pages often have thousands of links, even multiple links to the same webpages, and are developed in an effort to manipulate the link analysis algorithms of search engines (Henzinger et al, 2002). These pages are most effective when the link analysis uses raw counts of incoming links to determine a page’s importance. Algorithms that consider the quality of links, such as PageRank, are not particularly vulnerable to this technique (Henzinger et al, 2002; Sullivan, 2003).

Another type of doorway page is a webpage that targets a particular search term (Sullivan, 2003). The title tag, META tags and body copy of these pages are changed in a way that the website owner hopes will improve the ranking of the page. This type of doorway page is often used in conjunction with cloaking. Upon retrieval, the doorway page is presented to the search engine crawler, while the human visitor is presented with a different webpage that is intended for the human viewer.

Svendsen argues that this type of doorway page can be used effectively to achieve good rankings for specific search terms. Normal page text can be replaced with search terms that are more likely to be used by users (for example “car” instead of “auto”). Site designers can also optimize a webpage for search engine retrieval without interfering with the design and usability of the page (Svendsen, 2002). Sullivan warns that website owners must limit the number of doorway pages published on a website. The higher the percentage of doorway pages in relation to other webpages, the more likely search engines will consider the doorway pages as spam.

CloakingWhen a website presents a search engine crawler with content that is entirely different from the content presented to a human visitor, it is referred to as cloaking, or bait-switching technique. As a result, the search engine is deceived as to the content of a webpage and the page may receive a higher relevancy ranking than the human reader may expect (Henzinger et al, 2002).

Many arguments support the use of cloaking. Cloaking can be used to assist search engines by presenting them with a simple, text-only version of a dynamic or frame-based webpage or a page that is otherwise heavy with multimedia content (Henzinger et al, 2002). Cloaking is also used to protect the HTML META data and other search engine optimization features that cause a webpage to rank well from being discovered and copied by competitors. Cloaking remains problematic however, because it is a deceptive tactic and is often used in an attempt to distort search engine rankings (Sullivan, 2003).

13

One specific form of abuse is to present a search engine with a webpage that contains keywords that have no relevance to the actual content that is presented to the human visitor. Another example is to take a high-ranking webpage of another website and to present it to a search engine as your own page, while redirecting human visitors to a different webpage (Sullivan, 2003). Doorway pages that target particular search terms are often cloaked in an attempt to improve search engine rankings.

To provide results that are relevant, most search engines disregard or permanently exclude webpages that contain any elements aimed at misleading the relevancy algorithms of the search engine. As a result, all the major search engines, with the exception of AltaVista and FAST, discourage the use of webpage cloaking and may penalize websites that utilize this technique (Sullivan, 2003).

Paid InclusionThis scheme guarantees page inclusion from a website in search engine database listings in greater depth that would normally occur. Paid inclusion programs are provided in exchange for a fee. All the major search engines, except Google, offer these paid inclusion programs.

Sullivan explains that paid inclusion provides no guarantee that a webpage will rank well for particular keywords. The ranking of the webpage still depends on the relevancy algorithms of the search engine. Paid inclusion does, however, assure website owners that important webpages will always be indexed – webpages that search engines may not be able to retrieve under normal conditions. Listings are also refreshed regularly – often the refresh frequency is proportional to the fee paid.

Through a paid inclusion program, a new website may be visited by a crawler within one or two days, whereas with conventional registration methods it may take several weeks for the webpages to be indexed. Some paid inclusion programs also provide useful traffic reporting, such as the number of clicks brought to a website and the search terms associated with those clicks.

Paid PlacementPaid placement programs guarantee top search engine ranking in exchange for a fee. Paid placement is also referred to as “keyword buying”, because the ranking is achieved by buying one or more desired search terms. Most major search engines provide paid placement programs.

Listings containing paid content are normally separated from editorial results and highlighted in some way. Most paid placement programs are relatively expensive. Generally, a website is charged per click for any traffic that is generated to the site, hence the term PPC (pay per click) also used to describe this scheme.

Index SubmissionSearch engines typically use two methods to find webpages to index: by following links from previously registered websites and by allowing users to register the addresses of unknown sites (Lawrence et al, 1999). Search engine registration is easy and usually free (Thelwall, 2000). Software packages and web services are available that will simultaneously submit a website to a number of search engines and automatically generate the required electronic formats (Introna, 2000).

14

Tunender et al (2000) studied the consequence of search engine registration on the retrieval of a website. The results were inconsistent, with some search engines indexing only particular pages and others indexing no pages at all. It showed, however, a general lack of harvesting resulting from search engine registration. Turner et al (1998) also suggest that the speed of indexing sites by search engines following registration may be improved. Sullivan argues that search engines may be less responsive in an effort to promote paid inclusion programs.

Tunender et al argues that retrieval can be improved by registering several pages within a site, rather than only a single page (typically the home page). Registering pages individually is time consuming and will be required whenever new pages are added to the site, but it ensures that search engines have access to every part of the site.

Nearly a quarter of the websites surveyed by Thelwall (2000) were not registered in any of the five major search engines that were tested at the time. This suggests a misunderstanding of the importance of search engines by website designers or lack of knowledge of how to get sites registered and to remain registered. Thelwall (2000) found that, because of the size of the web, search engines are forced to be selective about the pages that they choose to index.

CONCLUSIONA proposed website design approachThe literature provides several website design guidelines aimed at improving a site’s visibility to search engines. The motivation for this research project stemmed from the fact that none of these provided a complete picture of the positive and negative design factors influencing website visibility.

From these guidelines, the model summarized in Figure 2 for the design of a search engine friendly website was established. It is called the WBPN after the two authors and the way it operates: Weideman-Binedell-Positive-Negative. Website authors should apply it by ensuring that as many positive elements and as few negative elements as possible are included in their design. At this stage the relative position of points on each scale do not carry any weight – a factor further away from the midpoint is not more or less important than any other one. Future research could investigate relative weights of these factors.

15

Figure 2: A proposed website design approach

There is consensus in the literature that search engines primarily consider the text that appears on a webpage to find matches for a search query. For this reason, the body of a webpage must consist predominantly of descriptive, keyword-rich text. Each webpage must also have a concise and meaningful page title, both in the body and the title META tag. The literature suggests that website owners should choose the keywords that accurately describe the content of the webpage, and use those keywords close to the top of the page (in the page title, page heading and opening paragraphs). Graphics and other multimedia elements are attractive for users but provide little content for search engines to index. Excessive use of these elements must be avoided.

Widespread abuse has lead to a decline in the importance of the HTML keyword tag. At present, this tag enjoys so little support that its inclusion in webpages is no longer recommended. Use of the “description” tag is however still encouraged. Frames based webpages pose problems for most search engines and must be avoided. Websites that consist primarily of dynamically generated pages and sites that are frequently updated can benefit from the paid inclusion programs offered by most search engines. Paid inclusion promises more comprehensive and in-depth retrieval of problematic webpages, and listings are more regularly refreshed. Several other techniques to improve the retrieval of dynamic webpages are discussed in the literature review.

It is difficult for a website author to influence page link popularity rating since search engines calculate this rating from data that is often not controlled by this author. With

16

organized link exchanges a website might succeed to improve it’s rating. Webmasters are, however, encouraged to rather focus their efforts on writing relevant, keyword rich text for webpages.

Excessive use of keywords, cloaking, doorway pages and any other technique that are aimed at misleading search engines must be avoided. Search engines are consistently improving technologies to detect and remove spam. Most search engines warn that offenders will be permanently removed from search engine databases, and/or blacklisted.

Website authors are further encouraged to manually submit the URL’s of the home page, the site map and a selection of key webpages to the major search engines for registration.

Final wordOn the World-Wide Web, few things are more important that attracting visitors. But getting users to find your website can be especially challenging, considering the vast amounts of information already available on the Internet. Search engines promise to generate large volumes of traffic to a website, provided that the site ranks well for particular keywords. In this regard, the design of a website plays a crucial role - optimizing a website for search engine retrieval can make the difference between regularly achieving top search engine rankings and being virtually lost in cyberspace.

It is expected that by following the WBPN model, website owners can improve the overall ranking of their webpages. It should be emphasized, however, that this is only a theoretical framework. Empirical research is required to test and evaluate this framework under real circumstances. Future research by these authors will concentrate on this facet.

REFERENCES

Basu, G., Using Internet for reference: myths vs. realities. Computers in Libraries, 15 no. 2 (1995): 38-39.

Benbow, S. M. P., File Not Found: The Problems of Changing URLs for the World-Wide Web. Internet Research: Electronic Networking Applications and Policy 8, no. 3 (1998): 247-250.

Berners-Lee, T., Cailliau, R., Luotonen, A., Nielsen, H. F., Secret, A., The World-Wide Web. Communications of the ACM 37, no. 8 (1994): 76-82.

Brin, S., Page, L., The anatomy of a Large Scale Hypertextual Web Search Engine. Computer Networks and ISDN Systems, 30 (1998): 107-117.

Burton, P.F., Information professionals and the world wide web. Online & CD-ROM Review. 23 no. 2 (1999): 103-104.

Choo, W.C. and Marton, C., Information seeking on the Web by women in IT professions. Internet Research: Electronic Networking Applications and Policy 13, no. 4 (2003): 267-280.

Corcoran, M., Dagar, L. and Stratigos, A., The changing roles of information professionals. Online, March/April (2000): 28-34.

17

Courtois, M. P., Berry, M.W., Results Ranking in Web Search Engines. Online, (May/June 1999): 39-46.

D’Angelo, J., Little, S. K., Successful Web Pages: What Are They and Do They Exist? Information Technology and Libraries 17, no. 2 (1998): 71-81.

Dantzig, P.M., Architecture and design of high volume websites. Proceedings of The 7th World Multi-conference on Systemics, Cybernetics and Informatics, (July 2003), Orlando, USA.

Devlin, B., and Burke, M., Internet: the ultimate reference tool? Internet Research, 7 no. 2 (1997): 101-108.

Furner-Hines, J., and Willett, P., The use of the world-wide web in UK academic libraries. Aslib Proceedings, 47 no. 1 (1995): 23-32.

Green, D., The Evolution of Web Searching. Online Information Review 24, no. 2 (2000): 124-137.

Henzinger, M. R., Motwani, R. Silverstein, C., Challenges in Web Search Engines. SIGIR Forum 36, no. 2 (2002).

Introna, L., Nissenbaum, H., Defining the Web: The Politics of Search Engines. Computer 33, no. 1 (2000).

Lawrence, S., Giles, C. L., Searching the World-Wide Web. Science, 280 (1998).

Lawrence, S., Giles, C. L., Accessibility of Information on the Web. Nature, 400 (1999).

Marcella, R., Baxter, G., The information needs and the information seeking behaviour of a national sample of the population in the United Kingdom, with special reference to needs related to citizenship. Journal of Documentation, 55 no. 2 (1999): 159-183.

Notess, G. R., Rising Relevance in Search Engines. Online, (May/June 1999): 84-86.

O’Leary, M., New roles come of age. Online, 24 no. 2 (2000): 21-25.

Pack, T. and Pemberton, J., The cutting-edge library at the Los Alamos national laboratory. Online, March/April (1999a): 34-42.

Pack, T. and Pemberton, J. The cutting-edge library at Shell research. Online. July/August (1999b): 28-33.

Pemberton, J. and Pack, T., The cutting-edge library at Hewlett-Packard. Online, September/October (1999): 30-36.

Poulter, A., The Design of World-Wide Web Search Engines: a Critical Review. Program 31, no. 2 (1997): 131-145.

18

Qin, J., Wesley, K., Web Indexing with META Fields: A Survey of Web Objects in Polymer Chemistry. Information Technology and Libraries 17, no. 3 (1998).

Radcliff, C., Du Mont, M. and Gatten, J., Internet and reference services: implications for academic libraries. Library Review, 42 no. 1 (1993): 15-19.

Sherman, C. The future of web search. Online, May/June (1999): 54-61.

Sullivan, D. Search Engines. http://www.searchenginewatch.com/links/article.php/2156221 [01 November 2003].

Svendsen, M. deMib. Doorway Pages. Search Engine Strategies Conference (2002), Munich.

Thelwall, M., Commercial Web Sites: Lost in Cyberspace? Internet Research: Electronic Networking Applications and Policy 10, no. 2 (2000): 150-159.

Thelwall, M., Commercial Web Sites Links. Internet Research: Electronic Networking Applications and Policy 11, no. 2 (2001): 114-124.

Thurow, S., Search Engine Visibility. New Riders, Indiana (2003).

Tunender, H., Jane, E., How to Succeed in Promoting Your Web Site: The Impact of Search Engine Registration on Retrieval of a World-Wide Web Site. Information Technology and Libraries 17, no. 3 (1998).

Turner, P., Brackbill, L., Rising to the Top: Evaluating the Use of the HTML META Tag to improve retrieval of World-Wide Web Documents through Internet Search Engines. LRTS 42, no. 4 (1998).

Voorbij, H.J., Searching scientific information on the Internet: a Dutch academic user survey. Journal of the American Society for Information Science, 50 no. 7 (1999): 598-615.

Weideman, M., Effective application of metadata in South African HEI websites to enhance visibility to search engines. Proceedings of WWW2002 (Sept 2002), Bellville, South Africa.

White, H.S., At the crossroads: librarians on the information superhighway. Englewood, CO: Libraries Unlimited (1995).

19