intranet search engine

1

Chapter 1

Introduction

Although search over World Wide Web pages has recently received much academic and

commercial attention, surprisingly little research has been done on how to search the web pages

within large/small, diverse intranets. Intranets contain the information associated with the

internal workings of an organization. The intranet creates new challenges for information

retrieval. The amount of information on the intranet is growing rapidly, as well as the number of

new users inexperienced in the art of intranet research. Earlier works that compared intranets and

the Internet from the view point of keyword search has pointed to several reasons why the search

problem is quite different in these two domains. In this project, we address the problem of

providing quality answers to navigational queries over the intranet. As intranets grow, providing

access to more and more documents, their value grows. The larger the collection, the harder and

harder is becomes to find that important presentation, contract, or HR form. Enterprise

Information Portals provide a starting point to intranets, and a search engine helps locate

information, including archives and unstructured data. Search engines need to be tuned and

indexed to provide the best answers.

Our approach is based on crawler identification of navigational pages, intelligent generation

of term variants to associate with each page, and the construction of separate indices exclusively

devoted to answering navigational queries.

This Chapter outlines the aims of the project and motivation behind its implementation.

2

1.1 What is Intranet?

An intranet is a private computer network that uses Internet protocols and network

connectivity to securely share any part of an organization's information or operational systems

with its employees.

1.1.1 Features of Intranet

Sometimes the term refers only to the organization's internal website, but often it is a more

extensive part of the organization's computer infrastructure and private websites are an

important component and focal point of internal communication and collaboration.

An intranet is built from the same concepts and technologies used for the Internet, such as

clients and servers running on the Internet Protocol Suite (TCP/IP). Any of the well known

Internet protocols may be found in an intranet, such as HTTP (web services), SMTP (email),

and FTP (file transfer).

Intranets differ from extranets in that the former are generally restricted to employees of the

organization while extranets may also be accessed by customers, suppliers, or other approved

parties.

Intranets are being used to deliver tools and applications, e.g., collaboration (to facilitate

working in groups and teleconferencing) or sophisticated corporate directories, sales and

Customer relationship management tools, project management etc., to advance productivity.

Intranets are also being used as corporate culture-change platforms. For example, large

numbers of employees discussing key issues in an intranet forum application could lead to

new ideas in management, productivity, quality, and other corporate issues.

Just one example of improved usability from taking advantage of managed diversity: an

intranet search engine can take advantage of weighted keywords to increase precision. Weights

are impossible on the open Internet, since every site about widgets will claim to have the highest

possible relevance weight for the keyword "widget." On an intranet, even a light touch of

information management should ensure that authors assign weights reasonably fairly and that

they use, say, a controlled vocabulary correctly to classify their pages.

Intranet is network of computers that can be accessed only by an authorized set of users

within an organization. Its purpose is typically to share information and computing resources

3

among employees within an organization. The term “search engine” is often used generically to

describe both crawler-based search engines and human-powered directories. These two types of

search engines gather their listings in radically different ways. Crawler-based search engines,

such as Google, create their listings automatically. Human-powered directories such as the Open

Directory, depends on humans for its listings. The search looks for matches only in the

descriptions submitted. In this case, if there are changes to any of the web pages, it has no effect

on the listing. The only exception is that a good site, with good content, might be more probable

to get review. There are two types of Intranet search, namely desktop-based and web-based.

Desktop-based address the whole spectra of electronic information that might be found in an

organization, including video, images, database etc.

Figure 1.1: A model of Intranet

1.2 Scope of project

This project can be used by the various clients who want to search for shared documents

scattered all over the intranet.

4

1.3 Requirement Specifications

1.3.1 Functional Requirements

Query Box:

There should be a query box for the existing user where the user types in the name of the file

that he is searching.

Search Button:

This button initiates the search operation over the intranet with the text typed by the user.

Result Box:

The results as obtained based on the query of the user is displayed with the names of the

system where the file is stored in.

1.3.2 Non-Functional Requirements

Security:

Only the files which are stored in the systems for which we have prior permission to access it

are displayed thus preventing unauthorized access.

Database:

Integrity should be maintained and all the constraints should be satisfied.

Platform Independence:

Written using 100 percent Pure Java Code.

1.3.3 Software Requirements

The following softwares have been used for the project.

Windows Platform:

The Microsoft‟s Windows have been be used as the platform for coding.

NetBeans:

IDE NetBeans used for developing the codes for the project using Java.

1.3.4 Hardware Requirements

PC with 2 GB Hard disk and 256 MB RAM

RJ-45 LAN cables and LAN connectors

5

1.4 The Difference between Intranet and Internet Design

The intranet and the public website on the open Internet are two different information spaces

and should have two different user interface designs. It is tempting to try to save design

resources by reusing a single design, but it is a bad idea to do so because the two types of site

differ along several dimensions:

Users differ. Intranet users are own employees who know a lot about the company, its

organizational structure, and special terminology and circumstances. The Internet site is

used by customers who will know much less about the company and also care less about

it.

The tasks differ. The intranet is used for everyday work inside the company, including

some quite complex applications; the Internet site is mainly used to find out information

about your products.

The type of information differs. The intranet will have many draft reports, project

progress reports, human resource information, and other detailed information, whereas

the Internet site will have marketing information and customer support information.

The amount of information differs. Typically, an intranet has between ten and a

hundred times as many pages as the same company's public website. The difference is

due to the extensive amount of work-in-progress that is documented on the intranet and

the fact that many projects and departments never publish anything publicly even though

they have many internal documents.

Bandwidth and cross-platform needs differ. Intranets often run between a hundred and

a thousand times faster than most Internet users' Web access which is stuck at low-band

or mid-band, so it is feasible to use rich graphics and even multimedia and other

advanced content on intranet pages. Also, it is sometimes possible to control what

computers and software versions are supported on an intranet, meaning that designs need

to be less cross-platform compatible (again allowing for more advanced page content).

Most basically, the intranet and the website are two different information spaces. They

should look different in order to let employees know when they are on the internal net and when

they have ventured out to the public site. Different looks will emphasize the sense of place and

thus facilitate navigation. Also, making the two information spaces feel different will facilitate

http://www.useit.com/alertbox/9703a.html



6

an understanding of when an employee is seeing information that can be freely shared with the

outside and when the information is internal and confidential.

An intranet design should be much more task-oriented and less promotional than an Internet

design. An organization should only have a single intranet design, so users only have to learn it

once. Therefore it is acceptable to use a much larger number of options and features on an

intranet since users will not feel intimidated and overwhelmed as they would on the open

Internet where people move rapidly between sites. An intranet will need a much stronger

navigational system than an Internet site because it has to encompass a larger amount of

information. In particular, the intranet will need a navigation system to facilitate movement

between servers, whereas a public website only needs to support within-site navigation.

7

Chapter 2

Problem definition

Today‟s age is better known as „INFORMATION AGE „. The world runs on information.

According to the Data Warehousing Institute the data available today gets doubled every 6

months. Lots of information is present on the private LAN or intranet of the organizations. So

lots of man power is needed to get proper information from scattered data on intranet. An

obvious reason for poor enterprise search is that a high performing text retrieval algorithm

developed in the laboratory cannot be applied without extensive engineering to the enterprise

search problem because of the complexity of typical enterprise information spaces.

As organization developed more and more information, there is a need to sort the data and

information in a systematic manner and made available to the user in intranet as requested. So

that he can decide what is necessary and take the appropriate action. Our project will provide a

helping hand for this regard, to access the information within our fingertips present in intranet.

Our aim is to provide effective, efficient and systematic search engine that works for a local area

network. In other words effective in terms of search, efficient in terms of time and systematic in

representation is our INTRANET SEARCH ENGINE.

2.1 The Need

1. The need to respect fine-grained individual access-control rights, typically at the document

level; thus two users issuing the same search/navigation request may see differing sets of

documents due to the differences in their privileges.

2. The need to index and search a large variety of file types (formats), such as PDF, Microsoft

Word and Power-point files, etc.

3. The need to seamlessly and scalably combine structured (e.g. relational) as well as

unstructured information in a document for search, as well as for organizational purposes

(clustering, classification, etc.) and for personalization.

8

An effective search tool on an intranet can make an enormous difference to its usability. A

good search engine ensures that users find what they're looking for, first time, regardless of the

format or location of the information. This means that a wide variety of information can be

effectively dispersed and made available to staff, without the need for complex navigation

systems or filing conventions.

Our project aims to help the user to search and access text information. The search will be a

content based search. As stated earlier there is load of information available for the user to

access on the intranet. But only specific required information is to be searched, sorted and

represented in a systematic manner to the user, thus increasing the availability of useful

information for the user to access. The access will be given to only those data which are shared,

thus preventing unauthorized access.

2.2 Objectives of project

To implement a centrally managed Intranet search engine that helps a client to search for

files over the intranet. Client can execute search operation as per his needs. The files, if present

shall be displayed over the same search form. By making use of this project we can provide a

enhanced capability of searching over the intranet.

9

Chapter 3

Mechanization of Search Engine

An intranet search engine is much and more the same as the Web-wide search engines. The

search engine locates the documents, extracts the text, and stores it in an index file, making an

entry for each word. When an end-user or employee types a word into a form and clicks the

Search button, the browser sends it to the server. The search engine receives the search query,

looks for matching words in the index file, gathers related document information, sorts the

documents by relevance, formats the results into appropriate format, and sends the page back to

the user. Several indexing aspects require attention from the intranet site manager. Indexing

integrates content from many sources: pages on internal sites, content management systems etc.

3.1 Processes of Search Operation

There are various processes and entities involved in finding the results for the user as per the

query he has input.

3.1.1 Gathering

The index should be kept current. As soon as the new content is published, it should be

indexed. Publishing or content management systems can notify the indexer of new data;

otherwise, index the frequently changing areas more often. If the search engine cannot respond to

queries when updating, use mirrored servers or switch search engines.

3.1.2 Indexing

In addition to HTML, XML, and text, intranet search engines deals with binary file formats

such as PDF, MS Office formats, including Word, Excel, and PowerPoint, WordPerfect, and

others. The index should store the entire content of every file, even very long documents. It

should keep every word and the word position in the document, for later phrase searching and

match highlighting.

10

Intranets generally include various levels of security and access controls, and the index

should store this information, so it can show only the accessible content in the search results. For

high-security content, it is a good idea to create a separate index file to avoid co-mingling private

and public text.

Figure 3.1: Components of Search Engine

11

3.1.3 Crawling

The general algorithm involves backtracking to the root directory and penetrating new web

pages via their links. The process continues until the entire website (Intranet) is indexed.

Besides, our crawler is able to recognize duplicate pages and discard them accordingly.

3.1.4 Searching agent

This is the tool that will be on the client side and triggered by the server with key word, so it

searches on the only client on which it lies. And returns back the result to server. Each client will

have the searching agent.

When a new search comes to server it searches the index (database) it have if found then s

returns back as response and if not then trigger to all client‟s searching agents and gets the replay

from them. When a user enters a query into a search engine (typically by using key words), the

engine examines its index and provides a listing of best-matching pages/files according to its

criteria, usually with a short summary containing the document's title and sometimes parts of the

text. Most search engines support the use of the boolean operators AND, OR and NOT to further

specify the search query. Boolean operators are for literal searches that allow the user to refine

and extend the terms of the search. The engine looks for the words or phrases exactly as entered.

Natural language queries allow the user to type a question in the same form one would ask it to a

human.

As intranets grow, providing access to more and more documents, their value grows. The

larger the collection, the harder and harder is becomes to find that important presentation,

contract, or HR form. Enterprise Information Portals provide a starting point to intranets, and a

search engine helps locate information, including archives and unstructured data. Search engines

need to be tuned and indexed to provide the best answers.

http://en.wikipedia.org/wiki/Web_search_query

http://en.wikipedia.org/wiki/Keyword_%28Internet_search%29

http://en.wikipedia.org/wiki/Inverted_index

http://en.wikipedia.org/wiki/Boolean_operators

http://en.wikipedia.org/wiki/Web_search_query

12

Figure 3.2: Basic Information Retrieval Process

13

3.2 Analysis of Search Engine

There is a need of analyzing the search engine so that we can optimize the software to its

optimum. For this purpose, we need to understand the pros & cons of the same.

3.2.1 Pros:

Search engines provide access to a fairly large portion of the publicly available pages over

the internet and intranet, which itself are growing exponentially.

Search engines are the best means devised yet for searching the internet and intranet.

Stranded in the middle of this global electronic library of information without either a card

catalog or any recognizable structure, how else are you going to find what you're looking for?

3.2.2 Cons:

On the down side, the sheer number of words indexed by search engines increases the

likelihood that they will return hundreds of thousands of responses to simple search requests.

Remember, they will return lengthy documents in which your keyword appears only once.

Additionally, many of these responses will be irrelevant to your search.

14

Chapter 4

Evaluation of Intranet Search Engine

Any Intranet Search Engine should be developed as per the requirement of the environment

in which it will be used. But as per our studies, for the overall deployment of any Intranet Search

Engine, there are some generic functions that are almost same for all of them.

4.1 Important Features for Intranet Search

Search functionality is divided into several parts: the search form and query options, the

search engine retrieval and relevance ranking, and results display.

1. Search Functionality

When the user clicks the Search button, the search form sends a query to the search engine

server. It looks for the words in the index file. Some search engines use stemming to locate

singular and plural forms of words. Once it locates the matches, the search engine gets

information about the associated documents, such as URL and titles. It sorts the documents by

relevance, as defined by an internal set of rules, by frequency of matched terms in the

documents, phrases, and location in the document.

2. Search Results Pages

Search results are not a place to surprise users with experimental interfaces. It is best to

conform to the basic conventions of Web search results, with a listing of documents showing

titles and descriptions. The Internet can be used to identify useful features.

3. Search Problems and No-Matches Pages

Searches fail for various reasons:

The user forgets to type anything in the search field.

The user is searching for text that is not in the scope of the index.

The user is using a term that is not used in the index.

15

The user has made a spelling or typing mistake.

The user is doing a search in which all the query requirements are not met (for example,

one word was matched but the other was not).

To avoid common search failures, create a page that explains these errors and helps users

understand what is within the scope of the search engine. If a taxonomy or hierarchy exists,

display it on the page to allow users to drill down through the category.

4. Search Log Analysis

Search logs are a great window into the minds of intranet users. If the search log tracks the

query and the number of matches, this is good. This makes it possible to count the 25 or 100

most popular search terms and to make sure these topics are adequately covered. It is also

possible to track the most common terms that do not find matches and to address these problems.

5. The Indexer

Full-text indexing literally creates a virtual copy of the entire website. The option is still

feasible as it only encompasses Intranet searches. With this, content can be subjected to further

scrutiny and hopefully more precise information. The first step is to initiate the creation of an

index; this index will contain location information for each and every word in all of your

documents. The creation of this index is external of the files and does not affect them in anyway.

Indexed documents are typically specified according to directory and extension. There can either

be one index for all of the files, or several separate indexes, each for a different project. The

indexes automatically are updated when new documents are created, or existing documents are

changed. However, any changes to the table‟s structure such as configuration data will need a

complete rebuilding or the full-text index. Once there is an index, it can be used to locate, view

and retrieve information. Using the indexes created, the search query can be used to locate the

required information in your documents. Results are displayed almost instantly, despite its

relatively large size and thus proving the speed and advantages of implementing indexes.

4.2 Multi-level Approach

Here in we developed a multi-level approach that comprises of four levels.

4.2.1 Data gathering

Most organizations have legacy data in formats other than HTML, e.g. Adobe‟s PDF,

MSOffice, FrameMaker, Lotus Notes, Postscript, and plain ASCII text. The spider should at least

16

be able to correctly interpret and index the most frequently used or the most important of these

formats. If meta-information and XML tags are likely to show up within the documents, the

spider must be able to interpret such tags, and it would also be useful if RDF-formatted

information could be gathered intelligently. If USENET newsgroups need to be indexed, the

spider must be able to crawl through them. That also goes for client side image maps,

CGIscripts, ASP generated pages, pages using frames, and Lotus Domino servers. Although

frames are frequently used within many companies, spiders, which generally work their way

round the net by picking up and following hypertext links, may not be able correctly interpret the

different syntax used for framed pages. These links could end up ignored. Spidering Domino

servers using the above HTTP requests requires the search engine to be able to intelligently filter

out the many collapsed/ expanded versions of the same page, or the index will quickly be filled

with duplicates. Another, and arguably better, way would be to access Domino servers via the

provided APIs.

Another situation that is likely to require access via APIs rather than having to crawl through

HTTP is when Content Management (CM) systems are used. In CM tools, the actual content of a

page is stored separated from the page layout information. Since pages are rendered dynamically

only when requested by a user (via her browser), the spider may not be able to pick up the link

information that is embedded in the page code. Without those links, the spider will not be able to

find the information. Even if the information is found and indexed correctly it might be difficult

for the search engine to understand how to display a search result since the information that has

been indexed may belong to several dynamic pages. This is an area not yet fully explored by

search engine vendors and proposed solutions should be investigated carefully.

Intelligent robots are able to detect copies or replicas of already indexed data while crawling

and advanced search engines can index “active” sites, e.g. sites that update frequently, more

often than sites that are more “passive”. If this is not supported, some manual means of

determining time-to-live should be provided. There should be some means of restricting the

robot from entering certain areas of the net, including any desired domain, sub-net, server,

directory, or file level. Also, check if search depth can be set to avoid loops when indexing

dynamically generated pages. Support for proxy servers and password handling can be useful, as

can the ability to not only follow links but also detect directories and thus find files not linked to

from other pages. The spider should be easy to set up and start. Check how the URLs from which

to start are specified as well as if the users may add URLs.

Finally, the Robot Exclusion Protocol provides a way for the webmaster to tell the robot not

to index a certain part of a server. This should be used to avoid indexing temporary files, caches,

test or backup copies, as well as classified information such as password files.

17

4.2.2 Index

Although a good index alone does not make a good search engine, the index is an essential

part of a search tool. One of the most important issues is keeping the index up-to-date, and the

best way to do that is to allow real-time updates. There is a big difference between indexing the

full text or just a portion. Though partial indexing saves disk space it may prevent people from

finding what they are looking for. The portion of text being indexed also affects the data that is

presented as the search result. Some tools only show the first few lines while others may

generate an automatic abstract or use meta-information.

If the organization consists of several sub-domains, users might only want to search their

specific sub-domain. Allowing the index to be divided into multiple collections might then speed

up the search. It may also prove useful to be able to split the index into several collections even

though they are kept at one physical location. For example, one may want separate collections

for separate topics or business areas.

Some tools support linguistic features such as automatic truncation or stemming of the search

terms, where the latter is a more sophisticated form that usually performs better. If the

organization is located in non-English speaking countries the ability to correctly handle national

characters becomes important. Also, note that some products cannot handle numbers. If number

searching is required, e.g. serial numbers, this limitation should be taken into consideration.

Should words that occur too frequently be removed from the index? Some engines have

automatically generated stop-lists, while others require the administrator to remove such words

manually.

Search engines are of little use if an overview of the indexed data is wanted, unless they are

able to categorize the data and present that data as a table of content. Automatic categorization

may also be used to focus in on the right sub-topic after having received too many documents. If

information about when a particular URL is due for indexing is available, it is useful to make it

accessible to the user.

18

4.2.3 Search features

The user query and the search result interfaces are often sadly confusing and unpredictable

argue that the text-search community would greatly benefit from a more consistent terminology.

Since we do not yet have this concordance, evaluation of the search features must be done with

great care. Different vendors use different names for the same feature, or the same name for

different features.

Though Boolean-type search language is often offered, most users do not feel comfortable

with Boolean expressions. Instead, studies have shown that the average user only enters 1.5

keywords. Due to the vocabulary problem, the user is likely to receive many irrelevant

documents as a result of a one-keyword search. Natural language queries have been shown to

yield more search terms and better search results, even when performed by skilled IR personnel.

Apart from Boolean operators, a number of more or less sophisticated options (e.g. full text

search, fuzzy search, require/exclude, case sensitivity, field search, stemming, phrase

recognition, thesaurus, or query-by-example) are usually offered. One feature to look for in

particular is proximity search, which lets the user search for words that appear relatively close

together in a document. Proximity search capability has been noted to have a positive influence

on precision.

Many organizations prefer to have a common “company look” on all their intranet pages.

This requires customization that may include anything from changing a logo to replacing entire

pages or chunks of code. Again, this is an aspect irrelevant to public search services but

something an intranet search engine might benefit from. Sometimes a built-in option allows the

user to choose a simple or an advanced interface. It should also be possible to customize the

result page.

The user could be given the opportunity to select the level of output, e.g., by specifying

compact or summary. Further, search terms may be highlighted in the retrieved text, the

individual word count can be shown, or the last modification date of the documents may be

displayed. It can also be possible to restrict the search to a specific domain or server, or to search

previously retrieved documents only. For the latter, relevance feedback is a very important way

to improve results and increase user satisfaction.

Ranking is usually done according to relevancy of some form. However, the true meaning of

the ranking is normally hidden to the user, and only presented as a number or percentage on the

result page. More sophisticated ways to communicate this important information to the user have

been developed, but not many of the commercially available products have yet incorporated such

features. However, the possibility to switch between relevancy and date is often supported.

19

Dividing the results into specific categories might help the user to interpret the returned result.

Finally, ensure the product comes with good and extensive online user documentation.

4.2.4 Operation and maintenance

Hosting a search service requires considerations not necessary when using a public search

engine. For example, operations and maintenance issues are of no importance to public search

engine evaluations, but for an internal search service, they are of course highly interesting.

Start by checking if the product is available on many platforms or if it requires the

organization to invest in new and unfamiliar hardware. If the intranet consists of one server only,

a spider is not needed, but as the web grows, crawling capabilities become essential. A spider

allows the net to grow without forcing the webmasters to install indexing software locally. An

intranet search engine often runs on a single machine and is operated and maintained by people

with knowledge about servers, but not necessarily experts in spider technology. This suggests

that a good intranet spider should be designed specifically for an intranet and not just be a ported

version of an Internet spider. Still, the spider and the index must be able to handle large amounts

of data without letting response times degrade or the users will be upset. For example, a product

that can take advantage of multi-processor hardware scales better as the intranet grows. The

product should therefore have been tested to handle an intranet of the intended size. Running the

spider should not interfere with how the index is operated. Both these components need to be

active simultaneously.

It is found that great differences exist in how straightforward the products were to install, set-

up, and operate. Some required an external HTTP server while others had a built-in web server.

The latter were consistently less complicated to install. However, installation is probably

something done once while indexing and searching is done daily. This ratio suggests that

indexing and searching features should be weighed higher than installation routines.

It is difficult to estimate data collection time since it depends on the network, but during the

test installation, this activity should be clocked. Also, try to determine how query response times

grow with the size of the index. If an index in every city, state, or country where the organization

is represented is wanted, ensure the product supports this kind of distributed operation, and check

whether any bandwidth-saving technique is used.

Having technical support locally is an advantage if the local support also has local

competence. If questions have to be sent to a lab elsewhere, the advantage is lost. An important

feature is the ability to automatically detect links to pages that have been moved or removed. If

dead links cannot be detected automatically, the links should at least be easy to remove,

20

preferably by the end-user. Allowing end-users to add links is a feature that will off-load the

administrator. Functions like email notification to an operator, should any of the main processes

die, and good logging and monitoring capabilities, are features to look for. I found that products

with a graphical administrator interface were more easily and intuitively handled, though the

possibility of being able to operate the engine via line commands may sometimes be desired. It

should also be able to administer the product remotely via any standard browser. Documentation

should be comprehensive and adequate.

Finally, consider the price - is it a fixed fee or is it correlated to the size of the intranet? In

addition, what kind of support is offered and to what cost? Sometimes installation and training

are included in the price. How long the products have been available and how often they are

updated are important factors that indicate the stability of the product, and it is also important to

ask about future plans and directions.

21

Chapter 5

Deploying an Effective Intranet Search

Engine

A search engine is often the first method used to find a page, and yet, most users suffer

frustration and failure. More still are put off by the complexity of the search engine, and the

confusing manner in which the results are displayed. An effective search tool on an intranet can

make an enormous difference to its usability. In fact, usability expert Jakob Nielsen found that

“Poor search was the greatest single cause of reduced usability across intranets”. A good search

engine ensures that users find what they're looking for, first time, regardless of the format or

location of the information. This means that a wide variety of information can be effectively

dispersed and made available to staff, without the need for complex navigation systems or filing

conventions. Most intranets evolve over time, and search functionality need not be a daunting

task. A search tool can be implemented quickly, and then refined as the intranet grows and the

needs of the organization change. It is important to recognize that every intranet is different, with

its own objectives, requirements and environment.

A good search engine must:

Be easy to use.

Assist users to find the correct information.

Display results in a meaningful way.

Help authors to improve the site.

5.1 Use of Search Engine

Search engines are best at finding unique keywords, phrases, quotes, and information buried

in the full-text of web pages. Because they index word by word, search engines are also useful in

retrieving tons of documents. If you want a wide range of responses to specific queries, use a

search engine.

Today, the line between search engines and subject directories is blurring. Search engines no

longer limit themselves to a search mechanism alone. Across the Web, they are partnering with

22

subject directories, or creating their own directories, and returning results gathered from a variety

of other guides and services as well.

Selecting a Search Engine

Before taking any action in determining the type of the search engine, we need to determine

our technical requirements. Once this is complete, research on currently available engines can be

pursued and built an effective search engine that caters to the need of ours.

5.2 Data Sources & File Types

Once we have our objectives clearly defined, we can work out what type of file formats and

data sources your search engine will need to support. Next step will be to list out every file type

used in creating the information that we want to share on your intranet. These usually fall into

one of three categories:

1. Unstructured formats

File formats that contain primarily text-based information. These include text files, word

processor files, PDFs, emails and formats used to create most documents. There is no real

structure to these file formats and few relationships exist between elements within them.

2. Semi-structured formats

File formats that contain a mixture of text-based and database information, with a basic

structure. These include file types such as HTML, spreadsheets, XML. There may be

relationships between elements within these files, however they are not as rigidly defined

as they are in structured formats, and there may be sections of textual information where

no structure exists.

3. Structured formats

File formats where the information is contained in a well defined structure, such as a

relational database. Many enterprise systems have a structured architecture, such as ERP

and CRM systems, as well as many legacy databases. An effective intranet search engine

should be able to support a large number of files that will be in the intranet data

repository.

23

5.3 Processing of Query

For most intranets, there will be a wide spectrum of users, from very basic all the way

through to highly technical power users. The search function needs to cater for all of these

people, with a simple yet powerful interface that provides options for advanced searching if

required. There should be three steps to the search process, and a range of features work to

streamline each of these steps:

(1) Entering the Query, or asking the initial question,

(2) Getting the Search Results, or receiving the list of found documents back from the search

engine, and

(3) Finding the Right Answer, or examining and refining the search results to find the

information you were looking for.

Step 1: Entering the Query.

When a user enters their query, they should have the option to do this using a natural

language approach; that is, by simply entering the question as they would ask it. Such as “What

is the cost of double-deck refrigerators?” There should also be the option to build queries using

Boolean operators, so that users who know exactly what they want can be extremely specific

with their search. For example “returns~ within 10 words of refrigerator but not freezer”.

Building a search engine with a simple user interface to make sure it is intuitive for basic users,

and also provide powerful advanced search functionality for more experienced users will be a

definite aim of ours. A good search engine should enable you to group logical chunks of

information together so that searches can be conducted on specific areas of interest.

Step 2: Getting the Search Results.

If there is specifically defined data, such as legal documents, a high degree of precision may

be required to identify and return specific information. In other situations, however, it may be

better to return a wider range of documents for a given query. The accuracy we require depends

on the role of the search engine and the nature of the data. If we want to make available a large

volume of data on your intranet, providing a fast search engine is important. Otherwise users find

it frustrating to wait for the search engine to bring back the search results. With smaller amounts

of data this will be less of a concern; it all depends on the volume of data that we intend to make

available on the intranet.

24

Any good search engine should use some form of intelligent relevancy determination. This is

where the search engine, based on the query entered, makes a judgment about which results will

be the most relevant, and ranks them accordingly.

Step 3: Finding the Right Answer.

The search process doesn‟t stop once the user receives the list of results. They then need to

refine and manipulate the results list until they find exactly what they were looking for. There are

many features that can assist in this task, some of which include:

Document summary information

The display of useful document attributes such as file type, file size, date last changed,

relevancy rating and the number of „hits‟ (key words found) in the document. The display

of an extract of the document, say several lines above and below the first hit, is helpful for

determining the context in which the document has been returned.

Re-sorting

The ability to re-sort the results list using different criteria, such as title, number of hits,

relevancy, and date changed; file type or any other criteria that makes sense for your

organization.

Hit-to-hit navigation

The provision of navigation buttons enabling users to go directly to the first hit in the

returned document, and thereon to the next or previous hit as required. This means users

avoid having to read through pages and pages of document before finding the relevant

section, making it much more efficient.

Hit highlighting

A familiar concept from searching the web, hit highlighting is when the key words, or

„hits‟, in a document, are highlighted in a different colour. This feature is often not

available in an intranet search engine, but it really should be, as combined with hit-to-hit

navigation it enables users to immediately see the relevant sections of the document.

Fast preview

The ability to preview large non-HTML documents in a basic HTML format, without the

need for downloading the whole document. This function enables users to view a few lines

above and below each hit, and then to expand up or down to continue reading.

Search within

The ability to search within the current set of results, to further narrow them. Although

just some of the features available in intranet search engines, these are the main features

25

required to ensure that users have the best overall experience. Others that may be relevant

to your organization might include intelligent agents that automatically advises users when

relevant content appears in the data repository, or the ability to save or export search

results.

5.4 Designing the interface

Take extra time and effort when designing your search pages. They should be clear, easy, and

above all, simple. Don‟t bother with an „advanced search‟ facility: your users won‟t understand

it.

Behind the scenes

Make your search engine quietly work for the user, to correct their mistakes, and to help

them find the right page. While much of the work of deploying a search engine goes on behind

the scenes, the design of the user interface greatly influences how successful the system will be.

While the interface design must be consistent with the rest of your online material, we

recommend the following guidelines:

5.4.1 Search Page

Keep it simple

There are two key elements on a search page: a field to enter the search terms, and a

‟search‟ button. There is no reason to make the page any more complex than this.

Provide hints

A list of tips and examples on the main search page helps users when they first use the

search engine. This list should be written in plain English, and should cover the common

issues and questions.

No advanced searching

Normal users have enough difficulty with search engines without confronting them with a

complex set of „advanced search‟ methods. Users want to quickly find a single page, and

therefore we must design our interface to meet this need.

26

Always ‘and’

Few users understand the concept of „Boolean operators‟. Instead, they expect that when

they type in three words, they will be given only those documents that contain all three.

Furthermore, typing in more words should provide fewer hits, not more.

The search engine must therefore default to „and-ing‟ the words together. In fact,

eliminate support for Boolean operators all together, unless there is a clear case that they

will be of value to your users.

Place the cursor

When the search page is opened, the cursor should already be in the search field (this is

known as ‟setting the focus‟). This allows the user to simply type in their words, and hit

enter. It‟s a small point, but it took only days for our users to specifically ask us for it

5.4.2 Result Page

Make it attractive

A results page should encourage users, not frighten them off with tiny text, difficult

layouts, and hard-to-read fonts. We expect users to spend time browsing through the list

of results, so it is worth spending some extra time making the pages easy on the eye.

Keep it simple

There are only three things that we need to present for each hit: title (a hyperlink to the

actual page), page summary and ranking. Why, for example, would the user want to

know the size of the page in kilobytes? The less we say for each hit, the easier it is for the

user to scan through the list and find the page they want.

Make the description meaningful

Ideally, each hit should provide a useful description of the page, obtained from the „meta‟

tags within the page. If this information is not available, we shall provide a brief extract,

highlighting where the search terms are used.

To ensure that the extract always shows some useful text and not the standard headings

on every page (how many listings have you seen that start with „[Home] [Contents]

[Index] …‟?) is also notified.

27

Behind the scenes

Effort should be spent „behind the scenes‟ to improve the effectiveness of your search

engine. Most engines have capabilities that, when implemented carefully, will help users to find

the pages they are looking for.

These features must operate transparently, so that the user is not even aware of their impact.

They should simply find the search engine both easy to use and effective.

Fuzzy searching, stemming, and more

Our selected search engine provided a number of powerful searching capabilities:

Fuzzy searching, or ’sounds-like’

There were three closely-related options which were essentially designed to find terms which

‟sounded like‟ those entered by the user. In this way, it becomes possible to handle spelling

mistakes and other inconstancies.

Stemming

This feature takes the terms entered by the user, and tries other combinations of endings. For

example, searching for „walks‟ would also find „walk‟, „walking‟, „walked‟.

We found this to be very effective, and it eliminated differences in singular versus plural uses

of terms in our pages.

There are a wide variety of other tools available in modern search engines, beyond those

mentioned above. As per our evaluation and study we noted that just because a feature exists, it

doesn‟t mean it will help the users.

Weightings and rankings

The order in which results are displayed by a search engine is the product of a number of

complex weighting and ranking factors behind the scenes. These vary from engine to engine.

They also have a big impact on how effective the search engine is.

The main aim would be to understand our search engine, and configure it (if required) to

meet our specific requirements. The key is to have the search engine work in a „transparent‟ and

understandable way.

28

Figure 6.1: Search Engine User Interface

29

Chapter 6

Conclusion and Future work

7.1 Conclusion

We have discussed the concept of Intranet search engine. Under this project, the mechanism

of intranet, search engine was thoroughly examined. Developing a search engine for intranet

needs a complete research as per the needs. In brief, we learned the following lessons as a result

of this project:

- Spend a lot of time identifying your needs, and researching the right search engine.

Choosing the wrong search engine is a costly mistake that is not easy to rectify half way

through a project.

- Keep the interface simple. The search page should have a field to type in and a ‟search‟

button. Complex interfaces and advanced searches will confuse users: by default, your

search engine should simply do what the users expect.

- Take the time to configure the intelligence „under the hood‟. The search engine should

quietly assist the user to find the desired page (via synonyms, fuzzy searching, and so

forth).

- Track the usage of your search engine, and use this to assess how well it is working. You

should be gathering enough information to allow you to refine the engine‟s configuration

to better meet user needs.

30

7.2 Future work

In this project following modifications and up gradation can be integrated to make it a

better search engine.

(a) Enable better query understanding

Building in intelligence so as to find the correct word and to solve typo errors, search engines

till today still lack the intelligence to actually understand the semantics rather than the syntax of

a search query.

(b) A ranking algorithm

Ranks are based on the number of occurrence of words in the content and title. Thus the

results are accurate base on content. However, this alone is insufficient when the content

searched is not purely documented based, as in the case of internet.

(c) Multimedia Search Engine

The current version of our Intranet search engine is only capable for searching documents in

text format. This version could be enhanced by supporting searches for various types of files

including images, audio, video etc.

31

Bibliography

[1] Cynthia P. Ruppel and Susan J. Harrington. Sharing Knowledge Through Intranets: A Study

of Organizational Culture and Intranet Implementation, 2000.

[2] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. An Introduction to

Information Retrieval, Online edition, 2009.

[3] Huaiyu Zhu, Sriram Raghavan, Shivakumar Vaithyanathan and Alexander Loser. Navigating

the Intranet with High Precision, 2007.

[4] Dick Stenmark. A Method for Intranet Search Engine Evaluations, Proceedings of IRIS22,

1999.

[5] Michael Chen, Marti Hearst and Jason Hong. Cha-Cha: A System for Organizing Intranet

Search Results, 2002.

intranet search engine

Documents