search engines - text mining in action

qwertyuiopasdfghjklzxcvbnmqwertyui

opasdfghjklzxcvbnmqwertyuiopasdfgh

jklzxcvbnmqwertyuiopasdfghjklzxcvb

nmqwertyuiopasdfghjklzxcvbnmqwer

tyuiopasdfghjklzxcvbnmqwertyuiopas

dfghjklzxcvbnmqwertyuiopasdfghjklzx

cvbnmqwertyuiopasdfghjklzxcvbnmq

wertyuiopasdfghjklzxcvbnmqwertyuio

pasdfghjklzxcvbnmqwertyuiopasdfghj

klzxcvbnmqwertyuiopasdfghjklzxcvbn

mqwertyuiopasdfghjklzxcvbnmqwerty

uiopasdfghjklzxcvbnmqwertyuiopasdf

ghjklzxcvbnmqwertyuiopasdfghjklzxc

vbnmqwertyuiopasdfghjklzxcvbnmrty

uiopasdfghjklzxcvbnmqwertyuiopasdf

ghjklzxcvbnmqwertyuiopasdfghjklzxc

vbnmqwertyuiopasdfghjklzxcvbnmqw

Search Engines

Text Mining in Action

Himanshu Joshi Roll no. 1114025

Search Engines – Text Mining in action by Himanshu Joshi

2

Contents Introduction to Text Mining .......................................................................................................................... 3

What is Text Mining? ................................................................................................................................ 3

Why Did I Choose Text Mining? ................................................................................................................ 3

Comparison with Data Mining ...................................................................................................................... 4

Similarities ................................................................................................................................................. 4

Dissimilarities ............................................................................................................................................ 4

Internet Industry ........................................................................................................................................... 4

History ....................................................................................................................................................... 4

The Uprising of Google.............................................................................................................................. 6

Social Media and Micro-blogging ............................................................................................................. 6

Search Engines .............................................................................................................................................. 7

What is a Search Engine? .......................................................................................................................... 7

Types of Search Engines ............................................................................................................................ 7

Web Search Engines .............................................................................................................................. 7

Vertical Search Engines ......................................................................................................................... 7

Semantic Search Engines....................................................................................................................... 8

Application of Text Mining to Various Types of Search Engines ................................................................... 9

Process of Retrieval in Web Search Engine............................................................................................... 9

Usage of Text Mining in Search Engines ................................................................................................. 11

Text categorization (faceted search systems) ................................................................................... 11

Contextualized clustering .................................................................................................................. 12

Concepts in Action .................................................................................................................................. 12

Text Categorization ............................................................................................................................ 12

Contextualized Clustering .................................................................................................................. 13

Usage of Text Mining in Semantic Search and Natural Language Processing ........................................ 15

Conclusion and Learning ............................................................................................................................. 18


3

Introduction to Text Mining

What is Text Mining?

Text mining is a burgeoning new field that attempts to glean meaningful information from natural

language text1. In simple terms, it is a way to

extract meaning from text. The meaning that is

extracted from the text is useful in a particular

purpose depending on need of text mining.

As compared to the data stored in databases

and tables, the data stored in the form of text is

much difficult to analyze using computers as the

algorithms to extract the data might be highly

sophisticated. However, the payoff is good as

the simplest and most common way of data exchange is text.

Text mining usually involves the process of structuring the input text (usually parsing, along with the

addition of some derived linguistic features and the removal of others, and subsequent insertion into a

database), deriving patterns within the structured data, and finally evaluation and interpretation of the

output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and

interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity

extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity

relation modeling (i.e., learning relations between named entities)2.

Why Did I Choose Text Mining?

With the advent of technology and the popularity of micro-blogging and social media, the patterns of

the internet are changing. Search engines are moving towards providing more relevant data, semantic

search engines are coming up, emotions are freely expressed on social websites such as Twitter,

Facebook etc. So in the coming times, searching a website like Twitter and Facebook will provide more

sensible and user provided content rather than normal web search engines like Google, Bing or Yahoo.

1 Text Mining – Ian H. Witten, Computer Science, University of Waikato, Hamilton, New Zealand

2 Text Mining – Wikipedia, The Free Encyclopedia


4

Being a technocrat and also the creator of a search engine, I do not want to miss the developments and

thus have chosen this topic so that I may gain some knowledge about the changes.

Comparison with Data Mining

Similarities

As Data mining refers to patterns in data, text mining refers to mining patterns in a chunk of text. This

inevitably means that both the techniques attempt to mine context out of reference. Another important

similarity between the two is that the information which is supposed to be extracted should be

potentially useful. There is no point in mining for data which is of very less or practically no importance

to the user. The definition of “potentially useful” is, however, different for data mining and text mining.

For data mining, the definition says that the information extracted should be comprehensible i.e. it

helps to explain the data. However, in case of text mining, the data itself is comprehensible without the

help of machines. Still, the stark similarity between text mining and data mining remains.

Dissimilarities

Even though the two techniques are similar conceptually, they have some differences between them.

The information to be extracted in data mining is hidden, unknown and could hardly be extracted

without using the automatic techniques of data mining. On the contrary, the data in text mining is still

useful if used alone and can very easily be comprehended without the usage of sophisticated techniques

and technology. It is the context that is missing in the data.

Internet Industry

History

The Internet was the result of some visionary thinking by people in the early 1960s that saw great

potential value in allowing computers to share information on research and development in scientific

and military fields. The Internet, then known as ARPANET, was brought online in 1969 under a contract

let by the renamed Advanced Research Projects Agency (ARPA) which initially connected four major


5

computers at universities in the southwestern US (UCLA, Stanford Research Institute, UCSB, and the

University of Utah). The early Internet was used by computer experts, engineers, scientists, and

librarians. There was nothing friendly about it. There were no home or office personal computers in

those days, and anyone who used it, whether a computer professional or an engineer or scientist or

librarian, had to learn to use a very complex system. E-mail was adapted for ARPANET by Ray Tomlinson

of BBN in 1972. He picked the @ symbol from the available symbols on his teletype to link the username

and address.

The Internet matured in the 70's as a result of the TCP/IP architecture first proposed by Bob Kahn at BBN

and further developed by Kahn and Vint Cerf at Stanford and others throughout the 70's. It was adopted

by the Defense Department in 1980 replacing the earlier Network Control Protocol (NCP) and universally

adopted by 1983.

In 1986, the National Science Foundation funded NSFNet as a cross country 56 Kbps backbone for the

Internet. They maintained their sponsorship for nearly a decade, setting rules for its non-commercial

government and research uses.

As the commands for e-mail, FTP, and telnet were standardized, it became a lot easier for non-technical

people to learn to use the nets. It was not easy by today's standards by any means, but it did open up

use of the Internet to many more people in universities in particular. Other departments besides the

libraries, computer, physics, and engineering departments found ways to make good use of the nets--to

communicate with colleagues around the world and to share files and resources.

While the number of sites on the Internet was small, it was fairly easy to keep track of the resources of

interest that were available. But as more and more universities and organizations--and their libraries--

connected, the Internet became harder and harder to track. There was more and more need for tools to

index the resources that were available. 3

This is where web directories like Yahoo!, Excite and DMOZ came into picture. However, the user

himself had to find the category he was looking for in such directories. Moreover, the updating of such

directories was not done on the realtime basis.

To address these problems, people started what are known as Search Engines today. Altavista, Ask

Jeeves, Google etc. started as a way to address this problem.

3 A Brief History of The Internet by Walt Howe (http://www.walthowe.com/navnet/history.html)

http://www.walthowe.com/navnet/history.html


6

The Uprising of Google

Google began in January 1996 as a research project by Larry Page and Sergey Brin when they were both

PhD students at Stanford University in California.

While conventional search engines ranked results by counting how many times the search terms

appeared on the page, the two theorized about a better system that analyzed the relationships between

websites. They called this new technology PageRank, where a website's relevance was determined by

the number of pages, and the importance of those pages, that linked back to the original site.

A small search engine called "RankDex" from IDD Information Services designed by Robin Li was, since

1996, already exploring a similar strategy for site-scoring and page ranking. The technology in RankDex

would be patented[33] and used later when Li founded Baidu in China.

Page and Brin originally nicknamed their new search engine "BackRub", because the system checked

backlinks to estimate the importance of a site.

Eventually, they changed the name to Google, originating from a misspelling of the word "googol", the

number one followed by one hundred zeros, which was picked to signify that the search engine wants to

provide large quantities of information for people. Originally, Google ran under the Stanford University

website, with the domain google.stanford.edu.

The domain name for Google was registered on September 15, 1997, and the company was

incorporated on September 4, 1998. It was based in a friend's (Susan Wojcicki) garage in Menlo Park,

California. Craig Silverstein, a fellow PhD student at Stanford, was hired as the first employee.

In May 2011, unique visitors of Google surpassed 1 billion for the first time, an 8.4 percent increase from

a year ago with 931 million unique visitors.

Google specializes in searching the internet for links, images, videos and other multimedia files. Google

has made its mark as the search engine having the best search results as its output.

Social Media and Micro-blogging

A recent trend that is building up is that of social media and micro-blogging. The amount of content

being uploaded on to the internet per second goes into terra bites and the database of search engines

need to be updated very quickly. With the advent to websites like Twitter and Facebook, there is a shift


7

in presence of relevant data towards the social media platform. Search engines based on twitter are

making their presence felt on the internet quickly and facebook is also planning for a search engine to

find public content across its platform.

With these developments, data mining techniques such as Text Mining and Web Mining become highly

relied upon and relevant today.

Search Engines

What is a Search Engine?

A Search Engine is an internet based website that allows a user to search for content on the internet.

The data to be searched can be in any form – text, links, images or even natural language processed

data. A search engine essentially uses a web crawler or spider – a program to crawl various web pages

on the internet and extract meaningful data from the web pages using complex algorithms and concepts

of text mining. The data extracted is stored in servers and is retrieved for the user when he enters a

query on the website of the search engine.

Types of Search Engines

There are many types of search engines. Few types of search engines are listed here:

Web Search Engines

A web search engine is designed to search for information on the World Wide Web and FTP servers. The

search results are generally presented in a list of results often referred to as SERPS, or "search engine

results pages". The information may consist of web pages, images, information and other types of files.

Some search engines also mine data available in databases or open directories. Unlike web directories,

which are maintained only by human editors, search engines also maintain real-time information by

running an algorithm on a web crawler4.

Vertical Search Engines

A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of

online content. The vertical content area may be based on topicality, media type, or genre of content.

4 Web Search Engine – Wikipedia, The Free Encyclopedia


8

Common verticals include shopping, the automotive industry, legal information, medical information,

and travel. In contrast to general Web search engines, which attempt to index large portions of the

World Wide Web using a web crawler, vertical search engines typically use a focused crawler that

attempts to index only Web pages that are relevant to a pre-defined topic or set of topics.

Some vertical search sites focus on individual verticals, while other sites include multiple vertical

searches within one search engine.

Vertical search offers several potential benefits over general search engines:

Greater precision due to limited scope

Leverage domain knowledge including taxonomies and ontologies

Support specific unique user tasks5

Semantic Search Engines

Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual

meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed

system, to generate more relevant results.

There are two major forms of search: Navigational and Research. In navigational search, the user is using

the search engine as a navigation tool to navigate to a particular intended document. Semantic Search is

not applicable to navigational searches. In Research Search, the user provides the search engine with a

phrase which is intended to denote an object about which the user is trying to gather/research

information. There is no particular document which the user knows about that s/he is trying to get to.

Rather, the user is trying to locate a number of documents which together will give him/her the

information s/he is trying to find. Semantic Search lends itself well here.

Rather than using ranking algorithms such as Google's PageRank to predict relevancy, Semantic Search

uses semantics, or the science of meaning in language, to produce highly relevant search results. In most

5 Vertical Search Engines – Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Vertical_search)

http://en.wikipedia.org/wiki/Vertical_search


9

cases, the goal is to deliver the information queried by a user rather than have a user sort through a list

of loosely related keyword results.6

Application of Text Mining to Various

Types of Search Engines

We now move on to investigate the functioning of a search engine. For the purpose of simplicity in the

report (as the area of research will become too wide), we are primarily concentrating on Web Search

Engines and Semantic Search Engines in our study. Vertical search engine more or less has ingredients

inherited from a web search engine and thus can be skipped.

The functionality of a search engine can be broadly based on two data mining techniques:

a. Web Mining or Link Mining wherein the spiders collect data from various websites. The data is

passed back to the text mining tool to extract meaning out of it & to the web mining tool to

extract other potential information from it (such as web links, images etc.)

b. Text Mining for fetching the data when a user puts up a query.

Our study here is primarily concerned with Text Mining and thus we will assume that the web search

spider has collected the data and have passed the data back to the text mining tool.

Process of Retrieval in Web Search Engine

The process flow of retrieval of data in a web search engine is shown as under:

6 Semantic Search – Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Semantic_search)

http://en.wikipedia.org/wiki/Semantic_search


10

The process of retrieval might look like an easy process but it is probably the most difficult process for a

search engine. The task is difficult because of the following reasons:

Overwhelming information in typical user query results. The data to be mined is huge and the

number of combinations to be mined makes it more difficult.

Results are only partly related to each other. There is a likely probability that two results that

appear at rank 1 and rank 2 in the search engine result page (SERP) are not linked at all. Example

will be given in the upcoming section.

Many users investigate only the two or three top ranked documents. Thus a search engine

needs to have an algorithm that will put up the most relevant documents at the top. This

essentially means that the efficiency of retrieval and efficiency of indexing both need to be at

their best.

Traditional lists of ranked documents do not seem to be sufficient for the exploratory search

tasks.

Stop Word Removal

•All common usage words like 'a', 'an', 'the' are removed from the input string.

•If the search query is given in quotations (""), stop words are not removed for phrase search.

Stemming

•Similar words are 'stemmed' down to their root. Mostly the root is a noun or a verb. Eg. Cat, Catty, Catlike are all stemmed down to the word 'cat'.

•Words with similar 'stems' are treated as synonyms. This process is called as conflation.

Database Search

•Database is mined. First the complete phrase is searched and then the words are searched.

•Data is retreived and shown to the user


11

Usage of Text Mining in Search Engines

Text mining is primarily used in two areas in web search engines. These are:

Text categorization (faceted search systems)

Faceted search (sometimes known as faceted browsing or faceted navigation), is a technique for

accessing and exploring a collection of information (database, catalogue, repository).It presents

the user with a faceted (layered, categorized, grouped) classification, allowing them to explore

by filtering available information. A faceted search system allows each item in the

catalogue/repository/database to be assigned multiple classifications, enabling the

classifications to be ordered in multiple ways, rather than in a single, pre-determined,

taxonomic order.

Facets can be derived manually from analysis of the item, or from pre-existing fields in the

item's metadata such as author, descriptor, language, and format. The former enables facet to

be derived and sourced via a range of user and content research methods. The latter permits

existing items in a catalogue/repository/database to have this extra metadata extracted,

mapped and presented as a navigation facet, without extra data input being needed.7

Key benefits of faceted Search:

Enhanced feedback - users receive an overview of their search results broken down by

category that they can then use for refining their search.

Informed choices - users know in advance how many items are available in each

category, so they can search first in categories more likely to bring them a successful

result. Categories with zero items in a given field are usually not shown; hence the user

is very unlikely to encounter a 'no results' outcome.

Users can select their own searching path or hierarchy based on the information

presented to them and can add or remove filters or facets at will.

7 Faceted Search | EIFL (http://www.eifl.net/faceted-search)

http://www.eifl.net/faceted-search


12

Contextualized clustering

Contextualized clustering is a method to group similar search results together and cluster

documents or pages (web) according to the terms found in the documents. The core advantage

of providing contextualized results is easiness for the user to find the relevant content which he

is looking for. This provides a more meaningful search experience to the user. A newly

developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the

weaknesses of traditional clustering approaches. Using result snippets rather than full

documents, HOBSearch both speeds up clustering substantially and manages to tailor the

clustering to the topics indicated in user’s query. An inherent problem with clustering, though, is

the choice of cluster labels.

Concepts in Action

Text Categorization

Most of the examples I will quote here are from the search engine which I have created: Molu – The

Search Spider (http://www.themolu.com).

Text categorization can be applied in various ways in a search engine. Some of them are shown below:

The above screenshot is from an upcoming version of the website. The concept allows drilling down the

search of the user to a single website and will allow him to narrow his scope to the websites he trusts.

http://www.themolu.com/


13

Using data mining techniques called as faceted categorization, the data item (news in this sense) is

assigned two properties – a date and relevance. Both of them have their own hierarchy according to

popularity and other factors. When any one of them is invoked, relevant search results are mined and

thrown back to the user. This concept is widely applied to search engines these days. The best example

being FlickR.com where the user gets to see the search results by Date/Relevance/Interestingness. FlickR

was also the first few companies to apply this concept.

Contextualized Clustering

The search engines today have become intelligent. Using text mining techniques, they group the results

in different categories and present it to the user. The method of clustering is simple to describe but

difficult to apply. The density of tags (different words extracted after stop word removal and stemming)

in one particular record (how many words of similar origin match in a particular page) decides the group

in which a particular page will fall.

A very nice implementation of the technique can be found at Yippy (http://search.yippy.com/). The

Search engine classifies the results based on groups (city, software, india etc.)/sources (Bing, Google

etc.)/sites (.com, .org etc.)/time of update.

http://search.yippy.com/


14

Some search engines also present the results in tree based form making it much easier for the user to

use them:


15

Usage of Text Mining in Semantic Search and Natural Language

Processing

Semantic Search requires a lot of text mining application so as to get the accurate results what the user

is looking for. The major difference between a semantic or natural language processing search engine

and a web search engine is that in a semantic search engine, the engine needs to understand the

meaning of search query and answer accordingly. In a web search engine, it needs to just take the query

as a whole and give results that match the query.

A Natural Language Linguistic Extractor (NLLE) automatically identifies the concepts structuring the

texts. Each significant word is a semantic chain. One word suffices to find all the documents containing

that word and its equivalents using plain English (or French, Spanish, etc.)

For instance, a query on the word "election" will retrieve documents containing the words

"campaigning", "ballot" and "vote", even if the word "election" does not occur explicitly in the source

document.

The number of Semantic Search engines has been increasing everyday. However, no search engine of

today can be called as 100% semantic. The search engine that comes very close to being semantic is

Wolframalpha.

Upon searching “Who is Mahatma Gandhi?” in the search engine, the following results were achieved:


16

The results are intelligent and better than a normal web search. This is a very good example of how

“Information Extraction” technique has been used in selecting the output.

Another question thrown at the search engine was “Why is sky blue in colour?” As expected, the result

was a plain answer rather than a cluster of web search results pointing to web pages containing the

keywords “Sky” “blue” and “colour”.

Using semantic technologies and mathematical computation power, Wolframalpha was able to calculate

and predict a mathematical function correctly:


17


18

Conclusion and Learning

By means of this study, I came to know about the working of a search engine in a better way. It is

obvious that every day, the amount of data being posted on to the internet is huge. To make sense of

the data and make it available to the users, the search engines should be able to index it faster and in a

sensible manner. Techniques like web mining and text mining come handy in such a scenario.

Users are moving towards customized search and semantic search. With the advent of semantic search

engines like Hakia, True Knowledge and Wolfram Alpha, the scope for text mining has further increased.

The results of mining are used in a better manner to get a meaning out of the query and not just context

that was the case before.

The results are getting improved day-by-day and the search engines are way advanced than they were

some 10 years ago. This industry is definitely a one to watch in the next 10 years.

search engines - text mining in action

Documents

search engines text

types of search engines

web search engines

vertical search engines

semantic search engines

application of text

data mining

various types of search