search engines - text mining in action
DESCRIPTION
With the advent of technology and the popularity of micro-blogging and social media, the patterns of the internet are changing. Search engines are moving towards providing more relevant data, semantic search engines are coming up, emotions are freely expressed on social websites such as Twitter, Facebook etc. So in the coming times, searching a website like Twitter and Facebook will provide more sensible and user provided content rather than normal web search engines like Google, Bing or Yahoo. This paper covers the basics on how text mining is used across various search engines.TRANSCRIPT
qwertyuiopasdfghjklzxcvbnmqwertyui
opasdfghjklzxcvbnmqwertyuiopasdfgh
jklzxcvbnmqwertyuiopasdfghjklzxcvb
nmqwertyuiopasdfghjklzxcvbnmqwer
tyuiopasdfghjklzxcvbnmqwertyuiopas
dfghjklzxcvbnmqwertyuiopasdfghjklzx
cvbnmqwertyuiopasdfghjklzxcvbnmq
wertyuiopasdfghjklzxcvbnmqwertyuio
pasdfghjklzxcvbnmqwertyuiopasdfghj
klzxcvbnmqwertyuiopasdfghjklzxcvbn
mqwertyuiopasdfghjklzxcvbnmqwerty
uiopasdfghjklzxcvbnmqwertyuiopasdf
ghjklzxcvbnmqwertyuiopasdfghjklzxc
vbnmqwertyuiopasdfghjklzxcvbnmrty
uiopasdfghjklzxcvbnmqwertyuiopasdf
ghjklzxcvbnmqwertyuiopasdfghjklzxc
vbnmqwertyuiopasdfghjklzxcvbnmqw
Search Engines
Text Mining in Action
Himanshu Joshi Roll no. 1114025
Search Engines – Text Mining in action by Himanshu Joshi
2
Contents Introduction to Text Mining .......................................................................................................................... 3
What is Text Mining? ................................................................................................................................ 3
Why Did I Choose Text Mining? ................................................................................................................ 3
Comparison with Data Mining ...................................................................................................................... 4
Similarities ................................................................................................................................................. 4
Dissimilarities ............................................................................................................................................ 4
Internet Industry ........................................................................................................................................... 4
History ....................................................................................................................................................... 4
The Uprising of Google.............................................................................................................................. 6
Social Media and Micro-blogging ............................................................................................................. 6
Search Engines .............................................................................................................................................. 7
What is a Search Engine? .......................................................................................................................... 7
Types of Search Engines ............................................................................................................................ 7
Web Search Engines .............................................................................................................................. 7
Vertical Search Engines ......................................................................................................................... 7
Semantic Search Engines....................................................................................................................... 8
Application of Text Mining to Various Types of Search Engines ................................................................... 9
Process of Retrieval in Web Search Engine............................................................................................... 9
Usage of Text Mining in Search Engines ................................................................................................. 11
Text categorization (faceted search systems) ................................................................................... 11
Contextualized clustering .................................................................................................................. 12
Concepts in Action .................................................................................................................................. 12
Text Categorization ............................................................................................................................ 12
Contextualized Clustering .................................................................................................................. 13
Usage of Text Mining in Semantic Search and Natural Language Processing ........................................ 15
Conclusion and Learning ............................................................................................................................. 18
Search Engines – Text Mining in action by Himanshu Joshi
3
Introduction to Text Mining
What is Text Mining?
Text mining is a burgeoning new field that attempts to glean meaningful information from natural
language text1. In simple terms, it is a way to
extract meaning from text. The meaning that is
extracted from the text is useful in a particular
purpose depending on need of text mining.
As compared to the data stored in databases
and tables, the data stored in the form of text is
much difficult to analyze using computers as the
algorithms to extract the data might be highly
sophisticated. However, the payoff is good as
the simplest and most common way of data exchange is text.
Text mining usually involves the process of structuring the input text (usually parsing, along with the
addition of some derived linguistic features and the removal of others, and subsequent insertion into a
database), deriving patterns within the structured data, and finally evaluation and interpretation of the
output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and
interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity
extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity
relation modeling (i.e., learning relations between named entities)2.
Why Did I Choose Text Mining?
With the advent of technology and the popularity of micro-blogging and social media, the patterns of
the internet are changing. Search engines are moving towards providing more relevant data, semantic
search engines are coming up, emotions are freely expressed on social websites such as Twitter,
Facebook etc. So in the coming times, searching a website like Twitter and Facebook will provide more
sensible and user provided content rather than normal web search engines like Google, Bing or Yahoo.
1 Text Mining – Ian H. Witten, Computer Science, University of Waikato, Hamilton, New Zealand
2 Text Mining – Wikipedia, The Free Encyclopedia
Search Engines – Text Mining in action by Himanshu Joshi
4
Being a technocrat and also the creator of a search engine, I do not want to miss the developments and
thus have chosen this topic so that I may gain some knowledge about the changes.
Comparison with Data Mining
Similarities
As Data mining refers to patterns in data, text mining refers to mining patterns in a chunk of text. This
inevitably means that both the techniques attempt to mine context out of reference. Another important
similarity between the two is that the information which is supposed to be extracted should be
potentially useful. There is no point in mining for data which is of very less or practically no importance
to the user. The definition of “potentially useful” is, however, different for data mining and text mining.
For data mining, the definition says that the information extracted should be comprehensible i.e. it
helps to explain the data. However, in case of text mining, the data itself is comprehensible without the
help of machines. Still, the stark similarity between text mining and data mining remains.
Dissimilarities
Even though the two techniques are similar conceptually, they have some differences between them.
The information to be extracted in data mining is hidden, unknown and could hardly be extracted
without using the automatic techniques of data mining. On the contrary, the data in text mining is still
useful if used alone and can very easily be comprehended without the usage of sophisticated techniques
and technology. It is the context that is missing in the data.
Internet Industry
History
The Internet was the result of some visionary thinking by people in the early 1960s that saw great
potential value in allowing computers to share information on research and development in scientific
and military fields. The Internet, then known as ARPANET, was brought online in 1969 under a contract
let by the renamed Advanced Research Projects Agency (ARPA) which initially connected four major
Search Engines – Text Mining in action by Himanshu Joshi
5
computers at universities in the southwestern US (UCLA, Stanford Research Institute, UCSB, and the
University of Utah). The early Internet was used by computer experts, engineers, scientists, and
librarians. There was nothing friendly about it. There were no home or office personal computers in
those days, and anyone who used it, whether a computer professional or an engineer or scientist or
librarian, had to learn to use a very complex system. E-mail was adapted for ARPANET by Ray Tomlinson
of BBN in 1972. He picked the @ symbol from the available symbols on his teletype to link the username
and address.
The Internet matured in the 70's as a result of the TCP/IP architecture first proposed by Bob Kahn at BBN
and further developed by Kahn and Vint Cerf at Stanford and others throughout the 70's. It was adopted
by the Defense Department in 1980 replacing the earlier Network Control Protocol (NCP) and universally
adopted by 1983.
In 1986, the National Science Foundation funded NSFNet as a cross country 56 Kbps backbone for the
Internet. They maintained their sponsorship for nearly a decade, setting rules for its non-commercial
government and research uses.
As the commands for e-mail, FTP, and telnet were standardized, it became a lot easier for non-technical
people to learn to use the nets. It was not easy by today's standards by any means, but it did open up
use of the Internet to many more people in universities in particular. Other departments besides the
libraries, computer, physics, and engineering departments found ways to make good use of the nets--to
communicate with colleagues around the world and to share files and resources.
While the number of sites on the Internet was small, it was fairly easy to keep track of the resources of
interest that were available. But as more and more universities and organizations--and their libraries--
connected, the Internet became harder and harder to track. There was more and more need for tools to
index the resources that were available. 3
This is where web directories like Yahoo!, Excite and DMOZ came into picture. However, the user
himself had to find the category he was looking for in such directories. Moreover, the updating of such
directories was not done on the realtime basis.
To address these problems, people started what are known as Search Engines today. Altavista, Ask
Jeeves, Google etc. started as a way to address this problem.
3 A Brief History of The Internet by Walt Howe (http://www.walthowe.com/navnet/history.html)
Search Engines – Text Mining in action by Himanshu Joshi
6
The Uprising of Google
Google began in January 1996 as a research project by Larry Page and Sergey Brin when they were both
PhD students at Stanford University in California.
While conventional search engines ranked results by counting how many times the search terms
appeared on the page, the two theorized about a better system that analyzed the relationships between
websites. They called this new technology PageRank, where a website's relevance was determined by
the number of pages, and the importance of those pages, that linked back to the original site.
A small search engine called "RankDex" from IDD Information Services designed by Robin Li was, since
1996, already exploring a similar strategy for site-scoring and page ranking. The technology in RankDex
would be patented[33] and used later when Li founded Baidu in China.
Page and Brin originally nicknamed their new search engine "BackRub", because the system checked
backlinks to estimate the importance of a site.
Eventually, they changed the name to Google, originating from a misspelling of the word "googol", the
number one followed by one hundred zeros, which was picked to signify that the search engine wants to
provide large quantities of information for people. Originally, Google ran under the Stanford University
website, with the domain google.stanford.edu.
The domain name for Google was registered on September 15, 1997, and the company was
incorporated on September 4, 1998. It was based in a friend's (Susan Wojcicki) garage in Menlo Park,
California. Craig Silverstein, a fellow PhD student at Stanford, was hired as the first employee.
In May 2011, unique visitors of Google surpassed 1 billion for the first time, an 8.4 percent increase from
a year ago with 931 million unique visitors.
Google specializes in searching the internet for links, images, videos and other multimedia files. Google
has made its mark as the search engine having the best search results as its output.
Social Media and Micro-blogging
A recent trend that is building up is that of social media and micro-blogging. The amount of content
being uploaded on to the internet per second goes into terra bites and the database of search engines
need to be updated very quickly. With the advent to websites like Twitter and Facebook, there is a shift
Search Engines – Text Mining in action by Himanshu Joshi
7
in presence of relevant data towards the social media platform. Search engines based on twitter are
making their presence felt on the internet quickly and facebook is also planning for a search engine to
find public content across its platform.
With these developments, data mining techniques such as Text Mining and Web Mining become highly
relied upon and relevant today.
Search Engines
What is a Search Engine?
A Search Engine is an internet based website that allows a user to search for content on the internet.
The data to be searched can be in any form – text, links, images or even natural language processed
data. A search engine essentially uses a web crawler or spider – a program to crawl various web pages
on the internet and extract meaningful data from the web pages using complex algorithms and concepts
of text mining. The data extracted is stored in servers and is retrieved for the user when he enters a
query on the website of the search engine.
Types of Search Engines
There are many types of search engines. Few types of search engines are listed here:
Web Search Engines
A web search engine is designed to search for information on the World Wide Web and FTP servers. The
search results are generally presented in a list of results often referred to as SERPS, or "search engine
results pages". The information may consist of web pages, images, information and other types of files.
Some search engines also mine data available in databases or open directories. Unlike web directories,
which are maintained only by human editors, search engines also maintain real-time information by
running an algorithm on a web crawler4.
Vertical Search Engines
A vertical search engine, as distinct from a general web search engine, focuses on a specific segment of
online content. The vertical content area may be based on topicality, media type, or genre of content.
4 Web Search Engine – Wikipedia, The Free Encyclopedia
Search Engines – Text Mining in action by Himanshu Joshi
8
Common verticals include shopping, the automotive industry, legal information, medical information,
and travel. In contrast to general Web search engines, which attempt to index large portions of the
World Wide Web using a web crawler, vertical search engines typically use a focused crawler that
attempts to index only Web pages that are relevant to a pre-defined topic or set of topics.
Some vertical search sites focus on individual verticals, while other sites include multiple vertical
searches within one search engine.
Vertical search offers several potential benefits over general search engines:
Greater precision due to limited scope
Leverage domain knowledge including taxonomies and ontologies
Support specific unique user tasks5
Semantic Search Engines
Semantic search seeks to improve search accuracy by understanding searcher intent and the contextual
meaning of terms as they appear in the searchable dataspace, whether on the Web or within a closed
system, to generate more relevant results.
There are two major forms of search: Navigational and Research. In navigational search, the user is using
the search engine as a navigation tool to navigate to a particular intended document. Semantic Search is
not applicable to navigational searches. In Research Search, the user provides the search engine with a
phrase which is intended to denote an object about which the user is trying to gather/research
information. There is no particular document which the user knows about that s/he is trying to get to.
Rather, the user is trying to locate a number of documents which together will give him/her the
information s/he is trying to find. Semantic Search lends itself well here.
Rather than using ranking algorithms such as Google's PageRank to predict relevancy, Semantic Search
uses semantics, or the science of meaning in language, to produce highly relevant search results. In most
5 Vertical Search Engines – Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Vertical_search)
Search Engines – Text Mining in action by Himanshu Joshi
9
cases, the goal is to deliver the information queried by a user rather than have a user sort through a list
of loosely related keyword results.6
Application of Text Mining to Various
Types of Search Engines
We now move on to investigate the functioning of a search engine. For the purpose of simplicity in the
report (as the area of research will become too wide), we are primarily concentrating on Web Search
Engines and Semantic Search Engines in our study. Vertical search engine more or less has ingredients
inherited from a web search engine and thus can be skipped.
The functionality of a search engine can be broadly based on two data mining techniques:
a. Web Mining or Link Mining wherein the spiders collect data from various websites. The data is
passed back to the text mining tool to extract meaning out of it & to the web mining tool to
extract other potential information from it (such as web links, images etc.)
b. Text Mining for fetching the data when a user puts up a query.
Our study here is primarily concerned with Text Mining and thus we will assume that the web search
spider has collected the data and have passed the data back to the text mining tool.
Process of Retrieval in Web Search Engine
The process flow of retrieval of data in a web search engine is shown as under:
6 Semantic Search – Wikipedia, the free encyclopedia (http://en.wikipedia.org/wiki/Semantic_search)
Search Engines – Text Mining in action by Himanshu Joshi
10
The process of retrieval might look like an easy process but it is probably the most difficult process for a
search engine. The task is difficult because of the following reasons:
Overwhelming information in typical user query results. The data to be mined is huge and the
number of combinations to be mined makes it more difficult.
Results are only partly related to each other. There is a likely probability that two results that
appear at rank 1 and rank 2 in the search engine result page (SERP) are not linked at all. Example
will be given in the upcoming section.
Many users investigate only the two or three top ranked documents. Thus a search engine
needs to have an algorithm that will put up the most relevant documents at the top. This
essentially means that the efficiency of retrieval and efficiency of indexing both need to be at
their best.
Traditional lists of ranked documents do not seem to be sufficient for the exploratory search
tasks.
Stop Word Removal
•All common usage words like 'a', 'an', 'the' are removed from the input string.
•If the search query is given in quotations (""), stop words are not removed for phrase search.
Stemming
•Similar words are 'stemmed' down to their root. Mostly the root is a noun or a verb. Eg. Cat, Catty, Catlike are all stemmed down to the word 'cat'.
•Words with similar 'stems' are treated as synonyms. This process is called as conflation.
Database Search
•Database is mined. First the complete phrase is searched and then the words are searched.
•Data is retreived and shown to the user
Search Engines – Text Mining in action by Himanshu Joshi
11
Usage of Text Mining in Search Engines
Text mining is primarily used in two areas in web search engines. These are:
Text categorization (faceted search systems)
Faceted search (sometimes known as faceted browsing or faceted navigation), is a technique for
accessing and exploring a collection of information (database, catalogue, repository).It presents
the user with a faceted (layered, categorized, grouped) classification, allowing them to explore
by filtering available information. A faceted search system allows each item in the
catalogue/repository/database to be assigned multiple classifications, enabling the
classifications to be ordered in multiple ways, rather than in a single, pre-determined,
taxonomic order.
Facets can be derived manually from analysis of the item, or from pre-existing fields in the
item's metadata such as author, descriptor, language, and format. The former enables facet to
be derived and sourced via a range of user and content research methods. The latter permits
existing items in a catalogue/repository/database to have this extra metadata extracted,
mapped and presented as a navigation facet, without extra data input being needed.7
Key benefits of faceted Search:
Enhanced feedback - users receive an overview of their search results broken down by
category that they can then use for refining their search.
Informed choices - users know in advance how many items are available in each
category, so they can search first in categories more likely to bring them a successful
result. Categories with zero items in a given field are usually not shown; hence the user
is very unlikely to encounter a 'no results' outcome.
Users can select their own searching path or hierarchy based on the information
presented to them and can add or remove filters or facets at will.
7 Faceted Search | EIFL (http://www.eifl.net/faceted-search)
Search Engines – Text Mining in action by Himanshu Joshi
12
Contextualized clustering
Contextualized clustering is a method to group similar search results together and cluster
documents or pages (web) according to the terms found in the documents. The core advantage
of providing contextualized results is easiness for the user to find the relevant content which he
is looking for. This provides a more meaningful search experience to the user. A newly
developed system, HOBSearch, makes use of suffix tree clustering to overcome many of the
weaknesses of traditional clustering approaches. Using result snippets rather than full
documents, HOBSearch both speeds up clustering substantially and manages to tailor the
clustering to the topics indicated in user’s query. An inherent problem with clustering, though, is
the choice of cluster labels.
Concepts in Action
Text Categorization
Most of the examples I will quote here are from the search engine which I have created: Molu – The
Search Spider (http://www.themolu.com).
Text categorization can be applied in various ways in a search engine. Some of them are shown below:
The above screenshot is from an upcoming version of the website. The concept allows drilling down the
search of the user to a single website and will allow him to narrow his scope to the websites he trusts.
Search Engines – Text Mining in action by Himanshu Joshi
13
Using data mining techniques called as faceted categorization, the data item (news in this sense) is
assigned two properties – a date and relevance. Both of them have their own hierarchy according to
popularity and other factors. When any one of them is invoked, relevant search results are mined and
thrown back to the user. This concept is widely applied to search engines these days. The best example
being FlickR.com where the user gets to see the search results by Date/Relevance/Interestingness. FlickR
was also the first few companies to apply this concept.
Contextualized Clustering
The search engines today have become intelligent. Using text mining techniques, they group the results
in different categories and present it to the user. The method of clustering is simple to describe but
difficult to apply. The density of tags (different words extracted after stop word removal and stemming)
in one particular record (how many words of similar origin match in a particular page) decides the group
in which a particular page will fall.
A very nice implementation of the technique can be found at Yippy (http://search.yippy.com/). The
Search engine classifies the results based on groups (city, software, india etc.)/sources (Bing, Google
etc.)/sites (.com, .org etc.)/time of update.
Search Engines – Text Mining in action by Himanshu Joshi
14
Some search engines also present the results in tree based form making it much easier for the user to
use them:
Search Engines – Text Mining in action by Himanshu Joshi
15
Usage of Text Mining in Semantic Search and Natural Language
Processing
Semantic Search requires a lot of text mining application so as to get the accurate results what the user
is looking for. The major difference between a semantic or natural language processing search engine
and a web search engine is that in a semantic search engine, the engine needs to understand the
meaning of search query and answer accordingly. In a web search engine, it needs to just take the query
as a whole and give results that match the query.
A Natural Language Linguistic Extractor (NLLE) automatically identifies the concepts structuring the
texts. Each significant word is a semantic chain. One word suffices to find all the documents containing
that word and its equivalents using plain English (or French, Spanish, etc.)
For instance, a query on the word "election" will retrieve documents containing the words
"campaigning", "ballot" and "vote", even if the word "election" does not occur explicitly in the source
document.
The number of Semantic Search engines has been increasing everyday. However, no search engine of
today can be called as 100% semantic. The search engine that comes very close to being semantic is
Wolframalpha.
Upon searching “Who is Mahatma Gandhi?” in the search engine, the following results were achieved:
Search Engines – Text Mining in action by Himanshu Joshi
16
The results are intelligent and better than a normal web search. This is a very good example of how
“Information Extraction” technique has been used in selecting the output.
Another question thrown at the search engine was “Why is sky blue in colour?” As expected, the result
was a plain answer rather than a cluster of web search results pointing to web pages containing the
keywords “Sky” “blue” and “colour”.
Using semantic technologies and mathematical computation power, Wolframalpha was able to calculate
and predict a mathematical function correctly:
Search Engines – Text Mining in action by Himanshu Joshi
17
Search Engines – Text Mining in action by Himanshu Joshi
18
Conclusion and Learning
By means of this study, I came to know about the working of a search engine in a better way. It is
obvious that every day, the amount of data being posted on to the internet is huge. To make sense of
the data and make it available to the users, the search engines should be able to index it faster and in a
sensible manner. Techniques like web mining and text mining come handy in such a scenario.
Users are moving towards customized search and semantic search. With the advent of semantic search
engines like Hakia, True Knowledge and Wolfram Alpha, the scope for text mining has further increased.
The results of mining are used in a better manner to get a meaning out of the query and not just context
that was the case before.
The results are getting improved day-by-day and the search engines are way advanced than they were
some 10 years ago. This industry is definitely a one to watch in the next 10 years.