fatma y. eldresi fatma y. eldresi ( mphil ) systems analysis / programming specialist, agoco part...
TRANSCRIPT
A Specialised A Specialised Search Engine for Search Engine for
Neuroscience WebPagesNeuroscience WebPages
Fatma Y. ELDRESI Fatma Y. ELDRESI (MPhil )Systems Analysis / Programming Specialist, AGOCO
Part time lecturer in University of Garyounis,
NeuroSearch
2
Contents Introduction
Implementation
Testing
Software lifecycle : (1)webCrawler Engine, (2) Indexer Engine, (3) Query Engine, (4) Re-Crawler Engine (Specialised Crawler)
Conclusions
Components in a NeuroSearch & its Architecture
Challenges
3
Introduction
What is a Search
Engine?
A server or a collection of servers dedicated to indexing internet web pages, storing the results and returning lists of pages which match particular queries.
Convenient search engines generate indexes :
•Google using Spider•Yahoo using Directory
“NeuroSearch” Using Spider & the Advance Knowledge
4
Introduction cont..
Defining the
problem
In addition,(1)- users have many challenges in choosing the relevant keywords;(2)- professionals sometimes fail in their search and get disappointed result,
becauseA. the retrieved pages sometimes not related orB. different from what the they’re looking for.
TheThe Objective
Creating a specialised search engine (i.e, Advance knowledge) to read web documentsIndex and update all the content in the local serverAnswer the queries from the local database Update the system over a constant period
why is a specialised search engine needed? Web has got non centralised organisation, with huge mixed
collection of Information Updated continuously, without standard format, Pages are extensively linked
Therefore,Therefore, establishing standard measures for relevance is a very challenging task establishing standard measures for relevance is a very challenging task
5
Components of “NeuroSearch”
It has two components:It has two components:1-1-Search/Crawler EngineSearch/Crawler Engine2- 2- Query enginesQuery engines
6
Components explained
Retriever (Query engine)
Re-crawler
Indexer
Spider
Crawler EngineCrawler Engine
Crawler EngineCrawler Engine
Crawler EngineCrawler Engine
Query EngineQuery Engine
7
“NeuroSearch” Architecture Model
Search Engine
Interface
Query Engine
Indexer
Index
Re-Crawler WebCrawler
World Wide Web
Users
WWW
8
Implementation and Case Study
•Creating the database using Access DB.
•Implementing all parts of “NueroSearch” using Java Language and SQL.
9
NeuroSearch Database
The
Advance
Knowledge
TEXTTEXT TEXT
WebCrawler data
Advance Knowledge data Re-crawler
data
Query Data
Indexer data
10
The advance knowledge Case study- Neuroscience (Vision)
Ph
ase
1P
has
e 1
Ph
ase
2P
has
e 2
Ph
ase
3P
has
e 3
NeuroSearch uses advance knowledge about Neuroscience (vision) as a case study.
Then, as a domain knowledge of Vision, do data mining to construct keywords and the relation between them.
This knowledge is stored in the database and categorised by numbers, and related knowledge is categorised
too and stored in data network form in the database.
11
Software lifecycle
Consists of 1. WebCrawler/Spider EngineWebCrawler/Spider Engine 2. 2. Indexer EngineIndexer Engine 3. 3. Re-Crawler (specialised)Re-Crawler (specialised)
Crawler Engine
12
WebCrawler (Spider)
Spider
1)-This web crawler is general one which can download any kind of WebPages. It performs this using :
3)-In addition, WebCrawlerhas to access the proxyaccess the proxyfirewallfirewall (i.e. in Newcastle University LAN), before downloaded any web sites.
2)-Fetch URL, retrieves all its WebPages and saves them in the local drive
4)-The crawler performs a performs a breadth-first breadth-first searchsearch, which means it collects a list of all the links that are on the current page before
it follows any of the links to a new page.
13
WebCrawler - real challenge.
Challenge 1:connect to www and accessing private websites.
Solution 1:Crawler has to allow its socket to connect first with the Proxy server.
Challenge 2:connect this socket further to the WWW
Solution 2:Get method : the straight forward socket uses is just to get the file name. However, in this caseGet command has to take the full URL.
14
Indexer Engine
Indexer Engine
4)-The Ranking Method
1)-Firstly, it search the webpage using it’s advance knowledge. Then, Webpage will be deleted if it is not related to the case study subject.
2)- if it is related to the case study subject (neuroscience) so the indexer will collect the following information from the document:
3)-All keywords it contains, how many times they are repeated, title, contents Then, save them in the database for later display in the query result and do other calculation.
15
Query Engine
QueryEngine
It has an interface to accept keywords from the user
gives the user 2 choices for either display only the most relevant result, or the whole result which include the related results.
It searches for query keywords in the index database and retrieved the result in html format.
16
Query Result: This is indeed an edge compared to other convenient search engines
17
Re-Crawling
Re-Crawling
2-its interface allow the special users decide to continue crawling the website or
cancel it.
1-WebCrawler is specialised of any subject created in the advance knowledge in the database, which will achieve this purpose by reading the URL from the index database using SQL
3-This Part of software aimed to update the index found new link. This is will make search and crawlany “advance knowledge” subject related websites easier
18
Testing phaseTesting phase
20 tests for each category
Test phase requires:checking the first 10 ranking queries results of the “NeuroSearch” withthe same 10 queries results of another search engine such as Google.
abbreviation abbreviation & combined& combined
keywordskeywords
generalgeneral keywordskeywords
specific specific keywordskeywords
AbbreviationAbbreviation keywordskeywords
combinedcombined keywordskeywords
Total ofTotal of 1000 tests 1000 tests
19
Testing cont..
Ranking query test results in General Keywords:
Search Engine Google NeuroSearch Search Engine
First 10
results
Rank Keyword Repeated Rank Keyword repeated Related-keyword
repeatedQuality/
percentage
1 0 0 0 10 1 3 53 3 37%
2 10 1 3 10 1 3 51 3 27%
3 0 0 0 10 1 3 37 3 36%
4 0 0 0 10 1 3 37 3 33.6%
5 0 0 0 10 1 3 34 3 36.7%
6 0 0 0 10 1 3 29 3 38.4%
7 0 0 0 10 1 3 28 3 38.1%
8 0 0 0 10 1 3 28 3 38%
9 0 0 0 10 1 3 28 3 24.9%
10 0 0 0 10 1 3 28 3 13.8%
Average %
10% 10% 100% 100%
Table 1: (Query 1) Ranking query test result in General Keywords: (Eye)
20
Testing cont..The Average Rankinf performance Engine Query test results
(Category based)Error bar = +/- 1 standard deviation
6.33
36.66
1.99
48.99
80.96
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5
Ra
nk
ing
pe
rfo
rm
an
ce
Chart 1 Average of Keywords
performance for Category Based test
results of the (Google)
The Average Keyword Performance Engine Query test results (Category based)
Error bar = +/- 1 standard deviation
92.33 88.49 92.9979.49
98.16
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5
Ra
nk
ing
pe
rfo
rma
nc
e
NeuroSearch
Chart 2 Average of Keywords
performance for Category Based test results of the (NeuroSearch)
21
Analysing the search engines ranking results Depends on the Categories
Independent Samples T-Test Google Search Engine * NeuroSearch Search Engine
-16.920
.000
9 Statisticallysignificant
-4.394
.000
19 Statisticallysignificant
-63.50
.000
19 Statisticallysignificant
-3.387
.003
19 Statisticallysignificant
-2.904
.009
19 Statisticallysignificant
T-value
Sig. (2-tailed)
df (degree offreedom
T-value
Sig. (2-tailed)
df (degree offreedom
T-value
Sig. (2-tailed)
df (degree offreedom
T-value
Sig. (2-tailed)
df (degree offreedom
T-value
Sig. (2-tailed)
df (degree offreedom
General Keywords
Specific keywords
abbreviationskeywords
combinedkeywords
abbreviations,combined andspecific keywords
GoogleSearchEngine
Generalkeywords
SpecificKeywords
abbreviationskeywords
combinedkeywords
abbreviations,combined and
specifickeywords
NeuroSearch Search Engine
Table 4. The Average Ranking Engines Performance Query test results Category based
22
Analysing the Average Ranking Engines Performance Query test results Category based
t test Result analysis Result analysis ..
is used to compare two groups' scores on the same variable
p value < .05).
That indicates, NeuroSearch have a statistically significantly higher mean score in all categories ranking results (100) than Google (52.35)
the negative values of t-test show the (inverse) relation between them when NeuroSearch results increase the Google results decrease.
23
Visual representation
52.35
100
0 10 20 30 40 50 60 70 80 90 100
Ranking Performance
1
Average Ranking Engines performance queries based
Google NeuroSearch
Chart 3 Average of Categories Based Engines ranking performance
90.29
34.98
0102030405060708090
100
Average of Keywords
1
Average Keywords Engines performance queries based
Google NeuroSearch
Chart 4 Average of the keyword Based in the documents in Query test results for (Category based Query) engines performance
24
Conclusion
Although “Although “NeuroSearch”NeuroSearch”
search engine Used search engine Used
a a simple algorithmsimple algorithm to judge the page to judge the page
quality compared by quality compared by
other convenient search engines,other convenient search engines,
Although “Although “NeuroSearch”NeuroSearch”
search engine Used search engine Used
a a simple algorithmsimple algorithm to judge the page to judge the page
quality compared by quality compared by
other convenient search engines,other convenient search engines,
““NeuroSearch”NeuroSearch” proves to be very proves to be very
powerful in obtaining relevant results,powerful in obtaining relevant results,
““NeuroSearch”NeuroSearch” proves to be very proves to be very
powerful in obtaining relevant results,powerful in obtaining relevant results,
Particularly, if its Particularly, if its advance advance knowledge knowledge built/createdbuilt/created by by specialist (domain specialist (domain knowledge),knowledge),
e.g. Oil, Medical, e.g. Oil, Medical, arts, etcarts, etc
Particularly, if its Particularly, if its advance advance knowledge knowledge built/createdbuilt/created by by specialist (domain specialist (domain knowledge),knowledge),
e.g. Oil, Medical, e.g. Oil, Medical, arts, etcarts, etc
25
Reference (example..)
: Wandell, Brain A. Foundations of Vision. Sunderland, Massachusetts, USA, 1995.
Brin, S. and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. The Seventh Annual International WWW Conference and computing science of Stanford University, Stanford, CA 94305.USA, 1998.
26
Ready for Questions!!!Ready for Questions!!!