query storage system query commandquery result process in memory webbot search engine dyn-sort...
TRANSCRIPT
QUERY
STORAGESYSTEM
QUERY COMMAND QUERY RESULT
PROCESS IN MEMORY
WEBBOT SEARCH ENGINE
DYN-SORT ALGORITHM CLASSIC
SEARCH ENGINE
SEARCHRESULTCACHE
DOCUMENT
DOC-WORD
WORD-DOC
STEMMING
ALTERNATIVE
IGNORE
SPIDER
DOCUMENTRATE ALGORITHM
RAW WEBPAGE DATA
URL DOWNLOAD
SPIDER
STORAGESYSTEM
THE ARCHITECTURE OF WEB SPIDER
SPIDERCONTROLCENTRE
HTML/METAPARSER
META PARSERPLUGIN
REMOTE URL DATA
DOMAIN URL DATA
EXTRA URL DATA
SEED URL
DOMAIN / FETCH INFO
URL DOWNLOAD
SPIDER
URL DOWNLOAD
SPIDER
MULTIPLE THREAD
THE ARCHITECTURE OF DATA STORAGE
http://www.test.com/path1/path2/DATA1.HTMLhttp://www.test.com/path1/path2/DATA2.HTMLhttp://www.test.com/path1/DATA1.HTMLhttp://www.test.com/INDEX.HTMLhttp://www.university.com/INDEX.HTMLhttp://portal/expat.com/download/INDEX.HTML
DATA1.HTML
DATA2.HTML
DATA1.HTML
INDEX.HTML
INDEX.HTML
INDEX.HTML
com
org
test
portal
university
expat
www
path1
path2
download
www
TREE BASED DATA STORAGE STRUCTURE
REVERSE INDEX BUILD IN ONE DIRECTORY
UID1.db
UID1.xml
UID2.db
UID2.xml
UIDn.db
UIDn.xml
UID1.xml UID2.xml UIDn.xml
DICT.db
RIDX.db
UID1.ddb
UID2.ddb
UIDn.ddb
DICT DATABASE OF EACH URL
DATA
COMBINE SEPARATE DICT INTO A WHOLE DICT IN WHOLE SAME DIRECTORY
REVERSE DICT DATABASE IN THE WHOLE DIRECTORY[3]
DELETE USELESS *.db AND *.ddb FILE TO RELEASE DISK SPACE[4]
*. db – URL RAW DATA*. xml – URL DATA’s REF include real URL, download time, current stat, meda type, etc.
*. ddb – URL DATA’s DICT, its struct like WORD-DOC.
1
3
2
4
GET *.ddb FROM *.db[2]
0 *.db AND *.xml DATA GET FROM SPIDER AND STORE IN ONE SAME DIRECTORY[1]
[1]
[2]
[4]
If we plan to support the web page cache function, we can not delete the *.db data
[3]
RIDX.db – Reverse Index database from DICT.db, the struct like DOC-WORD.
THE GLOBAL REVERSE INDEX MAP
reviate
a
b
b
d
c
a
b
t
roachment
~
arian
able
amer
abbreviateabbroachment
actable
actamer
bad
badarianass badass
CHARACTER SEP DIRECOTRY
RELATIVE WORD DATABASE
ACTUAL WORD
THE GLOBAL REVERSE INDEX BUILD PROCESS
A
B
C
D
EF
WORD URL WEIGHT
A F/E/C/A
B F/E/C/B
C F/E/C
D F/E/D
E F/E
F F
Recurse Sequence: A,B,C,D,E,F
GlobalRIDX.db
URL WORD WEIGHT
A WA 7
A WB 2
B WC 5
C WB 3
WORD URL WEIGHT
WB A 2
WB C 3
WA A 7
WC B 5
DICT.db
RIDX.db
NOTE: DICT.db AND RIDX.db ARE NOT USE SAME GROUP DEMO DATA WITH GlobalRIDX.db
FOUR CLUSTER MODULES AND THOSE MAIN TASK
HA/LOAD Cluster
QUERYCluster
HA/STORAGECluster
SPIDER/INDEXOR
Cluster
DISPATCH DISPATCH CONNECT CONNECT REQUESTREQUEST
DISPATCH DISPATCH FILE READ FILE READ AND WRITE AND WRITE REQUESTREQUEST
SYNCHRONIZATIONSYNCHRONIZATION QUERY QUERY
MULTIPLE MULTIPLE WORDSWORDS
SYNCHRONIZAITON SYNCHRONIZAITON CRAWL AND INDEX CRAWL AND INDEX
DATADATA
B1B1 B2B2
Q1Q1 Q2Q2 QNQN
S1S1 S2S2 SNSN
D1D1
QUERY RESPONSE QUERY RESPONSE VS/TUN VS/DRVS/TUN VS/DR
M = MANAGE M = MANAGE CONSOLECONSOLE
USER QUERYUSER QUERY
HA HA CENTRECENTRE
LOAD LOAD BALANCE BALANCE CENTRECENTRE
QUERY QUERY CENTRECENTRE
STORAGE STORAGE CENTRECENTRE
INDEXOR INDEXOR CENTRECENTRE
SPIDER CENTRE SPIDER CENTRE
TASK CENTRETASK CENTRE
SPIDER CRAWLSPIDER CRAWL
GIGA SWITCHGIGA SWITCH
QUERYORQUERYOR
STORAGORSTORAGOR
Dyn-SORTERDyn-SORTER
SPIDERSPIDER
QUERY QUERY STRING STRING
ANALYSISANALYSIS
CACHORCACHOR
FEEDBACKFEEDBACK
Internet Global DatabaseInternet Global Database
WEBBOT SEARCH ENGINEWEBBOT SEARCH ENGINE
INDEXORINDEXOR
Indexed Indexed DataData