toward real-time indexing on internet 2 dr. franz kurfess and foaad khosmood california polytechnic...
TRANSCRIPT
Toward Real-Time Toward Real-Time Indexing on Internet 2Indexing on Internet 2
Dr. Franz Kurfess and Foaad Khosmood
California Polytechnic State UniversityFall 2004
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
22
IntroductionIntroduction
How can Internet searching improve with new Internet2 capabilities?
In this presentation: Background concepts Why Internet2? Web searching fundamentals 4 Concepts: Techniques, Advantages and
Disadvantages I. “Push all the Way” II. “Push half way” III. Source-level pre-indexing IV. A P2P overlay network
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
33
Background Concepts: IndexingBackground Concepts: Indexing
What is Indexing? Indexing is creating an indexed list of objects in order
to make finding a particular set of objects easier. Two popular methods for indexing
Push: Where the “source” of a new piece of information contacts and places that piece of information in the correct place in the index. Examples (SQL Databases, File Systems)
Pull: Where a process searches and examines all available objects, organizes them and creates an indexed table. Examples (Classic Windows File searching, Web searching with Google or Yahoo.)
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
44
Background Concepts: Real TimeBackground Concepts: Real Time
What is meant by Real Time? In this context, it means that information is retrievable
as soon as mechanically possible to retrieve it. Does not necessarily mean instant; sometimes it is
called online search. If you have a set of computers, S that you can search;
and you place a new object, x, inside one of the computers in the set; then we define real-time search as a search that can begin the instant the object is placed and return x’s location through it’s normal search means.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
55
Background Concepts: PollingBackground Concepts: Polling
Polling versus Interrupting Analogy is borrowed from Electronic
Engineering, microcomputer architecture and operating systems, among others.
Polling is where one process periodically checks for the presence of a signal.
Interrupts are when the signal makes it’s own presence known and “interrupt” the process. No periodic checking is necessary.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
66
Background ConceptsBackground Concepts
Polling/Interrupting analogy as applied to web searching. Pull methods like WWW search engines could be said
to be polling. They crawl the entire known Internet and create many
indices. Users simply access the indices. They are not real-time.
Push methods exist only for limited domains. Blogs, for example, collections of information indexed by
date and author. They are indexed real-time when authors submit their text.
Instant Messaging program interfaces are also displays of text information indexed by arrival Time. They can be said to be real-time although not instant!
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
77
Push MethodsPush Methods
Why are they only available in limited domains? Why isn’t there near real-time WWW searching
using push methods now? No Push searching protocol was implemented along
with HTTP and now it’s too late to convert all web servers in the world.
No universal architecture exists. No centralized database is agreed-upon.
It would have to rely on source-side (http server) resources to complete the task.
Could be subject to abuse by intentional misreporting up stream.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
88
Why Internet2?Why Internet2?
Speed: Some of the proposed solutions are just not practical with slower speeds.
Community: Smaller and more homogeneous community (Universities and research institutions) allows better and faster chance for adoption of new protocols.
Resources could be shared. Agreements could be obtained much easier.
Abuse: Non-commercial nature minimizes incentives for abuse.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
99
Web Searching FundamentalsWeb Searching Fundamentals
Existing Architecture = Google/Yahoo style polling method.
Data is distributed randomly around the network/Internet.
Central process polls data periodically Central storage keeps all data for cross-
tabulation purposes. User accesses tabulation results
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1010
DataServer
Server
Server
Server
Indexer / Retriever
Client
ClientClient
Existing Web Indexing ArchitectureUsed by Google/Yahoo.
Web Indexing DiagramWeb Indexing Diagram
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1111
Web Searching FundamentalsWeb Searching Fundamentals
Timing Analysis for existing web search paradigms. (N node network)
Task Name Abbreviation Description Value
Downloading DL Time it takes to download an HTML page and cache it.
N(TDL)
Parsing and Indexing P Time it takes to parse a cached page and extract keywords.
N(TP)
Storage Time S Time it takes to store the results in a database.
N(TS)
Retrieval Time R Time it takes to retrieve one set of results out of the database per one user request.
TR
Total T Total Time for indexing and retrieving one piece of new information
N(TDL + TP + TS) + TR
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1212
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Title
New document
1. New Document is placed on the web server.
New Document DiagramNew Document Diagram
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1313
2. Indexer Crawls the web and downloads the document. (DL)
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Crawler Finds New DocumentCrawler Finds New Document
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1414
3. Indexer Parses (P) and Stores (S) the results in the DB storage
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Indexer Parses New DocumentIndexer Parses New Document
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1515
4. User retrieves search results, (R).
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
User Retrieves Stored Search User Retrieves Stored Search ResultsResults
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1616
Web Searching FundamentalsWeb Searching Fundamentals
Advantages Fast response to end user. Abuse minimized by owners of information
Because they have little control of ranking algorithms.
Disadvantages Indexing time is slow / non-real time (at Google
retrieved information reflects state of the web about 30 days ago)
Central Authority can be restrictive. All popular search information and database owned by two
or three large companies. Bound to have limited CPU and Storage resources
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1717
4 Proposals for better Internet2 4 Proposals for better Internet2 SearchingSearching
This projects aims to Fully describe and analyze the proposals Implement proof-of-concept for each Come up with new protocol and standard
recommendations The Proposals:1. Push all the way2. Push half-way3. Source-level pre-indexing4. P2P model for web searching
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1818
Push All the Way: DescriptionPush All the Way: Description
Web-server triggers upload mechanism for new content. For true real-time performance, any change in
a web-accessible file triggers mechanism. For more practical or near real-time
performance, a periodic process checks for changes in web-accessible files. When found uploads it.
Essentially: new information is written to two destinations, one local, one remote.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
1919
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Title
New document
1. New Document is placed on the web server.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2020
2. Document is “pushed” up to the main indexer immediately. No need to wait for crawling. (DL)
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2121
3. Indexer Parses (P) and Stores (S) the results in the DB storage
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2222
4. User retrieves search results, (R).
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2323
Push All the Way: TimingPush All the Way: Timing Google / Yahoo timing from before:
N(TDL + TP + TS) + TR
Ideal “Push all the way” case provides for a powerful processor units capable of handling high number of simultaneous upload requests.
For Ideal case: TDL + TP + TS + TR Many many orders of magnitude faster (no ‘N’) Real time performance
Worst case: some cost will be incurred based on multiple processing requests at the same time. (N/k) (TDL + TP + TS )+ TR Where k is the number of simultaneous requests that are
serviceable. Assuming worst case = every node ready to update
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2424
Push All the Way: AnalysisPush All the Way: Analysis
Advantages Can be RT or near RT
performance Caching is 100% up to
date Crawling is eliminated. Little incentive for
abuse: ranking algorithm is still done remotely
Disadvantages Requires all web servers
to adopt self-reporting standard.
Relies on full and correct functioning and diligence of web servers.
Traffic will be extremely high for uploading destination
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2525
Push All the Way: Proof of ConceptPush All the Way: Proof of Concept
Difficult to test with large number of nodes.
Developed software that would trigger upon change in web server content and upload new page to indexing server.
Testing concluded with 5 real nodes. Testing planned with about 200 virtual
nodes using simulator software.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2626
Push Half-Way: DescriptionPush Half-Way: Description
Web server triggers upload mechanism But not all servers upload to the same place
Use multiple databases, parsers and indexers, each responsible for a subset of nodes N.
Divide the whole set of N into X different parts. Each of the X sections accepts uploads from all the nodes it is responsible for.
Still need a central indexer. It receives results from multiple sources and puts in master database. But this process takes less time because there are
very few nodes and the information is already parsed and indexable.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2727
1. New Document is placed on the web server.
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Regional – “half-way” Indexer
Title
New Document
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2828
2. Document is immediately “pushed” up to the regional indexer.
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Regional – “half-way” Indexer
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
2929
3. Regional databases are periodically merged with the master Indexer DB by polling.
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Regional – “half-way” Indexer
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3030
4. Data is stored and retrieved by end-user as before.
Web Servers on the Internet (N)
Web Server
Storage
Master Indexer
Client
Regional – “half-way” Indexer
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3131
Push Half-Way: TimingPush Half-Way: Timing Google / Yahoo timing from before:
N(TDL + TP + TS) + TR
For Ideal case: TDL + TP + TS + X(TM) + TR
Where X is number of half-way indexers. TM is time it takes to merge an indexer database with the master
database. (TM) represents download time from half-way indexer to master
indexer AND storage time in the master indexer. The volume of the download and store operations is N/X * (TDL + TS) per half-way indexer. Or simply N(TDL + TS) for all the half-way indexers. So the Ideal case formula becomes:
Ideal case now approaches: N(TDL + TS) + TP + TR
Worst Case: (N/(Xk)) (TDL + TP + TS ) + N(TDL + TS) + TR Where k is the number of simultaneous requests that are
serviceable. Assuming worst case = every node ready to update.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3232
Push Half-Way: AnalysisPush Half-Way: Analysis
Advantages Reduced load on
central indexer Parsing is done by
regional indexers. Should be faster than
Yahoo/Google model Crawling eliminated Little incentive for
abuse
Disadvantages Not RT, the meta-indexing
portion is still done the old fashioned way.
Requires multiple upload points. This information must be communicated to web servers.
Still requires diligence and self-reporting standards by web servers
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3333
Push Half-Way: Proof of ConceptPush Half-Way: Proof of Concept
Difficult to test with large number of nodes.
Difficult to set up regional indexers. Was able to test with 5 nodes + 2
regional indexers. Simulation with 200 nodes and 10
indexers to be conlcuded.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3434
Source-level pre-indexing: Source-level pre-indexing: DescriptionDescription
“Source” is the web server where the content originates.
Web servers parse and index their own content, store it locally = mini-database.
Central indexer accesses the results, not the actual content.
Central indexer merges the mini-databases into its own giant database.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3535
Pre-Indexing: TimingPre-Indexing: Timing
Google / Yahoo timing from before: N(TDL + TP + TS) + TR
Pre-Indexing worst case N(TDL+ TS + TM) + TP + TR
TM represents time it takes to merge the mini-
database with the indexer, not including storage time.
In general: TM << TP , TM is almost negligible.
Ideal case: N(TDL+ TS) + TP + TR
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3636
Pre-Indexing: AnalysisPre-Indexing: Analysis
Advantages At least an order of
magnitude faster than Google/Yahoo
Less CPU load on indexing servers.
Disadvantages Not RT or near RT Requires web-servers to
cooperate and create the mini DB.
Requires new standard for representation and access of the mini DB.
Subject to abuse because parsing is done locally and there is control over reporting content.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3737
Pre-Indexing: Proof of ConceptPre-Indexing: Proof of Concept
Easily tested with 5 nodes 200 node simulation TBD.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3838
P2P Overlay Network: DescriptionP2P Overlay Network: Description Idea is very similar to P2P file-swapping concept. Instead of swapping files, we’re swapping reference
links to data on some computer. Web servers parse and index their own content, store it
locally = mini-database. Consists of an indexed set of objects along with location and
strength. Location is initially all local, but may include a set of “cached” searches with remote locations associated with them.
In addition, web servers also accept requests for promotion or demotion of the strength of their objects.
Search strings are forwarded along an overlay network, maintained by each node.
Results are returned and sorted based on strength. Results are accessible from any client on the network.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
3939
Web Server
Client
Web Server
Web Server
Web Server
Web Server
Web Server
Web Server
Web Server
Title
New Doc
P2P Overlay Network
1. New Document is placed on a web server
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
4040
P2P Overlay Network
2. Client initiates a search.3. P2P message passing software produces a path do the document.
Web Server
Client
Web Server
Web Server
Web Server
Web Server
Web Server
Web Server
Web Server
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
4141
P2P: TimingP2P: Timing Google / Yahoo timing from before:
N(TDL + TP + TS) + TR
P2P worst case X(Tmessage) + N(2Tmessage + TR) + Tsort
• X is the number of nodes originally connected to the searching initiator. X << N.
• Tmessage is the time required to send a P2P protocol message to another node. While extremely small, this value is not negligible. Tsort T is the time it takes to sort the incoming data. This value is
difficult to define because in p2p searches, the data is constantly incoming. Under this model, one rarely waits until the search is “complete.” A suitable result should present itself as soon as it is obtained.
• There is no caching of content. Meaning, you could be waiting for days for a complete set of results.
Through filtering and setting of minimum-strength thresholds per search, we can severely mitigate the worst case: i.e. N will be much smaller in a typical case.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
4242
P2P: AnalysisP2P: Analysis
Advantages Potential to be fast
and reflexive, although not real-time in this form since new data placement does not trigger anything.
No central indexer. Most resources are
distributed.
Disadvantages Requires resident P2P
clients to agree on an algorithm for strength determination.
Requires some additional CPU resources for web servers.
Extremely vulnerable to abuse because false messages can be sent without authentication.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
4343
P2P: Proof of ConceptP2P: Proof of Concept
A super-simple p2p software has been developed and tested.
Not yet tied to web-content and web server local indexing.
Must use simulation nodes because it’s difficult for I2 servers to agree to run a resident p2p program.
Plans for 200 node simulation in progress.
Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004
4444
ConclusionsConclusions
some models with (near) real-time search capabilities may have significant advantages over existing search engines up-to-date results no central crawler, indexer, data base
these models also may have significant disadvantages bandwidth, processing power lack of protocols coordination of distributed resources trustworthiness