toward real-time indexing on internet 2 dr. franz kurfess and foaad khosmood california polytechnic...

Toward Real-Time Toward Real-Time Indexing on Internet 2Indexing on Internet 2

Dr. Franz Kurfess and Foaad Khosmood

California Polytechnic State UniversityFall 2004

Toward Real-Time Indexing on Internet2 / KhosmooToward Real-Time Indexing on Internet2 / Khosmood and Kurfess, Cal Poly, Fall 2004d and Kurfess, Cal Poly, Fall 2004

22

IntroductionIntroduction

How can Internet searching improve with new Internet2 capabilities?

In this presentation: Background concepts Why Internet2? Web searching fundamentals 4 Concepts: Techniques, Advantages and

Disadvantages I. “Push all the Way” II. “Push half way” III. Source-level pre-indexing IV. A P2P overlay network


33

Background Concepts: IndexingBackground Concepts: Indexing

What is Indexing? Indexing is creating an indexed list of objects in order

to make finding a particular set of objects easier. Two popular methods for indexing

Push: Where the “source” of a new piece of information contacts and places that piece of information in the correct place in the index. Examples (SQL Databases, File Systems)

Pull: Where a process searches and examines all available objects, organizes them and creates an indexed table. Examples (Classic Windows File searching, Web searching with Google or Yahoo.)


44

Background Concepts: Real TimeBackground Concepts: Real Time

What is meant by Real Time? In this context, it means that information is retrievable

as soon as mechanically possible to retrieve it. Does not necessarily mean instant; sometimes it is

called online search. If you have a set of computers, S that you can search;

and you place a new object, x, inside one of the computers in the set; then we define real-time search as a search that can begin the instant the object is placed and return x’s location through it’s normal search means.


55

Background Concepts: PollingBackground Concepts: Polling

Polling versus Interrupting Analogy is borrowed from Electronic

Engineering, microcomputer architecture and operating systems, among others.

Polling is where one process periodically checks for the presence of a signal.

Interrupts are when the signal makes it’s own presence known and “interrupt” the process. No periodic checking is necessary.


66

Background ConceptsBackground Concepts

Polling/Interrupting analogy as applied to web searching. Pull methods like WWW search engines could be said

to be polling. They crawl the entire known Internet and create many

indices. Users simply access the indices. They are not real-time.

Push methods exist only for limited domains. Blogs, for example, collections of information indexed by

date and author. They are indexed real-time when authors submit their text.

Instant Messaging program interfaces are also displays of text information indexed by arrival Time. They can be said to be real-time although not instant!


77

Push MethodsPush Methods

Why are they only available in limited domains? Why isn’t there near real-time WWW searching

using push methods now? No Push searching protocol was implemented along

with HTTP and now it’s too late to convert all web servers in the world.

No universal architecture exists. No centralized database is agreed-upon.

It would have to rely on source-side (http server) resources to complete the task.

Could be subject to abuse by intentional misreporting up stream.


88

Why Internet2?Why Internet2?

Speed: Some of the proposed solutions are just not practical with slower speeds.

Community: Smaller and more homogeneous community (Universities and research institutions) allows better and faster chance for adoption of new protocols.

Resources could be shared. Agreements could be obtained much easier.

Abuse: Non-commercial nature minimizes incentives for abuse.


99

Web Searching FundamentalsWeb Searching Fundamentals

Existing Architecture = Google/Yahoo style polling method.

Data is distributed randomly around the network/Internet.

Central process polls data periodically Central storage keeps all data for cross-

tabulation purposes. User accesses tabulation results


1010

DataServer

Server

Server

Server

Indexer / Retriever

Client

ClientClient

Existing Web Indexing ArchitectureUsed by Google/Yahoo.

Web Indexing DiagramWeb Indexing Diagram


1111


Timing Analysis for existing web search paradigms. (N node network)

Task Name Abbreviation Description Value

Downloading DL Time it takes to download an HTML page and cache it.

N(TDL)

Parsing and Indexing P Time it takes to parse a cached page and extract keywords.

N(TP)

Storage Time S Time it takes to store the results in a database.

N(TS)

Retrieval Time R Time it takes to retrieve one set of results out of the database per one user request.

TR

Total T Total Time for indexing and retrieving one piece of new information

N(TDL + TP + TS) + TR


1212

Web Servers on the Internet (N)

Web Server

Storage

Master Indexer

Client

Title

New document

1. New Document is placed on the web server.

New Document DiagramNew Document Diagram


1313

2. Indexer Crawls the web and downloads the document. (DL)


Web Server

Storage

Master Indexer

Client

Crawler Finds New DocumentCrawler Finds New Document


1414

3. Indexer Parses (P) and Stores (S) the results in the DB storage


Web Server

Storage

Master Indexer

Client

Indexer Parses New DocumentIndexer Parses New Document


1515

4. User retrieves search results, (R).


Web Server

Storage

Master Indexer

Client

User Retrieves Stored Search User Retrieves Stored Search ResultsResults


1616


Advantages Fast response to end user. Abuse minimized by owners of information

Because they have little control of ranking algorithms.

Disadvantages Indexing time is slow / non-real time (at Google

retrieved information reflects state of the web about 30 days ago)

Central Authority can be restrictive. All popular search information and database owned by two

or three large companies. Bound to have limited CPU and Storage resources


1717

4 Proposals for better Internet2 4 Proposals for better Internet2 SearchingSearching

This projects aims to Fully describe and analyze the proposals Implement proof-of-concept for each Come up with new protocol and standard

recommendations The Proposals:1. Push all the way2. Push half-way3. Source-level pre-indexing4. P2P model for web searching


1818

Push All the Way: DescriptionPush All the Way: Description

Web-server triggers upload mechanism for new content. For true real-time performance, any change in

a web-accessible file triggers mechanism. For more practical or near real-time

performance, a periodic process checks for changes in web-accessible files. When found uploads it.

Essentially: new information is written to two destinations, one local, one remote.


1919


Web Server

Storage

Master Indexer

Client

Title

New document



2020

2. Document is “pushed” up to the main indexer immediately. No need to wait for crawling. (DL)


Web Server

Storage

Master Indexer

Client


2121

3. Indexer Parses (P) and Stores (S) the results in the DB storage


Web Server

Storage

Master Indexer

Client


2222

4. User retrieves search results, (R).


Web Server

Storage

Master Indexer

Client


2323

Push All the Way: TimingPush All the Way: Timing Google / Yahoo timing from before:


Ideal “Push all the way” case provides for a powerful processor units capable of handling high number of simultaneous upload requests.

For Ideal case: TDL + TP + TS + TR Many many orders of magnitude faster (no ‘N’) Real time performance

Worst case: some cost will be incurred based on multiple processing requests at the same time. (N/k) (TDL + TP + TS )+ TR Where k is the number of simultaneous requests that are

serviceable. Assuming worst case = every node ready to update


2424

Push All the Way: AnalysisPush All the Way: Analysis

Advantages Can be RT or near RT

performance Caching is 100% up to

date Crawling is eliminated. Little incentive for

abuse: ranking algorithm is still done remotely

Disadvantages Requires all web servers

to adopt self-reporting standard.

Relies on full and correct functioning and diligence of web servers.

Traffic will be extremely high for uploading destination


2525

Push All the Way: Proof of ConceptPush All the Way: Proof of Concept

Difficult to test with large number of nodes.

Developed software that would trigger upon change in web server content and upload new page to indexing server.

Testing concluded with 5 real nodes. Testing planned with about 200 virtual

nodes using simulator software.


2626

Push Half-Way: DescriptionPush Half-Way: Description

Web server triggers upload mechanism But not all servers upload to the same place

Use multiple databases, parsers and indexers, each responsible for a subset of nodes N.

Divide the whole set of N into X different parts. Each of the X sections accepts uploads from all the nodes it is responsible for.

Still need a central indexer. It receives results from multiple sources and puts in master database. But this process takes less time because there are

very few nodes and the information is already parsed and indexable.


2727



Web Server

Storage

Master Indexer

Client

Regional – “half-way” Indexer

Title

New Document


2828

2. Document is immediately “pushed” up to the regional indexer.


Web Server

Storage

Master Indexer

Client



2929

3. Regional databases are periodically merged with the master Indexer DB by polling.


Web Server

Storage

Master Indexer

Client



3030

4. Data is stored and retrieved by end-user as before.


Web Server

Storage

Master Indexer

Client



3131

Push Half-Way: TimingPush Half-Way: Timing Google / Yahoo timing from before:


For Ideal case: TDL + TP + TS + X(TM) + TR

Where X is number of half-way indexers. TM is time it takes to merge an indexer database with the master

database. (TM) represents download time from half-way indexer to master

indexer AND storage time in the master indexer. The volume of the download and store operations is N/X * (TDL + TS) per half-way indexer. Or simply N(TDL + TS) for all the half-way indexers. So the Ideal case formula becomes:

Ideal case now approaches: N(TDL + TS) + TP + TR

Worst Case: (N/(Xk)) (TDL + TP + TS ) + N(TDL + TS) + TR Where k is the number of simultaneous requests that are

serviceable. Assuming worst case = every node ready to update.


3232

Push Half-Way: AnalysisPush Half-Way: Analysis

Advantages Reduced load on

central indexer Parsing is done by

regional indexers. Should be faster than

Yahoo/Google model Crawling eliminated Little incentive for

abuse

Disadvantages Not RT, the meta-indexing

portion is still done the old fashioned way.

Requires multiple upload points. This information must be communicated to web servers.

Still requires diligence and self-reporting standards by web servers


3333

Push Half-Way: Proof of ConceptPush Half-Way: Proof of Concept

Difficult to test with large number of nodes.

Difficult to set up regional indexers. Was able to test with 5 nodes + 2

regional indexers. Simulation with 200 nodes and 10

indexers to be conlcuded.


3434

Source-level pre-indexing: Source-level pre-indexing: DescriptionDescription

“Source” is the web server where the content originates.

Web servers parse and index their own content, store it locally = mini-database.

Central indexer accesses the results, not the actual content.

Central indexer merges the mini-databases into its own giant database.


3535

Pre-Indexing: TimingPre-Indexing: Timing

Google / Yahoo timing from before: N(TDL + TP + TS) + TR

Pre-Indexing worst case N(TDL+ TS + TM) + TP + TR

TM represents time it takes to merge the mini-

database with the indexer, not including storage time.

In general: TM << TP , TM is almost negligible.

Ideal case: N(TDL+ TS) + TP + TR


3636

Pre-Indexing: AnalysisPre-Indexing: Analysis

Advantages At least an order of

magnitude faster than Google/Yahoo

Less CPU load on indexing servers.

Disadvantages Not RT or near RT Requires web-servers to

cooperate and create the mini DB.

Requires new standard for representation and access of the mini DB.

Subject to abuse because parsing is done locally and there is control over reporting content.


3737

Pre-Indexing: Proof of ConceptPre-Indexing: Proof of Concept

Easily tested with 5 nodes 200 node simulation TBD.


3838

P2P Overlay Network: DescriptionP2P Overlay Network: Description Idea is very similar to P2P file-swapping concept. Instead of swapping files, we’re swapping reference

links to data on some computer. Web servers parse and index their own content, store it

locally = mini-database. Consists of an indexed set of objects along with location and

strength. Location is initially all local, but may include a set of “cached” searches with remote locations associated with them.

In addition, web servers also accept requests for promotion or demotion of the strength of their objects.

Search strings are forwarded along an overlay network, maintained by each node.

Results are returned and sorted based on strength. Results are accessible from any client on the network.


3939

Web Server

Client

Web Server

Web Server

Web Server

Web Server

Web Server

Web Server

Web Server

Title

New Doc

P2P Overlay Network

1. New Document is placed on a web server


4040

P2P Overlay Network

2. Client initiates a search.3. P2P message passing software produces a path do the document.

Web Server

Client

Web Server

Web Server

Web Server

Web Server

Web Server

Web Server

Web Server


4141

P2P: TimingP2P: Timing Google / Yahoo timing from before:


P2P worst case X(Tmessage) + N(2Tmessage + TR) + Tsort

• X is the number of nodes originally connected to the searching initiator. X << N.

• Tmessage is the time required to send a P2P protocol message to another node. While extremely small, this value is not negligible. Tsort T is the time it takes to sort the incoming data. This value is

difficult to define because in p2p searches, the data is constantly incoming. Under this model, one rarely waits until the search is “complete.” A suitable result should present itself as soon as it is obtained.

• There is no caching of content. Meaning, you could be waiting for days for a complete set of results.

Through filtering and setting of minimum-strength thresholds per search, we can severely mitigate the worst case: i.e. N will be much smaller in a typical case.


4242

P2P: AnalysisP2P: Analysis

Advantages Potential to be fast

and reflexive, although not real-time in this form since new data placement does not trigger anything.

No central indexer. Most resources are

distributed.

Disadvantages Requires resident P2P

clients to agree on an algorithm for strength determination.

Requires some additional CPU resources for web servers.

Extremely vulnerable to abuse because false messages can be sent without authentication.


4343

P2P: Proof of ConceptP2P: Proof of Concept

A super-simple p2p software has been developed and tested.

Not yet tied to web-content and web server local indexing.

Must use simulation nodes because it’s difficult for I2 servers to agree to run a resident p2p program.

Plans for 200 node simulation in progress.


4444

ConclusionsConclusions

some models with (near) real-time search capabilities may have significant advantages over existing search engines up-to-date results no central crawler, indexer, data base

these models also may have significant disadvantages bandwidth, processing power lack of protocols coordination of distributed resources trustworthiness

toward real-time indexing on internet 2 dr. franz kurfess and foaad khosmood california polytechnic...

Documents

realtime search

internet2 khosmood

realtime www

cal poly

internet searching

new internet2 capabilities

franz kurfess

text information