building custom crawlers for oracle secure enterprise search€¦ · web viewjanuary 2007...

Building Custom Crawlers for Oracle Secure Enterprise SearchAn Oracle White PaperJanuary 2007

Building Custom Crawlers for Oracle Secure Enterprise Search

Executive Overview.......................................................3Introduction..................................................................3

The Plug-in Architecture............................................3Crawler Plug-in.......................................................3Identity Plug-in.......................................................3Authorization Plug-in..............................................4Query Filter Plug-in................................................4Result Filter Plug-in................................................4

Part One - General Considerations for Connector Writers..........................................................................5

Do I need an identity and/or authorization plug-in?................................................................................5What Parts Of My Data Do I Want To Index?..........6Who Crawls the Data, and how does the Crawler Authenticate?..........................................................6What Constitutes a Document?...............................7How will I Know when Data has Changed?............7How will I get Access Control Lists for my Documents?............................................................8

Part Two – Building a Connector...................................8Crawler Plug-In..........................................................8

Interface CrawlerPluginManager...........................8Interface CrawlerPlugin.........................................9Interface QueueService..........................................9Interface DocumentContainer..............................10Interface DocumentMetadata...............................10

Identity Plug-In........................................................12Authorization Plug-In...............................................13

Appendix A- Sample Custom Crawler.........................14About the Sample Crawler.......................................14

Building the Sample Crawler................................14Debugging............................................................16Source code for SimpleFileCrawlerPluginManager.java................16Source code for SimpleFileCrawlerPlugin.java....21


EXECUTIVE OVERVIEWThis paper discusses the planning and coding process for building custom connectors for Oracle Secure Enterprise Search (SES). It is aimed at the developers and system architects who wish to connect SES to new data sources.

INTRODUCTIONOracle Secure Enterprise Search (SES) is a product that enables you to find information within your corporate intranet by keyword or contextual searches.

To do this, it must first collect the content from diverse sources and text-index it.

The process of collecting this information is known as crawling, and is performed by a crawler.

The crawled information is indexed, and users have access to that index via a query application.

Oracle provides a number of crawlers out-of-the-box, and also provides an API for customers, partners and others to develop secure crawlers to access datasources not easily crawlable by the standard set of crawlers.

The Plug-in ArchitectureIn order to implement the secure crawling and accessing of private information, Secure Enterprise Search version 10.1.8 introduced a comprehensive plug-in architecture, which allows developers to tailor the critical security components to the characteristics of the particular datastore to which they are developing access.

The following is a brief summary of the plug-in types used by Secure Enterprise Search.

Crawler Plug-in

This is a crawler designed to pick up documents from a datasource type which is not supported by any of the standard datasources. Its task is to fetch documents, together with Access Control (AC) information – basically, the list of users and/or groups which have access to each document. If no AC information is supplied, the data will be protected by a source-level Access Control List which applies to all documents, or will be public.


Identity Plug-in

The identity plug-in is concerned with users and groups. It performs the following main tasks:

Authenticates a user at login by accepting a username/password pair

Provides a list of groups (or roles) for a specified user

Indicates whether a specified user exists

Authorization Plug-in

The authorization plug-in can contain either or both of the following two components:

Query Filter Plug-in

The Query Filter Plug-in is responsible for getting a list of security attributes for the currently logged in user. In practice, this usually means returning a list of groups (or security roles) of which the user is a member.

Result Filter Plug-in

The Result Filter Plug-in implements “Query Time Authorization” (QTA). The filter is called on a “per document” basis during hitlist generation, to tell the query engine whether the current user is authorized to see that particular document. ResultFilter replaces QueryTimeFilter as the preferred method for implementing QTA in 10.1.8.


The diagram above shows how the various plug-ins fit into the overall Secure Enterprise Search architecture.

All plug-ins are written in the Java language. They must confirm to the APIs provided by Oracle. They are implemented as JAR files which must be placed in the appropriate directory within the Secure Enterprise Search file hierarchy.

PART ONE - GENERAL CONSIDERATIONS FOR CONNECTOR WRITERS

The following is a list of items that should be considered before any detailed work is done on designing or building a plug-in or set of plug-ins.

Do I need an identity and/or authorization plug-in?

In order to implement security with my new data source, do I need to implement an identity plugin, or can I use an existing plug-in for a directory such as Active Directory or Oracle Internet Directory?

Or perhaps all the data I’m going to be indexing should be publicly available, in which case I don’t need an identity plugin at all.

Generally speaking, if your users log in via an already supported identity system in order to use your datasource, you will not need a new identity plugin.

If they do NOT use a supported system, but your company does use such a system for other purposes, then it may still not be necessary to develop a custom identity plugin.

For example, let’s assume your company uses Oracle Internet Directory for single sign-on purposes to many applications. But you want to develop a custom plugin to fetch data from another system – we’ll call it the Acme Business Content (ABC) system – which is NOT protected by OID or SSO. ABC has it’s own database of users and performs its own login checking.

We’ll further assume that the usernames on OID are the same as the usernames on ABC. Since we trust OID for all other


login purposes, there is no reason we shouldn’t trust it for ABC authentication too. So we effectively delegate responsibility for authentication to OID, and when we do searches we assume that the user – say “JohnSmith” – who authnticated with OID is the same “JohnSmith” who is allowed to see documents in ABC. SES is trusted by ABC to properly authenticate the users via OID.

If usernames in ABC are different from usernames in OID, our crawler plug-in will have to map the names appropriately when it crawls a document and obtains a list of ABC users who have access to a particular doc. By some mechanism it will have to convert this list of ABC users into a list of OID users.

One complication to this scenario is when the list of users is the same, but ABC has a concept of groups (or roles) which differs from those in OID. In this situation, we might need to use an authorization Query Filter (QF) plug-in. The QF plug-in would be responsible for checking which groups the current logged-on user is a member of. So at indexing time, we would store a list of ABC groups in the Access Control List for the document. Then, at query time, SES would ask the QF plug-in which groups the user is a member of – perhaps “SALES”, “MARKETING” and “AMERICAS”. The query would then be run with a search condition attached which said the equivalent of “ALLOWED_USER=username OR ALLOWED_GROUPS IN (“SALES”, “MARKETING”, “AMERICAS”)”

Finally, we have to decide whether we need a Result Filter (RF) plug-in. An RF plugin is needed in one of the following situations:

1. There is a requirement for secure searching, but no ACLs are available during indexing. All security must be applied on a row-by-row basis at the time of hitlist generation. Note this only makes sense when security granularity is high. If I have access to half the documents in the system, then on average SES has to examine 20 hits to give me a hitlist of 10 results. On the other hand, if I only have access to 1% of the documents, SES would have to examine 1000 hits for the same result.

2. ACLs are available at indexing time, but it is important to check that the security information at query time, in case access has been removed from a document since it was last indexed.


What Parts Of My Data Do I Want To Index?

Given an application or datasource, there may be some parts that should be indexed by SES, and some parts which should not – either because the information is not appropriate or useful for indexing, or because it’s temporary or transient in nature.

If I’m indexing an email system, perhaps I don’t want to index information in “Trash” folders.

If I’m indexing an application, do I want to index the menu structure of the application in order to aid navigation to particular points in the application? Or am I only interested in indexing the actual information stored in the system?

Who Crawls the Data, and how does the Crawler Authenticate?

A secure crawler generally has to run in some sort of “superuser” mode. The crawler needs read access to all of the documents to be indexed. This can be achieved in one of two ways.

1. A username and password for a privileged user are provided as parameters to the crawler, from the admin screens. The crawler logs on to the data source using these credentials, and fetches information as that privileged user.

2. “Service to Service” authentication (S2S). The data source trusts SES and gives it unrestricted access to all information, knowing that SES will enforce security at query time. The SES instance authenticates itself (proves that it is the actual trusted SES instance) by means of an authentication key (or password) which is known to both systems, or to SES and an intermediate identity manager.

These two methods are essentially very similar – a password proves that the SES instance is authorized. The main difference is that in the first method, there is no need for the data source to “know” that it is being crawled by SES.

What Constitutes a Document?

This is perhaps one of the harder decisions to make. SES has the concept of a document, which is one “unit of indexing”. It doesn’t necessarily have to be a document in the traditional sense of a Word document or complete Powerpoint presentation. It might, for example, be all the collected


information about a single person from an HR system. When you do a search, a document is what is returned as one entry in the hitlist.

For some systems, what constitutes a document will be simple and obvious. For other systems it will be far less so.

Associated with this is the consideration of Display URLs. A Display URL is provided by a crawler plugin as a metadata item, along with the actual data to be indexed. A Display URL is the URL which appears in the hitlist for the user to click on, and will normally take the user directly to the source of the document information. Display URLs must be unique for each document.

The key to creating a “rich”document for better searching is metadata extraction. It is important that as much metadata (business object attributes) should bet gathered from the application as possible, and mapped appropriately to SES metadata fields. The core document metadata attributes used by SES are:

Title

Author

Description

Keywords

Display URL

Last Modified Date

Source Hierarchy

How will I Know when Data has Changed?

Or won’t I? Crawlers normally perform “incremental crawls” which only provide documents which are new or have been modified since the last crawl. It is possible for a crawler to declare that it does not support incremental crawling, in which case it will perform a complete recrawl of all data each time it is invoked. Obviously, this should be avoided where possible for performance reasons.

Hence we need to know which documents are new or have changed since the last crawl. Depending on the source of the data, this information may be available by a simple lookup in some central directory, or it may require a “surface” crawl of all documents to check the date and time of last modification. Where this information is simply not available, it may be


necessary to do a complete crawl, and rely on the duplicate detection mechanisms within SES to ignore documents which are already indexed.

How will I get Access Control Lists for my Documents?

When each document is passed back to SES, the crawler is responsible for creating an Access Control List (ACL). This is simply a list of all the users and/or groups who should have read access to that document. This will be stored by SES in the index, and used to establish which documents may be retrieved by any particular user as part of the search. A critical part of the design process is to figure out how this information is to be obtained.

PART TWO – BUILDING A CONNECTOR

Introduction

This section of the document will introduce you to the actual APIs needed to construct the various plug-ins.

Crawler Plug-InA crawler plug-in is usually implemented as two (or more) classes in a package. These are:

A Crawler Plugin Manager

The Crawler Plugin itself

Any supporting classes needed

The Crawler Plugin Manager (“manager” from here on) is a class which implements the oracle.search.sdk.crawler.CrawlerPluginManager interface.

Interface CrawlerPluginManager

The purpose of the manager class is to describe the capabilities and requirements of the crawler plug-in itself, and to create an instance of that class. It should:


Describe the parameters needed to invoke the crawler (method getPluginParameters)

Describe the security model used by the crawler (none, source level or document level – method getDefaultAccessPolicy)

Provide administrative information such as version, plugin name and description (methods getPluginName, getPluginVersion, getPluginDescription, getBaseAPIVersion)

Create the document processing queue and seed it with starting values where appropriate (method init)

Instantiate and return to SES an instance of the crawler plugin itself (method getCrawlerPlugin)

Additionally, the init() method is called to start the manager. init() is called with the parameter values, some administrative info (a forced-recrawl flag, the last time a crawl was run and the number of threads in use) and a GeneralService manager, used for such things as logging information and messages.

cleanup() is called at the end of the crawl (either through completion or through forced termination).

getCrawlerPlugin() is passed a CrawlingThreadService object by SES. A method in this object will be used by the CrawlerPlugin to submit documents back to SES for indexing.

Interface CrawlerPlugin

This is the main crawler code. The manager creates an instance of this class and passes it to SES. SES then invokes the plugin and processes documents that are submitted in the form of DocumentContainer classes.

The crawl() method is called from SES and starts processing the queue. Typically this will have been seeded by the manager’s init() method with one or more starting values. In the case of a hierarchical crawl, the crawler will fetch values from this queue, process those queue entries then push new derived values back onto the queue.

Each document that is to be indexed will be put into a DocumentContainer object and submitted for indexing via the CrawlingThreadService method submitForProcessing. Remember the CrawlingThreadService object was passed to the manager when it called getCrawlerPlugin.


Interface QueueService

The Queue service is a utility function provided to the plugin. It is not required that the plugin uses it, but it is generally very useful, especially when a hierarchy of documents is to be crawled, as it avoids the need to code recursion into the crawler itself. Functions include:

Get the complete list of previously-crawled documents. This is vital if the crawler is to have the ability to remove deleted documents and the source itself cannot provide such deletion information (method enqueueExistingURLs)

Get a list of previously failed documents – or more accurately, a list of documents returning a certain status code (method enqueueExistingURLs, with statusCode as an argument).

Check whether a document is already in the queue (method isEnqueued)

Enqueue a document (method enqueue)

Fetch the next document from the queue (menthd getNextItem)

Interface DocumentContainer

This is a container for a document, which will be submitted to SES for indexing. It includes the document content, document metadata (see DocumentMetadata) and status information. Key functions are:

Setting the document metadata (method setMetaData)

Setting the document content. This may be set to a java.io.Reader stream, or to a java.io.InputStream. The latter is usually used for binary documents (method setDocument)

Setting the document status. Most commonly used status values are STATUS_OK_FOR_INDEX and STATUS_OK_BUT_NO_INDEX. The latter would be used, for example, when submitting a directory which we want to keep for future reference, but do not wish to index.

Interface DocumentMetadata

Metadata is stored in a document container. Metadata for a document consists of a set of attributes. Some attributes are


specifically named in the various methods used (such as ContentType, CrawlDepth and DisplayURL), others will be provided as name/value pairs. There is no restriction on the the attributes – a source can use attributes already known to SES (such as LastModifiedDate and Author) or it can create its own attributes. Key methods:

Set the Access Control Information for a document (method setACLInfo)

Set an arbitrary attribute (method setAttribute)

Set the content type of the document (method setContentType)

The following table summarized the crawler plug-in API:

Interface Summary

CrawlerPluginImplemented by plug-in writer. crawl() method is the heart of the plug-in.

key method: crawl( )

CrawlerPluginManager

Implemented by plug-in writer. Responsible for plug-in registration and materializing plug-in instance to be used by the crawler.

key method: init(), getCrawlerPlugin(), getPluginParameters()

CrawlingThreadService

Entry point for submitting document to the crawler.

key method: submitForProcessing(DocumentContainer target)

DataSourceService

An optional service used for managing data source

key method: delete(url), indexNow(), registerGlobalLOV()

DocumentAcl Object for holding document access control principal. Save it to DocumentMetadata object.


key method: addPrincipal(), addDenyPrincipal()

DocumentContainer

A document “holder” for the document. Note metadata and document status must be set in order to submit the document.

key method: setMetadata(DocumentMetadata), setDocument(InputStream), setDocument(Reader), setDocumentStatus()

DocumentMetadata

Object for storing document metadata and access control information.

key method: setACLInfo(DocumentAcl), setAttribute(), setContentType(), setSourceHierarchy()

GeneralService

Entry point for getting DataSourceService, QueueService, and LoggingService. Factory for creating DocumentAcl, DocumentMetadata, DocumentContainer, and LovInfo object.

LoggerLogging interface to output message to the crawler log file.

key method: error(), fatal(), info(), warn()

LovInfo

Object for holding search attribute list of values

key method: addAttributeValue(name, value)

ParameterValues An interface for the plug-in to read the value of data source parameter.

QueueServiceAn optional service for storing pending document URLs.

key method: enqueue(), getNextItem()


Class Summary

ParameterInfo

ParameterInfo is a class for describing the general properties of a parameter. PluginManager returns a list of ParameterInfo through getPluginParameters().

Exception Summary

PluginException

An exception thrown by the plug-in to report error. This will shut down the crawler if isFatalException() is true.

ProcessingException

Exception thrown by the crawler to the plug-in to indicates trouble processing plug-in’s request. If this is a fatal error the crawler will try to shut down. Otherwise it’s up to the plug-in to continue to the next document or not.

Identity Plug-InAn identity plugin is responsible for the following tasks:

Authenticating a user, given a username and password

Checking whether a user or group name is valid

Providing a list of supported attributes for users and groups

Providing a list of groups of which a user is a member

Like a crawler plugin, the identity plugin is provided as a package containing a manager class, the main identity plugin class, and any supporting classes.

Here is a summary of the API

Interface Summary

IdentityPlugin Implemented by plug-in writer.


key methods: authenticate(), getAllGroups,validateUser, validateGroup

IdentityPluginManager

Implemented by plug-in writer. Responsible for plug-in registration and materializing plug-in instance to be used by the crawler.

key method: init(), getIdentityPlugin(), getPluginParameters()

Authorization Plug-InAn authorization plugin consists of

An authorization plugin manager

(Optionally) A query filter plugin

(Optionally) A result filter plugin

The authorization plugin manager is much like the other plugin managers we’ve seen, but also has the responsibility for declaring whether there is a Query Filter plugin available, or a Result Filter plugin, or both.

Interface Summary

QueryFilterPlugin

Given the name of a security attribute, returns a list of values for that attribute. For example, given the name GROUP it might return all groups of which the currently logged-on user is a member (the username is passed to the Manager in the method getQueryFilterPlugin())

key methods: getSecurityValues()

ResultFilterPlugin Responsible for checking the current logged-in user’s access rights to a set of documents. This might be a selection of documents returned as a


query result, or a group of documents found during a browse operation.

key methods: filterDocuments(), filterBrowseFolders(), pruneSource().

AuthorizationPluginManager

Responsible for plug-in registration and materializing plug-in instance to be used by the crawler.

key method: init(), getIdentityPlugin(), getPluginParameters()

APPENDIX A- SAMPLE CUSTOM CRAWLER

About the Sample CrawlerThis is a simple example crawler which demonstrates the basic principles of a custom crawler. It crawls files on the file system of the machine where SES is installed. This functionality is already covered by the standard file crawler, but this example is intended to provide a basis for developing customers’ own data sources. For illustration purposes, it creates a hard-coded Author attribute and Access Control List for each document.


Building the Sample Crawler

To build this sample, you can use your favorite Java IDE, or use the command-line java utilities provided in the SES instance, in %SES_HOME%\jdk\bin (assuming SES_HOME is set to your SES installation directory). Create a directory $SES_HOME\search\lib\plugins\app\crawler\simplefile and copy the two .java files into it. Ensure that %SES_HOME%\jdk\bin is in your path, then create a .bat file (Windows) or shell script (Linux/Unix) in the search\lib\plugins directory containing the following commands:

javac -classpath ..\search.jar app\crawler\simplefile\*.java

jar cf simplefilecrawler.jar app\crawler\simplefile\*.class

Run this file to create the jar file. Then go to the SES admin pages, under Global Settings -> Source Types. Hit create, and enter the details as follows:

Name: Simple File Crawler

Description: Sample crawler for local files

Class Name: app.crawler.simplefile.SimpleFileCrawlerPluginManager

Jar File Name: simplefilecrawler.jar

Hit “Next”, then “Finish”. You will now be able to go to Home -> Sources and create a new source of type “Simple File Crawler”


Note: The crawler code is set up as a secure crawler, and assumes the following:

1. Your SES instance is connected to an OID directory

2. You are using “nickname” as the authentication attribute.

3. There is a user called “test001”

If these assumptions do not hold true, change the code to getDefaultAccessPolicy() in SimpleFileCrawlerPluginManager.java according to the comments, and/or comment out the two lines in processItem() in SimpleFileCrawlerPlugin.java

AS RELEASED, ONLY THE USER test001 WILL HAVE ACCESS TO CRAWLED DOCUMENTS.

Debugging

The first step in debugging crawler code is to edit the file %SES_HOME%\search\data\crawler.dat and change the logLevel to 2 (second to last line in the file). This will ensure


that extra diagnostic output is produced in the crawler log file, and that logger.debug() calls from your own code will produce output. It is not normally easy to run crawler code from an IDE debugger, so you will normally have to rely on such output for debugging your code.

Remember when running a crawler repeatedly to make sure that your Schedule has been edited to set the recrawl policy to “Process All Documents”. You may also want to set the Source Crawling Parameters to have “Number of Crawler Threads = 1” as this often makes the crawler flow and output easier to understand.

Source code for SimpleFileCrawlerPluginManager.javapackage app.crawler.simplefile;

import java.io.File;import java.io.InputStream;import java.io.PrintWriter;import java.io.StringReader;import java.io.StringWriter;

import java.net.MalformedURLException;import java.net.URL;

import oracle.search.sdk.crawler.CrawlerPlugin;import oracle.search.sdk.crawler.CrawlingThreadService;import oracle.search.sdk.crawler.DocumentContainer;import oracle.search.sdk.crawler.DocumentMetadata;import oracle.search.sdk.crawler.DocumentAcl;import oracle.search.sdk.crawler.GeneralService;import oracle.search.sdk.crawler.Logger;import oracle.search.sdk.crawler.PluginException;import oracle.search.sdk.crawler.ProcessingException;import oracle.search.sdk.crawler.QueueService;

/** * An implementation of <code>CrawlerPlugin</code> that crawls a file system. * See the documentation for <code>SimpleFileCrawlerPluginManager</code> for general * documentation on this crawler's functionality. * * The crawler is given a list of FILE protocol seed URLs for the starting * directories. It will then recurse from here fetching all files. * * If the filenames contain spaces, a space character must be used in the seed rather * than the <code>%20</code> which might be expected. * * * @author roger.ford * @version 1.0, 12/05/06 * @see oracle.search.sdk.CrawlerPlugin * @since 1.0 */public class SimpleFileCrawlerPlugin implements CrawlerPlugin{ /** Service handle to access services */ private GeneralService gs;

/** Service handle for crawl related tasks */ private CrawlingThreadService cts; /** Service handle for the crawler queue */ private QueueService qs; /** Service handle for the logger */ private Logger logger; /** Has a stop been requested? */ private boolean stop;

/** List of file suffixes to ignore */ private String[] ignoreList; /** Constructor */ public SimpleFileCrawlerPlugin (GeneralService gs, CrawlingThreadService cts, String[] ignoreList) { this.gs = gs; this.cts = cts; this.qs = gs.getQueueService(); this.logger = gs.getLoggingService(); this.ignoreList = ignoreList; } /** * Process documents. * * @throws oracle.search.sdk.crawler.PluginException */ public void crawl() throws PluginException { while( !stop ) // loop until stopCrawling has been called or we run out { try { // get the next item to process from the queue, or break if no more items DocumentMetadata dm = qs.getNextItem(); if( dm == null ) break; // nothing left to do String displayUrl = dm.getDisplayURL(); DocumentContainer dc = gs.newDocumentContainer(); dc.setMetadata( dm ); logger.debug( "About to process " + displayUrl ); // create URL object for accessing resource URL url = new URL( displayUrl ); // if it's a directory, process as a collection of resources logger.info("*** file name "+url.getPath()); File file = new File( url.getFile() ); if( file.isDirectory() ) { processCollection( file, dc); } else { // check ignore list if( ! inExclusionList( url, ignoreList ) ) { processItem( url, dc ); }


else { logger.info( "File "+url.getPath()+" excluded by suffix exclusion list" ); } } } catch( Exception e ) { logger.error( e ); } } } /** * Process a collection of documents (i.e., a directory). * All the files and directories found are pushed onto the queue without further checking * * @param parent directory, as a <code>File</code> * @param dc directory, as a <code>DocumentContainer</code> */ private void processCollection( File parent, DocumentContainer dc ) { try { // Get the directory contents String [] children = parent.list(); for( int i=0; i < children.length; i++ ) { // enqueue all the child resources File f = new File( parent, children[i] ); int depth = dc.getMetadata().getCrawlDepth() + 1; qs.enqueue(fileToURL( f ).toExternalForm(), null, depth ); } } catch( Exception e ) { logStackTrace( e ); submit( dc, DocumentContainer.STATUS_SERVER_ERROR ); } }

/** * Process an individual document. * * @param item document, as a <code>URL</code> * @param dc document, as a <code>DocumentContainer</code> */ private void processItem( URL item, DocumentContainer dc ) { try { DocumentMetadata dm = dc.getMetadata();

// Do any necessary metadata processing - here's some static examples dm.addAttribute( "Author", "Anne Author" );

DocumentAcl theAcl = gs.newDocumentAcl();

// Create a username entry in the ACL. The first argument is the username. // the second will vary according to the directory in use

// the third is either .USER, .GROUP, .OWNER or .UNKNOWN

// THIS WILL ONLY WORK IN THIS FORM IF YOUR SES INSTANCE IS CONNECTED TO OID // AND HAS A USERNAME test001

// comment out these two lines for public documents theAcl.addPrincipal( "test001", "nickname", DocumentAcl.USER ); dm.setACLInfo( theAcl );

logger.info( "Submitting document"+item.toExternalForm()); InputStream stream = null; try { stream = item.openStream(); dc.setDocument( stream ); submit( dc, DocumentContainer.STATUS_OK_FOR_INDEX ); } catch( Exception e ) { logStackTrace( e ); submit( dc, DocumentContainer.STATUS_SERVER_ERROR ); } finally { if( stream != null ) { try { stream.close(); } catch( Exception e ) { logger.error( e ); } } } } catch( Exception e ) { logStackTrace( e ); submit( dc, DocumentContainer.STATUS_SERVER_ERROR ); } }

/** * Submit a document for processing, using the specified status code. * * @param dc document, as a <code>DocumentContainer</code> * @param status document processing status code */ private void submit( DocumentContainer dc, int status ) { try { dc.setDocumentStatus( status ); cts.submitForProcessing( dc ); } // Catch a document filtering error, since we don't want this to stop the crawl catch( ProcessingException e ) { if( e.getMessage().indexOf( "EQG-30065" ) >= 0 ) { try { dc.setDocumentStatus( DocumentContainer.STATUS_FILTER_ERROR ); dc.setDocument( (InputStream) null ); cts.submitForProcessing( dc ); }

catch( ProcessingException pe ) { logStackTrace( pe ); } } else { logStackTrace( e ); } } }

/** * Initialize variables prior to starting a crawl. * * @throws oracle.search.sdk.crawler.PluginException */

public void init() { stop = false; }

/** * Request the crawler to stop. Crawling will stop once processing completes * on the active document at the time of the stop request. * * @throws oracle.search.sdk.crawler.PluginException */ public void stopCrawling() { stop = true; }

/** * Log the stack trace of a thrown exception. * * @param t <code>Throwable</code> to be logged */

private void logStackTrace( Throwable t ) { StringWriter sw = new StringWriter(); PrintWriter pw = new PrintWriter( sw ); t.printStackTrace( pw ); logger.error( sw.getBuffer().toString() ); }

/** * Convert a file reference to a <code>URL</code>, normalizing for operating * system specific representations, such as the path separator. * * @param url URL of the file * @param suffixList List of suffixes to disallow * @return true or false */

private boolean inExclusionList (URL url, String[] suffixList) { for ( int i=0; i < suffixList.length ; i++) {

String path = url.getPath(); logger.info("check suffix >"+suffixList[i]+"< against path >"+path.substring(path.lastIndexOf( "." )+1 )+"<" ); if( suffixList[i].equals( path.substring( path.lastIndexOf( "." )+1 ) ) ) { logger.info("got a match"); return true; } } return false; } /** * Convert a file reference to a <code>URL</code>, normalizing for operating * system specific representations, such as the path separator. * * @param file input file reference * @return URL version of a file reference * @throws java.net.MalformedURLException */ private URL fileToURL( File file ) throws MalformedURLException {

String path = file.getAbsolutePath(); String sep = System.getProperty( "file.separator" ); if( ( sep != null ) && ( sep.length() == 1 ) ) path = path.replace( sep.charAt( 0 ), '/' ); if( ( path.length() > 0 ) && ( path.charAt( 0 ) != '/') ) path = '/' + path; return new URL( "file", null, path ); }

}

Source code for SimpleFileCrawlerPlugin.javapackage app.crawler.simplefile;

import java.io.File;import java.io.InputStream;import java.io.PrintWriter;import java.io.StringReader;import java.io.StringWriter;

import java.net.MalformedURLException;import java.net.URL;

import oracle.search.sdk.crawler.CrawlerPlugin;import oracle.search.sdk.crawler.CrawlingThreadService;import oracle.search.sdk.crawler.DocumentContainer;import oracle.search.sdk.crawler.DocumentMetadata;import oracle.search.sdk.crawler.DocumentAcl;import oracle.search.sdk.crawler.GeneralService;import oracle.search.sdk.crawler.Logger;import oracle.search.sdk.crawler.PluginException;import oracle.search.sdk.crawler.ProcessingException;import oracle.search.sdk.crawler.QueueService;

/** * An implementation of <code>CrawlerPlugin</code> that crawls a file system. * See the documentation for <code>SimpleFileCrawlerPluginManager</code> for general

* documentation on this crawler's functionality. * * The crawler is given a list of FILE protocol seed URLs for the starting * directories. It will then recurse from here fetching all files. * * If the filenames contain spaces, a space character must be used in the seed rather * than the <code>%20</code> which might be expected. * * * @author roger.ford * @version 1.0, 12/05/06 * @see oracle.search.sdk.CrawlerPlugin * @since 1.0 */public class SimpleFileCrawlerPlugin implements CrawlerPlugin{ /** Service handle to access services */ private GeneralService gs; /** Service handle for crawl related tasks */ private CrawlingThreadService cts; /** Service handle for the crawler queue */ private QueueService qs; /** Service handle for the logger */ private Logger logger; /** Has a stop been requested? */ private boolean stop;

/** List of file suffixes to ignore */ private String[] ignoreList; /** Constructor */ public SimpleFileCrawlerPlugin (GeneralService gs, CrawlingThreadService cts, String[] ignoreList) { this.gs = gs; this.cts = cts; this.qs = gs.getQueueService(); this.logger = gs.getLoggingService(); this.ignoreList = ignoreList; } /** * Process documents. * * @throws oracle.search.sdk.crawler.PluginException */ public void crawl() throws PluginException { while( !stop ) // loop until stopCrawling has been called or we run out { try { // get the next item to process from the queue, or break if no more items DocumentMetadata dm = qs.getNextItem(); if( dm == null )

break; // nothing left to do String displayUrl = dm.getDisplayURL(); DocumentContainer dc = gs.newDocumentContainer(); dc.setMetadata( dm ); logger.debug( "About to process " + displayUrl ); // create URL object for accessing resource URL url = new URL( displayUrl ); // if it's a directory, process as a collection of resources logger.info("*** file name "+url.getPath()); File file = new File( url.getFile() ); if( file.isDirectory() ) { processCollection( file, dc); } else { // check ignore list if( ! inExclusionList( url, ignoreList ) ) { processItem( url, dc ); } else { logger.info( "File "+url.getPath()+" excluded by suffix exclusion list" ); } } } catch( Exception e ) { logger.error( e ); } } } /** * Process a collection of documents (i.e., a directory). * All the files and directories found are pushed onto the queue without further checking * * @param parent directory, as a <code>File</code> * @param dc directory, as a <code>DocumentContainer</code> */ private void processCollection( File parent, DocumentContainer dc ) { try { // Get the directory contents String [] children = parent.list(); for( int i=0; i < children.length; i++ ) { // enqueue all the child resources File f = new File( parent, children[i] ); int depth = dc.getMetadata().getCrawlDepth() + 1; qs.enqueue(fileToURL( f ).toExternalForm(), null, depth ); } } catch( Exception e ) { logStackTrace( e ); submit( dc, DocumentContainer.STATUS_SERVER_ERROR ); } }

/** * Process an individual document. * * @param item document, as a <code>URL</code> * @param dc document, as a <code>DocumentContainer</code> */ private void processItem( URL item, DocumentContainer dc ) { try { DocumentMetadata dm = dc.getMetadata();

// Do any necessary metadata processing - here's some static examples dm.addAttribute( "Author", "Anne Author" );

DocumentAcl theAcl = gs.newDocumentAcl();

// Create a username entry in the ACL. The first argument is the username. // the second will vary according to the directory in use // the third is either .USER, .GROUP, .OWNER or .UNKNOWN

// THIS WILL ONLY WORK IN THIS FORM IF YOUR SES INSTANCE IS CONNECTED TO OID // AND HAS A USERNAME test001

// comment out these two lines for public documents theAcl.addPrincipal( "test001", "nickname", DocumentAcl.USER ); dm.setACLInfo( theAcl );

logger.info( "Submitting document"+item.toExternalForm()); InputStream stream = null; try { stream = item.openStream(); dc.setDocument( stream ); submit( dc, DocumentContainer.STATUS_OK_FOR_INDEX ); } catch( Exception e ) { logStackTrace( e ); submit( dc, DocumentContainer.STATUS_SERVER_ERROR ); } finally { if( stream != null ) { try { stream.close(); } catch( Exception e ) { logger.error( e ); } } } } catch( Exception e ) { logStackTrace( e ); submit( dc, DocumentContainer.STATUS_SERVER_ERROR ); } }

/** * Submit a document for processing, using the specified status code. *

* @param dc document, as a <code>DocumentContainer</code> * @param status document processing status code */ private void submit( DocumentContainer dc, int status ) { try { dc.setDocumentStatus( status ); cts.submitForProcessing( dc ); } // Catch a document filtering error, since we don't want this to stop the crawl catch( ProcessingException e ) { if( e.getMessage().indexOf( "EQG-30065" ) >= 0 ) { try { dc.setDocumentStatus( DocumentContainer.STATUS_FILTER_ERROR ); dc.setDocument( (InputStream) null ); cts.submitForProcessing( dc ); } catch( ProcessingException pe ) { logStackTrace( pe ); } } else { logStackTrace( e ); } } }

/** * Initialize variables prior to starting a crawl. * * @throws oracle.search.sdk.crawler.PluginException */

public void init() { stop = false; }

/** * Request the crawler to stop. Crawling will stop once processing completes * on the active document at the time of the stop request. * * @throws oracle.search.sdk.crawler.PluginException */ public void stopCrawling() { stop = true; }

/** * Log the stack trace of a thrown exception. * * @param t <code>Throwable</code> to be logged */

private void logStackTrace( Throwable t ) { StringWriter sw = new StringWriter(); PrintWriter pw = new PrintWriter( sw ); t.printStackTrace( pw ); logger.error( sw.getBuffer().toString() ); }

/** * Convert a file reference to a <code>URL</code>, normalizing for operating * system specific representations, such as the path separator. * * @param url URL of the file * @param suffixList List of suffixes to disallow * @return true or false */

private boolean inExclusionList (URL url, String[] suffixList) { for ( int i=0; i < suffixList.length ; i++) { String path = url.getPath(); logger.info("check suffix >"+suffixList[i]+"< against path >"+path.substring(path.lastIndexOf( "." )+1 )+"<" ); if( suffixList[i].equals( path.substring( path.lastIndexOf( "." )+1 ) ) ) { logger.info("got a match"); return true; } } return false; } /** * Convert a file reference to a <code>URL</code>, normalizing for operating * system specific representations, such as the path separator. * * @param file input file reference * @return URL version of a file reference * @throws java.net.MalformedURLException */ private URL fileToURL( File file ) throws MalformedURLException {

String path = file.getAbsolutePath(); String sep = System.getProperty( "file.separator" ); if( ( sep != null ) && ( sep.length() == 1 ) ) path = path.replace( sep.charAt( 0 ), '/' ); if( ( path.length() > 0 ) && ( path.charAt( 0 ) != '/') ) path = '/' + path; return new URL( "file", null, path ); }

}

Secure Search with Oracle Secure Enterprise SearchJanuary 2007Author: Roger FordContributing Authors: Meeten Bhavsar, Thomas Chang, Muralidhar Krishnaprasad

Oracle CorporationWorld Headquarters500 Oracle ParkwayRedwood Shores, CA 94065U.S.A.

Worldwide Inquiries:Phone: +1.650.506.7000Fax: +1.650.506.7200www.oracle.com

This Document Is For Informational Purposes Only And May Not Be Incorporated Into A Contract or Agreement.

Oracle is a registered trademark of Oracle Corporation. Variousproduct and service names referenced herein may be trademarksof Oracle Corporation. All other product and service namesmentioned may be trademarks of their respective owners.

Copyright © 2007 Oracle CorporationAll rights reserved.

http://www.oracle.com/