using the fast search for sharepoint pipeline to improve search

19
Improving search using the pipeline in FAST Search for SharePoint SurfRay.com ideaeng.com Miles Kehoe Author of: Professional Microsoft Search [email protected] www.enterprisesearchblog.com @miles_kehoe mileskehoe

Upload: surfray

Post on 02-Jul-2015

3.146 views

Category:

Technology


0 download

DESCRIPTION

Miles Kehoe, Enterprise Search Guru, presents FAST Search for SharePoint and ways to improve search in your organization by utilizing the document processing pipeline in FAST Search for SharePoint.

TRANSCRIPT

Page 1: Using the Fast Search for SharePoint Pipeline to Improve Search

Improving search using the pipeline in FAST Search for SharePoint

SurfRay.comideaeng.com

Miles KehoeAuthor of: Professional Microsoft [email protected]@miles_kehoemileskehoe

Page 2: Using the Fast Search for SharePoint Pipeline to Improve Search

Agenda

• Introductions

• When FS4SP makes sense

• What is the FS4SP indexing pipeline?

• Why is it important to you?

• How do you use it?

• Wrap Up

Page 3: Using the Fast Search for SharePoint Pipeline to Improve Search

About Me

• Founder of New Idea Engineering Inc.

• Work with enterprise search since 1989

• Co-Author Professional Microsoft Search/Wrox

• Author several blogs:

- Enterprisesearchblog.com

- SearchComponentsOnline.com

• Search nerd

Page 4: Using the Fast Search for SharePoint Pipeline to Improve Search

When to use FS4SP

Large datasets

• SP Search indexes 100M documents

• FS4SP virtually unlimited (650M in tests)

• Rows and Columns concept

Need to fine-tune index & search

• Pipeline

• Need custom relevance profiles

• Need to fine-tune queries for relevance

Page 5: Using the Fast Search for SharePoint Pipeline to Improve Search

What is the FS4SP indexing pipeline?

Standard sequence of ‘stages’ from crawl to index

• Format conversion & language detection

• Lemmatization / Stemming

• Entity extraction

• Map crawled properties to managed properties

Unique to FAST: the ability to insert custom processing

• ‘Must’ be just before mapper

• C# supported; but any code using STDIN/STDOUT ok

• Time critical!

A great way to fix up messy data!

Page 6: Using the Fast Search for SharePoint Pipeline to Improve Search

Pipeline Architecture

User QueriesData Sources

Content Processor

Crawler Indexer Query Processor

Form

at C

on

vers

ion

Lan

guag

e D

etec

tio

n

Enti

ty E

xtra

ctio

n

Lem

mat

izat

ion

…C

ust

om

Ext

ensi

bili

ty

Map

per

Index Flow

FS4SP Pipeline

Page 7: Using the Fast Search for SharePoint Pipeline to Improve Search

Why is the pipeline important to you?

Sometimes content IS messy:

• URLs with abbreviations

• Additional metadata is in external sources

• Geo-tag documents

Diagnose problems in the indexing process:

• Identify bad or missing metadata

Page 8: Using the Fast Search for SharePoint Pipeline to Improve Search

Examples where the pipeline can save you

Cryptic URLs

• With URLs like www.myco.com/mkt/prodmgmt/products.aspx

• I can add specific metadata to the document

‘marketing’ (because of ‘mkt’) & product management’ (because of ‘prodmgmt’)

Adding valuable metadata:

• When I find a user name in a document I can lookup and return phone number and email

• When I find a city name I can geo-tag with latitude and longitude

Debugging the indexing process

• When things are not as they seem I can diagnose problems in the indexing process

Page 9: Using the Fast Search for SharePoint Pipeline to Improve Search

How do you use the pipeline?

Pipeline configuration files in \FASTSearch\etc

• PipelineConfig.xml

• PipelineExtensibility.xml

For each Document Processor node:

• Create an entry for a new ‘processor’

• Add your new processor name to the <pipelines> node

• Restart the ‘FAST processor server’ from CMD: psctrl reset

• Submit a single known test document

• Check your results

Page 10: Using the Fast Search for SharePoint Pipeline to Improve Search

Config Files

Page 11: Using the Fast Search for SharePoint Pipeline to Improve Search

Adding a Processor Stage

On each FAST document processor node:• Edit %FASTSEARCH%\etc\pipelineconfig.xml

<processor name=“Spy1" type="general" hidden="0"><load module="processors.Spy" class="Spy"/><config><param name="SpyDumpFile" value="var/log/spy.txt" type="str"/><param name="FileStringCutOffLen" value="32768" type="int"/></config><inputs></inputs>

</processor>• In the ‘Document Conversion’ section, add the new pipeline stage to run (in the Office 14

pipeline) <processor name=“Spy1” />

• Reset (each) document processor node:psctrl reset

Page 12: Using the Fast Search for SharePoint Pipeline to Improve Search

FS4SP Pipeline Extensibility

Page 13: Using the Fast Search for SharePoint Pipeline to Improve Search

How do you create a custom stage?

Edit file %FASTSEARCH%\etc\pipelineconfig as aboveEdit file %FASTSearch%\etc\PipelineExtensibility.xml

<PipelineExtensibility><Run command=“YourCode.EXE %(input)s %(output)s"><Input><CrawledProperty propertyName=“author" propertySet=“GUID“ varType="31" />

</Input><Output>

<CrawledProperty propertyName=“mytags” propertySet=“GUID" varType="31"/><CrawledProperty propertyName=“phone" propertySet=“GUID" varType=“31"/>

</Output></Run>

</PipelineExtensibility>

Restart content servers from command Line promptpsctrl reset

Page 14: Using the Fast Search for SharePoint Pipeline to Improve Search

Pipeline is performance-critical

Pipeline runs in ‘sandbox’ environment

• NOT the same type of ‘sandbox’ in O365

• File I/O only allowed in C:\users\<fast service user>\AppData\LocalLow

• Maximum of 10 seconds to live

• Permissions restricted regardless of FAST Service user permissions

• Each Document Processor (DP) is an individual instance

• Only one item passes thru a DP at a time

• If each document takes 1 second then10 DPs can process at best 10 docs/sec

• Consider 1 sec for each of 100K docs ~ 3 hours!

Page 15: Using the Fast Search for SharePoint Pipeline to Improve Search

Pipeline HintsMS only supports:

• Single custom stage (in PipelineConfig.xml)

• .NET languages (C#, etc)

But:

• A custom stage can appear in multiple places in PipelineConfig.xml even w/ different parameters

• Theoretically any executable that handles STDIN/STDOUT will do

• VC#/VC++/VBScript/CMD files seem to work

• Web services calls are supported

Page 16: Using the Fast Search for SharePoint Pipeline to Improve Search

Using web services in Sandbox

Web Service

Stage

Stage

Stage

Stage

XML

XML

XML Config

Page 17: Using the Fast Search for SharePoint Pipeline to Improve Search

Ontolica FAST Management

Ontolica Fast Management provides clear and easy to use configuration directly from within the SharePoint admin GUI. Forget XML configuration files, manual file deployments, and tricky PowerShell configuration with easy management consoles.

Key Features:

• Backup, Manage, & Deploy Configurations• Manage FAST Relevance Profiles• Upload & Manage Pipeline Extensions• Create & Manage JDBC Connections• FAST Webcrawler Configuration• Manage FAST Server Processes from Central

Admin

Page 18: Using the Fast Search for SharePoint Pipeline to Improve Search

Additional Resources

• This slide deck live at http://slidesha.re/sCGAaP

• SP2010 ES/FS4SP Blog (Eric Belisle) - http://fs4sp.blogspot.com/

• Enterprise Search Blog (NIE) - http://www.enterprisesearchblog.com/

• Search Unleashed (Len Ocsouza) - http://searchunleashed.wordpress.com/

• ESW Blog - http://www.enterprisesearchwiki.com/wp/

• TechNet/MSDN/Microsoft

• And of course: SurfRay.com (Robert Piddocke & Josh Noble)