oslo enterprise meetup may 12th 2010 - jan høydahl

52
cominvent as cominvent as Migrating FAST to Solr By Jan Høydahl Oslo Enterprise Search MeetUp May 2010 Enterprise Search Specialists

Upload: cominvent-as

Post on 08-May-2015

2.260 views

Category:

Technology


3 download

DESCRIPTION

Presentation held at Oslo Enterprise MeetUp in May, pitched towards an audience who come from the FAST ESP side and have some existing FAST knowledge. Check out one of my other presentations if you're most familiar with Lucene/Solr.

TRANSCRIPT

Page 1: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

cominvent as

Migrating FAST to SolrBy Jan Høydahl

Oslo Enterprise Search MeetUp May 2010

Enterprise Search Specialists

Page 2: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Jan Høydahl

● IT architect - search, telecom, mobile

● Helped build FAST's Global Services as first engineer

● Founder of Cominvent AS● Search consultant 10 years

Page 3: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

cominvent as

Page 4: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Consulting

– Cominvent delivers independent search consulting– Focus on Apache Lucene/Solr & Microsoft FAST ESP

Idea –> architecture –> implementation

Page 5: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Commercial Support (Solr/Lucene)

– When community & mailing list support is not enough..– Paid support agreement for Apache Solr/Lucene– In cooperation with Lucid Imagination

– Read more: http://www.cominvent.com/support/

Page 6: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Training

– Cominvent AS delivers training public and on-site– Certified Solr Training Partner for Lucid Imagination– Certified FAST ESP Training Partner

– Read more: http://www.cominvent.com/training/

Photo: fluidpowerzone.com

Page 7: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Solr kurs

Page 8: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Page 9: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

FAST & Solr are very similar...

Page 10: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Areas of usage

Page 11: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Common features

Page 12: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Common features

Page 13: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Introduction to...

...for FAST people

Page 14: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Apache Solr - characteristics

(Commercially friendly)

Search server

Page 15: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Apache Solr - characteristics

Modular Community

Light weightContributions & patches

Page 16: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Solr-user community growth

2006 Jan2006 Mar

2006 May2006 Jul

2006 Sep2006 Nov

2007 Jan2007 Mar

2007 May2007 Jul

2007 Sep2007 Nov

2008 Jan2008 Mar

2008 May2008 Jul

2008 Sep2008 Nov

2009 Feb2009 Apr

2009 Jun2009 Aug

2009 Oct2009 Dec

2010 Feb

0

200

400

600

800

1000

1200

1400

1600

Solr-user growth

Column B

Month

Mes

sag

es

Page 17: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Lucene/Solr deployments

– More: http://wiki.apache.org/solr/PublicServers

Thanks to Lucid Imagination for logo collection

Page 18: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

18

XML/HTTP

Page 19: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Solr Architecture

Page 20: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

The Apache Software Foundation

Page 21: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Other ASF Lucene sub-projects

– Lucene Java library

– Rich document extraction

– Crawling web pages

– Machine learning• Classification/clustering• Collaborative filtering...

Page 22: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Introduction to...

...for Solr people

Page 23: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

FAST ESP – characteristics & key strengths

Connectors

Security

Page 24: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

FAST ESP – characteristics & key strengths

Page 25: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

FAST ESP – characteristics & key strengths

FormatConversion

LanguageDetection Entities

Linguistic Normalization

OntologyCustomPlug-in

AlertSearch

Taxonomy Sentiment

PARIS (Reuters) - Venus Williams raced into the second

round of the $11.25 million French Open Monday,

brushing aside Bianka Lamade, 6-3, 6-3, in 65 minutes.

The Wimbledon and U.S. Open champion, seeded second,

breezed past the German on a blustery center court to

become the first seed to advance at Roland Garros.

"I love being here, I love the French Open and more than

anything I'd love to do well here," the American said.

A first round loser last year, Williams is hoping to progress

beyond the quarter-finals for the first time in her career.

– Very strong document processing framework

Page 26: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

FAST ESP architecture

Page 27: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

The migration...

Page 28: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migration objectives

– Possible objectives include:• Lower maintenance cost• Deeper in-house competency• Less dependent on external consultants• Ownership and visibility of source code• Shorter time to market for new features• Bugs fixed faster – or even fix ourselves• Larger community, mailing lists that work!• More choice in external consultants• Contribute back to Open Source• Lower HW footprint

Page 29: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migration steps

– Knowledge gathering & Training– Review current features & arch

• Want to keep all features? Add new?

– Migration areas:• Index profile• Content• Feeding• Document Processing• Querying• Search middleware?• Admin & Operational

– What to do in Application space vs Search space?

Page 30: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Feature comparison ESP – Solr (similarities)

Feature ESP Solr

Full-text, boolean, range search, sorting, sub-second, facets, did-you-mean, synonyms, faceting

Yes Yes

Scaling for QPS Add rows Add rows

Scaling for document volume Add columns Add shards

Synonyms Index/query side Index/query side

GEO search Yes Yes (1.5)

Boolean query language Yes (FQL) Yes (Lucene or(e)DisMax)

APIs HTTP, Java, .NET, C++, PHP

HTTP, Java, .NET, Ruby, Python, PHP, Perl, JS

Page 31: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Admin server Yes No (coming 1.5)

Processes Many (C++, Java, Python)

One WAR in Java app-server, 100% Java

Navigators / Facets Index-time Query-time

Did-you-mean Dictionary based Dictionary or index based

Feeding API only HTTP POST or API

Document processing Pipeline (py) Simple pipeline (Java, JS, Groovy, Jython, JRuby..)

Multi field querying Composite fields DisMax handler

Page 32: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Relevancy tuning Rank profiles, term boosting

Dynamic function queries and boost functions

XRANK XRANK operator Function Queries

Freshness boost Freshness in rank profile

Function Queries

Boost GEO distance Rank profile and special

Function Queries

Major schema or software updates Cold update, use stage environment

Stage new content into new Solr core

Pluggability Docprocs, QT/RP (limited), clients

Everything :)Request Handlers, Query Parsers, Docprocs, Rank, Spell, tokenizer++

Page 33: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Lemmatization Can be licensed for many languages

Can be licensed from 3rd party

Query syntax and(a:foo, b:bar)i:range(0, 100)

d:range(2000-01-01T00:00:00, 2010-03-03T12:00:00)

a:foo OR b:barI:[0 TO 100]

d:[2000-01-01T00:00:00Z TO NOW]

Query params query=offset=hits=spell=1

q=start=rows=spellcheck=true

What fields to return view=viewname fl=title,price,body...

Page 34: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Feature comparison ESP – Solr (differences)

Feature ESP Solr

Search XML hierarchy Yes, scope search No

Reports Built in analytics Use 3rd party log analysis such as Splunk.com

Page 35: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Your existing FAST system - overview

Search middleware?

Your web-app

Graphics diagram: www.microsoft.com

Page 36: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating index profile

– ESP index profile -> Solr schema.xml– Setup field types, use defaults or create your own– Setup the static fields. ESP:

– Solr equivalent:

– No need for generic*, use dynamic fields:

Page 37: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating index profile

– Composite fields?• Solr can use <copyField> to copy multiple fields into

one, e.g. as we did to map many attributes into one field

• However, to achieve ranking with different boost of each field, Solr does not need composite field. Use DisMax query handler instead. Very powerful!

– No need to edit schema to add new fields. Using dynamic fields, it is easy to e.g. Introduce a color facet for cars or a Mpixels facet for digital cameras

Page 38: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

DisMax query example

– This Solr query can replace use of composite-field• qt=dismax• q=oslo• qf=title^0.7 highpriorityfields^1.5

mediumpriorityfields^0.6 lowpriorityfields^0.2 recallfields^0.0 body^0.0

• bf=recip(rord(creationDate),1,1000,1000)

Page 39: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating content

– If using FAST ContentAPI to push programatically• Use Solr's clients (Java, .NET, Ruby, Python, PHP...)

– If feeding FastXML using FileTraverser• Feed as Solr XML using HTTP POST or a POST client

– If you feed custom XML with XMLMapper• Have a look at DIH's import and mapping features

Page 40: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Push Feeding example

– Feed XML using HTTP POST:• curl http://localhost:8080/solr/update?commit=true

-H "Content-Type: text/xml" --data-binary @mydoc.xml

– Ruby example:• >gem sources -a http://gemcutter.org

>sudo gem install rsolrrequire 'rsolr'solr = RSolr.connect :url=>'http://localhost:8080' documents = [{:id=>1, :price=>1.00},

{:id=>2, :price=>10.50}]solr.add documentssolr.commit

Page 41: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Pull: DataImportHandler (DIH)

Page 42: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Querying examples

– http://localhost:8080/solr/select?q=car&fl=id,title

– Ruby• res=solr.select :q=>'roses', :fq=>['red','white']

res['response']['docs'].each do |doc| puts doc['title']end

Page 43: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating document processing

– Solr lacks a sophisticated pipeline with entity extraction etc. Alternatives:

• Do extraction in Application space (Ruby)• Write own stage in Solr pipeline for simple cases• Integrate to do more advanced stuff

– Matchers/extractors• LingPipe NamedEntityExtractor inside of OpenPipeline

– Synonyms:• Use Solr's synonym handling index/query side

– Custom stages:• Write a Solr UpdateProcessor (in Java, Jython etc)

– Got a LOT of custom FAST docproc stages?• Have a look at SESAT's PY ProcServer for Solr (GPL)

Page 44: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating linguistics (lemmatization)

– Solr ships with Stemming instead of Lemmatization– Stemming has limitations

• Biler, bilen, bilene -> bilBUT

• Bøker, bøkene -> bøk; boka, bok -> bok

– Kstem better. Free with LucidWorks for Solr– If you need singular/plural handling only

• Free dictionaries? Check lucene-hunspell

– Lemmatization can be licensed from 3rd party such as Basistech, who also has language identification & entity extraction

– Language identification also from Sematext

Page 45: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Basistech Rosette for Lucene

– High-end linguistics capabilities for19 languages

– Language Identification– Segmentation and tokenization– Lemmatization– Noun decompounding– Part-of-speech tagging– Entity extraction

– Easily integrated with Lucene/Solr

– More: http://www.basistech.com/lucene/

Page 46: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating search middleware

– Using FAST Unity?• Consider migrating middleware logic such as external

source querying and federation to SESAT (AGPL)

– Using Comperio Front?• Ask Comperio for Solr engine support• Or migrate custom Q&R formats

– Or is plain Solr enough?• Solr has built-in support for shards• A shard query will query multiple shards

and merge the results into one• Add custom processing as Query

Components in Solr• Check contrib & patches!

Page 47: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating Front ends

– Using a middleware with Solr support? Lucky you!– If not, consider introducing one now. Look at (Java):

– If you decide to migrate from FAST Java/.NET APIs• Choose SolrJ or SolrNET• Query language differences. &fq= instead of filter()• Solr facets do not require sessions/state as FAST's

– Migrate fast's «views» into named ReqHandler configs– Multi lingual: Need to handle title_no, title_en etc... :(

Page 48: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating Web Crawler

– Solr has no built-in web crawler• Instead you can choose from several integrations

– The Apache Nutch crawler• Proven with hundreds of millions of pages• http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

– Apache Droids• Still an incubator, but aims at becoming a full crawler• http://incubator.apache.org/droids/

– Heritix + Solr (example in Solr1.4 book)– OpenPipeline has a (very) simple crawler– Lucene Connectors Framework

• Preparing crawler support

Page 49: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Migrating Connectors

– Solr handles these sources internally through DIH:• Database, RSS, Web-services, Local filesystem

– Additionally throgh Lucene Connectors Framework:•

• EMC Documentum, FileNet, JDBC, LiveLink, Patriarch (Memex), Meridio, SharePoint, RSS

• New connectors should be written for LCF

– Another option:•• Sharepoint, IMAP, Documentum, Vignette, Filesystem

Page 50: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Operations

– Solr has no admin-server (coming in 1.5)– Possible to run multiple Tomcat on same server– Multiple cores in same Tomcat – easier migration– No built-in query reports, use 3rd party tools– No built-in monitoring, have a look at

– Log analysis? Check out

Page 52: Oslo Enterprise MeetUp May 12th 2010 - Jan Høydahl

cominvent as

Thank You

www.cominvent.com

www.twitter.com/cominvent

[email protected]

This presentation licensed under CC-by-sa licenseYou must attribute Cominvent with name and link

linkedin.com/in/janhoy