solr indexing and analysis tricks

13
Solr Indexing and Analysis Tricks @ErikHatcher Senior Solutions Architect, LucidWorks

Upload: lucenerevolution

Post on 11-May-2015

1.079 views

Category:

Technology


1 download

DESCRIPTION

Presented by Erik Hatcher, Senior Solutions Architect, LucidWorks This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/

TRANSCRIPT

Page 1: Solr Indexing and Analysis Tricks

Solr Indexing and Analysis Tricks @ErikHatcher Senior Solutions Architect, LucidWorks

Page 2: Solr Indexing and Analysis Tricks

•  Lucene/Solr committer •  Apache Software Foundation member •  Co-founder, Senior Solutions Architect, and Janitor at LucidWorks •  Creator of Blacklight •  Co-author of "Ant in Action" and "Lucene in Action"

Erik Hatcher's Relevant Professional Bio

Page 3: Solr Indexing and Analysis Tricks

This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http://www.searchhub.org/2013/06/27/poor-mans-entity-extraction-with-solr/

Abstract

Page 4: Solr Indexing and Analysis Tricks

•  acronyms: a searchable/filterable/facetable (but not stored) field containing all three+ letter CAPS acronyms

•  key_phrases: a searchable/filterable/facetable (but also not stored) field containing any key phrases matching a provided list

•  links: a stored field containing http(s) links •  extracted_locations: lat/long points that are mentioned in

the document content, indexed as geographically savvy points, stored on the document for easy use in maps or elsewhere

Poor Man’s Entity Extraction

Page 5: Solr Indexing and Analysis Tricks

The DUB airport is at 53.421389,-6.27 See also: http://en.wikipedia.org/wiki/Dublin_Airport

example_data.txt

Page 6: Solr Indexing and Analysis Tricks

End results

Page 7: Solr Indexing and Analysis Tricks

•  Analyzers and Query Parsers –  Analysis != query parsing

•  Query parsers generally analyze “chunks” of the query expression and combine the results in various ways

–  Synergy, working in conjunction •  Query parsing

–  q=Lucene Revolution in Dubhlinn –  q="Lucene Revolution" –  q=lucene [AND/OR] revolution –  On which field(s)? Which query parser?

•  Analysis –  +((content:lucen) (content:revolut) (content:dublin)) [from edismax]

Challenges and needs

Page 8: Solr Indexing and Analysis Tricks

•  copyField content => acronyms –  Note that destination of a copy field generally should not be stored

(stored="false)

•  "caps" field type –  PatternCaptureGroupFilterFactory with pattern="((?:[A-Z]\.?){3,})"

•  "The Dublin airport, DUB, is at…" => DUB

•  Results could be suitable for faceting, searching, and boosting but the results are not "stored" values (only indexed terms)

Extracting with copyField

Page 9: Solr Indexing and Analysis Tricks

•  An update processor can manipulate (add, modify, delete) document fields –  Field values can be stored

•  update.chain=script –  With post.jar:

•  java –Dauto -Dparams=update.chain=script -jar post.jar –  Or make the update chain the default

•  <updateRequestProcessorChain default="true"…

// basic lat/long pattern matching eg "38.1384683,-78.4527887" var location_regexp = /(-?\d{1,2}\.\d{2,7},-?\d{1,3}\.\d{2,7})/g; var extracted_locations = getMatches(location_regexp, content); doc.setField("extracted_locations", extracted_locations);

Extracting with ScriptUpdateProcessor

Page 10: Solr Indexing and Analysis Tricks

var analyzer = req.getCore().getLatestSchema() .getFieldTypeByName("<field type>") .getAnalyzer(); doc.setField("token_ss", getAnalyzerResult(analyzer, null, content));

Analysis in ScriptUpdateProcessor

Page 11: Solr Indexing and Analysis Tricks

function getAnalyzerResult(analyzer, fieldName, fieldValue) { var result = []; var token_stream = analyzer.tokenStream(fieldName, new java.io.StringReader(fieldValue));

var term_att = token_stream.getAttribute( Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);

token_stream.reset(); while (token_stream.incrementToken()) { result.push(term_att.toString()); }

token_stream.end(); token_stream.close();

return result; }

getAnalyzerResult

Page 12: Solr Indexing and Analysis Tricks

•  http://localhost:8983/solr/collection1/analysis/field –  ?analysis.fieldvalue=Dubhlinn –  &analysis.fieldtype=just_synonyms

•  => dublin

Using analysis externally

Page 13: Solr Indexing and Analysis Tricks