apache solr

46
Enterprise search with Solr Minh Tran

Upload: minh-tran

Post on 25-May-2015

7.548 views

Category:

Documents


0 download

DESCRIPTION

This will introduce you what Apache SOLR could do and apply it for your project

TRANSCRIPT

  • 1. Enterprise search with Solr
    Minh Tran

2. Why does search matter?
Then:
Most of the data encountered created for the web
Heavy use of a site s search function considered a failure in navigation
Now:
Navigation not always relevant
Less patience to browse
Users are used to navigation by search box
Confidential
2
3. What is SOLR
Open source enterprise search platform based on Apache Lucene project.
REST-like HTTP/XML and JSONAPIs
Powerful full-text search, hit highlighting, faceted search
Database integration, and rich document (e.g., Word, PDF) handling
Dynamic clustering, distributed search and index replication
Loose Schema to define types and fields
Written in Java5, deployable as a WAR
Confidential
3
4. Public Websites using Solr
Mature product powering search for public sites like Digg, CNet, Zappos, and Netflix
See here for more information: http://wiki.apache.org/solr/PublicServers
Confidential
4
5. Architecture
5
Admin
Interface
HTTP Request Servlet
Update Servlet
Standard
Request
Handler
Disjunction
Max
Request
Handler
Custom
Request
Handler
XML
Update
Interface
XML
Response
Writer
Solr Core
Update
Handler
Caching
Config
Schema
Analysis
Concurrency
Lucene
Replication
Confidential
6. Starting Solr
We need to set these settings for SOLR:
solr.solr.home: SOLR home folder contains conf/solrconfig.xml
solr.data.dir: folder contains index folder
Or configure a JNDI lookup of java:comp/env/solr/home to point to the solr directory.
For e.g:
java -Dsolr.solr.home=./solr -Dsolr.data.dir=./solr/data -jar start.jar (Jetty)
Other web server, set these values by setting Java properties
Confidential
6
7. Web Admin Interface
Confidential
7
8. Confidential
8
9. How Solr Sees the World
An index is built of one or more Documents
A Document consists of one or more Fields
Documents are composed of fields
A Field consists of a name, content, and metadata telling Solr how to handle the content.
You can tell Solr about the kind of data a field contains by specifying its field type
Confidential
9
10. Field Analysis
Field analyzers are used both during ingestion, when a document is indexed, and at query time
An analyzer examines the text of fields and generates a token stream. Analyzers may be a single class or they may be composed of a series of tokenizer and filter classes.
Tokenizersbreak field data into lexical units, or tokens
Example:
Setting all letters to lowercase
Eliminating punctuation and accents, mapping words to their stems, and so on
ram, Ram and RAM would all match a query for ram
Confidential
10
11. Schema.xml
schema.xml file located in ../solr/conf
schema file starts with tag
Solr supports one schema per deployment
The schema can be organized into three sections:
Types
Fields
Other declarations
11
12. Example for TextField type
Confidential
12
13. Filter explanation
StopFilterFactory: Tokenize on whitespace, then removed any common words
WordDelimiterFilterFactory: Handle special cases with dashes, case transitions, etc.
LowerCaseFilterFactory: lowercase all terms.
EnglishPorterFilterFactory: Stem using the Porter Stemming algorithm.
E.g: runs, running, ran its elemental root "run"
RemoveDuplicatesTokenFilterFactory: Remove any duplicates:
Confidential
13
14. Field Attributes
Indexed:
Indexed Fields are searchable and sortable.
You also can run Solr 's analysis process on indexed Fields, which can alter the content to improve or change results.
Stored:
The contents of a stored Field are saved in the index.
This is useful for retrieving and highlighting the contents for display but is not necessary for the actual search.
For example, many applications store pointers to the location of contents rather than the actual contents of a file.
Confidential
14
15. Field Definitions
Field Attributes: name, type, indexed, stored, multiValued, omitNorms
Dynamic Fields, in the spirit of Lucene!



15
16. Other declaration
url: urlfield is the unique identifier, is determined a document is being added or updated
defaultSearchField: is the Field Solr uses in queries when no field is prefixed to a query term
For e.g: q=title:Solr, If you entered q=Solr instead, the default search field would apply
Confidential
16
17. Indexing data
Using curl to interact with Solr: http://curl.haxx.se/download.html
Here are different data formats:
Solr'snative XML
CSV (Character Separated Value)
Rich documents through SolrCell
JSON format
Direct Database and XML Import through Solr'sDataImportHandler
Confidential
17
18. Add / Update documents
HTTP POST to add / update


05991
Apache Solr
An intro...
search
lucene
Solr is a full...


Confidential
18
19. Delete documents
Delete by Id
05591
Delete by Query (multiple documents)
manufacturer:microsoft
Confidential
19
20. Commit / Optimize
tells Solr that all changes made since the last commit should be made available for searching.
same as commit.
Merges all index segments. Restructures Lucene 's files to improve performance for searching.
Optimization is generally good to do when indexing has completed
If there are frequent updates, you should schedule optimization for low-usage times
An index does not need to be optimized to work properly. Optimization can be a time-consuming process.
Confidential
20
21. Index XML documents
Use the command line tool for POSTing raw XML to a Solr
Other options:
-Ddata=[files|args|stdin]
-Durl=http://localhost:8983/solr/update
-Dcommit=yes
(Option default values are in red)
Example:
java -jar post.jar *.xml
java -Ddata=args-jar post.jar "42"
java -Ddata=stdin -jar post.jar
java -Dcommit=no -Ddata=args-jar post.jar "*:*"
Confidential
21
22. Index CSV file usingHTTP POST
curl command does this with data-binaryand an appropriate content-type header reflecting that it's XML.
Example: using HTTP-POST to send the CSV data over the network to the Solr server:
curl http://localhost:9090/solr/update -H "Content-type:text/xml;charset=utf-8" --data-binary @ipod_other.xml
Confidential
22
23. Index CSV usingremote streaming
Uploading a local CSV file can be more efficient than sending it over the network via HTTP. Remote streaming must be enabled for this method to work.
Change enableRemoteStreaming="true in solrconfig.xml:

  • Example:

java -Ddata=args -Durl=http://localhost:9090/solr/update -jar post.jar ""
curl http://localhost:9090/solr/update/csv -F "stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv" -F "commit=true" F "optimize=true" -F "stream.contentType=text/plain;charset=utf-8"
curl "http://localhost:9090/solr/update/csv?overwrite=false&stream.file=d:/Study/Solr/apache-solr-1.4.1/example/exampledocs/books.csv&commit=true&optimize=true"
Confidential
23
24. Index rich document withSolr Cell
Solr uses Apache Tika, framework for wrapping many different format parsers like PDFBox, POI, and others
Example:
curl "http://localhost:9090/solr/update/extract?literal.id=doc1&commit=true" -F "[email protected]"
curl "http://localhost:9090/solr/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F [email protected] (index html)
Capture

tags separate, and then map that field to a dynamic field named foo_t:
curl "http://localhost:9090/solr/update/extract?literal.id=doc2&captureAttr=true&defaultField=text&fmap.div=foo_t&capture=div"-F [email protected] (index pdf)
Confidential
24
25. Updating a Solr Index with JSON
The JSON request handler needs to be configured in solrconfig.xml

Example:
curl "http://localhost:8983/solr/update/json?commit=true" --data-binary @books.json -H "Content-type:application/json"
Confidential
25
26. Searching
Spellcheck
Editorial results replacement
Scaling index size with distributed search
Confidential
26
27. Default Query Syntax
Lucene Query Syntax [; sort specification]
mission impossible; releaseDatedesc
+mission +impossible actor:cruise
mission impossible actor:cruise
title:spiderman^10 description:spiderman
description:spiderman movie~10
+HDTV +weight:[0 TO 100]
Wildcard queries: te?t, te*t, test*
Confidential
27
28. Default Parameters
Query Arguments for HTTP GET/POST to /select
Confidential
28
29. Search Results
http://localhost:8983/solr/select?q=video&start=0&rows=2&fl=name,price
0
1


Apple 60 GB iPod with Video
399.0


ASUS Extreme N7800GTX/2DHTV
479.95



29
30. Query response writers
query responses will be written using the 'wt' request parameter matching the name of a registered writer.
The "default" writer is the default and will be used if 'wt' is not specified in the request
E.g.:
http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true
Confidential
30
31. Caching
IndexSearchers view of an index is fixed
Aggressive caching possible
Consistency for multi-query requests
filterCache unordered set of document ids matching a query
resultCache ordered subset of document ids matching a query
documentCache the stored fields of documents
userCaches application specific, custom query handlers
31
32. Configuring Relevancy




synonyms="synonyms.txt/>
protected="protwords.txt"/>


32
33. Faceted Browsing Example
33
34. Faceted Browsing
34
computer_type:PC
proc_manu:Intel
= 594
memory:[1GB TO *]
proc_manu:AMD
intersection Size()
= 382
computer
price asc
Search(Query,Filter[],Sort,offset,n)
price:[0 TO 500]
= 247
price:[500 TO 1000]
section of ordered results
= 689
Unordered set of all results
manu:Dell
= 104
DocList
DocSet
manu:HP
= 92
manu:Lenovo
= 75
Query Response
35. Index optimization
Confidential
35
36. Updater
High Availability
Dynamic HTML Generation
Appservers
HTTP search requests
Load Balancer
Solr Searchers
Index Replication
admin queries
DB
updates
updates
admin terminal
Solr Master
37. Distributed and replicated Solr architecture
Confidential
37
38. Index by using SolrJ
Confidential
38
39. Query with SolrJ
Confidential
39
40. Distributed and replicated Solrarchitecture (cont.)
At this time, applications must still handle the process of sending the documents to individual shards for indexing
The size of an index that a machine can hold depends on the machine's configuration (RAM, CPU, disk, and so on), the query and indexing volume, document size, and the search patterns
Typically the number of documents a single machine can hold is in the range of several million up to around 100 million documents.
Confidential
40
41. Advance Functionality
Structure Data Store Data with the Data Import Handler (JDBC, HTTP, File, URL)
Support for other programming languages (.Net, PHP, Ruby, Perl, Python,)
Support for NoSQL database like MongoDB, Cassandra?
41
42. Other open source server
Sphinx
Elastic Search
Confidential
42
43. Resources
  • http://wiki.apache.org/solr/UpdateCSV
44. http://wiki.apache.org/solr/ExtractingRequestHandler 45. http://lucene.apache.org/tika/ 46. http://wiki.apache.org/solr/ 47. Solr 1.4 Enterprise Search Server.43
48. Resources (cont.)
  • http://www.ibm.com/developerworks/java/library/j-solr2/
49. http://www.ibm.com/developerworks/java/library/j-solr1/ 50. http://en.wikipedia.org/wiki/Solr 51. Apache Conf Europe 2006 - Yonik Seeley 52. LucidWorksSolr Reference GuideConfidential
44
53. Confidential
45
54. Confidential
46