solr

34
by NNNN (周世恩) Code & Coffee 2013/11/1

Upload: -

Post on 27-Jan-2015

109 views

Category:

Technology


3 download

DESCRIPTION

Code & Beer Topic: Apache Solr

TRANSCRIPT

Page 1: Solr

by NNNN (周世恩)Code & Coffee 2013/11/1

Page 2: Solr

What is Solr?

Page 3: Solr

What is

Page 4: Solr

• Full-featured text search

• High performance

• index size: 20-30% the size of text data.

• small RAM requirements(~1MB)

• Powerful, Accurate and Efficient Search Algorithms

• 100% in Java(^^)

Page 5: Solr

Lucene(cont.)

• Multiple Analyzer / Tokenizer

• Fields Searching

• Merge results

• Flexible faceting, highlighting, joins and result grouping

• Typo-tolerant suggesters(當然要⾃自⼰己建⽴立)

• Customize ranking model..(VSM, BM25)

Page 8: Solr

Where is index file stored?

• Memory

• File System

• HDFS

• FileSystem config 設定為 HDFS

Page 9: Solr

#Note

• 只有被index的field 才可以search

• 可以純store 不index

• ⽀支援多種Type(Long, Int, String, Text...)

• Indexing 就要決定好Tokenizer(Analyzer) 了

• ⽀支援同時searching and indexing?

Page 10: Solr

#Note

• 使⽤用前搖⼀一搖• ⼀一開始就要清楚有哪些Field

• 降低重建index的機會(RDB只要打個指令就好)

Page 11: Solr

Lucene Index file項⺫⽬目很多, 少⼀一個你就GG

Page 12: Solr

What is Solr?

Page 13: Solr

超屌企業級免費的

Page 14: Solr

Search Platform

Page 15: Solr

Lucene 功能該有的都有了

Page 16: Solr

Solr 還多了....

• 漂亮的Admin Interface!

• REST-like API(易與其他App結合)

• Dynamic clustering

• Database integration

• Geospatial search(Google Map?)

• 調整Cache Size

Page 17: Solr

還記得雲端的優勢...

• Highly reliable

• Scalable

• Fault tolerant

• Distributed indexing

• Replication

• Load-balanced

• Automated failover and recovery(?)

Page 18: Solr
Page 19: Solr

常⽤用的config

• schema.xml(定義每個field)

• solrconfig.xml (定義每個handler的URI)

• jetty.xml(!)

• solr.xml(定義core的數量)

Page 20: Solr

Real-time indexing?

Page 21: Solr

Near Real-time indexing

• Documents are available for search almost immediately after being indexed...

• 也要有commit 才算數(....)

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

Page 22: Solr

Searching

• Query

“id: {id} AND name:{Name} OR title:{text}”

• Highlighting

• Projection

• Sorting(asc, desc)

• Output format: JSON, CSV, XML

• Others: spellcheck, Wildcard Query, +-*/

Page 23: Solr

Sample Output

Page 24: Solr

Import Data From DB

• 在solrconfig.xml 修改

http://wiki.apache.org/solr/DataImportHandler

Page 25: Solr

The diff between Solr and RDB

• Solr is for indexed text or lots of unstructured docs.

• Solr is optimized for searching, not for storage and retrieval of individual records.

http://stackoverflow.com/questions/5814050/solr-or-database

Page 26: Solr

Distributed Search cluster

• 很多台機器架設 Solr, 選⼀一台來進⾏行聯結

• 需要在config設定嗎?

Page 27: Solr

Distributed Solr Cluster & Load balancer

http://wiki.apache.org/solr/SolrReplication

Page 29: Solr

#Note

• 你可以先⽤用包有lucene indexing 功能的java application 先製作好index directory再給solr ⽤用

• 如果solr要進⾏行update時, 最好先確認沒有其他application正在進⾏行寫⼊入的程序, 否則GG

• indexing 時, 不管是solr還是lucene, write-lock不要亂刪

Page 30: Solr

Live Demo

眾神們曾經說過這很危險的

Page 31: Solr

下載Solr 最新版 $: cd solr-4.4.0/example $: java -Xmx2048m -jar start.jar

Page 32: Solr

The End

Page 33: Solr

好Tool 分享

• Luke(檢查index⽤用)

• Apache Tika

• Apache hadoop

• Apache Tomcat

Page 34: Solr

BBQ(Bonus)

• Customize tokenizer

• Document Boosting

• Field Boosting

• Field aliasing / renaming