what's new in lucene/solr presented by grant ingersoll at solrexchage dc

Post on 27-Jan-2015

107 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

What’s new in Lucene and Solr?Grant Ingersoll

CTO, LucidWorksLucene/Solr Committer

Sink or Swim?

Search is good for…• Traditional: Fast, fuzzy text matching across a large document

collection• De-normalized data

– “light” relational• Top N problems

– Key-value (top 1)– Recommendations, “Good enough” classification, clustering

• Faceting, slicing and dicing of numerical/enumerated data• Spatial, spell checking, record linkage, highlighting• NoSQL

What’s New?

• Community

• Lucene

• Solr

Relax, You’re Among Friends• Large, diverse search community with many non-traditional search

engine usages– Object stores, Record linkage, Social, mobile -> web

• “The Apache Way”– Meritocracy – Those who do, decide!

• Always Be Testing– Randomized system tests are all the rage– http://vimeo.com/32087114

• Patches Welcome!

Acceleration!

Coming Soon: Lucene and Solr 4.8

Java 1.7

Lucene: Speed and Memory• Native Near Real Time (NRT) support

– Per segment– FieldCache can be controlled to only load new segments– Soft commit -- faster without fsync, allows quicker update visibility

• DWPT (Document Writer per Thread)– Faster more consistent index speed

• Faster fuzzy & wildcard query processing• Automatic compression of stored fields and term vectors• String -> BytesRef

– Much improved data structure– … means less memory and less garbage collection effort

Lucene: Flexibility• Flexible Index Formats

– New posting list codecs: Block, Simple Text, HDFS, etc.– Pulsing codec: improves performance of primary key searches, inlining

docs, positions, and payloads, saves disk seeks

• Pluggable Scoring– Decoupled from TF/IDF– Built in alternatives include BM25 & DFR, and others

• http://en.wikipedia.org/wiki/Okapi_BM25• http://terrier.org/docs/v3.5/dfr_description.html

– Add your own

FS(A|T)• Keys:

– byte[] – write-once– Linear time build of min. automata– Compression, Reverse lookups– Weights (used for auto-suggest)– Pluggable Algebra

• Uses:– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c

• More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0– “Smaller Representation of Finite State Automata”

• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.

Grab Bag• Lots of new suggesters

– Available in Solr

• Doc Values– Column oriented store– Numeric and binary variants are updatable (coming to Solr soon)

• Overhauled term vectors APIs– Now look a lot like Terms

Solr 4: New Features• Search/Faceting/Relevance

– New Relevance Function Queries (tf, df, others)– Pivot Faceting– Pseudo-join– Improved Spatial (more later)– Full support for Lucene Codecs, pluggable scoring

• Indexing– New Update Processors, including scripting option– Near real time

• Schema and Config APIs + Schemaless• Cursors (aka Deep Paging)• Admin UI

Geospatial improvements• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle

• Indexing:– "geo”:”43.17614,-90.57341”– “geo”:”Circle(4.56,1.23 d=0.0710)”– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”

• Searching:– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10

30)))”

Scaling Solr• Distributed/sharded indexing & search

– Auto distributes updates and queries to appropriate shards– Near Real Time (NRT) indexing capable– Document routing extensions

• Dynamically scalable– New SolrCloud instances add indexing and query capacity– Supports re-balancing (shard-splitting)

• Reliable– No single point of failure– Transactions logged– Robust, automatic recover

• http://wiki.apache.org/solr/SolrCloud

Solr as NoSQL• Non-traditional data stores

• Not designed for SQL type queries

• Distributed fault tolerant architecture

• Document oriented, data format agnostic (JSON, XML, CSV, binary)

Go Deep!

APIs• New APIs for Schema and Solr Config

– XML becoming more of an implementation detail

• Managed Schema mode

• Data-driven schema (aka schemaless)

• Synonyms, stopwords, request handlers

Beyond Solr: LucidWorks Open Source• Effortless AWS deployment and monitoring: http

://www.github.com/lucidworks/solr-scale-tk

• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager• Banana (Kibana for Solr): https://github.com/LucidWorks/banana

• Data Quality Toolkit: https://github.com/LucidWorks/data-quality

• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash

Summary• Lucene/Solr 4.x:

– Faster– More Flexible– Easier than ever scaling– More reliable than ever

• Go forth and rank!

Resources• Me

– grant@lucidworks.com– @gsingers on Twitter

• LucidWorks– http://www.lucidworks.com– http://www.lucidworks.com/support-services/ask-the-experts/

top related