what's new in lucene/solr presented by grant ingersoll at solrexchage dc

What’s new in Lucene and Solr?Grant Ingersoll

CTO, LucidWorksLucene/Solr Committer

Sink or Swim?

Search is good for…• Traditional: Fast, fuzzy text matching across a large document

collection• De-normalized data

– “light” relational• Top N problems

– Key-value (top 1)– Recommendations, “Good enough” classification, clustering

• Faceting, slicing and dicing of numerical/enumerated data• Spatial, spell checking, record linkage, highlighting• NoSQL

What’s New?

• Community

• Lucene

• Solr

Relax, You’re Among Friends• Large, diverse search community with many non-traditional search

engine usages– Object stores, Record linkage, Social, mobile -> web

• “The Apache Way”– Meritocracy – Those who do, decide!

• Always Be Testing– Randomized system tests are all the rage– http://vimeo.com/32087114

• Patches Welcome!

Acceleration!

Coming Soon: Lucene and Solr 4.8

Java 1.7

Lucene: Speed and Memory• Native Near Real Time (NRT) support

– Per segment– FieldCache can be controlled to only load new segments– Soft commit -- faster without fsync, allows quicker update visibility

• DWPT (Document Writer per Thread)– Faster more consistent index speed

• Faster fuzzy & wildcard query processing• Automatic compression of stored fields and term vectors• String -> BytesRef

– Much improved data structure– … means less memory and less garbage collection effort

Lucene: Flexibility• Flexible Index Formats

– New posting list codecs: Block, Simple Text, HDFS, etc.– Pulsing codec: improves performance of primary key searches, inlining

docs, positions, and payloads, saves disk seeks

• Pluggable Scoring– Decoupled from TF/IDF– Built in alternatives include BM25 & DFR, and others

• http://en.wikipedia.org/wiki/Okapi_BM25• http://terrier.org/docs/v3.5/dfr_description.html

– Add your own

FS(A|T)• Keys:

– byte[] – write-once– Linear time build of min. automata– Compression, Reverse lookups– Weights (used for auto-suggest)– Pluggable Algebra

• Uses:– Term Dictionary, TokenStreams, Japanese, synonyms, spelling, others– FuzzyQuery is 100x faster -- http://bit.ly/hgO65c

• More: – http://slidesha.re/vKtpVA, http://bit.ly/Pkjyu0– “Smaller Representation of Finite State Automata”

• Proc. of the 16th Inter. Conf. on Implementation and Application of Automata, CIAA'2011, vol. 6807, 2011, pp. 118—192.

Grab Bag• Lots of new suggesters

– Available in Solr

• Doc Values– Column oriented store– Numeric and binary variants are updatable (coming to Solr soon)

• Overhauled term vectors APIs– Now look a lot like Terms

Solr 4: New Features• Search/Faceting/Relevance

– New Relevance Function Queries (tf, df, others)– Pivot Faceting– Pseudo-join– Improved Spatial (more later)– Full support for Lucene Codecs, pluggable scoring

• Indexing– New Update Processors, including scripting option– Near real time

• Schema and Config APIs + Schemaless• Cursors (aka Deep Paging)• Admin UI

Geospatial improvements• Index shapes other than points (circles, polygons, etc)• More complex interactions than point in a circle

• Indexing:– "geo”:”43.17614,-90.57341”– “geo”:”Circle(4.56,1.23 d=0.0710)”– “geo”:”POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10 30))”

• Searching:– fq=geo:"Intersects(-74.093 41.042 -69.347 44.558)"– fq=geo:"Intersects(POLYGON((-10 30, -40 40, -10 -20, 40 20, 0 0, -10

30)))”

Scaling Solr• Distributed/sharded indexing & search

– Auto distributes updates and queries to appropriate shards– Near Real Time (NRT) indexing capable– Document routing extensions

• Dynamically scalable– New SolrCloud instances add indexing and query capacity– Supports re-balancing (shard-splitting)

• Reliable– No single point of failure– Transactions logged– Robust, automatic recover

• http://wiki.apache.org/solr/SolrCloud

Solr as NoSQL• Non-traditional data stores

• Not designed for SQL type queries

• Distributed fault tolerant architecture

• Document oriented, data format agnostic (JSON, XML, CSV, binary)

Go Deep!

APIs• New APIs for Schema and Solr Config

– XML becoming more of an implementation detail

• Managed Schema mode

• Data-driven schema (aka schemaless)

• Synonyms, stopwords, request handlers

Beyond Solr: LucidWorks Open Source• Effortless AWS deployment and monitoring: http

://www.github.com/lucidworks/solr-scale-tk

• Logstash for Solr: https://github.com/LucidWorks/solrlogmanager• Banana (Kibana for Solr): https://github.com/LucidWorks/banana

• Data Quality Toolkit: https://github.com/LucidWorks/data-quality

• Coming Soon for Big Data: Hadoop, Pig, Hive 2-way support w/ Lucene and Solr, different file formats, pipelines, Logstash

Summary• Lucene/Solr 4.x:

– Faster– More Flexible– Easier than ever scaling– More reliable than ever

• Go forth and rank!

Resources• Me

– grant@lucidworks.com– @gsingers on Twitter

• LucidWorks– http://www.lucidworks.com– http://www.lucidworks.com/support-services/ask-the-experts/

what's new in lucene/solr presented by grant ingersoll at solrexchage dc

community lucene solr

apis new apis

whats new

bm25 http

solr config xml

rage http

twitter lucidworks http

big data

Technology

randomized continuous testing: solr & lucene use case

oslo lucene/solr meetup

grouping and joining in lucene/solr

intro to apache lucene and solr

lucene/solr 3.1

understanding the solr security framework - lucene solr...

keynote session - lucene/solr revolution

relevantes schneller finden mit lucene und solr · • solr...

lucene solr meetup july 2010 short

apache solr cms integration @ lucene/solr revolution san...

faceting with lucene block join query - lucene/solr...

solr lucene revolution 2014 - solr compute cloud - nitin

apache solr/lucene internals by anatoliy sokolenko

using lwe/solr/lucene for ecom

solr, lucene, apache, and you!

introduction to apache lucene/solr

apache solr/lucene: looking ahead

semantic & multilingual strategies in lucene/solr

nyc lucene/solr meetup: spark / solr

the case for lucene/solr: