hibernate search 5: adding full-text query super-powers to your jpa!

Hibernate Search 5: Adding Full-Text Query Super-Powers to Your

Sanne GrinoveroPrincipal Software Engineer at Red Hat

@SanneGrinovero

Organised and Sponsored by

● What's new in Hibernate● What is “full-text” and how it can help you● Get you up to speed with Hibernate Search

Who am I?Engineering team: Hibernate

– Hibernate Search project lead

– Hibernate OGM team

– Hibernate ORM● Query parser and performance

– Infinispan● Contributor for fun and need● Designed some recent improvements● Driving some of the Hibernate and

Apache Lucene integrations– @SanneGrinovero on Twitter

Hibernate ORM

● Hibernate 5 is coming● Small performance improvements

every week● Jandex work● Metamodel work● New query parser

Hibernate Validator

● Bean Validation spec work● Improved Java8 support

– See version 5.2 (work in progress)

Hibernate OGM

● Just released the first stable version

● Lots of new features coming

Hibernate Search

● Just released version 5.0.0.Final in December– (and 5.0.1.Final this week)

● Let's see...

The Search problem

● Who searches, doesn't know what he Searches:– Please, don't ask the user to give up the primary key

– Doesn't know the exact content of the document either

Hibernate model example

The stupid search engine

The dumb search engine

Does it work? How about these:

String author = “Fabrizio De André”String title = “Nuvole barocche”List<Product> list = s.createQuery( “ ...? “ );

String author = “De André, Fabrizio”String title = “Nuvole barocche”List<Product> list = s.createQuery( “ ...? “ );

String author = “De Andre, Fabrizio”String title = “Nuvole barocche”List<Product> list = s.createQuery( “ ...? “ );

More requirements

● Unique search input– Might contain either/both author, title names

– Both entities might be composed of multiple terms

● Relevance– Produts matching both should be listed on top

– Exact word matches should be scored better● So you need approximate word matches?

● How about typos?

List<Products> list = s.createQuery(“ ...? “).setParameter(“F. de André nuvole barocche”).list();

● Mixed case, accents● Relative order of terms, distance● Abbreviations, typos● Match on multiple fields● 18,800 results in 0,41 seconds

More useful stuff:

● Similarity:– hibernate ~ hybernat

● Proximity, synonyms, abbreviations:– 'JPA' or 'Java Persistence API'

● Boosting, field adjustments:– A match in the title is more “worth” than in the text content?

● Stemming (is language specific!)

Would you still use Google..

If it returned matches in alphabetical order?

“hibernate search”About 3.580.000 results (0.04 seconds)

So what..● The database is not a good fit

– SQL is not appropriate for the task

– Still SQL is very handy for other tasks

● Need to use the best tool for each task:– Relational databases

– filesystems

– Fulltext search engines

– key-value stores

● Need to integrate them, keep data integrity and consistency

Apache Lucene

● Open source Apache™ top level project,● Very advanced implementation● Main language is Java, ports exists in other languages● Rich open source ecosystem around it● Many products embed it

Some notable users of Lucene(on top of many JBoss projects)

Apache Lucene

● Similarity● Sinonyms● Stemming● Stopwords● TermVectors● MoreLikeThis● Faceted Search● Speed!

● ... and full-text

Similarity

● N-Grams (edit distance)● Phonetic (Soundex™)● Any custom...

Lucene: Synonyms (or close)

● Can be applied at “index time”● at “query time”● Requires a vocabulary

– WordNet

newspaper ⁓ daily ⁓ journal

Journal ??⁓ newspaper

Lucene: Stemming

continuait ⁓ continucontinuation ⁓ continu

continué ⁓ continucontinuelle ~ continuel

Lucene: Stopwords

● Removes terms which are frequently used: not suited as search keywords – might depend on your domain!

a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your

Apache Lucene: Index

● It requires an Index– On filesystem

– In memory

– ...

● Made of immutable segments– Optimized for search speed, not for updates

● A world of strings and frequencies

So, back to our database..

● The index structure is deeply different than a relational database – not everything is possible

● You need to keep the data in sync– In case they are not, which one should be trusted more?

● How do queries look like?● What do queries return?

Different worlds

● A Lucene Document, is unstructured (schemaless), something close to

Map<String,String>● An Hibernate model is structured to be functional as

representation of your business model● Entities returned by an EntityManager or Hibernate

Session are managed, to keep the database in sync● A bridge is needed

The data mismatch

The architectural mismatch

Quickstart Hibernate Search

● Add hibernate-search dependency:

<groupId>org.hibernate</groupId>

<artifactId>hibernatesearchorm</artifactId>

<version>5.0.0.Final</version>

</dependency>

● Any other configuration is optional:

– Where to store indexes– Extension modules, custom analyzers– Performance tuning– Advanced mapping– Clustering

● JGroups● Infinispan● JMS

Quickstart Hibernate Search@Entitypublic class Essay { @Id public Long getId() { return id; }

public String getSummary() { return summary; } @Lob public String getText() { return text; } @ManyToOne public Author getAuthor() { return author; }...

@Entity @Indexedpublic class Essay { @Id public Long getId() { return id; }

public String getSummary() { return summary; } @Lob public String getText() { return text; } @ManyToOne public Author getAuthor() { return author; }...

@Entity @Indexedpublic class Essay { @Id public Long getId() { return id; } @Field public String getSummary() { return summary; } @Lob public String getText() { return text; } @ManyToOne public Author getAuthor() { return author; }...

@Entity @Indexedpublic class Essay { @Id public Long getId() { return id; } @Field public String getSummary() { return summary; } @Lob @Field @Boost(0.8) public String getText() { return text; } @ManyToOne public Author getAuthor() { return author; }...

@Entity @Indexedpublic class Essay { @Id public Long getId() { return id; } @Field public String getSummary() { return summary; } @Lob @Field @Boost(0.8) public String getText() { return text; } @ManyToOne @IndexedEmbedded public Author getAuthor() { return author; }...

String[] productFields = {"summary", "author.name"};

Query luceneQuery = // query builder or any Lucene Query

FullTextEntityManager ftEm = Search.getFullTextEntityManager(entityManager);

FullTextQuery query = ftEm.createFullTextQuery( luceneQuery, Product.class );

List<Product> items = query.setMaxResults(100).getResultList();

int totalNbrOfResults = query.getResultSize();

TotalNbrOfResults= 8.320.000(0.20 seconds)

Creating a Lucene Query with the DSL

Results

● Managed POJO: updates are applied to both Lucene and database

● JPA pagination, known APIs:– .setMaxResults( 20 ).setFirstResult( 100 );

● Type restrictions, polymorphic fulltext queries:– .createQuery( luceneQuery, A.class, B.class, ..);

● Projection

● Result mapping

Filters: declarative, stacking, reusable

Filters

FullTextQuery ftQuery = s // s is a FullTextSession

.createFullTextQuery( query, Product.class )

.enableFullTextFilter( "minorsFilter" )

.enableFullTextFilter( "specialDayOffers" )

.setParameter( "day", “20110218” )

.enableFullTextFilter( "inStockAt" )

.setParameter( "location", "londonhighbury" );

List<Product> results = ftQuery.list();

Advanced text analysis@Entity @Indexed

@AnalyzerDef(name = "frenchAnalyzer", tokenizer =

@TokenizerDef(factory=StandardTokenizerFactory.class),filters = {

@TokenFilterDef(factory = LowerCaseFilterFactory.class),

@TokenFilterDef(factory = SnowballPorterFilterFactory.class,

params = {@Parameter(name = "language", value = "French")})

public class Book {

@Field(index=Index.TOKENIZED, store=Store.NO)

@Analyzer(definition = "frenchAnalyzer")

private String title;

More...

● @Boost & @DynamicBoost● @AnalyzerDiscriminator● @DateBridge(resolution=Resolution.MINUTE)● @ClassBridge & @FieldBridge● @Similarity● Automatic Index optimization● Sharding, sharding filters

Faceted Search

“More Like This” queries

@Spatial: geospatial distance filtering and sorting

Queue-based clustering(via filesystem)

Index stored in Infinispan

Infinispan

● Open source highly scalable data grid platform– Distribution or Replication

– Sync or Async

– Transactional

– Persists contents using a CacheLoader● Write-through or write-behind● Shared or per cluster node

– Hibernate second-level cache

– State of the art eviction strategies

– Technology used by WildFly and JBoss EAP for many clustering features

Hibernate & Infinispan

● Infinispan uses Hibernate Search:– Query

– Lucene Directory

● Hibernate:– Infinispan 2nd level cache

● Hibernate Search:– Infinispan based DirectoryProvider

– Queues (JMS, JGroups) for clustered writing

Also integrated / used in:

● WildFly

● CapeDwarf

● Hibernate OGM

21/01/15

Hibernate and Infinispan sharing a powerful Search Engine

What's new in Hibernate Search 5.0 ?

● Apache Lucene 4.x– Extremely demanded

– Big transition

– Performance!

● MoreLikeThis queries● Experimental OSGi support

(testing Apache Karaf)

● Stable API● Improved integrations

– JBoss Modules

– Extension points

Let's see some coding?

Anyone up to participate in the project?

About products

● The products● JBoss Enterprise Application

Platform

● Red Hat JBoss Data Grid http://red.ht/data-grid

● JBoss Web Framework Kit

● The projects

Hibernate● Hibernate Search● Hibernate OGM● Apache Lucene● Infinispan

@Hibernate@SanneGrinovero

http://hibernate.orghttp://in.relation.tohttp://jboss.org

Organised and Sponsored by

hibernate search 5: adding full-text query super-powers to your jpa!

search problem

stupid search engine

dumb search engine

nuvole barocchelist

fabriziostring title

string author

apache lucene integrations

andrstring title

Technology

performance tuning with jpa 2.1 and hibernate (geecon prague...

jpa query guide (v5.1)

módulo ii padrão de persistência jpa · jpa transitive...

· hibernate configuration and mapping what orm and its...

spring + jpa + hibernateumeshk29.yolasite.com/resources/4 -...

introducing hibernate ogm: porting jpa applications to...

10 hibernate jpa

hibernate using jpa

mapping objects with jpa - computer science · mapping...

introduction... · introduction while much of the work in...

guia hibernate y jpa vehialpes

mastering jpa performance - java forum stuttgart · das...

spring + jpa + hibernate...jpa – java persistence api jee...

7 – introduction à jpa et hibernate auteurs:...

hibernate query languagehibernate query language and native...

introduction to jpa and hibernate including examples

hibernate ogm - jpa for infinispan and nosql

using jpa applications in the era of nosql: introducing...

formation jpa avancé / hibernate gratuite par ippon 2014

hibernate with jpa