opencms days 2012 - opencms 8.5: using apache solr to retrieve content

35

Click here to load reader

Upload: alkacon-software-gmbh

Post on 14-May-2015

3.299 views

Category:

Technology


7 download

DESCRIPTION

OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well. Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code. In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.

TRANSCRIPT

Page 1: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Rüdiger Kurz, Alkacon Software

WORKSHOP TRACK

Using Apache Solr to

retrieve content

25.09.2012

Page 2: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

2

Project Collaboration

Page 3: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

1. What is Solr?

2. Benefits

3. Searching

4. Indexing

5. Configuration

3

Agenda

Page 4: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

●Apache Solr is hopefully not able to answer this question!

●BUT it will return the results in less than a second

4

Retrieving data fast

Page 5: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● Solr is an enterprise search platform from the Apache Lucene project

● Solr is highly scalable, providing distributed search and index replication

● Solr powers the search and navigation features

● Major features include

● Powerful full-text search

● Hit highlighting

● Faceted search

● Rich document (e.g., Word, PDF) handling

5

What is Apache Solr?

Page 6: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● Faceted search is the dynamic clustering of items or search results into categories

● That let users drill into search results (or even skip searching entirely)

● Each facet displayed typically shows the number of hits that match that category

● Users can then “drill down” by applying specific constraints to the search results

● Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search

6

What is faceted search?

Page 7: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

7

What is Faceted Search?

“Resource types” is a

facet, a way of

categorizing the results

containerpage,

v8flwoer, v8textblock,

… are constraints, or

facet values

The breadcrumb trail

shows what constraints

have already been

applied and allows their

removal

The facet count shows

how many results

match each value

The tag bar shows other

facet values of the found

document that can be applied

Regular search results

Page 8: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Benefits

8

Page 9: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● DBs are proprietary

● Require elaborate infrastructures

● SQL queries are hard to formulate

● SQL on DB is slower than search queries

● A lot SQL statements make DB to bottleneck

● Also lower traffic sites will slow to run when

executing too many statements on DB layer

Overall performance starts to degrade

9

Database as bottleneck

Page 10: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● OpenCms stores the content in a RDBMS

● To access values of an XML content you have to

perform the following steps:

10

Content retrieval so far

1. Read the resource

2. Read binary content

3. Un-marshal content

4. Access with getters

Resource (dates, refs, attr)

Content (blob)

Marshaled XML

Java Access Bean

Page 11: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● “Read” whole resource content by a single query

● Increase ease of data structure by storing

documents

● New flexibility by using power of Solr query syntax

● Best performance based on optimized index

● HTTP interface for external applications

● Secure, scalable and cost-effective access

● Reduced DB traffic and increased performance

11

The new way of content retrieval

Page 12: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

OpenCms 8.5 Solr Integration

Page 13: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Searching

13

Page 14: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

●Querying OpenCms content using

the power of Solr’s query syntax

1. Send a HTTP request handler

2. Use the new Solr Collector

3. Call the Java API search method

14

Search with Solr in OpenCms

Page 15: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● The REST-like interface of Solr makes you able

to access indexed documents over HTTP

without any knowledge about CMS specific

syntax

● A permission check is performed by OpenCms

making sure no secure documents will be returned

● Using Solr based UI frameworks like “Ajax Solr” on

your website without development costs

● Providing an open interface for external

applications e.g. mobile applications

15

OpenCms Solr handler

Page 16: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

16

Examples: REST / JAVA / Collector

http://localhost:8080/opencms/opencms/handleSolrSelect

?fq=type:v8flower 1

<cms:contentload

collector="byQuery"

param="type:v8flower">

<cms:contentaccess var="content" />

${content.value.Title}

</cms:contentload>

2

CmsObject cms = getCmsObject();

String query = "fq=type:v8flower";

CmsSearchManager mananger = OpenCms.getSearchManager();

CmsSolrIndex index = manager.getIndexSolr("Solr Online");

CmsSolrResultList results = index.search(cms, query);

3

Page 17: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

17

Live Demo

Demo

Demo Demo

Demo

デモ

Page 18: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Indexing

18

Page 19: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● Data indexed by default (hard coded)

● Field configuration (opencms-search.xml)

● XSD field mapping (Content definition)

● Implement a custom field configuration (Java)

19

Indexed data

Page 20: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● The Schema file contains all of the details about which fields your documents can contain

● OpenCms uses an adjusted version of the schema.xml that is contained within Apache Solr standard distribution

WEB-INF/solr/conf/schama.xml ● If you want to add a new custom field or

field type for documents you can modify this file

20

Solr schema

Page 21: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

●Types are checked during the index

process

● It enables easy rage queries even for

dates, what is real facilitation making

dev-life easier

●Custom types can be added, e.g.

key/value tuple or some special JSON

fields

21

Advantages of field types

Page 22: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● id - Structure id used as unique identifier for an document (The structure id of the resource)

● path - Full root path (The root path of the resource e.g. /sites/default/flower_en/.content/article.html)

● path_hierarchy - The full path as (path tokenized field type: text_path)

● parent-folders - Parent folders (multi-valued field containing an entry for each parent path)

● type - Type name (the resource type name)

● res_locales - Existing locale nodes for XML content and all available locales in case of binary files

● created - The creation date (The date when the resource itself has being created)

● lastmodified - The date last modified (The last modification date of the resource itself)

● contentdate - The content date (The date when the resource's content has been modified)

● released - The release and expiration date of the resource

● content A general content field that holds all extracted resource data (all languages, type text_general)

● contentblob - The serialized extraction result toimprove the extraction performance while indexing

● category - All categories as general text

● category_exact - All categories as exact string for faceting reasons

● text_<locale> - Extracted textual content optimized for the language specific search

● timestamp - The time when the document was indexed last time

● *_prop - All properties of a resource as searchable and stored text (<Property_Definition_Name>_prop)

● *_exact - All properties of a resource as exact not stored string (<Property_Definition_Name>_exact)

22

Default indexed data

Page 23: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● Additional field mappings for XML contents can

now be configured directly within the XSD Schema

● Without modifying opencms-search.xml No

restart of the servlet container required

23

XSD field mapping

<searchsetting element=“DisplayDate” searchcontent=“false”>

<solrfield targetfield=“myDisplayDateField” sourcefield=“*_dt” />

</searchsetting>

<searchsetting element=“Teaser”>

<solrfield targetfield=“ateaser”>

<mapping type=“item” default=“Homepage n.a.”>Homepage</mapping>

<mapping type=“property-search”>search.special</mapping>

<mapping type=“dynamic” class=“my.DynamicMapping”>special</mapping>

</solrfield>

</searchsetting>

Page 24: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Configuration

24

Page 25: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● When installing OpenCms v8.5 Solr will be enabled by default while Solr will be disabled after updating a system to OpenCms 8.5

● To enable Solr in after updating you must create a Solr home directory in the WEB-INF folder of your OpenCms application

● Copy the solr/ folder from the OpenCms standard distribution as a starting point for your configuration

● All search configurations are done as usual in the opencms-search.xml below WEB-INF/config

● Adding the following lines will enable the Embedded Server

25

Enable Solr in OpenCms

<opencms><search>

<solr enabled="true"/> […]

</search></opencms>

Page 26: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● You can add a custom Solr index with the known OpenCms search configuration syntax

● NOTE: class attributes are needed for the index and its field configuration

26

Search index configuration

<index

class="org.opencms.search.solr.CmsSolrIndex">

<name>Solr Online</name>

<rebuild>auto</rebuild>

<project>Online</project>

<locale>all</locale>

<configuration>solr_fields</configuration>

<sources>

<source>solr_source</source>

</sources>

</index>

Page 27: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● For converting a field configuration by:

1. Copy a <filedconfiguration>-node

2. Change / set the class attribute

3. Optionally add a type attributes for fields

27

Create field configuration (1/3)

<fieldconfiguration

class="org.opencms.search.solr.CmsSolrFieldConfiguration">

<name>example</name>

<description>Converted Lucene Index</description>

<field name="meta" store="false" index="true" type="en">

<mapping type="property">Title</mapping>

<mapping type="property">Description</mapping>

</field>

</fields>

</fieldconfiguration>

Page 28: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● As value for the type attribute of a field

definition inside the opencms-system.xml

you can use names of any dynamic field defined in the schema.xml

● For example:

28

Create field configuration (2/3)

i - type=“int”

dt - type=“date”

txt - type=“text_general”

en - type=“text_en”

es - type=“text_es”

fr - type=“text_fr”

Page 29: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● As previously said the field names are defined in the schema.xml <solr_name> of Solr, now

we define additional fields inside the opencms-search.xml <opencms_name>

● How does that work?

29

Create field configuration (3/3)

String fieldName = <opencms_name>_txt;

if (existsInSolrSchema(fieldName)) {

fieldName = <opencms_name>;

} else if (isTypeAttributeSet()) {

fieldName = <opencms_name>_<type>;

}

Page 30: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

30

Live Demo

Demo

Demo Demo

Demo

デモ

Page 31: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● Having Solr and VIE integrated into OpenCms

we are well prepared start using Apache

Stanbol

● Stanbol is a top level Apache project

● Stanbol guarantees a quality standard

● Stanbol opens the perspective of sustainability

● We are looking to integrate Stanbol into

OpenCms 9

31

Future steps with IKS and Stanbol

Page 32: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

32

Live Demo

Demo

Demo Demo

Demo

デモ

Page 33: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

● Permission checked search (secure)

● Solr Request handler (accessible)

● Solr Collector (integrated)

● Result highlighting (user-friendly)

● Configuration opportunities (flexible)

● Search field mapping (sensitive)

● Type based field schema (type-safe)

● Lucene conversion (compatible)

33

Integration Conclusion

Page 34: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

Rüdiger Kurz

Alkacon Software GmbH

http://www.alkacon.com

http://www.opencms.org

http://www.iks-project.eu

http://stanbol.apache.org

Thank you very much for your

attention! 34

Page 35: OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content

35

Any Questions?

Fragen? Questions?

Questiones?

¿Preguntas? 質問