opencms days 2012 - opencms 8.5: using apache solr to retrieve content
DESCRIPTION
OpenCms 8.5 integrates Apache Solr. And not only for full text search, but as a powerful query engine as well. Imagine you want to show a list of "all resources of type news, that have changed since yesterday, where property X has the value Y" on your web page. Sure, there are API methods in OpenCms to load resources based on the type, on the date of change, or on the value of a specific property. But for many common use case combinations, there is no single API call. This means if you create a collector, you often end up sorting out the results of the initial API query in code. In this session, Rüdiger will show how Apache Solr has been integrated in OpenCms 8.5. He will explain how to create improved front-end full text search functions with advanced options like faceting and spell check suggestions. And he will explain how to use Solr to directly read resources from the OpenCms VFS, allowing query combinations that combine resource attributes, properties and content in a powerful new way.TRANSCRIPT
Rüdiger Kurz, Alkacon Software
WORKSHOP TRACK
Using Apache Solr to
retrieve content
25.09.2012
2
Project Collaboration
1. What is Solr?
2. Benefits
3. Searching
4. Indexing
5. Configuration
3
Agenda
●Apache Solr is hopefully not able to answer this question!
●BUT it will return the results in less than a second
4
Retrieving data fast
● Solr is an enterprise search platform from the Apache Lucene project
● Solr is highly scalable, providing distributed search and index replication
● Solr powers the search and navigation features
● Major features include
● Powerful full-text search
● Hit highlighting
● Faceted search
● Rich document (e.g., Word, PDF) handling
5
What is Apache Solr?
● Faceted search is the dynamic clustering of items or search results into categories
● That let users drill into search results (or even skip searching entirely)
● Each facet displayed typically shows the number of hits that match that category
● Users can then “drill down” by applying specific constraints to the search results
● Faceted search is also called faceted browsing, faceted navigation, guided navigation and sometimes parametric search
6
What is faceted search?
7
What is Faceted Search?
“Resource types” is a
facet, a way of
categorizing the results
containerpage,
v8flwoer, v8textblock,
… are constraints, or
facet values
The breadcrumb trail
shows what constraints
have already been
applied and allows their
removal
The facet count shows
how many results
match each value
The tag bar shows other
facet values of the found
document that can be applied
Regular search results
Benefits
8
● DBs are proprietary
● Require elaborate infrastructures
● SQL queries are hard to formulate
● SQL on DB is slower than search queries
● A lot SQL statements make DB to bottleneck
● Also lower traffic sites will slow to run when
executing too many statements on DB layer
Overall performance starts to degrade
9
Database as bottleneck
● OpenCms stores the content in a RDBMS
● To access values of an XML content you have to
perform the following steps:
10
Content retrieval so far
1. Read the resource
2. Read binary content
3. Un-marshal content
4. Access with getters
Resource (dates, refs, attr)
Content (blob)
Marshaled XML
Java Access Bean
● “Read” whole resource content by a single query
● Increase ease of data structure by storing
documents
● New flexibility by using power of Solr query syntax
● Best performance based on optimized index
● HTTP interface for external applications
● Secure, scalable and cost-effective access
● Reduced DB traffic and increased performance
11
The new way of content retrieval
OpenCms 8.5 Solr Integration
Searching
13
●Querying OpenCms content using
the power of Solr’s query syntax
1. Send a HTTP request handler
2. Use the new Solr Collector
3. Call the Java API search method
14
Search with Solr in OpenCms
● The REST-like interface of Solr makes you able
to access indexed documents over HTTP
without any knowledge about CMS specific
syntax
● A permission check is performed by OpenCms
making sure no secure documents will be returned
● Using Solr based UI frameworks like “Ajax Solr” on
your website without development costs
● Providing an open interface for external
applications e.g. mobile applications
15
OpenCms Solr handler
16
Examples: REST / JAVA / Collector
http://localhost:8080/opencms/opencms/handleSolrSelect
?fq=type:v8flower 1
<cms:contentload
collector="byQuery"
param="type:v8flower">
<cms:contentaccess var="content" />
${content.value.Title}
</cms:contentload>
2
CmsObject cms = getCmsObject();
String query = "fq=type:v8flower";
CmsSearchManager mananger = OpenCms.getSearchManager();
CmsSolrIndex index = manager.getIndexSolr("Solr Online");
CmsSolrResultList results = index.search(cms, query);
3
17
Live Demo
Demo
Demo Demo
Demo
デモ
Indexing
18
● Data indexed by default (hard coded)
● Field configuration (opencms-search.xml)
● XSD field mapping (Content definition)
● Implement a custom field configuration (Java)
19
Indexed data
● The Schema file contains all of the details about which fields your documents can contain
● OpenCms uses an adjusted version of the schema.xml that is contained within Apache Solr standard distribution
WEB-INF/solr/conf/schama.xml ● If you want to add a new custom field or
field type for documents you can modify this file
20
Solr schema
●Types are checked during the index
process
● It enables easy rage queries even for
dates, what is real facilitation making
dev-life easier
●Custom types can be added, e.g.
key/value tuple or some special JSON
fields
21
Advantages of field types
● id - Structure id used as unique identifier for an document (The structure id of the resource)
● path - Full root path (The root path of the resource e.g. /sites/default/flower_en/.content/article.html)
● path_hierarchy - The full path as (path tokenized field type: text_path)
● parent-folders - Parent folders (multi-valued field containing an entry for each parent path)
● type - Type name (the resource type name)
● res_locales - Existing locale nodes for XML content and all available locales in case of binary files
● created - The creation date (The date when the resource itself has being created)
● lastmodified - The date last modified (The last modification date of the resource itself)
● contentdate - The content date (The date when the resource's content has been modified)
● released - The release and expiration date of the resource
● content A general content field that holds all extracted resource data (all languages, type text_general)
● contentblob - The serialized extraction result toimprove the extraction performance while indexing
● category - All categories as general text
● category_exact - All categories as exact string for faceting reasons
● text_<locale> - Extracted textual content optimized for the language specific search
● timestamp - The time when the document was indexed last time
● *_prop - All properties of a resource as searchable and stored text (<Property_Definition_Name>_prop)
● *_exact - All properties of a resource as exact not stored string (<Property_Definition_Name>_exact)
22
Default indexed data
● Additional field mappings for XML contents can
now be configured directly within the XSD Schema
● Without modifying opencms-search.xml No
restart of the servlet container required
23
XSD field mapping
<searchsetting element=“DisplayDate” searchcontent=“false”>
<solrfield targetfield=“myDisplayDateField” sourcefield=“*_dt” />
</searchsetting>
<searchsetting element=“Teaser”>
<solrfield targetfield=“ateaser”>
<mapping type=“item” default=“Homepage n.a.”>Homepage</mapping>
<mapping type=“property-search”>search.special</mapping>
<mapping type=“dynamic” class=“my.DynamicMapping”>special</mapping>
</solrfield>
</searchsetting>
Configuration
24
● When installing OpenCms v8.5 Solr will be enabled by default while Solr will be disabled after updating a system to OpenCms 8.5
● To enable Solr in after updating you must create a Solr home directory in the WEB-INF folder of your OpenCms application
● Copy the solr/ folder from the OpenCms standard distribution as a starting point for your configuration
● All search configurations are done as usual in the opencms-search.xml below WEB-INF/config
● Adding the following lines will enable the Embedded Server
25
Enable Solr in OpenCms
<opencms><search>
<solr enabled="true"/> […]
</search></opencms>
● You can add a custom Solr index with the known OpenCms search configuration syntax
● NOTE: class attributes are needed for the index and its field configuration
26
Search index configuration
<index
class="org.opencms.search.solr.CmsSolrIndex">
<name>Solr Online</name>
<rebuild>auto</rebuild>
<project>Online</project>
<locale>all</locale>
<configuration>solr_fields</configuration>
<sources>
<source>solr_source</source>
</sources>
</index>
● For converting a field configuration by:
1. Copy a <filedconfiguration>-node
2. Change / set the class attribute
3. Optionally add a type attributes for fields
27
Create field configuration (1/3)
<fieldconfiguration
class="org.opencms.search.solr.CmsSolrFieldConfiguration">
<name>example</name>
<description>Converted Lucene Index</description>
<field name="meta" store="false" index="true" type="en">
<mapping type="property">Title</mapping>
<mapping type="property">Description</mapping>
</field>
</fields>
</fieldconfiguration>
● As value for the type attribute of a field
definition inside the opencms-system.xml
you can use names of any dynamic field defined in the schema.xml
● For example:
28
Create field configuration (2/3)
i - type=“int”
dt - type=“date”
txt - type=“text_general”
en - type=“text_en”
es - type=“text_es”
fr - type=“text_fr”
● As previously said the field names are defined in the schema.xml <solr_name> of Solr, now
we define additional fields inside the opencms-search.xml <opencms_name>
● How does that work?
29
Create field configuration (3/3)
String fieldName = <opencms_name>_txt;
if (existsInSolrSchema(fieldName)) {
fieldName = <opencms_name>;
} else if (isTypeAttributeSet()) {
fieldName = <opencms_name>_<type>;
}
30
Live Demo
Demo
Demo Demo
Demo
デモ
● Having Solr and VIE integrated into OpenCms
we are well prepared start using Apache
Stanbol
● Stanbol is a top level Apache project
● Stanbol guarantees a quality standard
● Stanbol opens the perspective of sustainability
● We are looking to integrate Stanbol into
OpenCms 9
31
Future steps with IKS and Stanbol
32
Live Demo
Demo
Demo Demo
Demo
デモ
● Permission checked search (secure)
● Solr Request handler (accessible)
● Solr Collector (integrated)
● Result highlighting (user-friendly)
● Configuration opportunities (flexible)
● Search field mapping (sensitive)
● Type based field schema (type-safe)
● Lucene conversion (compatible)
33
Integration Conclusion
Rüdiger Kurz
Alkacon Software GmbH
http://www.alkacon.com
http://www.opencms.org
http://www.iks-project.eu
http://stanbol.apache.org
Thank you very much for your
attention! 34
35
Any Questions?
Fragen? Questions?
Questiones?
¿Preguntas? 質問